Browse Source

Merge #2775

Design about scaling across multiple cores.

Conflicts:
	doc/design/resolver/01-scaling-across-cores (taken from the branch)
	doc/design/resolver/02-mixed-recursive-authority-setup (taken from master)
Michal 'vorner' Vaner 12 years ago
parent
commit
1095506737
2 changed files with 592 additions and 10 deletions
  1. 336 10
      doc/design/resolver/01-scaling-across-cores
  2. 256 0
      doc/design/resolver/03-cache-algorithm.txt

+ 336 - 10
doc/design/resolver/01-scaling-across-cores

@@ -1,7 +1,9 @@
-01-scaling-across-cores
+Scaling across (many) cores
+===========================
+
+Problem statement
+-----------------
 
 
-Introduction
-------------
 The general issue is how to insure that the resolver scales.
 The general issue is how to insure that the resolver scales.
 
 
 Currently resolvers are CPU bound, and it seems likely that both
 Currently resolvers are CPU bound, and it seems likely that both
@@ -10,12 +12,336 @@ scaling will need to be across multiple cores.
 
 
 How can we best scale a recursive resolver across multiple cores?
 How can we best scale a recursive resolver across multiple cores?
 
 
-Some possible solutions:
+Image of how resolution looks like
+----------------------------------
+
+                               Receive the query. @# <------------------------\
+                                       |                                      |
+                                       |                                      |
+                                       v                                      |
+                                 Parse it, etc. $                             |
+                                       |                                      |
+                                       |                                      |
+                                       v                                      |
+                              Look into the cache. $#                         |
+       Cry  <---- No <---------- Is it there? -----------> Yes ---------\     |
+        |                            ^                                  |     |
+ Prepare upstream query $            |                                  |     |
+        |                            |                                  |     |
+        v                            |                                  |     |
+  Send an upstream query (#)         |                                  |     |
+        |                            |                                  |     |
+        |                            |                                  |     |
+        v                            |                                  |     |
+    Wait for answer @(#)             |                                  |     |
+        |                            |                                  |     |
+        v                            |                                  |     |
+       Parse $                       |                                  |     |
+        |                            |                                  |     |
+        v                            |                                  |     |
+   Is it enough? $ ----> No ---------/                                  |     |
+        |                                                               |     |
+       Yes                                                              |     |
+        |                                                               |     |
+        \-----------------------> Build answer $ <----------------------/     |
+                                        |                                     |
+                                        |                                     |
+                                        v                                     |
+                                   Send answer # -----------------------------/
+
+This is simplified version, however. There may be other tasks (validation, for
+example), which are not drawn mostly for simplicity, as they don't produce more
+problems. The validation would be done as part of some computational task and
+they could do more lookups in the cache or upstream queries.
+
+Also, multiple queries may generate the same upstream query, so they should be
+aggregated together somehow.
+
+Legend
+~~~~~~
+ * $ - CPU intensive
+ * @ - Waiting for external event
+ * # - Possible interaction with other tasks
+
+Goals
+-----
+ * Run the CPU intensive tasks in multiple threads to allow concurrency.
+ * Minimise waiting for locks.
+ * Don't require too much memory.
+ * Minimise the number of upstream queries (both because they are slow and
+   expensive and also because we don't want to eat too much bandwidth and spam
+   the authoritative servers).
+ * Design simple enough so it can be implemented.
+
+Naïve version
+-------------
+
+Let's look at possible approaches and list their pros and cons. Many of the
+simple versions would not really work, but let's have a look at them anyway,
+because thinking about them might bring some solutions for the real versions.
+
+We take one query, handle it fully, with blocking waits for the answers. After
+this is done, we take another. The cache is private for each one process.
+
+Advantages:
+
+ * Very simple.
+ * No locks.
+
+Disadvantages:
+
+ * To scale across cores, we need to run *a lot* of processes, since they'd be
+   waiting for something most of their time. That means a lot of memory eaten,
+   because each one has its own cache. Also, running so many processes may be
+   problematic, processes are not very cheap.
+ * Many things would be asked multiple times, because the caches are not
+   shared.
+
+Threads
+~~~~~~~
+
+Some of the problems could be solved by using threads, but they'd not improve
+it much, since threads are not really cheap either (starting several hundred
+threads might not be a good idea either).
+
+Also, threads bring other problems. When we still assume separate caches (for
+caches, see below), we need to ensure safe access to logging, configuration,
+network, etc. These could be a bottleneck (eg. if we lock every time we read a
+packet from network, when there are many threads, they'll just fight over the
+lock).
+
+Supercache
+~~~~~~~~~~
+
+The problem with cache could be solved by placing a ``supercache'' between the
+resolvers and the Internet. That one would do almost no processing, it would
+just take the query, looked up in the cache and either answered from the cache
+or forwarded the query to the external world. It would store the answer and
+forward it back.
+
+The cache, if single-threaded, could be a bottle-neck. To solve it, there could
+be several approaches:
+
+Layered cache::
+  Each process has it's own small cache, which catches many queries. Then, a
+  group of processes shares another level of bigger cache, which catches most
+  of the queries that get past the private caches. We further group them and
+  each level handles less queries from each process, so they can keep up.
+  However, with each level, we add some overhead to do another lookup.
+Segmented cache::
+  We have several caches of the same level, in parallel. When we would ask a
+  cache, we hash the query and decide which cache to ask by the hash. Only that
+  cache would have that answer if any and each could run in a separate process.
+  The only problem is, could there be a pattern of queries that would skew to
+  use only one cache while the rest would be idle?
+Shared cache access::
+  A cache would be accessed by multiple processes/threads. See below for
+  details, but there's a risk of lock contention on the cache (it depends on
+  the data structure).
+
+Upstream queries
+~~~~~~~~~~~~~~~~
+
+Before doing an upstream query, we look into the cache to ensure we don't have
+the information yet. When we get the answer, we want to update the cache.
+
+This suggests the upstream queries are tightly coupled with the cache. Now,
+when we have several cache processes/threads, each can have some set of opened
+sockets which are not shared with other caches to do the lookups. This way we
+can avoid locking the upstream network communication.
+
+Also, we can have three conceptual states for data in cache, and act
+differently when it is requested.
+
+Present::
+  If it is available, in positive or negative version, we just provide the
+  answer right away.
+Not present::
+  The continuation of processing is queued somehow (blocked/callback is
+  stored/whatever). An upstream query is sent and we get to the next state.
+Waiting for answer::
+  If another query for the same thing arrives, we just queue it the same way
+  and keep waiting. When the answer comes, all the queued tasks are resumed.
+  If the TTL > 0, we store the answer and set it to ``present''.
+
+We want to do aggregation of upstream queries anyway, using cache for it saves
+some more processing and possibly locks.
+
+Multiple parallel queries
+-------------------------
+
+It seems obvious we can't afford to have a thread or process for each
+outstanding query. We need to handle multiple queries in each one at any given
+time.
+
+Coroutines
+~~~~~~~~~~
+
+The OS-level threads might be too expensive, but coroutines might be cheap
+enough. In that way, we could still write a code that would be easy to read,
+but limit the number of OS threads to reasonable number.
+
+In this model, when a query comes, a new coroutine/user-level thread is created
+for it. We use special reads and writes whenever there's an operation that
+could block. These reads and writes would internally schedule the operation
+and switch to another coroutine (if there's any ready to be executed).
+
+Each thread/process maintains its own set of coroutines and they do not
+migrate. This way, the queue of coroutines is kept lock-less, as well as any
+private caches. Only the shared caches are protected by a lock.
+
+[NOTE]
+The `coro` unit we have in the current code is *not* considered a coroutine
+library here. We would need a coroutine library where we have real stack for
+each coroutine and we switch the stacks on coroutine switch. That is possible
+with reasonable amount of dark magic (see `ucontext.h`, for example, but there
+are surely some higher-level libraries for that).
+
+There are some trouble with multiple coroutines waiting on the same event, like
+the same upstream query (possibly even coroutines from different threads), but
+it should be possible to solve.
+
+Event-based
+~~~~~~~~~~~
+
+We use events (`asio` and stuff) for writing it. Each outstanding query is an
+object with some callbacks on it. When we would do a possibly blocking
+operation, we schedule a callback to happen once the operation finishes.
+
+This is more lightweight than the coroutines (the query objects will be smaller
+than the stacks for coroutines), but it is harder to write and read for.
+
+[NOTE]
+Do not consider cross-breeding the models. That leads to space-time distortions
+and brain damage. Implementing one on top of other is OK, but mixing it in the
+same bit of code is a way do madhouse.
+
+Landlords and peasants
+~~~~~~~~~~~~~~~~~~~~~~
+
+In both the coroutines and event-based models, the cache and other shared
+things are easier to imagine as objects the working threads fight over to hold
+for a short while. In this model, it is easier to imagine each such shared
+object as something owned by a landlord that doesn't let anyone else on it,
+but you can send requests to him.
+
+A query is an object once again, with some kind of state machine.
+
+Then there are two kinds of threads. The peasants are just to do the heavy
+work. There's a global work-queue for peasants. Once a peasant is idle, it
+comes to the queue and picks up a handful of queries from there. It does as
+much on each the query as possible without requiring any shared resource.
+
+The other kind, the landlords, have a resource to watch over each. So we would
+have a cache (or several parts of cache), the sockets for accepting queries and
+answering them, possibly more. Each of these would have a separate landlord
+thread and a queue of tasks to do on the resource (look up something, send an
+answer...).
+
+Similarly, the landlord would take a handful of tasks from its queue and start
+handling them. It would possibly produce some more tasks for the peasants.
+
+The point here is, all the synchronisation is done on the queues, not on the
+shared resources themselves. And, we would append to a queues once the whole
+batch was completed. By tweaking the size of the batch, we could balance the
+lock contention, throughput and RTT. The append/remove would be a quick
+operation, and the cost of locks would amortize in the larger amount of queries
+handled per one lock operation.
+
+The possible downside is, a query needs to travel across several threads during
+its lifetime. It might turn out it is faster to move the query between cores
+than accessing the cache from several threads, since it is smaller, but it
+might be slower as well.
+
+It would be critical to make some kind of queue that is fast to append to and
+fast to take out first n items. Also, the tasks in the queues can be just
+abstract `boost::function<void (Worker&)>` functors, and each worker would just
+iterate through the queue, calling each functor. The parameter would be to
+allow easy generation of more tasks for other queues (they would be stored
+privately first, and appended to remote queues at the end of batch).
+
+Also, if we wanted to generate multiple parallel upstream queries from a single
+query, we would need to be careful. A query object would not have a lock on
+itself and the upstream queries could end up in a different caches/threads. To
+protect the original query, we would add another landlord that would aggregate
+answers together and let the query continue processing once it got enough
+answers. That way, the answers would be pushed all to the same threads and they
+could not fight over the query.
+
+[NOTE]
+This model would work only with threads, not processes.
+
+Shared caches
+-------------
+
+While it seems it is good to have some sort of L1 cache with pre-rendered
+answers (according to measurements in the #2777 ticket), we probably need some
+kind of larger shared cache.
+
+If we had just a single shared cache protected by lock, there'd be a lot of
+lock contention on the lock.
+
+Partitioning the cache
+~~~~~~~~~~~~~~~~~~~~~~
+
+We split the cache into parts, either by the layers or by parallel bits we
+switch between by a hash. If we take it to the extreme, a lock on each hash
+bucket would be this kind, though that might be wasting resources (how
+expensive is it to create a lock?).
+
+Landlords
+~~~~~~~~~
+
+The landlords do synchronizations themselves. Still, the cache would need to be
+partitioned.
+
+RCU
+~~~
+
+The RCU is a lock-less synchronization mechanism. An item is accessed through a
+pointer.  An updater creates a copy of the structure (in our case, it would be
+content of single hash bucket) and then atomically replaces the pointer. The
+readers from before have the old version, the new ones get the new version.
+When all the old readers die out, the old copy is reclaimed. Also, the
+reclamation can AFAIK be postponed for later times when we are slightly more
+idle or to a different thread.
+
+We could use it for cache ‒ in the fast track, we would just read the cache. In
+the slow one, we would have to wait in queue to do the update, in a single
+updater thread (because we don't really want to be updating the same cell twice
+at the same time).
+
+Proposals
+---------
+
+In either case, we would have some kind of L1 cache with pre-rendered answers.
+For these proposals (except the third), we wouldn't care if we split the cache
+into parallel chunks or layers.
+
+Hybrid RCU/Landlord
+~~~~~~~~~~~~~~~~~~~
+
+The landlord approach, just read only accesses to the cache are done directly
+by the peasants. Only if they don't find what they want, they'd append the
+queue to the task of the landlord. The landlord would be doing the RCU updates.
+It could happen that by the time the landlord gets to the task the answer is
+already there, but that would not matter much.
+
+Accessing network would be from landlords.
+
+Coroutines+RCU
+~~~~~~~~~~~~~~
+
+We would do the coroutines, and the reads from shared cache would go without
+locking. When doing write, we would have to lock.
+
+To avoid locking, each worker thread would have its own set of upstream sockets
+and we would dup the sockets from users so we don't have to lock that.
 
 
-a. Multiple processes with independent caches
-b. Multiple processes with shared cache
-c. A mix of independent/shared cache
-d. Thread variations of the above
+Multiple processes with coroutines and RCU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 
-All of these may be complicated by NUMA architectures (with
-faster/slower access to specific RAM).
+This would need the layered cache. The upper caches would be mapped to local
+memory for read-only access. Each cache would be a separate process. The
+process would do the updates ‒ if the answer was not there, the process would
+be asked by some kind of IPC to pull it from upstream cache or network.

+ 256 - 0
doc/design/resolver/03-cache-algorithm.txt

@@ -0,0 +1,256 @@
+03-cache-algorithm
+
+Introduction
+------------
+Cache performance may be important for the resolver. It might not be
+critical. We need to research this.
+
+One key question is: given a specific cache hit rate, how much of an
+impact does cache performance have?
+
+For example, if we have 90% cache hit rate, will we still be spending
+most of our time in system calls or in looking things up in our cache?
+
+There are several ways we can consider figuring this out, including
+measuring this in existing resolvers (BIND 9, Unbound) or modeling
+with specific values.
+
+Once we know how critical the cache performance is, we can consider
+which algorithm is best for that. If it is very critical, then a
+custom algorithm designed for DNS caching makes sense. If it is not,
+then we can consider using an STL-based data structure.
+
+Effectiveness of Cache
+----------------------
+
+First, I'll try to answer the introductory questions.
+
+In some simplified model, we can express the amount of running time
+for answering queries directly from the cache in the total running
+time including that used for recursive resolution due to cache miss as
+follows:
+
+A = r*Q2*/(r*Q2+ Q1*(1-r))
+where
+A: amount of time for answering queries from the cache per unit time
+   (such as sec, 0<=A<=1)
+r: cache hit rate (0<=r<=1)
+Q1: max qps of the server with 100% cache hit
+Q2: max qps of the server with 0% cache hit
+
+Q1 can be measured easily for given data set; measuring Q2 is tricky
+in general (it requires many external queries with unreliable
+results), but we can still have some not-so-unrealistic numbers
+through controlled simulation.
+
+As a data point for these values, see a previous experimental results
+of mine:
+https://lists.isc.org/pipermail/bind10-dev/2012-July/003628.html
+
+Looking at the "ideal" server implementation (no protocol overhead)
+with the set up 90% and 85% cache hit rate with 1 recursion on cache
+miss, and with the possible maximum total throughput, we can deduce
+Q1 and Q2, which are: 170591qps and 60138qps respectively.
+
+This means, with 90% cache hit rate (r = 0.9), the server would spend
+76% of its run time for receiving queries and answering responses
+directly from the cache: 0.9*60138/(0.9*60138 + 0.1*170591) = 0.76.
+
+I also ran more realistic experiments: using BIND 9.9.2 and unbound
+1.4.19 in the "forward only" mode with crafted query data and the
+forwarded server to emulate the situation of 100% and 0% cache hit
+rates.  I then measured the max response throughput using a
+queryperf-like tool.  In both cases Q2 is about 28% of Q1 (I'm not
+showing specific numbers to avoid unnecessary discussion about
+specific performance of existing servers; it's out of scope of this
+memo).  Using Q2 = 0.28*Q1, above equation with 90% cache hit rate
+will be: A = 0.9 * 0.28 / (0.9*0.28 + 0.1) = 0.716. So the server will
+spend about 72% of its running time to answer queries directly from
+the cache.
+
+Of course, these experimental results are too simplified.  First, in
+these experiments we assumed only one external query is needed on
+cache miss.  In general it can be more; however, it may not actually
+be too optimistic either: in my another research result:
+http://bind10.isc.org/wiki/ResolverPerformanceResearch
+In the more detailed analysis using real query sample and tracing what
+an actual resolver would do, it looked we'd need about 1.44 to 1.63
+external queries per cache miss in average.
+
+Still, of course, the real world cases are not that simple: in reality
+we'd need to deal with timeouts, slower remote servers, unexpected
+intermediate results, etc.  DNSSEC validating resolvers will clearly
+need to do more work.
+
+So, in the real world deployment Q2 should be much smaller than Q1.
+Here are some specific cases of the relationship between Q1 and Q2 for
+given A (assuming r = 0.9):
+
+70%: Q2 = 0.26 * Q1
+60%: Q2 = 0.17 * Q1
+50%: Q2 = 0.11 * Q1
+
+So, even if "recursive resolution is 10 times heavier" than the cache
+only case, we can assume the server spends a half of its run time for
+answering queries directly from the cache at the cache hit rate of
+90%.  I think this is a reasonably safe assumption.
+
+Now, assuming the number of 50% or more, does this suggest we should
+highly optimize the cache?  Opinions may vary on this point, but I
+personally think the answer is yes.  I've written an experimental
+cache only implementation that employs the idea of fully-rendered
+cached data.  On one test machine (2.20GHz AMD64, using a single
+core), queryperf-like benchmark shows it can handle over 180Kqps,
+while BIND 9.9.2 can just handle 41K qps.  The experimental
+implementation skips some necessary features for a production server,
+and cache management itself is always inevitable bottleneck, so the
+production version wouldn't be that fast, but it still suggests it may
+not be very difficult to reach over 100Kqps in production environment
+including recursive resolution overhead.
+
+Cache Types
+-----------
+
+1. Record cache
+
+Conceptually, any recursive resolver (with cache) implementation would
+have cache for RRs (or RRsets in the modern version of protocol) given
+in responses to its external queries.  In BIND 9, it's called the
+"cached DB", using an in-memory rbt-like tree.  unbound calls it
+"rrset cache", which is implemented as a hash table.
+
+2. Delegation cache
+
+Recursive server implementations would also have cache to determine
+the deepest zone cut for a given query name in the recursion process.
+Neither BIND 9 nor unbound has a separate cache for this purpose;
+basically they try to find an NR RRset from the "record cache" whose
+owner name best matches the given query name.
+
+3. Remote server cache
+
+In addition, a recursive server implementation may maintain a cache
+for information of remote authoritative servers.  Both BIND 9 and
+unbound conceptually have this type of cache, although there are some
+non-negligible differences in details.  BIND 9's implementation of
+this cache is called ADB.  Its a hash table whose key is domain name,
+and each entry stores corresponding IPv6/v4 addresses; another data
+structure for each address stores averaged RTT for the address,
+lameness information, EDNS availability, etc.  unbound's
+implementation is called "infrastructure cache".  It's a hash table
+keyed with IP addresses whose entries store similar information as
+that in BIND 9's per address ADB entry.  In unbound a remote server's
+address must be determined by looking up the record cache (rrset cache
+in unbound terminology); unlike BIND 9's ADB, there's no direct
+shortcut from a server's domain name to IP addresses.
+
+4. Full response cache
+
+unbound has an additional cache layer, called the "message cache".
+It's a hash table whose hash key is query parameter (essentially qname
+and type) and entry is a sequence to record (rrset) cache entries.
+This sequence constructs a complete response to the corresponding
+query, so it would help optimize building a response message skipping
+the record cache for each section (answer/authority/additional) of the
+response message.  PowerDNS recursor has (seemingly) the same concept
+called "packet cache" (but I don't know its implementation details
+very much).
+
+BIND 9 doesn't have this type of cache; it always looks into the
+record cache to build a complete response to a given query.
+
+Miscellaneous General Requirements
+----------------------------------
+
+- Minimize contention between threads (if threaded)
+- Cache purge policy: normally only a very small part of cached DNS
+  information will be reused, and those reused are very heavily
+  reused.  So LRU-like algorithm should generally work well, but we'll
+  also need to honor DNS TTL.
+
+Random Ideas for BIND 10
+------------------------
+
+Below are specific random ideas for BIND 10.  Some are based on
+experimental results with reasonably realistic data; some others are
+mostly a guess.
+
+1. Fully rendered response cache
+
+Some real world query samples show that a very small portion of entire
+queries are very popular and queried very often and many times; the
+rest is rarely reused, if any.  Two different data sets show top
+10,000 queries would cover around 80% of total queries, regardless
+of the size of the total queries.  This suggests an idea of having a
+small, highly optimized full response cache.
+
+I tried this idea in the jinmei-l1cache branch.  It's a hash table
+keyed with a tuple of query name and type whose entry stores fully
+rendered, wire-format response image (answer section only, assuming
+the "minimal-responses" option).  It also maintains offsets to each
+RR, so it can easily update TTLs when necessary or rotate RRs if
+optionally requested.  If neither TTL adjustment nor RR rotation is
+required, query handling is just to lookup the hash table and copy the
+pre-rendered data.  Experimental benchmark showed it ran vary fast;
+more than 4 times faster than BIND 9, and even much faster than other
+implementations that have full response cache (although, as usual, the
+comparison is not entirely fair).
+
+Also, the cache size is quite small; the run time memory footprint of
+this server process was just about 5MB.  So, I think it's reasonable
+to have each process/thread have their own copy of this cache to
+completely eliminate contention.  Also, if we can keep the cache size
+this small, it would be easier to dump it to a file on shutdown and
+reuse it on restart.  This will be quite effective (if the downtime is
+reasonably short) because the cached data are expected to be highly
+popular.
+
+2. Record cache
+
+For the normal record cache, I don't have a particular idea beyond
+something obvious, like a hash table to map from query parameters to
+corresponding RRset (or negative information).  But I guess this cache
+should be shared by multiple threads.  That will help reconstruct the
+full response cache data on TTL expiration more efficiently.  And, if
+shared, the data structure should be chosen so that contention
+overhead can be minimized.  In general, I guess something like hash
+tables is more suitable than tree-like structure in that sense.
+
+There's other points to discuss for this cache related to other types
+of cache (see below).
+
+3. Separate delegation cache
+
+One thing I'm guessing is that it may make sense if we have a separate
+cache structure for delegation data.  It's conceptually a set of NS
+RRs so we can identify the best (longest) matching one for a given
+query name.
+
+Analysis of some sets of query data showed the vast majority of
+end client's queries are for A and AAAA (not surprisingly).  So, even
+if we separate this cache from the record cache, the additional
+overhead (both for memory and fetch) will probably (hopefully) be
+marginal.  Separating caches will also help reduce contention between
+threads.  It *might* also help improve lookup performance because this
+can be optimized for longest match search.
+
+4. Remote server cache without involving the record cache
+
+Likewise, it may make sense to maintain the remote server cache
+separately from the record cache.  I guess these AAAA and A records
+are rarely the queried by end clients, so, like the case of delegation
+cache it's possible that the data sets are mostly disjoint.  Also, for
+this purpose the RRsets don't have to have higher trust rank (per
+RFC2181 5.4.1): glue or additional are okay, and, by separating these
+from the record cache, we can avoid accidental promotion of these data
+to trustworthy answers and returning them to clients (BIND 9 had this
+type of bugs before).
+
+Custom vs Existing Library (STL etc)
+------------------------------------
+
+It may have to be discussed, but I guess in many cases we end up
+introducing custom implementation because these caches should be
+highly performance sensitive, directly related our core business, and
+also have to be memory efficient.  But in some sub components we may
+be able to benefit from existing generic libraries.