12 years ago · 1095506737
--- a/doc/design/resolver/01-scaling-across-cores
+++ b/doc/design/resolver/01-scaling-across-cores
@@ -1,7 +1,9 @@
 
																-01-scaling-across-cores
															
 
																+Scaling across (many) cores
															
 
																+===========================
															
 
																+
															
 
																+Problem statement
															
 
																+-----------------
															
 
																-Introduction
															
 
																-------------
															
 
																 The general issue is how to insure that the resolver scales.
															
 
																 Currently resolvers are CPU bound, and it seems likely that both
															
@@ -10,12 +12,336 @@ scaling will need to be across multiple cores.
 
																 How can we best scale a recursive resolver across multiple cores?
															
 
																-Some possible solutions:
															
 
																+Image of how resolution looks like
															
 
																+----------------------------------
															
 
																+
															
 
																+                               Receive the query. @# <------------------------\
															
 
																+                                       |                                      |
															
 
																+                                       |                                      |
															
 
																+                                       v                                      |
															
 
																+                                 Parse it, etc. $                             |
															
 
																+                                       |                                      |
															
 
																+                                       |                                      |
															
 
																+                                       v                                      |
															
 
																+                              Look into the cache. $#                         |
															
 
																+       Cry  <---- No <---------- Is it there? -----------> Yes ---------\     |
															
 
																+        |                            ^                                  |     |
															
 
																+ Prepare upstream query $            |                                  |     |
															
 
																+        |                            |                                  |     |
															
 
																+        v                            |                                  |     |
															
 
																+  Send an upstream query (#)         |                                  |     |
															
 
																+        |                            |                                  |     |
															
 
																+        |                            |                                  |     |
															
 
																+        v                            |                                  |     |
															
 
																+    Wait for answer @(#)             |                                  |     |
															
 
																+        |                            |                                  |     |
															
 
																+        v                            |                                  |     |
															
 
																+       Parse $                       |                                  |     |
															
 
																+        |                            |                                  |     |
															
 
																+        v                            |                                  |     |
															
 
																+   Is it enough? $ ----> No ---------/                                  |     |
															
 
																+        |                                                               |     |
															
 
																+       Yes                                                              |     |
															
 
																+        |                                                               |     |
															
 
																+        \-----------------------> Build answer $ <----------------------/     |
															
 
																+                                        |                                     |
															
 
																+                                        |                                     |
															
 
																+                                        v                                     |
															
 
																+                                   Send answer # -----------------------------/
															
 
																+
															
 
																+This is simplified version, however. There may be other tasks (validation, for
															
 
																+example), which are not drawn mostly for simplicity, as they don't produce more
															
 
																+problems. The validation would be done as part of some computational task and
															
 
																+they could do more lookups in the cache or upstream queries.
															
 
																+
															
 
																+Also, multiple queries may generate the same upstream query, so they should be
															
 
																+aggregated together somehow.
															
 
																+
															
 
																+Legend
															
 
																+~~~~~~
															
 
																+ * $ - CPU intensive
															
 
																+ * @ - Waiting for external event
															
 
																+ * # - Possible interaction with other tasks
															
 
																+
															
 
																+Goals
															
 
																+-----
															
 
																+ * Run the CPU intensive tasks in multiple threads to allow concurrency.
															
 
																+ * Minimise waiting for locks.
															
 
																+ * Don't require too much memory.
															
 
																+ * Minimise the number of upstream queries (both because they are slow and
															
 
																+   expensive and also because we don't want to eat too much bandwidth and spam
															
 
																+   the authoritative servers).
															
 
																+ * Design simple enough so it can be implemented.
															
 
																+
															
 
																+Naïve version
															
 
																+-------------
															
 
																+
															
 
																+Let's look at possible approaches and list their pros and cons. Many of the
															
 
																+simple versions would not really work, but let's have a look at them anyway,
															
 
																+because thinking about them might bring some solutions for the real versions.
															
 
																+
															
 
																+We take one query, handle it fully, with blocking waits for the answers. After
															
 
																+this is done, we take another. The cache is private for each one process.
															
 
																+
															
 
																+Advantages:
															
 
																+
															
 
																+ * Very simple.
															
 
																+ * No locks.
															
 
																+
															
 
																+Disadvantages:
															
 
																+
															
 
																+ * To scale across cores, we need to run *a lot* of processes, since they'd be
															
 
																+   waiting for something most of their time. That means a lot of memory eaten,
															
 
																+   because each one has its own cache. Also, running so many processes may be
															
 
																+   problematic, processes are not very cheap.
															
 
																+ * Many things would be asked multiple times, because the caches are not
															
 
																+   shared.
															
 
																+
															
 
																+Threads
															
 
																+~~~~~~~
															
 
																+
															
 
																+Some of the problems could be solved by using threads, but they'd not improve
															
 
																+it much, since threads are not really cheap either (starting several hundred
															
 
																+threads might not be a good idea either).
															
 
																+
															
 
																+Also, threads bring other problems. When we still assume separate caches (for
															
 
																+caches, see below), we need to ensure safe access to logging, configuration,
															
 
																+network, etc. These could be a bottleneck (eg. if we lock every time we read a
															
 
																+packet from network, when there are many threads, they'll just fight over the
															
 
																+lock).
															
 
																+
															
 
																+Supercache
															
 
																+~~~~~~~~~~
															
 
																+
															
 
																+The problem with cache could be solved by placing a ``supercache'' between the
															
 
																+resolvers and the Internet. That one would do almost no processing, it would
															
 
																+just take the query, looked up in the cache and either answered from the cache
															
 
																+or forwarded the query to the external world. It would store the answer and
															
 
																+forward it back.
															
 
																+
															
 
																+The cache, if single-threaded, could be a bottle-neck. To solve it, there could
															
 
																+be several approaches:
															
 
																+
															
 
																+Layered cache::
															
 
																+  Each process has it's own small cache, which catches many queries. Then, a
															
 
																+  group of processes shares another level of bigger cache, which catches most
															
 
																+  of the queries that get past the private caches. We further group them and
															
 
																+  each level handles less queries from each process, so they can keep up.
															
 
																+  However, with each level, we add some overhead to do another lookup.
															
 
																+Segmented cache::
															
 
																+  We have several caches of the same level, in parallel. When we would ask a
															
 
																+  cache, we hash the query and decide which cache to ask by the hash. Only that
															
 
																+  cache would have that answer if any and each could run in a separate process.
															
 
																+  The only problem is, could there be a pattern of queries that would skew to
															
 
																+  use only one cache while the rest would be idle?
															
 
																+Shared cache access::
															
 
																+  A cache would be accessed by multiple processes/threads. See below for
															
 
																+  details, but there's a risk of lock contention on the cache (it depends on
															
 
																+  the data structure).
															
 
																+
															
 
																+Upstream queries
															
 
																+~~~~~~~~~~~~~~~~
															
 
																+
															
 
																+Before doing an upstream query, we look into the cache to ensure we don't have
															
 
																+the information yet. When we get the answer, we want to update the cache.
															
 
																+
															
 
																+This suggests the upstream queries are tightly coupled with the cache. Now,
															
 
																+when we have several cache processes/threads, each can have some set of opened
															
 
																+sockets which are not shared with other caches to do the lookups. This way we
															
 
																+can avoid locking the upstream network communication.
															
 
																+
															
 
																+Also, we can have three conceptual states for data in cache, and act
															
 
																+differently when it is requested.
															
 
																+
															
 
																+Present::
															
 
																+  If it is available, in positive or negative version, we just provide the
															
 
																+  answer right away.
															
 
																+Not present::
															
 
																+  The continuation of processing is queued somehow (blocked/callback is
															
 
																+  stored/whatever). An upstream query is sent and we get to the next state.
															
 
																+Waiting for answer::
															
 
																+  If another query for the same thing arrives, we just queue it the same way
															
 
																+  and keep waiting. When the answer comes, all the queued tasks are resumed.
															
 
																+  If the TTL > 0, we store the answer and set it to ``present''.
															
 
																+
															
 
																+We want to do aggregation of upstream queries anyway, using cache for it saves
															
 
																+some more processing and possibly locks.
															
 
																+
															
 
																+Multiple parallel queries
															
 
																+-------------------------
															
 
																+
															
 
																+It seems obvious we can't afford to have a thread or process for each
															
 
																+outstanding query. We need to handle multiple queries in each one at any given
															
 
																+time.
															
 
																+
															
 
																+Coroutines
															
 
																+~~~~~~~~~~
															
 
																+
															
 
																+The OS-level threads might be too expensive, but coroutines might be cheap
															
 
																+enough. In that way, we could still write a code that would be easy to read,
															
 
																+but limit the number of OS threads to reasonable number.
															
 
																+
															
 
																+In this model, when a query comes, a new coroutine/user-level thread is created
															
 
																+for it. We use special reads and writes whenever there's an operation that
															
 
																+could block. These reads and writes would internally schedule the operation
															
 
																+and switch to another coroutine (if there's any ready to be executed).
															
 
																+
															
 
																+Each thread/process maintains its own set of coroutines and they do not
															
 
																+migrate. This way, the queue of coroutines is kept lock-less, as well as any
															
 
																+private caches. Only the shared caches are protected by a lock.
															
 
																+
															
 
																+[NOTE]
															
 
																+The `coro` unit we have in the current code is *not* considered a coroutine
															
 
																+library here. We would need a coroutine library where we have real stack for
															
 
																+each coroutine and we switch the stacks on coroutine switch. That is possible
															
 
																+with reasonable amount of dark magic (see `ucontext.h`, for example, but there
															
 
																+are surely some higher-level libraries for that).
															
 
																+
															
 
																+There are some trouble with multiple coroutines waiting on the same event, like
															
 
																+the same upstream query (possibly even coroutines from different threads), but
															
 
																+it should be possible to solve.
															
 
																+
															
 
																+Event-based
															
 
																+~~~~~~~~~~~
															
 
																+
															
 
																+We use events (`asio` and stuff) for writing it. Each outstanding query is an
															
 
																+object with some callbacks on it. When we would do a possibly blocking
															
 
																+operation, we schedule a callback to happen once the operation finishes.
															
 
																+
															
 
																+This is more lightweight than the coroutines (the query objects will be smaller
															
 
																+than the stacks for coroutines), but it is harder to write and read for.
															
 
																+
															
 
																+[NOTE]
															
 
																+Do not consider cross-breeding the models. That leads to space-time distortions
															
 
																+and brain damage. Implementing one on top of other is OK, but mixing it in the
															
 
																+same bit of code is a way do madhouse.
															
 
																+
															
 
																+Landlords and peasants
															
 
																+~~~~~~~~~~~~~~~~~~~~~~
															
 
																+
															
 
																+In both the coroutines and event-based models, the cache and other shared
															
 
																+things are easier to imagine as objects the working threads fight over to hold
															
 
																+for a short while. In this model, it is easier to imagine each such shared
															
 
																+object as something owned by a landlord that doesn't let anyone else on it,
															
 
																+but you can send requests to him.
															
 
																+
															
 
																+A query is an object once again, with some kind of state machine.
															
 
																+
															
 
																+Then there are two kinds of threads. The peasants are just to do the heavy
															
 
																+work. There's a global work-queue for peasants. Once a peasant is idle, it
															
 
																+comes to the queue and picks up a handful of queries from there. It does as
															
 
																+much on each the query as possible without requiring any shared resource.
															
 
																+
															
 
																+The other kind, the landlords, have a resource to watch over each. So we would
															
 
																+have a cache (or several parts of cache), the sockets for accepting queries and
															
 
																+answering them, possibly more. Each of these would have a separate landlord
															
 
																+thread and a queue of tasks to do on the resource (look up something, send an
															
 
																+answer...).
															
 
																+
															
 
																+Similarly, the landlord would take a handful of tasks from its queue and start
															
 
																+handling them. It would possibly produce some more tasks for the peasants.
															
 
																+
															
 
																+The point here is, all the synchronisation is done on the queues, not on the
															
 
																+shared resources themselves. And, we would append to a queues once the whole
															
 
																+batch was completed. By tweaking the size of the batch, we could balance the
															
 
																+lock contention, throughput and RTT. The append/remove would be a quick
															
 
																+operation, and the cost of locks would amortize in the larger amount of queries
															
 
																+handled per one lock operation.
															
 
																+
															
 
																+The possible downside is, a query needs to travel across several threads during
															
 
																+its lifetime. It might turn out it is faster to move the query between cores
															
 
																+than accessing the cache from several threads, since it is smaller, but it
															
 
																+might be slower as well.
															
 
																+
															
 
																+It would be critical to make some kind of queue that is fast to append to and
															
 
																+fast to take out first n items. Also, the tasks in the queues can be just
															
 
																+abstract `boost::function<void (Worker&)>` functors, and each worker would just
															
 
																+iterate through the queue, calling each functor. The parameter would be to
															
 
																+allow easy generation of more tasks for other queues (they would be stored
															
 
																+privately first, and appended to remote queues at the end of batch).
															
 
																+
															
 
																+Also, if we wanted to generate multiple parallel upstream queries from a single
															
 
																+query, we would need to be careful. A query object would not have a lock on
															
 
																+itself and the upstream queries could end up in a different caches/threads. To
															
 
																+protect the original query, we would add another landlord that would aggregate
															
 
																+answers together and let the query continue processing once it got enough
															
 
																+answers. That way, the answers would be pushed all to the same threads and they
															
 
																+could not fight over the query.
															
 
																+
															
 
																+[NOTE]
															
 
																+This model would work only with threads, not processes.
															
 
																+
															
 
																+Shared caches
															
 
																+-------------
															
 
																+
															
 
																+While it seems it is good to have some sort of L1 cache with pre-rendered
															
 
																+answers (according to measurements in the #2777 ticket), we probably need some
															
 
																+kind of larger shared cache.
															
 
																+
															
 
																+If we had just a single shared cache protected by lock, there'd be a lot of
															
 
																+lock contention on the lock.
															
 
																+
															
 
																+Partitioning the cache
															
 
																+~~~~~~~~~~~~~~~~~~~~~~
															
 
																+
															
 
																+We split the cache into parts, either by the layers or by parallel bits we
															
 
																+switch between by a hash. If we take it to the extreme, a lock on each hash
															
 
																+bucket would be this kind, though that might be wasting resources (how
															
 
																+expensive is it to create a lock?).
															
 
																+
															
 
																+Landlords
															
 
																+~~~~~~~~~
															
 
																+
															
 
																+The landlords do synchronizations themselves. Still, the cache would need to be
															
 
																+partitioned.
															
 
																+
															
 
																+RCU
															
 
																+~~~
															
 
																+
															
 
																+The RCU is a lock-less synchronization mechanism. An item is accessed through a
															
 
																+pointer.  An updater creates a copy of the structure (in our case, it would be
															
 
																+content of single hash bucket) and then atomically replaces the pointer. The
															
 
																+readers from before have the old version, the new ones get the new version.
															
 
																+When all the old readers die out, the old copy is reclaimed. Also, the
															
 
																+reclamation can AFAIK be postponed for later times when we are slightly more
															
 
																+idle or to a different thread.
															
 
																+
															
 
																+We could use it for cache ‒ in the fast track, we would just read the cache. In
															
 
																+the slow one, we would have to wait in queue to do the update, in a single
															
 
																+updater thread (because we don't really want to be updating the same cell twice
															
 
																+at the same time).
															
 
																+
															
 
																+Proposals
															
 
																+---------
															
 
																+
															
 
																+In either case, we would have some kind of L1 cache with pre-rendered answers.
															
 
																+For these proposals (except the third), we wouldn't care if we split the cache
															
 
																+into parallel chunks or layers.
															
 
																+
															
 
																+Hybrid RCU/Landlord
															
 
																+~~~~~~~~~~~~~~~~~~~
															
 
																+
															
 
																+The landlord approach, just read only accesses to the cache are done directly
															
 
																+by the peasants. Only if they don't find what they want, they'd append the
															
 
																+queue to the task of the landlord. The landlord would be doing the RCU updates.
															
 
																+It could happen that by the time the landlord gets to the task the answer is
															
 
																+already there, but that would not matter much.
															
 
																+
															
 
																+Accessing network would be from landlords.
															
 
																+
															
 
																+Coroutines+RCU
															
 
																+~~~~~~~~~~~~~~
															
 
																+
															
 
																+We would do the coroutines, and the reads from shared cache would go without
															
 
																+locking. When doing write, we would have to lock.
															
 
																+
															
 
																+To avoid locking, each worker thread would have its own set of upstream sockets
															
 
																+and we would dup the sockets from users so we don't have to lock that.
															
 
																-a. Multiple processes with independent caches
															
 
																-b. Multiple processes with shared cache
															
 
																-c. A mix of independent/shared cache
															
 
																-d. Thread variations of the above
															
 
																+Multiple processes with coroutines and RCU
															
 
																+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
															
 
																-All of these may be complicated by NUMA architectures (with
															
 
																-faster/slower access to specific RAM).
															
 
																+This would need the layered cache. The upper caches would be mapped to local
															
 
																+memory for read-only access. Each cache would be a separate process. The
															
 
																+process would do the updates ‒ if the answer was not there, the process would
															
 
																+be asked by some kind of IPC to pull it from upstream cache or network.
															
--- a/doc/design/resolver/03-cache-algorithm.txt
+++ b/doc/design/resolver/03-cache-algorithm.txt
@@ -0,0 +1,256 @@
 
																+03-cache-algorithm
															
 
																+
															
 
																+Introduction
															
 
																+------------
															
 
																+Cache performance may be important for the resolver. It might not be
															
 
																+critical. We need to research this.
															
 
																+
															
 
																+One key question is: given a specific cache hit rate, how much of an
															
 
																+impact does cache performance have?
															
 
																+
															
 
																+For example, if we have 90% cache hit rate, will we still be spending
															
 
																+most of our time in system calls or in looking things up in our cache?
															
 
																+
															
 
																+There are several ways we can consider figuring this out, including
															
 
																+measuring this in existing resolvers (BIND 9, Unbound) or modeling
															
 
																+with specific values.
															
 
																+
															
 
																+Once we know how critical the cache performance is, we can consider
															
 
																+which algorithm is best for that. If it is very critical, then a
															
 
																+custom algorithm designed for DNS caching makes sense. If it is not,
															
 
																+then we can consider using an STL-based data structure.
															
 
																+
															
 
																+Effectiveness of Cache
															
 
																+----------------------
															
 
																+
															
 
																+First, I'll try to answer the introductory questions.
															
 
																+
															
 
																+In some simplified model, we can express the amount of running time
															
 
																+for answering queries directly from the cache in the total running
															
 
																+time including that used for recursive resolution due to cache miss as
															
 
																+follows:
															
 
																+
															
 
																+A = r*Q2*/(r*Q2+ Q1*(1-r))
															
 
																+where
															
 
																+A: amount of time for answering queries from the cache per unit time
															
 
																+   (such as sec, 0<=A<=1)
															
 
																+r: cache hit rate (0<=r<=1)
															
 
																+Q1: max qps of the server with 100% cache hit
															
 
																+Q2: max qps of the server with 0% cache hit
															
 
																+
															
 
																+Q1 can be measured easily for given data set; measuring Q2 is tricky
															
 
																+in general (it requires many external queries with unreliable
															
 
																+results), but we can still have some not-so-unrealistic numbers
															
 
																+through controlled simulation.
															
 
																+
															
 
																+As a data point for these values, see a previous experimental results
															
 
																+of mine:
															
 
																+https://lists.isc.org/pipermail/bind10-dev/2012-July/003628.html
															
 
																+
															
 
																+Looking at the "ideal" server implementation (no protocol overhead)
															
 
																+with the set up 90% and 85% cache hit rate with 1 recursion on cache
															
 
																+miss, and with the possible maximum total throughput, we can deduce
															
 
																+Q1 and Q2, which are: 170591qps and 60138qps respectively.
															
 
																+
															
 
																+This means, with 90% cache hit rate (r = 0.9), the server would spend
															
 
																+76% of its run time for receiving queries and answering responses
															
 
																+directly from the cache: 0.9*60138/(0.9*60138 + 0.1*170591) = 0.76.
															
 
																+
															
 
																+I also ran more realistic experiments: using BIND 9.9.2 and unbound
															
 
																+1.4.19 in the "forward only" mode with crafted query data and the
															
 
																+forwarded server to emulate the situation of 100% and 0% cache hit
															
 
																+rates.  I then measured the max response throughput using a
															
 
																+queryperf-like tool.  In both cases Q2 is about 28% of Q1 (I'm not
															
 
																+showing specific numbers to avoid unnecessary discussion about
															
 
																+specific performance of existing servers; it's out of scope of this
															
 
																+memo).  Using Q2 = 0.28*Q1, above equation with 90% cache hit rate
															
 
																+will be: A = 0.9 * 0.28 / (0.9*0.28 + 0.1) = 0.716. So the server will
															
 
																+spend about 72% of its running time to answer queries directly from
															
 
																+the cache.
															
 
																+
															
 
																+Of course, these experimental results are too simplified.  First, in
															
 
																+these experiments we assumed only one external query is needed on
															
 
																+cache miss.  In general it can be more; however, it may not actually
															
 
																+be too optimistic either: in my another research result:
															
 
																+http://bind10.isc.org/wiki/ResolverPerformanceResearch
															
 
																+In the more detailed analysis using real query sample and tracing what
															
 
																+an actual resolver would do, it looked we'd need about 1.44 to 1.63
															
 
																+external queries per cache miss in average.
															
 
																+
															
 
																+Still, of course, the real world cases are not that simple: in reality
															
 
																+we'd need to deal with timeouts, slower remote servers, unexpected
															
 
																+intermediate results, etc.  DNSSEC validating resolvers will clearly
															
 
																+need to do more work.
															
 
																+
															
 
																+So, in the real world deployment Q2 should be much smaller than Q1.
															
 
																+Here are some specific cases of the relationship between Q1 and Q2 for
															
 
																+given A (assuming r = 0.9):
															
 
																+
															
 
																+70%: Q2 = 0.26 * Q1
															
 
																+60%: Q2 = 0.17 * Q1
															
 
																+50%: Q2 = 0.11 * Q1
															
 
																+
															
 
																+So, even if "recursive resolution is 10 times heavier" than the cache
															
 
																+only case, we can assume the server spends a half of its run time for
															
 
																+answering queries directly from the cache at the cache hit rate of
															
 
																+90%.  I think this is a reasonably safe assumption.
															
 
																+
															
 
																+Now, assuming the number of 50% or more, does this suggest we should
															
 
																+highly optimize the cache?  Opinions may vary on this point, but I
															
 
																+personally think the answer is yes.  I've written an experimental
															
 
																+cache only implementation that employs the idea of fully-rendered
															
 
																+cached data.  On one test machine (2.20GHz AMD64, using a single
															
 
																+core), queryperf-like benchmark shows it can handle over 180Kqps,
															
 
																+while BIND 9.9.2 can just handle 41K qps.  The experimental
															
 
																+implementation skips some necessary features for a production server,
															
 
																+and cache management itself is always inevitable bottleneck, so the
															
 
																+production version wouldn't be that fast, but it still suggests it may
															
 
																+not be very difficult to reach over 100Kqps in production environment
															
 
																+including recursive resolution overhead.
															
 
																+
															
 
																+Cache Types
															
 
																+-----------
															
 
																+
															
 
																+1. Record cache
															
 
																+
															
 
																+Conceptually, any recursive resolver (with cache) implementation would
															
 
																+have cache for RRs (or RRsets in the modern version of protocol) given
															
 
																+in responses to its external queries.  In BIND 9, it's called the
															
 
																+"cached DB", using an in-memory rbt-like tree.  unbound calls it
															
 
																+"rrset cache", which is implemented as a hash table.
															
 
																+
															
 
																+2. Delegation cache
															
 
																+
															
 
																+Recursive server implementations would also have cache to determine
															
 
																+the deepest zone cut for a given query name in the recursion process.
															
 
																+Neither BIND 9 nor unbound has a separate cache for this purpose;
															
 
																+basically they try to find an NR RRset from the "record cache" whose
															
 
																+owner name best matches the given query name.
															
 
																+
															
 
																+3. Remote server cache
															
 
																+
															
 
																+In addition, a recursive server implementation may maintain a cache
															
 
																+for information of remote authoritative servers.  Both BIND 9 and
															
 
																+unbound conceptually have this type of cache, although there are some
															
 
																+non-negligible differences in details.  BIND 9's implementation of
															
 
																+this cache is called ADB.  Its a hash table whose key is domain name,
															
 
																+and each entry stores corresponding IPv6/v4 addresses; another data
															
 
																+structure for each address stores averaged RTT for the address,
															
 
																+lameness information, EDNS availability, etc.  unbound's
															
 
																+implementation is called "infrastructure cache".  It's a hash table
															
 
																+keyed with IP addresses whose entries store similar information as
															
 
																+that in BIND 9's per address ADB entry.  In unbound a remote server's
															
 
																+address must be determined by looking up the record cache (rrset cache
															
 
																+in unbound terminology); unlike BIND 9's ADB, there's no direct
															
 
																+shortcut from a server's domain name to IP addresses.
															
 
																+
															
 
																+4. Full response cache
															
 
																+
															
 
																+unbound has an additional cache layer, called the "message cache".
															
 
																+It's a hash table whose hash key is query parameter (essentially qname
															
 
																+and type) and entry is a sequence to record (rrset) cache entries.
															
 
																+This sequence constructs a complete response to the corresponding
															
 
																+query, so it would help optimize building a response message skipping
															
 
																+the record cache for each section (answer/authority/additional) of the
															
 
																+response message.  PowerDNS recursor has (seemingly) the same concept
															
 
																+called "packet cache" (but I don't know its implementation details
															
 
																+very much).
															
 
																+
															
 
																+BIND 9 doesn't have this type of cache; it always looks into the
															
 
																+record cache to build a complete response to a given query.
															
 
																+
															
 
																+Miscellaneous General Requirements
															
 
																+----------------------------------
															
 
																+
															
 
																+- Minimize contention between threads (if threaded)
															
 
																+- Cache purge policy: normally only a very small part of cached DNS
															
 
																+  information will be reused, and those reused are very heavily
															
 
																+  reused.  So LRU-like algorithm should generally work well, but we'll
															
 
																+  also need to honor DNS TTL.
															
 
																+
															
 
																+Random Ideas for BIND 10
															
 
																+------------------------
															
 
																+
															
 
																+Below are specific random ideas for BIND 10.  Some are based on
															
 
																+experimental results with reasonably realistic data; some others are
															
 
																+mostly a guess.
															
 
																+
															
 
																+1. Fully rendered response cache
															
 
																+
															
 
																+Some real world query samples show that a very small portion of entire
															
 
																+queries are very popular and queried very often and many times; the
															
 
																+rest is rarely reused, if any.  Two different data sets show top
															
 
																+10,000 queries would cover around 80% of total queries, regardless
															
 
																+of the size of the total queries.  This suggests an idea of having a
															
 
																+small, highly optimized full response cache.
															
 
																+
															
 
																+I tried this idea in the jinmei-l1cache branch.  It's a hash table
															
 
																+keyed with a tuple of query name and type whose entry stores fully
															
 
																+rendered, wire-format response image (answer section only, assuming
															
 
																+the "minimal-responses" option).  It also maintains offsets to each
															
 
																+RR, so it can easily update TTLs when necessary or rotate RRs if
															
 
																+optionally requested.  If neither TTL adjustment nor RR rotation is
															
 
																+required, query handling is just to lookup the hash table and copy the
															
 
																+pre-rendered data.  Experimental benchmark showed it ran vary fast;
															
 
																+more than 4 times faster than BIND 9, and even much faster than other
															
 
																+implementations that have full response cache (although, as usual, the
															
 
																+comparison is not entirely fair).
															
 
																+
															
 
																+Also, the cache size is quite small; the run time memory footprint of
															
 
																+this server process was just about 5MB.  So, I think it's reasonable
															
 
																+to have each process/thread have their own copy of this cache to
															
 
																+completely eliminate contention.  Also, if we can keep the cache size
															
 
																+this small, it would be easier to dump it to a file on shutdown and
															
 
																+reuse it on restart.  This will be quite effective (if the downtime is
															
 
																+reasonably short) because the cached data are expected to be highly
															
 
																+popular.
															
 
																+
															
 
																+2. Record cache
															
 
																+
															
 
																+For the normal record cache, I don't have a particular idea beyond
															
 
																+something obvious, like a hash table to map from query parameters to
															
 
																+corresponding RRset (or negative information).  But I guess this cache
															
 
																+should be shared by multiple threads.  That will help reconstruct the
															
 
																+full response cache data on TTL expiration more efficiently.  And, if
															
 
																+shared, the data structure should be chosen so that contention
															
 
																+overhead can be minimized.  In general, I guess something like hash
															
 
																+tables is more suitable than tree-like structure in that sense.
															
 
																+
															
 
																+There's other points to discuss for this cache related to other types
															
 
																+of cache (see below).
															
 
																+
															
 
																+3. Separate delegation cache
															
 
																+
															
 
																+One thing I'm guessing is that it may make sense if we have a separate
															
 
																+cache structure for delegation data.  It's conceptually a set of NS
															
 
																+RRs so we can identify the best (longest) matching one for a given
															
 
																+query name.
															
 
																+
															
 
																+Analysis of some sets of query data showed the vast majority of
															
 
																+end client's queries are for A and AAAA (not surprisingly).  So, even
															
 
																+if we separate this cache from the record cache, the additional
															
 
																+overhead (both for memory and fetch) will probably (hopefully) be
															
 
																+marginal.  Separating caches will also help reduce contention between
															
 
																+threads.  It *might* also help improve lookup performance because this
															
 
																+can be optimized for longest match search.
															
 
																+
															
 
																+4. Remote server cache without involving the record cache
															
 
																+
															
 
																+Likewise, it may make sense to maintain the remote server cache
															
 
																+separately from the record cache.  I guess these AAAA and A records
															
 
																+are rarely the queried by end clients, so, like the case of delegation
															
 
																+cache it's possible that the data sets are mostly disjoint.  Also, for
															
 
																+this purpose the RRsets don't have to have higher trust rank (per
															
 
																+RFC2181 5.4.1): glue or additional are okay, and, by separating these
															
 
																+from the record cache, we can avoid accidental promotion of these data
															
 
																+to trustworthy answers and returning them to clients (BIND 9 had this
															
 
																+type of bugs before).
															
 
																+
															
 
																+Custom vs Existing Library (STL etc)
															
 
																+------------------------------------
															
 
																+
															
 
																+It may have to be discussed, but I guess in many cases we end up
															
 
																+introducing custom implementation because these caches should be
															
 
																+highly performance sensitive, directly related our core business, and
															
 
																+also have to be memory efficient.  But in some sub components we may
															
 
																+be able to benefit from existing generic libraries.