12 years ago · a713b09fa8
--- a/doc/design/resolver/03-cache-algorithm.txt
+++ b/doc/design/resolver/03-cache-algorithm.txt
@@ -71,7 +71,7 @@ the cache.
 
																 Of course, these experimental results are too simplified.  First, in
															
 
																 these experiments we assumed only one external query is needed on
															
 
																 cache miss.  In general it can be more; however, it may not actually
															
 
																-too optimistic either: in my another research result:
															
 
																+be too optimistic either: in my another research result:
															
 
																 http://bind10.isc.org/wiki/ResolverPerformanceResearch
															
 
																 In the more detailed analysis using real query sample and tracing what
															
 
																 an actual resolver would do, it looked we'd need about 1.44 to 1.63
															
@@ -107,3 +107,150 @@ and cache management itself is always inevitable bottleneck, so the
 
																 production version wouldn't be that fast, but it still suggests it may
															
 
																 not be very difficult to reach over 100Kqps in production environment
															
 
																 including recursive resolution overhead.
															
 
																+
															
 
																+Cache Types
															
 
																+-----------
															
 
																+
															
 
																+1. Record cache
															
 
																+
															
 
																+Conceptually, any recursive resolver (with cache) implementation would
															
 
																+have cache for RRs (or RRsets in the modern version of protocol) given
															
 
																+in responses to its external queries.  In BIND 9, it's called the
															
 
																+"cached DB", using an in-memory rbt-like tree.  unbound calls it
															
 
																+"rrset cache", which is implemented as a hash table.
															
 
																+
															
 
																+2. Delegation cache
															
 
																+
															
 
																+Recursive server implementations would also have cache to determine
															
 
																+the deepest zone cut for a given query name in the recursion process.
															
 
																+Neither BIND 9 nor unbound has a separate cache for this purpose;
															
 
																+basically they try to find an NR RRset from the "record cache" whose
															
 
																+owner name best matches the given query name.
															
 
																+
															
 
																+3. Remote server cache
															
 
																+
															
 
																+In addition, a recursive server implementation may maintain a cache
															
 
																+for information of remote authoritative servers.  Both BIND 9 and
															
 
																+unbound conceptually have this type of cache, although there are some
															
 
																+non-negligible differences in details.  BIND 9's implementation of
															
 
																+this cache is called ADB.  Its a hash table whose key is domain name,
															
 
																+and each entry stores corresponding IPv6/v4 addresses; another data
															
 
																+structure for each address stores averaged RTT for the address,
															
 
																+lameness information, EDNS availability, etc.  unbound's
															
 
																+implementation is called "infrastructure cache".  It's a hash table
															
 
																+keyed with IP addresses whose entries store similar information as
															
 
																+that in BIND 9's per address ADB entry.  In unbound a remote server's
															
 
																+address must be determined by looking up the record cache (rrset cache
															
 
																+in unbound terminology); unlike BIND 9's ADB, there's no direct
															
 
																+shortcut from a server's domain name to IP addresses.
															
 
																+
															
 
																+4. Full response cache
															
 
																+
															
 
																+unbound has an additional cache layer, called the "message cache".
															
 
																+It's a hash table whose hash key is query parameter (essentially qname
															
 
																+and type) and entry is a sequence to record (rrset) cache entries.
															
 
																+This sequence constructs a complete response to the corresponding
															
 
																+query, so it would help optimize building a response message skipping
															
 
																+the record cache for each section (answer/authority/additional) of the
															
 
																+response message.  PowerDNS recursor has (seemingly) the same concept
															
 
																+called "packet cache" (but I don't know its implementation details
															
 
																+very much).
															
 
																+
															
 
																+BIND 9 doesn't have this type of cache; it always looks into the
															
 
																+record cache to build a complete response to a given query.
															
 
																+
															
 
																+Miscellaneous General Requirements
															
 
																+----------------------------------
															
 
																+
															
 
																+- Minimize contention between threads (if threaded)
															
 
																+- Cache purge policy: normally only a very small part of cached DNS
															
 
																+  information will be reused, and those reused are very heavily
															
 
																+  reused.  So LRU-like algorithm should generally work well, but we'll
															
 
																+  also need to honor DNS TTL.
															
 
																+
															
 
																+Random Ideas for BIND 10
															
 
																+------------------------
															
 
																+
															
 
																+Below are specific random ideas for BIND 10.  Some are based on
															
 
																+experimental results with reasonably realistic data; some others are
															
 
																+mostly a guess.
															
 
																+
															
 
																+1. Fully rendered response cache
															
 
																+
															
 
																+Some real world query samples show that a very small portion of entire
															
 
																+queries are very popular and queried very often and many times; the
															
 
																+rest is rarely reused, if any.  Two different data sets show top
															
 
																+10,000 queries would cover around 80% of total queries, regardless
															
 
																+of the size of the total queries.  This suggests an idea of having a
															
 
																+small, highly optimized full response cache.
															
 
																+
															
 
																+I tried this idea in the jinmei-l1cache branch.  It's a hash table
															
 
																+keyed with a tuple of query name and type whose entry stores fully
															
 
																+rendered, wire-format response image (answer section only, assuming
															
 
																+the "minimal-responses" option).  It also maintains offsets to each
															
 
																+RR, so it can easily update TTLs when necessary or rotate RRs if
															
 
																+optionally requested.  If neither TTL adjustment nor RR rotation is
															
 
																+required, query handling is just to lookup the hash table and copy the
															
 
																+pre-rendered data.  Experimental benchmark showed it ran vary fast;
															
 
																+more than 4 times faster than BIND 9, and even much faster than other
															
 
																+implementations that have full response cache (although, as usual, the
															
 
																+comparison is not entirely fair).
															
 
																+
															
 
																+Also, the cache size is quite small; the run time memory footprint of
															
 
																+this server process was just about 5MB.  So, I think it's reasonable
															
 
																+to have each process/thread have their own copy of this cache to
															
 
																+completely eliminate contention.  Also, if we can keep the cache size
															
 
																+this small, it would be easier to dump it to a file on shutdown and
															
 
																+reuse it on restart.  This will be quite effective (if the downtime is
															
 
																+reasonably short) because the cached data are expected to be highly
															
 
																+popular.
															
 
																+
															
 
																+2. Record cache
															
 
																+
															
 
																+For the normal record cache, I don't have a particular idea beyond
															
 
																+something obvious, like a hash table to map from query parameters to
															
 
																+corresponding RRset (or negative information).  But I guess this cache
															
 
																+should be shared by multiple threads.  That will help reconstruct the
															
 
																+full response cache data on TTL expiration more efficiently.  And, if
															
 
																+shared, the data structure should be chosen so that contention
															
 
																+overhead can be minimized.  In general, I guess something like hash
															
 
																+tables is more suitable than tree-like structure in that sense.
															
 
																+
															
 
																+There's other points to discuss for this cache related to other types
															
 
																+of cache (see below).
															
 
																+
															
 
																+3. Separate delegation cache
															
 
																+
															
 
																+One thing I'm guessing is that it may make sense if we have a separate
															
 
																+cache structure for delegation data.  It's conceptually a set of NS
															
 
																+RRs so we can identify the best (longest) matching one for a given
															
 
																+query name.
															
 
																+
															
 
																+Analysis of some sets of query data showed the vast majority of
															
 
																+end client's queries are for A and AAAA (not surprisingly).  So, even
															
 
																+if we separate this cache from the record cache, the additional
															
 
																+overhead (both for memory and fetch) will probably (hopefully) be
															
 
																+marginal.  Separating caches will also help reduce contention between
															
 
																+threads.  It *might* also help improve lookup performance because this
															
 
																+can be optimized for longest match search.
															
 
																+
															
 
																+4. Remote server cache without involving the record cache
															
 
																+
															
 
																+Likewise, it may make sense to maintain the remote server cache
															
 
																+separately from the record cache.  I guess these AAAA and A records
															
 
																+are rarely the queried by end clients, so, like the case of delegation
															
 
																+cache it's possible that the data sets are mostly disjoint.  Also, for
															
 
																+this purpose the RRsets don't have to have higher trust rank (per
															
 
																+RFC2181 5.4.1): glue or additional are okay, and, by separating these
															
 
																+from the record cache, we can avoid accidental promotion of these data
															
 
																+to trustworthy answers and returning them to clients (BIND 9 had this
															
 
																+type of bugs before).
															
 
																+
															
 
																+Custom vs Existing Library (STL etc)
															
 
																+------------------------------------
															
 
																+
															
 
																+It may have to be discussed, but I guess in many cases we end up
															
 
																+introducing custom implementation because these caches should be
															
 
																+highly performance sensitive, directly related our core business, and
															
 
																+also have to be memory efficient.  But in some sub components we may
															
 
																+be able to benefit from existing generic libraries.