03-cache-algorithm.txt 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256
  1. 03-cache-algorithm
  2. Introduction
  3. ------------
  4. Cache performance may be important for the resolver. It might not be
  5. critical. We need to research this.
  6. One key question is: given a specific cache hit rate, how much of an
  7. impact does cache performance have?
  8. For example, if we have 90% cache hit rate, will we still be spending
  9. most of our time in system calls or in looking things up in our cache?
  10. There are several ways we can consider figuring this out, including
  11. measuring this in existing resolvers (BIND 9, Unbound) or modeling
  12. with specific values.
  13. Once we know how critical the cache performance is, we can consider
  14. which algorithm is best for that. If it is very critical, then a
  15. custom algorithm designed for DNS caching makes sense. If it is not,
  16. then we can consider using an STL-based data structure.
  17. Effectiveness of Cache
  18. ----------------------
  19. First, I'll try to answer the introductory questions.
  20. In some simplified model, we can express the amount of running time
  21. for answering queries directly from the cache in the total running
  22. time including that used for recursive resolution due to cache miss as
  23. follows:
  24. A = r*Q2*/(r*Q2+ Q1*(1-r))
  25. where
  26. A: amount of time for answering queries from the cache per unit time
  27. (such as sec, 0<=A<=1)
  28. r: cache hit rate (0<=r<=1)
  29. Q1: max qps of the server with 100% cache hit
  30. Q2: max qps of the server with 0% cache hit
  31. Q1 can be measured easily for given data set; measuring Q2 is tricky
  32. in general (it requires many external queries with unreliable
  33. results), but we can still have some not-so-unrealistic numbers
  34. through controlled simulation.
  35. As a data point for these values, see a previous experimental results
  36. of mine:
  37. https://lists.isc.org/pipermail/bind10-dev/2012-July/003628.html
  38. Looking at the "ideal" server implementation (no protocol overhead)
  39. with the set up 90% and 85% cache hit rate with 1 recursion on cache
  40. miss, and with the possible maximum total throughput, we can deduce
  41. Q1 and Q2, which are: 170591qps and 60138qps respectively.
  42. This means, with 90% cache hit rate (r = 0.9), the server would spend
  43. 76% of its run time for receiving queries and answering responses
  44. directly from the cache: 0.9*60138/(0.9*60138 + 0.1*170591) = 0.76.
  45. I also ran more realistic experiments: using BIND 9.9.2 and unbound
  46. 1.4.19 in the "forward only" mode with crafted query data and the
  47. forwarded server to emulate the situation of 100% and 0% cache hit
  48. rates. I then measured the max response throughput using a
  49. queryperf-like tool. In both cases Q2 is about 28% of Q1 (I'm not
  50. showing specific numbers to avoid unnecessary discussion about
  51. specific performance of existing servers; it's out of scope of this
  52. memo). Using Q2 = 0.28*Q1, above equation with 90% cache hit rate
  53. will be: A = 0.9 * 0.28 / (0.9*0.28 + 0.1) = 0.716. So the server will
  54. spend about 72% of its running time to answer queries directly from
  55. the cache.
  56. Of course, these experimental results are too simplified. First, in
  57. these experiments we assumed only one external query is needed on
  58. cache miss. In general it can be more; however, it may not actually
  59. be too optimistic either: in my another research result:
  60. http://bind10.isc.org/wiki/ResolverPerformanceResearch
  61. In the more detailed analysis using real query sample and tracing what
  62. an actual resolver would do, it looked we'd need about 1.44 to 1.63
  63. external queries per cache miss in average.
  64. Still, of course, the real world cases are not that simple: in reality
  65. we'd need to deal with timeouts, slower remote servers, unexpected
  66. intermediate results, etc. DNSSEC validating resolvers will clearly
  67. need to do more work.
  68. So, in the real world deployment Q2 should be much smaller than Q1.
  69. Here are some specific cases of the relationship between Q1 and Q2 for
  70. given A (assuming r = 0.9):
  71. 70%: Q2 = 0.26 * Q1
  72. 60%: Q2 = 0.17 * Q1
  73. 50%: Q2 = 0.11 * Q1
  74. So, even if "recursive resolution is 10 times heavier" than the cache
  75. only case, we can assume the server spends a half of its run time for
  76. answering queries directly from the cache at the cache hit rate of
  77. 90%. I think this is a reasonably safe assumption.
  78. Now, assuming the number of 50% or more, does this suggest we should
  79. highly optimize the cache? Opinions may vary on this point, but I
  80. personally think the answer is yes. I've written an experimental
  81. cache only implementation that employs the idea of fully-rendered
  82. cached data. On one test machine (2.20GHz AMD64, using a single
  83. core), queryperf-like benchmark shows it can handle over 180Kqps,
  84. while BIND 9.9.2 can just handle 41K qps. The experimental
  85. implementation skips some necessary features for a production server,
  86. and cache management itself is always inevitable bottleneck, so the
  87. production version wouldn't be that fast, but it still suggests it may
  88. not be very difficult to reach over 100Kqps in production environment
  89. including recursive resolution overhead.
  90. Cache Types
  91. -----------
  92. 1. Record cache
  93. Conceptually, any recursive resolver (with cache) implementation would
  94. have cache for RRs (or RRsets in the modern version of protocol) given
  95. in responses to its external queries. In BIND 9, it's called the
  96. "cached DB", using an in-memory rbt-like tree. unbound calls it
  97. "rrset cache", which is implemented as a hash table.
  98. 2. Delegation cache
  99. Recursive server implementations would also have cache to determine
  100. the deepest zone cut for a given query name in the recursion process.
  101. Neither BIND 9 nor unbound has a separate cache for this purpose;
  102. basically they try to find an NR RRset from the "record cache" whose
  103. owner name best matches the given query name.
  104. 3. Remote server cache
  105. In addition, a recursive server implementation may maintain a cache
  106. for information of remote authoritative servers. Both BIND 9 and
  107. unbound conceptually have this type of cache, although there are some
  108. non-negligible differences in details. BIND 9's implementation of
  109. this cache is called ADB. Its a hash table whose key is domain name,
  110. and each entry stores corresponding IPv6/v4 addresses; another data
  111. structure for each address stores averaged RTT for the address,
  112. lameness information, EDNS availability, etc. unbound's
  113. implementation is called "infrastructure cache". It's a hash table
  114. keyed with IP addresses whose entries store similar information as
  115. that in BIND 9's per address ADB entry. In unbound a remote server's
  116. address must be determined by looking up the record cache (rrset cache
  117. in unbound terminology); unlike BIND 9's ADB, there's no direct
  118. shortcut from a server's domain name to IP addresses.
  119. 4. Full response cache
  120. unbound has an additional cache layer, called the "message cache".
  121. It's a hash table whose hash key is query parameter (essentially qname
  122. and type) and entry is a sequence to record (rrset) cache entries.
  123. This sequence constructs a complete response to the corresponding
  124. query, so it would help optimize building a response message skipping
  125. the record cache for each section (answer/authority/additional) of the
  126. response message. PowerDNS recursor has (seemingly) the same concept
  127. called "packet cache" (but I don't know its implementation details
  128. very much).
  129. BIND 9 doesn't have this type of cache; it always looks into the
  130. record cache to build a complete response to a given query.
  131. Miscellaneous General Requirements
  132. ----------------------------------
  133. - Minimize contention between threads (if threaded)
  134. - Cache purge policy: normally only a very small part of cached DNS
  135. information will be reused, and those reused are very heavily
  136. reused. So LRU-like algorithm should generally work well, but we'll
  137. also need to honor DNS TTL.
  138. Random Ideas for BIND 10
  139. ------------------------
  140. Below are specific random ideas for BIND 10. Some are based on
  141. experimental results with reasonably realistic data; some others are
  142. mostly a guess.
  143. 1. Fully rendered response cache
  144. Some real world query samples show that a very small portion of entire
  145. queries are very popular and queried very often and many times; the
  146. rest is rarely reused, if any. Two different data sets show top
  147. 10,000 queries would cover around 80% of total queries, regardless
  148. of the size of the total queries. This suggests an idea of having a
  149. small, highly optimized full response cache.
  150. I tried this idea in the jinmei-l1cache branch. It's a hash table
  151. keyed with a tuple of query name and type whose entry stores fully
  152. rendered, wire-format response image (answer section only, assuming
  153. the "minimal-responses" option). It also maintains offsets to each
  154. RR, so it can easily update TTLs when necessary or rotate RRs if
  155. optionally requested. If neither TTL adjustment nor RR rotation is
  156. required, query handling is just to lookup the hash table and copy the
  157. pre-rendered data. Experimental benchmark showed it ran vary fast;
  158. more than 4 times faster than BIND 9, and even much faster than other
  159. implementations that have full response cache (although, as usual, the
  160. comparison is not entirely fair).
  161. Also, the cache size is quite small; the run time memory footprint of
  162. this server process was just about 5MB. So, I think it's reasonable
  163. to have each process/thread have their own copy of this cache to
  164. completely eliminate contention. Also, if we can keep the cache size
  165. this small, it would be easier to dump it to a file on shutdown and
  166. reuse it on restart. This will be quite effective (if the downtime is
  167. reasonably short) because the cached data are expected to be highly
  168. popular.
  169. 2. Record cache
  170. For the normal record cache, I don't have a particular idea beyond
  171. something obvious, like a hash table to map from query parameters to
  172. corresponding RRset (or negative information). But I guess this cache
  173. should be shared by multiple threads. That will help reconstruct the
  174. full response cache data on TTL expiration more efficiently. And, if
  175. shared, the data structure should be chosen so that contention
  176. overhead can be minimized. In general, I guess something like hash
  177. tables is more suitable than tree-like structure in that sense.
  178. There's other points to discuss for this cache related to other types
  179. of cache (see below).
  180. 3. Separate delegation cache
  181. One thing I'm guessing is that it may make sense if we have a separate
  182. cache structure for delegation data. It's conceptually a set of NS
  183. RRs so we can identify the best (longest) matching one for a given
  184. query name.
  185. Analysis of some sets of query data showed the vast majority of
  186. end client's queries are for A and AAAA (not surprisingly). So, even
  187. if we separate this cache from the record cache, the additional
  188. overhead (both for memory and fetch) will probably (hopefully) be
  189. marginal. Separating caches will also help reduce contention between
  190. threads. It *might* also help improve lookup performance because this
  191. can be optimized for longest match search.
  192. 4. Remote server cache without involving the record cache
  193. Likewise, it may make sense to maintain the remote server cache
  194. separately from the record cache. I guess these AAAA and A records
  195. are rarely the queried by end clients, so, like the case of delegation
  196. cache it's possible that the data sets are mostly disjoint. Also, for
  197. this purpose the RRsets don't have to have higher trust rank (per
  198. RFC2181 5.4.1): glue or additional are okay, and, by separating these
  199. from the record cache, we can avoid accidental promotion of these data
  200. to trustworthy answers and returning them to clients (BIND 9 had this
  201. type of bugs before).
  202. Custom vs Existing Library (STL etc)
  203. ------------------------------------
  204. It may have to be discussed, but I guess in many cases we end up
  205. introducing custom implementation because these caches should be
  206. highly performance sensitive, directly related our core business, and
  207. also have to be memory efficient. But in some sub components we may
  208. be able to benefit from existing generic libraries.