01-scaling-across-cores 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347
  1. Scaling across (many) cores
  2. ===========================
  3. Problem statement
  4. -----------------
  5. The general issue is how to insure that the resolver scales.
  6. Currently resolvers are CPU bound, and it seems likely that both
  7. instructions-per-cycle and CPU frequency will not increase radically,
  8. scaling will need to be across multiple cores.
  9. How can we best scale a recursive resolver across multiple cores?
  10. Image of how resolution looks like
  11. ----------------------------------
  12. Receive the query. @# <------------------------\
  13. | |
  14. | |
  15. v |
  16. Parse it, etc. $ |
  17. | |
  18. | |
  19. v |
  20. Look into the cache. $# |
  21. Cry <---- No <---------- Is it there? -----------> Yes ---------\ |
  22. | ^ | |
  23. Prepare upstream query $ | | |
  24. | | | |
  25. v | | |
  26. Send an upstream query (#) | | |
  27. | | | |
  28. | | | |
  29. v | | |
  30. Wait for answer @(#) | | |
  31. | | | |
  32. v | | |
  33. Parse $ | | |
  34. | | | |
  35. v | | |
  36. Is it enough? $ ----> No ---------/ | |
  37. | | |
  38. Yes | |
  39. | | |
  40. \-----------------------> Build answer $ <----------------------/ |
  41. | |
  42. | |
  43. v |
  44. Send answer # -----------------------------/
  45. This is simplified version, however. There may be other tasks (validation, for
  46. example), which are not drawn mostly for simplicity, as they don't produce more
  47. problems. The validation would be done as part of some computational task and
  48. they could do more lookups in the cache or upstream queries.
  49. Also, multiple queries may generate the same upstream query, so they should be
  50. aggregated together somehow.
  51. Legend
  52. ~~~~~~
  53. * $ - CPU intensive
  54. * @ - Waiting for external event
  55. * # - Possible interaction with other tasks
  56. Goals
  57. -----
  58. * Run the CPU intensive tasks in multiple threads to allow concurrency.
  59. * Minimize waiting for locks.
  60. * Don't require too much memory.
  61. * Minimize the number of upstream queries (both because they are slow and
  62. expensive and also because we don't want to eat too much bandwidth and spam
  63. the authoritative servers).
  64. * Design simple enough so it can be implemented.
  65. Naïve version
  66. -------------
  67. Let's look at possible approaches and list their pros and cons. Many of the
  68. simple versions would not really work, but let's have a look at them anyway,
  69. because thinking about them might bring some solutions for the real versions.
  70. We take one query, handle it fully, with blocking waits for the answers. After
  71. this is done, we take another. The cache is private for each one process.
  72. Advantages:
  73. * Very simple.
  74. * No locks.
  75. Disadvantages:
  76. * To scale across cores, we need to run *a lot* of processes, since they'd be
  77. waiting for something most of their time. That means a lot of memory eaten,
  78. because each one has its own cache. Also, running so many processes may be
  79. problematic, processes are not very cheap.
  80. * Many things would be asked multiple times, because the caches are not
  81. shared.
  82. Threads
  83. ~~~~~~~
  84. Some of the problems could be solved by using threads, but they'd not improve
  85. it much, since threads are not really cheap either (starting several hundred
  86. threads might not be a good idea either).
  87. Also, threads bring other problems. When we still assume separate caches (for
  88. caches, see below), we need to ensure safe access to logging, configuration,
  89. network, etc. These could be a bottleneck (eg. if we lock every time we read a
  90. packet from network, when there are many threads, they'll just fight over the
  91. lock).
  92. Supercache
  93. ~~~~~~~~~~
  94. The problem with cache could be solved by placing a ``supercache'' between the
  95. resolvers and the Internet. That one would do almost no processing, it would
  96. just take the query, looked up in the cache and either answered from the cache
  97. or forwarded the query to the external world. It would store the answer and
  98. forward it back.
  99. The cache, if single-threaded, could be a bottle-neck. To solve it, there could
  100. be several approaches:
  101. Layered cache::
  102. Each process has it's own small cache, which catches many queries. Then, a
  103. group of processes shares another level of bigger cache, which catches most
  104. of the queries that get past the private caches. We further group them and
  105. each level handles less queries from each process, so they can keep up.
  106. However, with each level, we add some overhead to do another lookup.
  107. Segmented cache::
  108. We have several caches of the same level, in parallel. When we would ask a
  109. cache, we hash the query and decide which cache to ask by the hash. Only that
  110. cache would have that answer if any and each could run in a separate process.
  111. The only problem is, could there be a pattern of queries that would skew to
  112. use only one cache while the rest would be idle?
  113. Shared cache access::
  114. A cache would be accessed by multiple processes/threads. See below for
  115. details, but there's a risk of lock contention on the cache (it depends on
  116. the data structure).
  117. Upstream queries
  118. ~~~~~~~~~~~~~~~~
  119. Before doing an upstream query, we look into the cache to ensure we don't have
  120. the information yet. When we get the answer, we want to update the cache.
  121. This suggests the upstream queries are tightly coupled with the cache. Now,
  122. when we have several cache processes/threads, each can have some set of opened
  123. sockets which are not shared with other caches to do the lookups. This way we
  124. can avoid locking the upstream network communication.
  125. Also, we can have three conceptual states for data in cache, and act
  126. differently when it is requested.
  127. Present::
  128. If it is available, in positive or negative version, we just provide the
  129. answer right away.
  130. Not present::
  131. The continuation of processing is queued somehow (blocked/callback is
  132. stored/whatever). An upstream query is sent and we get to the next state.
  133. Waiting for answer::
  134. If another query for the same thing arrives, we just queue it the same way
  135. and keep waiting. When the answer comes, all the queued tasks are resumed.
  136. If the TTL > 0, we store the answer and set it to ``present''.
  137. We want to do aggregation of upstream queries anyway, using cache for it saves
  138. some more processing and possibly locks.
  139. Multiple parallel queries
  140. -------------------------
  141. It seems obvious we can't afford to have a thread or process for each
  142. outstanding query. We need to handle multiple queries in each one at any given
  143. time.
  144. Coroutines
  145. ~~~~~~~~~~
  146. The OS-level threads might be too expensive, but coroutines might be cheap
  147. enough. In that way, we could still write a code that would be easy to read,
  148. but limit the number of OS threads to reasonable number.
  149. In this model, when a query comes, a new coroutine/user-level thread is created
  150. for it. We use special reads and writes whenever there's an operation that
  151. could block. These reads and writes would internally schedule the operation
  152. and switch to another coroutine (if there's any ready to be executed).
  153. Each thread/process maintains its own set of coroutines and they do not
  154. migrate. This way, the queue of coroutines is kept lock-less, as well as any
  155. private caches. Only the shared caches are protected by a lock.
  156. [NOTE]
  157. The `coro` unit we have in the current code is *not* considered a coroutine
  158. library here. We would need a coroutine library where we have real stack for
  159. each coroutine and we switch the stacks on coroutine switch. That is possible
  160. with reasonable amount of dark magic (see `ucontext.h`, for example, but there
  161. are surely some higher-level libraries for that).
  162. There are some trouble with multiple coroutines waiting on the same event, like
  163. the same upstream query (possibly even coroutines from different threads), but
  164. it should be possible to solve.
  165. Event-based
  166. ~~~~~~~~~~~
  167. We use events (`asio` and stuff) for writing it. Each outstanding query is an
  168. object with some callbacks on it. When we would do a possibly blocking
  169. operation, we schedule a callback to happen once the operation finishes.
  170. This is more lightweight than the coroutines (the query objects will be smaller
  171. than the stacks for coroutines), but it is harder to write and read for.
  172. [NOTE]
  173. Do not consider cross-breeding the models. That leads to space-time distortions
  174. and brain damage. Implementing one on top of other is OK, but mixing it in the
  175. same bit of code is a way do madhouse.
  176. Landlords and peasants
  177. ~~~~~~~~~~~~~~~~~~~~~~
  178. In both the coroutines and event-based models, the cache and other shared
  179. things are easier to imagine as objects the working threads fight over to hold
  180. for a short while. In this model, it is easier to imagine each such shared
  181. object as something owned by a landlord that doesn't let anyone else on it,
  182. but you can send requests to him.
  183. A query is an object once again, with some kind of state machine.
  184. Then there are two kinds of threads. The peasants are just to do the heavy
  185. work. There's a global work-queue for peasants. Once a peasant is idle, it
  186. comes to the queue and picks up a handful of queries from there. It does as
  187. much on each the query as possible without requiring any shared resource.
  188. The other kind, the landlords, have a resource to watch over each. So we would
  189. have a cache (or several parts of cache), the sockets for accepting queries and
  190. answering them, possibly more. Each of these would have a separate landlord
  191. thread and a queue of tasks to do on the resource (look up something, send an
  192. answer...).
  193. Similarly, the landlord would take a handful of tasks from its queue and start
  194. handling them. It would possibly produce some more tasks for the peasants.
  195. The point here is, all the synchronisation is done on the queues, not on the
  196. shared resources themselves. And, we would append to a queues once the whole
  197. batch was completed. By tweaking the size of the batch, we could balance the
  198. lock contention, throughput and RTT. The append/remove would be a quick
  199. operation, and the cost of locks would amortize in the larger amount of queries
  200. handled per one lock operation.
  201. The possible downside is, a query needs to travel across several threads during
  202. its lifetime. It might turn out it is faster to move the query between cores
  203. than accessing the cache from several threads, since it is smaller, but it
  204. might be slower as well.
  205. It would be critical to make some kind of queue that is fast to append to and
  206. fast to take out first n items. Also, the tasks in the queues can be just
  207. abstract `boost::function<void (Worker&)>` functors, and each worker would just
  208. iterate through the queue, calling each functor. The parameter would be to
  209. allow easy generation of more tasks for other queues (they would be stored
  210. privately first, and appended to remote queues at the end of batch).
  211. Also, if we wanted to generate multiple parallel upstream queries from a single
  212. query, we would need to be careful. A query object would not have a lock on
  213. itself and the upstream queries could end up in a different caches/threads. To
  214. protect the original query, we would add another landlord that would aggregate
  215. answers together and let the query continue processing once it got enough
  216. answers. That way, the answers would be pushed all to the same threads and they
  217. could not fight over the query.
  218. [NOTE]
  219. This model would work only with threads, not processes.
  220. Shared caches
  221. -------------
  222. While it seems it is good to have some sort of L1 cache with pre-rendered
  223. answers (according to measurements in the #2777 ticket), we probably need some
  224. kind of larger shared cache.
  225. If we had just a single shared cache protected by lock, there'd be a lot of
  226. lock contention on the lock.
  227. Partitioning the cache
  228. ~~~~~~~~~~~~~~~~~~~~~~
  229. We split the cache into parts, either by the layers or by parallel bits we
  230. switch between by a hash. If we take it to the extreme, a lock on each hash
  231. bucket would be this kind, though that might be wasting resources (how
  232. expensive is it to create a lock?).
  233. Landlords
  234. ~~~~~~~~~
  235. The landlords do synchronizations themselves. Still, the cache would need to be
  236. partitioned.
  237. RCU
  238. ~~~
  239. The RCU is a lock-less synchronization mechanism. An item is accessed through a
  240. pointer. An updater creates a copy of the structure (in our case, it would be
  241. content of single hash bucket) and then atomically replaces the pointer. The
  242. readers from before have the old version, the new ones get the new version.
  243. When all the old readers die out, the old copy is reclaimed. Also, the
  244. reclamation can AFAIK be postponed for later times when we are slightly more
  245. idle or to a different thread.
  246. We could use it for cache ‒ in the fast track, we would just read the cache. In
  247. the slow one, we would have to wait in queue to do the update, in a single
  248. updater thread (because we don't really want to be updating the same cell twice
  249. at the same time).
  250. Proposals
  251. ---------
  252. In either case, we would have some kind of L1 cache with pre-rendered answers.
  253. For these proposals (except the third), we wouldn't care if we split the cache
  254. into parallel chunks or layers.
  255. Hybrid RCU/Landlord
  256. ~~~~~~~~~~~~~~~~~~~
  257. The landlord approach, just read only accesses to the cache are done directly
  258. by the peasants. Only if they don't find what they want, they'd append the
  259. queue to the task of the landlord. The landlord would be doing the RCU updates.
  260. It could happen that by the time the landlord gets to the task the answer is
  261. already there, but that would not matter much.
  262. Accessing network would be from landlords.
  263. Coroutines+RCU
  264. ~~~~~~~~~~~~~~
  265. We would do the coroutines, and the reads from shared cache would go without
  266. locking. When doing write, we would have to lock.
  267. To avoid locking, each worker thread would have its own set of upstream sockets
  268. and we would dup the sockets from users so we don't have to lock that.
  269. Multiple processes with coroutines and RCU
  270. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  271. This would need the layered cache. The upper caches would be mapped to local
  272. memory for read-only access. Each cache would be a separate process. The
  273. process would do the updates ‒ if the answer was not there, the process would
  274. be asked by some kind of IPC to pull it from upstream cache or network.