12 years ago · a756e890b5
--- a/doc/design/resolver/01-scaling-across-cores
+++ b/doc/design/resolver/01-scaling-across-cores
@@ -1,7 +1,9 @@
 
				-01-scaling-across-cores
			
 
				+Scaling across (many) cores
			
 
				+===========================
			
 
				+
			
 
				+Problem statement
			
 
				+-----------------
			
 
				 
			
 
				-Introduction
			
 
				-------------
			
 
				 The general issue is how to insure that the resolver scales.
			
 
				 
			
 
				 Currently resolvers are CPU bound, and it seems likely that both
			
@@ -10,17 +12,8 @@ scaling will need to be across multiple cores.
 
				 
			
 
				 How can we best scale a recursive resolver across multiple cores?
			
 
				 
			
 
				-Some possible solutions:
			
 
				-
			
 
				-a. Multiple processes with independent caches
			
 
				-b. Multiple processes with shared cache
			
 
				-c. A mix of independent/shared cache
			
 
				-d. Thread variations of the above
			
 
				-
			
 
				-All of these may be complicated by NUMA architectures (with
			
 
				-faster/slower access to specific RAM).
			
 
				-
			
 
				-How does resolution look like:
			
 
				+Image of how resolution looks like
			
 
				+----------------------------------
			
 
				 
			
 
				                                Receive the query. @# <------------------------\
			
 
				                                        |                                      |
			
@@ -56,7 +49,218 @@ How does resolution look like:
 
				                                         v                                     |
			
 
				                                    Send answer # -----------------------------/
			
 
				 
			
 
				-Legend:
			
 
				+This is simplified version, however. There may be other tasks (validation, for
			
 
				+example), which are not drawn mostly for simplicity, as they don't produce more
			
 
				+problems. The validation would be done as part of some computational task and
			
 
				+they could do more lookups in the cache or upstream queries.
			
 
				+
			
 
				+Also, multiple queries may generate the same upstream query, so they should be
			
 
				+aggregated together somehow.
			
 
				+
			
 
				+Legend
			
 
				+~~~~~~
			
 
				  * $ - CPU intensive
			
 
				  * @ - Waiting for external event
			
 
				  * # - Possible interaction with other tasks
			
 
				+
			
 
				+Goals
			
 
				+-----
			
 
				+ * Run the CPU intensive tasks in multiple threads to allow concurrency.
			
 
				+ * Minimise waiting for locks.
			
 
				+ * Don't require too much memory.
			
 
				+ * Minimise the number of upstream queries (both because they are slow and
			
 
				+   expensive and also because we don't to eat too much bandwidth and spam the
			
 
				+   authoritative servers).
			
 
				+ * Design simple enough so it can be implemented.
			
 
				+
			
 
				+Naïve version
			
 
				+-------------
			
 
				+
			
 
				+Let's look at possible approaches and list their pros and cons. Many of the
			
 
				+simple versions would not really work, but let's have a look at them anyway,
			
 
				+because thinking about them might bring some solutions for the real versions.
			
 
				+
			
 
				+We take one query, handle it fully, with blocking waits for the answers. After
			
 
				+this is done, we take another. The cache is private for each one process.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+ * Very simple.
			
 
				+ * No locks.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+ * To scale across cores, we need to run *a lot* of processes, since they'd be
			
 
				+   waiting for something most of their time. That means a lot of memory eaten,
			
 
				+   because each one has its own cache. Also, running so many processes may be
			
 
				+   problematic, processes are not very cheap.
			
 
				+ * Many things would be asked multiple times, because the caches are not
			
 
				+   shared.
			
 
				+
			
 
				+Threads
			
 
				+~~~~~~~
			
 
				+
			
 
				+Some of the problems could be solved by using threads, but they'd not improve
			
 
				+it much, since threads are not really cheap either (starting several hundred
			
 
				+threads might not be a good idea either).
			
 
				+
			
 
				+Also, threads bring other problems. When we still assume separate caches (for
			
 
				+caches, see below), we need to ensure safe access to logging, configuration,
			
 
				+network, etc. These could be a bottleneck (eg. if we lock every time we read a
			
 
				+packet from network, when there are many threads, they'll just fight over the
			
 
				+lock).
			
 
				+
			
 
				+Supercache
			
 
				+~~~~~~~~~~
			
 
				+
			
 
				+The problem with cache could be solved by placing a ``supercache'' between the
			
 
				+resolvers and the Internet. That one would do almost no processing, it would
			
 
				+just take the query, looked up in the cache and either answered from the cache
			
 
				+or forwarded the query to the external world. It would store the answer and
			
 
				+forward it back.
			
 
				+
			
 
				+The cache, if single-threaded, could be a bottle-neck. To solve it, there could
			
 
				+be several approaches:
			
 
				+
			
 
				+Layered cache::
			
 
				+  Each process has it's own small cache, which catches many queries. Then, a
			
 
				+  group of processes shares another level of bigger cache, which catches most
			
 
				+  of the queries that get past the private caches. We further group them and
			
 
				+  each level handles less queries from each process, so they can keep up.
			
 
				+  However, with each level, we add some overhead to do another lookup.
			
 
				+Segmented cache::
			
 
				+  We have several caches of the same level, in parallel. When we would ask a
			
 
				+  cache, we hash the query and decide which cache to ask by the hash. Only that
			
 
				+  cache would have that answer if any and each could run in a separate process.
			
 
				+  The only problem is, could there be a pattern of queries that would skew to
			
 
				+  use only one cache while the rest would be idle?
			
 
				+Shared cache access::
			
 
				+  A cache would be accessed by multiple processes/threads. See below for
			
 
				+  details, but there's a risk of lock contention on the cache (it depends on
			
 
				+  the data structure).
			
 
				+
			
 
				+Upstream queries
			
 
				+~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Before doing an upstream query, we look into the cache to ensure we don't have
			
 
				+the information yet. When we get the answer, we want to update the cache.
			
 
				+
			
 
				+This suggests the upstream queries are tightly coupled with the cache. Now,
			
 
				+when we have several cache processes/threads, each can have some set of opened
			
 
				+sockets which are not shared with other caches to do the lookups. This way we
			
 
				+can avoid locking the upstream network communication.
			
 
				+
			
 
				+Also, we can have three conceptual states for data in cache, and act
			
 
				+differently when it is requested.
			
 
				+
			
 
				+Present::
			
 
				+  If it is available, in positive or negative version, we just provide the
			
 
				+  answer right away.
			
 
				+Not present::
			
 
				+  The continuation of processing is queued somehow (blocked/callback is
			
 
				+  stored/whatever). An upstream query is sent and we get to the next state.
			
 
				+Waiting for answer::
			
 
				+  If another query for the same thing arrives, we just queue it the same way
			
 
				+  and keep waiting. When the answer comes, all the queued tasks are resumed.
			
 
				+  If the TTL > 0, we store the answer and set it to ``present''.
			
 
				+
			
 
				+We want to do aggregation of upstream queries anyway, using cache for it saves
			
 
				+some more processing and possibly locks.
			
 
				+
			
 
				+Multiple parallel queries
			
 
				+-------------------------
			
 
				+
			
 
				+It seems obvious we can't afford to have a thread or process for each
			
 
				+outstanding query. We need to handle multiple queries in each one at any given
			
 
				+time.
			
 
				+
			
 
				+Coroutines
			
 
				+~~~~~~~~~~
			
 
				+
			
 
				+The OS-level threads might be too expensive, but coroutines might be cheap
			
 
				+enough. In that way, we could still write a code that would be easy to read,
			
 
				+but limit the number of OS threads to reasonable number.
			
 
				+
			
 
				+In this model, when a query comes, a new coroutine/user-level thread is created
			
 
				+for it. We use special reads and writes whenever there's an operation that
			
 
				+could block. These reads and writes would internally schedule the operation
			
 
				+and switch to another coroutine (if there's any ready to be executed).
			
 
				+
			
 
				+Each thread/process maintains its own set of coroutines and they do not
			
 
				+migrate. This way, the queue of coroutines is kept lock-less, as well as any
			
 
				+private caches. Only the shared caches are protected by a lock.
			
 
				+
			
 
				+[NOTE]
			
 
				+The `coro` unit we have in the current code is *not* considered a coroutine
			
 
				+library here. We would need a coroutine library where we have real stack for
			
 
				+each coroutine and we switch the stacks on coroutine switch. That is possible
			
 
				+with reasonable amount of dark magic (see `ucontext.h`, for example, but there
			
 
				+are surely some higher-level libraries for that).
			
 
				+
			
 
				+There are some trouble with multiple coroutines waiting on the same event, like
			
 
				+the same upstream query (possibly even coroutines from different threads), but
			
 
				+it should be possible to solve.
			
 
				+
			
 
				+Event-based
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+We use events (`asio` and stuff) for writing it. Each outstanding query is an
			
 
				+object with some callbacks on it. When we would do a possibly blocking
			
 
				+operation, we schedule a callback to happen once the operation finishes.
			
 
				+
			
 
				+This is more lightweight than the coroutines (the query objects will be smaller
			
 
				+than the stacks for coroutines), but it is harder to write and read for.
			
 
				+
			
 
				+[NOTE]
			
 
				+Do not consider cross-breeding the models. That leads to space-time distortions
			
 
				+and brain damage. Implementing one on top of other is OK, but mixing it in the
			
 
				+same bit of code is a way do madhouse.
			
 
				+
			
 
				+Landlords and peasants
			
 
				+~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+In both the coroutines and event-based models, the cache and other shared
			
 
				+things are easier to imagine as objects the working threads fight over to hold
			
 
				+for a short while. In this model, it is easier to imagine each such shared
			
 
				+object as something owned by a landlord that doesn't let anyone else on it,
			
 
				+but you can send requests to him.
			
 
				+
			
 
				+A query is an object once again, with some kind of state machine.
			
 
				+
			
 
				+Then there are two kinds of threads. The peasants are just to do the heavy
			
 
				+work. There's a global work-queue for peasants. Once a peasant is idle, it
			
 
				+comes to the queue and picks up a handful of queries from there. It does as
			
 
				+much each the query as possible without requiring any shared resource.
			
 
				+
			
 
				+The other kind, the landlords, have a resource to watch over each. So we would
			
 
				+have a cache (or several parts of cache), the sockets for accepting queries and
			
 
				+answering them, possibly more. Each of these would have a separate landlord
			
 
				+thread and a queue of tasks to do on the resource (look up something, send an
			
 
				+answer...).
			
 
				+
			
 
				+Similarly, the landlord would take a handful of tasks from its queue and start
			
 
				+handling them. It would possibly produce some more tasks for the peasants.
			
 
				+
			
 
				+The point here is, all the synchronisation is done on the queues, not on the
			
 
				+shared resources themselves. And, we would append to a queues once the whole
			
 
				+batch was completed. By tweaking the size of the batch, we could balance the
			
 
				+lock contention, throughput and RTT. The append/remove would be a quick
			
 
				+operation, and the cost of locks would amortize in the larger amount of queries
			
 
				+handled per one lock operation.
			
 
				+
			
 
				+The possible downside is, a query needs to travel across several threads during
			
 
				+its lifetime. It might turn out it is faster to move the query between cores
			
 
				+than accessing the cache from several threads, since it is smaller, but it
			
 
				+might be slower as well.
			
 
				+
			
 
				+It would be critical to make some kind of queue that is fast to append to and
			
 
				+fast to take out first n items. Also, the tasks in the queues can be just
			
 
				+abstract `boost::function<void (Worker&)>` functors, and each worker would just
			
 
				+iterate through the queue, calling each functor. The parameter would be to
			
 
				+allow easy generation of more tasks for other queues (they would be stored
			
 
				+privately first, and appended to remote queues at the end of batch).
			
 
				+
			
 
				+[NOTE]
			
 
				+This model would work only with threads, not processes.
			
 
				+
			
 
				+TODO: The shared caches