Browse Source

[2775] Possible approaches to threads

Michal 'vorner' Vaner 12 years ago
parent
commit
a756e890b5
1 changed files with 219 additions and 15 deletions
  1. 219 15
      doc/design/resolver/01-scaling-across-cores

+ 219 - 15
doc/design/resolver/01-scaling-across-cores

@@ -1,7 +1,9 @@
-01-scaling-across-cores
+Scaling across (many) cores
+===========================
+
+Problem statement
+-----------------
 
 
-Introduction
-------------
 The general issue is how to insure that the resolver scales.
 The general issue is how to insure that the resolver scales.
 
 
 Currently resolvers are CPU bound, and it seems likely that both
 Currently resolvers are CPU bound, and it seems likely that both
@@ -10,17 +12,8 @@ scaling will need to be across multiple cores.
 
 
 How can we best scale a recursive resolver across multiple cores?
 How can we best scale a recursive resolver across multiple cores?
 
 
-Some possible solutions:
-
-a. Multiple processes with independent caches
-b. Multiple processes with shared cache
-c. A mix of independent/shared cache
-d. Thread variations of the above
-
-All of these may be complicated by NUMA architectures (with
-faster/slower access to specific RAM).
-
-How does resolution look like:
+Image of how resolution looks like
+----------------------------------
 
 
                                Receive the query. @# <------------------------\
                                Receive the query. @# <------------------------\
                                        |                                      |
                                        |                                      |
@@ -56,7 +49,218 @@ How does resolution look like:
                                         v                                     |
                                         v                                     |
                                    Send answer # -----------------------------/
                                    Send answer # -----------------------------/
 
 
-Legend:
+This is simplified version, however. There may be other tasks (validation, for
+example), which are not drawn mostly for simplicity, as they don't produce more
+problems. The validation would be done as part of some computational task and
+they could do more lookups in the cache or upstream queries.
+
+Also, multiple queries may generate the same upstream query, so they should be
+aggregated together somehow.
+
+Legend
+~~~~~~
  * $ - CPU intensive
  * $ - CPU intensive
  * @ - Waiting for external event
  * @ - Waiting for external event
  * # - Possible interaction with other tasks
  * # - Possible interaction with other tasks
+
+Goals
+-----
+ * Run the CPU intensive tasks in multiple threads to allow concurrency.
+ * Minimise waiting for locks.
+ * Don't require too much memory.
+ * Minimise the number of upstream queries (both because they are slow and
+   expensive and also because we don't to eat too much bandwidth and spam the
+   authoritative servers).
+ * Design simple enough so it can be implemented.
+
+Naïve version
+-------------
+
+Let's look at possible approaches and list their pros and cons. Many of the
+simple versions would not really work, but let's have a look at them anyway,
+because thinking about them might bring some solutions for the real versions.
+
+We take one query, handle it fully, with blocking waits for the answers. After
+this is done, we take another. The cache is private for each one process.
+
+Advantages:
+
+ * Very simple.
+ * No locks.
+
+Disadvantages:
+
+ * To scale across cores, we need to run *a lot* of processes, since they'd be
+   waiting for something most of their time. That means a lot of memory eaten,
+   because each one has its own cache. Also, running so many processes may be
+   problematic, processes are not very cheap.
+ * Many things would be asked multiple times, because the caches are not
+   shared.
+
+Threads
+~~~~~~~
+
+Some of the problems could be solved by using threads, but they'd not improve
+it much, since threads are not really cheap either (starting several hundred
+threads might not be a good idea either).
+
+Also, threads bring other problems. When we still assume separate caches (for
+caches, see below), we need to ensure safe access to logging, configuration,
+network, etc. These could be a bottleneck (eg. if we lock every time we read a
+packet from network, when there are many threads, they'll just fight over the
+lock).
+
+Supercache
+~~~~~~~~~~
+
+The problem with cache could be solved by placing a ``supercache'' between the
+resolvers and the Internet. That one would do almost no processing, it would
+just take the query, looked up in the cache and either answered from the cache
+or forwarded the query to the external world. It would store the answer and
+forward it back.
+
+The cache, if single-threaded, could be a bottle-neck. To solve it, there could
+be several approaches:
+
+Layered cache::
+  Each process has it's own small cache, which catches many queries. Then, a
+  group of processes shares another level of bigger cache, which catches most
+  of the queries that get past the private caches. We further group them and
+  each level handles less queries from each process, so they can keep up.
+  However, with each level, we add some overhead to do another lookup.
+Segmented cache::
+  We have several caches of the same level, in parallel. When we would ask a
+  cache, we hash the query and decide which cache to ask by the hash. Only that
+  cache would have that answer if any and each could run in a separate process.
+  The only problem is, could there be a pattern of queries that would skew to
+  use only one cache while the rest would be idle?
+Shared cache access::
+  A cache would be accessed by multiple processes/threads. See below for
+  details, but there's a risk of lock contention on the cache (it depends on
+  the data structure).
+
+Upstream queries
+~~~~~~~~~~~~~~~~
+
+Before doing an upstream query, we look into the cache to ensure we don't have
+the information yet. When we get the answer, we want to update the cache.
+
+This suggests the upstream queries are tightly coupled with the cache. Now,
+when we have several cache processes/threads, each can have some set of opened
+sockets which are not shared with other caches to do the lookups. This way we
+can avoid locking the upstream network communication.
+
+Also, we can have three conceptual states for data in cache, and act
+differently when it is requested.
+
+Present::
+  If it is available, in positive or negative version, we just provide the
+  answer right away.
+Not present::
+  The continuation of processing is queued somehow (blocked/callback is
+  stored/whatever). An upstream query is sent and we get to the next state.
+Waiting for answer::
+  If another query for the same thing arrives, we just queue it the same way
+  and keep waiting. When the answer comes, all the queued tasks are resumed.
+  If the TTL > 0, we store the answer and set it to ``present''.
+
+We want to do aggregation of upstream queries anyway, using cache for it saves
+some more processing and possibly locks.
+
+Multiple parallel queries
+-------------------------
+
+It seems obvious we can't afford to have a thread or process for each
+outstanding query. We need to handle multiple queries in each one at any given
+time.
+
+Coroutines
+~~~~~~~~~~
+
+The OS-level threads might be too expensive, but coroutines might be cheap
+enough. In that way, we could still write a code that would be easy to read,
+but limit the number of OS threads to reasonable number.
+
+In this model, when a query comes, a new coroutine/user-level thread is created
+for it. We use special reads and writes whenever there's an operation that
+could block. These reads and writes would internally schedule the operation
+and switch to another coroutine (if there's any ready to be executed).
+
+Each thread/process maintains its own set of coroutines and they do not
+migrate. This way, the queue of coroutines is kept lock-less, as well as any
+private caches. Only the shared caches are protected by a lock.
+
+[NOTE]
+The `coro` unit we have in the current code is *not* considered a coroutine
+library here. We would need a coroutine library where we have real stack for
+each coroutine and we switch the stacks on coroutine switch. That is possible
+with reasonable amount of dark magic (see `ucontext.h`, for example, but there
+are surely some higher-level libraries for that).
+
+There are some trouble with multiple coroutines waiting on the same event, like
+the same upstream query (possibly even coroutines from different threads), but
+it should be possible to solve.
+
+Event-based
+~~~~~~~~~~~
+
+We use events (`asio` and stuff) for writing it. Each outstanding query is an
+object with some callbacks on it. When we would do a possibly blocking
+operation, we schedule a callback to happen once the operation finishes.
+
+This is more lightweight than the coroutines (the query objects will be smaller
+than the stacks for coroutines), but it is harder to write and read for.
+
+[NOTE]
+Do not consider cross-breeding the models. That leads to space-time distortions
+and brain damage. Implementing one on top of other is OK, but mixing it in the
+same bit of code is a way do madhouse.
+
+Landlords and peasants
+~~~~~~~~~~~~~~~~~~~~~~
+
+In both the coroutines and event-based models, the cache and other shared
+things are easier to imagine as objects the working threads fight over to hold
+for a short while. In this model, it is easier to imagine each such shared
+object as something owned by a landlord that doesn't let anyone else on it,
+but you can send requests to him.
+
+A query is an object once again, with some kind of state machine.
+
+Then there are two kinds of threads. The peasants are just to do the heavy
+work. There's a global work-queue for peasants. Once a peasant is idle, it
+comes to the queue and picks up a handful of queries from there. It does as
+much each the query as possible without requiring any shared resource.
+
+The other kind, the landlords, have a resource to watch over each. So we would
+have a cache (or several parts of cache), the sockets for accepting queries and
+answering them, possibly more. Each of these would have a separate landlord
+thread and a queue of tasks to do on the resource (look up something, send an
+answer...).
+
+Similarly, the landlord would take a handful of tasks from its queue and start
+handling them. It would possibly produce some more tasks for the peasants.
+
+The point here is, all the synchronisation is done on the queues, not on the
+shared resources themselves. And, we would append to a queues once the whole
+batch was completed. By tweaking the size of the batch, we could balance the
+lock contention, throughput and RTT. The append/remove would be a quick
+operation, and the cost of locks would amortize in the larger amount of queries
+handled per one lock operation.
+
+The possible downside is, a query needs to travel across several threads during
+its lifetime. It might turn out it is faster to move the query between cores
+than accessing the cache from several threads, since it is smaller, but it
+might be slower as well.
+
+It would be critical to make some kind of queue that is fast to append to and
+fast to take out first n items. Also, the tasks in the queues can be just
+abstract `boost::function<void (Worker&)>` functors, and each worker would just
+iterate through the queue, calling each functor. The parameter would be to
+allow easy generation of more tasks for other queues (they would be stored
+privately first, and appended to remote queues at the end of batch).
+
+[NOTE]
+This model would work only with threads, not processes.
+
+TODO: The shared caches