123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347 |
- Scaling across (many) cores
- ===========================
- Problem statement
- -----------------
- The general issue is how to insure that the resolver scales.
- Currently resolvers are CPU bound, and it seems likely that both
- instructions-per-cycle and CPU frequency will not increase radically,
- scaling will need to be across multiple cores.
- How can we best scale a recursive resolver across multiple cores?
- Image of how resolution looks like
- ----------------------------------
- Receive the query. @# <------------------------\
- | |
- | |
- v |
- Parse it, etc. $ |
- | |
- | |
- v |
- Look into the cache. $# |
- Cry <---- No <---------- Is it there? -----------> Yes ---------\ |
- | ^ | |
- Prepare upstream query $ | | |
- | | | |
- v | | |
- Send an upstream query (#) | | |
- | | | |
- | | | |
- v | | |
- Wait for answer @(#) | | |
- | | | |
- v | | |
- Parse $ | | |
- | | | |
- v | | |
- Is it enough? $ ----> No ---------/ | |
- | | |
- Yes | |
- | | |
- \-----------------------> Build answer $ <----------------------/ |
- | |
- | |
- v |
- Send answer # -----------------------------/
- This is simplified version, however. There may be other tasks (validation, for
- example), which are not drawn mostly for simplicity, as they don't produce more
- problems. The validation would be done as part of some computational task and
- they could do more lookups in the cache or upstream queries.
- Also, multiple queries may generate the same upstream query, so they should be
- aggregated together somehow.
- Legend
- ~~~~~~
- * $ - CPU intensive
- * @ - Waiting for external event
- * # - Possible interaction with other tasks
- Goals
- -----
- * Run the CPU intensive tasks in multiple threads to allow concurrency.
- * Minimize waiting for locks.
- * Don't require too much memory.
- * Minimize the number of upstream queries (both because they are slow and
- expensive and also because we don't want to eat too much bandwidth and spam
- the authoritative servers).
- * Design simple enough so it can be implemented.
- Naïve version
- -------------
- Let's look at possible approaches and list their pros and cons. Many of the
- simple versions would not really work, but let's have a look at them anyway,
- because thinking about them might bring some solutions for the real versions.
- We take one query, handle it fully, with blocking waits for the answers. After
- this is done, we take another. The cache is private for each one process.
- Advantages:
- * Very simple.
- * No locks.
- Disadvantages:
- * To scale across cores, we need to run *a lot* of processes, since they'd be
- waiting for something most of their time. That means a lot of memory eaten,
- because each one has its own cache. Also, running so many processes may be
- problematic, processes are not very cheap.
- * Many things would be asked multiple times, because the caches are not
- shared.
- Threads
- ~~~~~~~
- Some of the problems could be solved by using threads, but they'd not improve
- it much, since threads are not really cheap either (starting several hundred
- threads might not be a good idea either).
- Also, threads bring other problems. When we still assume separate caches (for
- caches, see below), we need to ensure safe access to logging, configuration,
- network, etc. These could be a bottleneck (eg. if we lock every time we read a
- packet from network, when there are many threads, they'll just fight over the
- lock).
- Supercache
- ~~~~~~~~~~
- The problem with cache could be solved by placing a ``supercache'' between the
- resolvers and the Internet. That one would do almost no processing, it would
- just take the query, looked up in the cache and either answered from the cache
- or forwarded the query to the external world. It would store the answer and
- forward it back.
- The cache, if single-threaded, could be a bottle-neck. To solve it, there could
- be several approaches:
- Layered cache::
- Each process has it's own small cache, which catches many queries. Then, a
- group of processes shares another level of bigger cache, which catches most
- of the queries that get past the private caches. We further group them and
- each level handles less queries from each process, so they can keep up.
- However, with each level, we add some overhead to do another lookup.
- Segmented cache::
- We have several caches of the same level, in parallel. When we would ask a
- cache, we hash the query and decide which cache to ask by the hash. Only that
- cache would have that answer if any and each could run in a separate process.
- The only problem is, could there be a pattern of queries that would skew to
- use only one cache while the rest would be idle?
- Shared cache access::
- A cache would be accessed by multiple processes/threads. See below for
- details, but there's a risk of lock contention on the cache (it depends on
- the data structure).
- Upstream queries
- ~~~~~~~~~~~~~~~~
- Before doing an upstream query, we look into the cache to ensure we don't have
- the information yet. When we get the answer, we want to update the cache.
- This suggests the upstream queries are tightly coupled with the cache. Now,
- when we have several cache processes/threads, each can have some set of opened
- sockets which are not shared with other caches to do the lookups. This way we
- can avoid locking the upstream network communication.
- Also, we can have three conceptual states for data in cache, and act
- differently when it is requested.
- Present::
- If it is available, in positive or negative version, we just provide the
- answer right away.
- Not present::
- The continuation of processing is queued somehow (blocked/callback is
- stored/whatever). An upstream query is sent and we get to the next state.
- Waiting for answer::
- If another query for the same thing arrives, we just queue it the same way
- and keep waiting. When the answer comes, all the queued tasks are resumed.
- If the TTL > 0, we store the answer and set it to ``present''.
- We want to do aggregation of upstream queries anyway, using cache for it saves
- some more processing and possibly locks.
- Multiple parallel queries
- -------------------------
- It seems obvious we can't afford to have a thread or process for each
- outstanding query. We need to handle multiple queries in each one at any given
- time.
- Coroutines
- ~~~~~~~~~~
- The OS-level threads might be too expensive, but coroutines might be cheap
- enough. In that way, we could still write a code that would be easy to read,
- but limit the number of OS threads to reasonable number.
- In this model, when a query comes, a new coroutine/user-level thread is created
- for it. We use special reads and writes whenever there's an operation that
- could block. These reads and writes would internally schedule the operation
- and switch to another coroutine (if there's any ready to be executed).
- Each thread/process maintains its own set of coroutines and they do not
- migrate. This way, the queue of coroutines is kept lock-less, as well as any
- private caches. Only the shared caches are protected by a lock.
- [NOTE]
- The `coro` unit we have in the current code is *not* considered a coroutine
- library here. We would need a coroutine library where we have real stack for
- each coroutine and we switch the stacks on coroutine switch. That is possible
- with reasonable amount of dark magic (see `ucontext.h`, for example, but there
- are surely some higher-level libraries for that).
- There are some trouble with multiple coroutines waiting on the same event, like
- the same upstream query (possibly even coroutines from different threads), but
- it should be possible to solve.
- Event-based
- ~~~~~~~~~~~
- We use events (`asio` and stuff) for writing it. Each outstanding query is an
- object with some callbacks on it. When we would do a possibly blocking
- operation, we schedule a callback to happen once the operation finishes.
- This is more lightweight than the coroutines (the query objects will be smaller
- than the stacks for coroutines), but it is harder to write and read for.
- [NOTE]
- Do not consider cross-breeding the models. That leads to space-time distortions
- and brain damage. Implementing one on top of other is OK, but mixing it in the
- same bit of code is a way do madhouse.
- Landlords and peasants
- ~~~~~~~~~~~~~~~~~~~~~~
- In both the coroutines and event-based models, the cache and other shared
- things are easier to imagine as objects the working threads fight over to hold
- for a short while. In this model, it is easier to imagine each such shared
- object as something owned by a landlord that doesn't let anyone else on it,
- but you can send requests to him.
- A query is an object once again, with some kind of state machine.
- Then there are two kinds of threads. The peasants are just to do the heavy
- work. There's a global work-queue for peasants. Once a peasant is idle, it
- comes to the queue and picks up a handful of queries from there. It does as
- much on each the query as possible without requiring any shared resource.
- The other kind, the landlords, have a resource to watch over each. So we would
- have a cache (or several parts of cache), the sockets for accepting queries and
- answering them, possibly more. Each of these would have a separate landlord
- thread and a queue of tasks to do on the resource (look up something, send an
- answer...).
- Similarly, the landlord would take a handful of tasks from its queue and start
- handling them. It would possibly produce some more tasks for the peasants.
- The point here is, all the synchronisation is done on the queues, not on the
- shared resources themselves. And, we would append to a queues once the whole
- batch was completed. By tweaking the size of the batch, we could balance the
- lock contention, throughput and RTT. The append/remove would be a quick
- operation, and the cost of locks would amortize in the larger amount of queries
- handled per one lock operation.
- The possible downside is, a query needs to travel across several threads during
- its lifetime. It might turn out it is faster to move the query between cores
- than accessing the cache from several threads, since it is smaller, but it
- might be slower as well.
- It would be critical to make some kind of queue that is fast to append to and
- fast to take out first n items. Also, the tasks in the queues can be just
- abstract `boost::function<void (Worker&)>` functors, and each worker would just
- iterate through the queue, calling each functor. The parameter would be to
- allow easy generation of more tasks for other queues (they would be stored
- privately first, and appended to remote queues at the end of batch).
- Also, if we wanted to generate multiple parallel upstream queries from a single
- query, we would need to be careful. A query object would not have a lock on
- itself and the upstream queries could end up in a different caches/threads. To
- protect the original query, we would add another landlord that would aggregate
- answers together and let the query continue processing once it got enough
- answers. That way, the answers would be pushed all to the same threads and they
- could not fight over the query.
- [NOTE]
- This model would work only with threads, not processes.
- Shared caches
- -------------
- While it seems it is good to have some sort of L1 cache with pre-rendered
- answers (according to measurements in the #2777 ticket), we probably need some
- kind of larger shared cache.
- If we had just a single shared cache protected by lock, there'd be a lot of
- lock contention on the lock.
- Partitioning the cache
- ~~~~~~~~~~~~~~~~~~~~~~
- We split the cache into parts, either by the layers or by parallel bits we
- switch between by a hash. If we take it to the extreme, a lock on each hash
- bucket would be this kind, though that might be wasting resources (how
- expensive is it to create a lock?).
- Landlords
- ~~~~~~~~~
- The landlords do synchronizations themselves. Still, the cache would need to be
- partitioned.
- RCU
- ~~~
- The RCU is a lock-less synchronization mechanism. An item is accessed through a
- pointer. An updater creates a copy of the structure (in our case, it would be
- content of single hash bucket) and then atomically replaces the pointer. The
- readers from before have the old version, the new ones get the new version.
- When all the old readers die out, the old copy is reclaimed. Also, the
- reclamation can AFAIK be postponed for later times when we are slightly more
- idle or to a different thread.
- We could use it for cache ‒ in the fast track, we would just read the cache. In
- the slow one, we would have to wait in queue to do the update, in a single
- updater thread (because we don't really want to be updating the same cell twice
- at the same time).
- Proposals
- ---------
- In either case, we would have some kind of L1 cache with pre-rendered answers.
- For these proposals (except the third), we wouldn't care if we split the cache
- into parallel chunks or layers.
- Hybrid RCU/Landlord
- ~~~~~~~~~~~~~~~~~~~
- The landlord approach, just read only accesses to the cache are done directly
- by the peasants. Only if they don't find what they want, they'd append the
- queue to the task of the landlord. The landlord would be doing the RCU updates.
- It could happen that by the time the landlord gets to the task the answer is
- already there, but that would not matter much.
- Accessing network would be from landlords.
- Coroutines+RCU
- ~~~~~~~~~~~~~~
- We would do the coroutines, and the reads from shared cache would go without
- locking. When doing write, we would have to lock.
- To avoid locking, each worker thread would have its own set of upstream sockets
- and we would dup the sockets from users so we don't have to lock that.
- Multiple processes with coroutines and RCU
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- This would need the layered cache. The upper caches would be mapped to local
- memory for read-only access. Each cache would be a separate process. The
- process would do the updates ‒ if the answer was not there, the process would
- be asked by some kind of IPC to pull it from upstream cache or network.
|