tl;dr AdRoll’s real-time bidding system produces tens of billions events daily. We store these events in TrailDBs , a fast embedded database that we recently open-sourced . By leveraging a cool new feature in the Linux kernel , you can now query TrailDBs directly in Amazon S3 without having to store the database locally, like here:
Tens of Trillions of Events in Amazon S3
We have been very happy with our data processing pipeline , built on top of Docker, Luigi, Spot Instances, and Amazon S3. It allows us to process hundreds of terabytes of data stored in TrailDBs cost-efficiently with low operational overhead.
A typical job at AdRoll downloads a set of TrailDB shards from S3, processes them, and pushes results back to S3. This model works fine for batch jobs that need to query a large amount of data, similar to a full table scan in RDBMS. Over time, our TrailDB shards have grown in size (more rows) and they include a wider variety of information (more columns). Consequently, an increasing number of queries need to access only a subset of rows and columns.
Having to download gigabytes of unneeded data to find the proverbial needle in a haystack, expressed as a highly selective query, is quite expensive. A more efficient solution would download only the specific bytes that matter for the query. Traditional data warehouses address this issue by carefully managing both storage and query optimization. In contrast, our TrailDB-on-AWS architecture unbundles storage from querying for ease of scalability and operation, similar to many other big data systems.
However, we can still leverage a traditional database technique to solve the problem, namely page management . Instead of downloading a full shard, we can fetch only the pages that are required to execute the query.
User-Space Page Fault Handling
Cache and page managers are very non-trivial subsystems of modern databases . The topic is an inherently complex one but it is further complicated by the complex interplay between the kernel and the database . After all, the kernel has a very non-trivial subsystem of its own for page management that doesn’t always work in perfect harmony with its database counterpart.
It is a valid question whether the database even needs a virtual memory system of its own, instead of just letting the kernel take care of the job . In practice, it is not easy to make the kernel-managed paging perform as well as a custom subsystem given how much the database knows about the data and its access patterns in contrast to the kernel.
Enter a very promising new feature in the Linux kernel, user-space page fault handling . This feature allows a user-space application, such as the TrailDB library, to handle page faults, that is, a certain piece of data being missing from the process’ address space, as it best sees fit. In particular, this feature allows us to fetch page-size (4KB) chunks of data from S3 with practically no impact to the existing codebase.
While user-space page fault handling might not be flexible enough yet to replace Postgres’ venerable page management subsystem, it allowed us to implement this key piece of functionality in TrailDB in less than 2000 lines of code while working harmoniously with the kernel. As demonstrated by this case, it is extremely powerful to be able to embed application-specific policies in kernel mechanisms that take care of heavy lifting such as page fault handling, and, in other cases, file systems , and packet processing .
The API for user-space page fault handling is surprisingly straightforward. If you are curious, you can try userfaultfd() yourself by following this hello world tutorial .
TrailDB External Data Architecture
The solution implemented in TrailDB is fully generic and allows data to be fetched from any source, not just Amazon S3. The block servers that handle interfacing with the data sources are pluggable and implemented outside the TrailDB library, so they can be conveniently implemented in any programming language.
Interfacing with S3 is implemented by a block server written in Go, traildb-s3-server . In the diagram below, this is the “server process.” The TrailDB application, “client process,” communicates with the server using simple requests sent over TCP.
The purple box is an unmodified application using the TrailDB library. The blue boxes are responsible for paging and caching. The yellow box is an external data source, in our case Amazon S3.
Let’s walk through a page fault situation, that is, what happens when TrailDB tries to access a piece of external data like s3://traildb.io/data/wikipedia-small.tdb that is managed by the user-space page fault handler:
1. Page fault generated
If tdb_open() is called with a non-local URL, e.g. s3:// , memory regions that are normally memory-mapped with a file are mapped as MAP_ANONYMOUS and the region is registered to the user-space page fault handler.
When the region is accessed by TrailDB as usual, a page fault is generated which blocks the app. Instead of the page fault being handled by the kernel, it sends a message to the userfaultfd file descriptor indicating what address was requested.
2. User-space page fault handler triggered
When tdb_open() was called with a non-local URL, also a separate thread was launched to handle page faults. The thread uses poll() to wait for incoming messages in the userfaultfd file descriptor.
When a message is received, the virtual address requested is translated to an offset in the TrailDB file. Next, the handler checks if the offset corresponds to the latest block received from the server. If it does, we already have the required data which can be copied back to the app.
If the offset doesn’t resolve to the latest block, we must request a new block from the block server. A simple message is prepared which contains the URL of the TrailDB, an offset, and the minimum number of bytes requested, typically 4KB. This message is then sent over a TCP connection to the server.
3. Block server handles the request
The block server is expected to return bytes at the requested offset.
For efficiency reasons, our traildb-s3-server caches blocks received from S3 locally. When a request is received, first we need to check if the block exists in the local cache, possibly as a sub-range of a previous cached larger block.
source blockcache.go .
If the block is found, the server sends a response that includes a local path and an offset where the client can find the requested block. Currently we assume that both the client and the server share a common filesystem.
source s3tdb.go .
4. Block server fetches a block from S3
If the block is not found, we must fetch it from its original source, in this case from Amazon S3. This is accomplished using the standard AWS SDK for Go , which, besides downloading data, handles authentication transparently, in our case using an instance-specific IAM role.
S3 supports standard HTTP range requests, which allows us to request an exact range of bytes from S3. Since there is quite a high constant overhead and a monetary price associated to each GET request in S3, it makes sense to request a larger block instead of the single 4KB page that was originally requested by the client. You can read more about the effect of the block size below.
After a successful download, the block is added to the cache, and a response is sent to the client as in Step 3.
source s3tdb.go .
5. The requested block is memory-mapped by the client
The page fault handler receives a path and an offset to the requested block from the server. The block is memory mapped by the handler.
6. The requested page is delivered
The page fault handler thread uses UDDFIO_COPY to copy the requested page back to the app’s address space atomically. Once this call finishes, the app thread unblocks and it proceeds to process data as usual.
As described above, handling a single page fault may require up to six steps which involve a separate thread, a TCP connection, a server, and an HTTPS request to S3. Is this going to be too slow for realistic workloads?
Below we have benchmarked the S3-direct solution versus the baseline approach that involves downloading the full TrailDB file first to a local disk. The test data is a 100MB snapshot of Wikipedia edit history , as seen in the TrailDB tutorial . This TrailDB contains 6.3M trails. The benchmarks were performed on an r3.8xlarge EC2 instance in the same region where the data is located in S3.
First, consider a simple case where you want to access the first X trails in the TrailDB. In the chart below the green line shows the constant cost of the baseline: it takes about 1.6 seconds to download the full TrailDB. After downloading it, there is practically no cost for accessing the first 0-2000 trails.
The orange line shows the cost for performing the same operation using S3 directly with a 16MB block size. Accessing the first 0-2000 trails involves downloading a single 16MB block, which takes about 500ms.
The blue line shows the cost for using S3 directly with a 16KB block size. The small block size yields the lowest latency to the first results: You can get results for the first 0-10 trails in about 100ms, which is a 16x speedup compared to the baseline! However, the more trails are processed the more small blocks need to be fetched. The overhead adds up so that at about 200 trails it is cheaper to use the baseline approach instead.
It is expected that sequential access favors large block sizes. However, the original needle in a haystack question is more about random access.
Like above, the baseline approach first downloads the full TrailDB and then access X trails randomly. Again, the cost to download dominates the total cost since a small 100MB TrailDB fits in memory easily.