（Disk-oriented数据库的问题，维护buffer pool） DBMSs invariably（always） maintain a buffer pool of blocks in main memory for faster access.
When an executing query attempts to read a disk block, the DBMS first checks to see whether the block already exists in this buffer pool.
If not, a block is evicted to make room for the needed one.
There is substantial overhead to managing the buffer pool , since blocks have to be pinned in main memory and the system must maintain an eviction order policy (e.g., least recently used).
As noted in , when all data fits in main memory, the cost of maintaining a buffer pool is nearly one-third of all the CPU cycles used by the DBMS.
（产生内存数据库，避免buffer pool导致的巨大overhead） The expense of managing disk-resident data has fostered（培养） a class of new DBMSs that put the entire database in main memory and thus have no buffer pool .
TimesTen was an early proponent（支持者） of this approach , and more recent examples include H-Store [2, 18], MemSQL , and RAMCloud .
H-Store (and its commercial version VoltDB ) performs significantly better than disk-based DBMSs on standard OLTP benchmarks  because of this main memory orientation,
as well as from avoiding the overhead of concurrency control and heavy-weight data logging .
（内存数据库的问题，超出内存容量导致读盘会很慢） The fundamental problem with main memory DBMSs, however, is that this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system.
If the database does not fit in memory, then the operating system will start to page virtual memory, and main memory accesses will cause page faults.
Because page faults are transparent to the user, in this case the main memory DBMS, the execution of transactions is stalled while the page is fetched from disk.
This is a significant problem in a DBMS, like H-Store, that executes transactions serially without the use of heavyweight locking and latching.
Because of this, all main memory DBMSs warn users not to exceed the amount of real memory .
If memory is exceeded (or if it might be at some point in the future), then a user must either
(1) provision new hardware and migrate their database to a larger cluster, or (2) fall back to a traditional disk-based system, with its inherent performance problems.
（一个提升disk-based数据库性能的方式，是外置的cache） One widely adopted performance enhancer is to use a main memory distributed cache, such as Memcached , in front of a disk-based DBMS.
Under this two-tier architecture, the application first looks in the cache for the tuple of interest.
If this tuple is not in the cache, then the application executes a query in the DBMS to fetch the desired data.
Once the application receives this data from the DBMS, it updates the cache for fast access in the future.
Whenever a tuple is modified in the database, the application must invalidate its cache entry
so that the next time it is accessed the application will retrieve the current version from the DBMS.
Many notable web sites, such as Facebook, use a large cluster of Memcached nodes in front of their sharded MySQL installation.
There are two problems with this two-tier model. First, data objects may reside both in the cache (in main memory format) and in the DBMS buffer pool (in disk format).
This double buffering of data is a waste of resources.
The second issue is that it requires developers to embed logic in their application to keep the two systems independently synchronized.
For example, when an object is modified, the update is sent to the back-end DBMS.
But now the states of the object in the DBMS and in the cache are different.
If the application requires up-to-date values, the application must also update the object in the cache.
To overcome these problems, we present a new architecture for main memory DBMSs that we call anti-caching .
In a DBMS with anti-caching, when memory is exhausted, the DBMS gathers the “coldest” tuples and writes them to disk with minimal translation from their main memory format,
thereby freeing up space for more recently accessed tuples.
As such, the “hotter” data resides in main memory, while the colder data resides on disk in the anti-cache portion of the system.
Unlike a traditional DBMS architecture, tuples do not reside in both places;
each tuple is either in memory or in a disk block, but never in both places at the same time.
In this new architecture, main memory, rather than disk, becomes the primary storage location.
Rather than starting with data on disk and reading hot data into the cache, data starts in memory and cold data is evicted to the anti-cache on disk.
（机制和虚拟内存相似） This approach is similar to virtual memory swapping in operating systems (OS).
With virtual memory, when the amount of data exceeds the amount of available memory, cold data is written out to disk in pages, typically in least recently used (LRU) order.
When the evicted page is accessed, it is read back in, possibly causing other pages to be evicted.
This allows the amount of virtual memory to exceed the amount of physical memory allocated to a process.
Similarly, anti-caching allows the amount of data to exceed the available memory by evicting cold data to disk in blocks.
If data access is skewed, the working set will remain in main memory.
（和虚拟内存比的优势） With anti-caching, it is the responsibility of the DBMS to read and write data as needed.
An alternative is to let the virtual memory system do the paging of the data to and from disk. Indeed, this is the approach taken in .
However, anti-caching has several advantages over virtual memory in the context of a main memory DBMS.
In particular, it provides fine-grained control of the data evicted to disk and non-blocking reads of evicted data from disk.
These two main differences are described in detail below:
Fine-Grained Eviction : A key advantage of anti-caching over virtual memory in the context of a main memory DBMS is the granularity at which data can be evicted.
In anti-caching, eviction decisions are performed at the tuple-level . This means that the coldest tuples will be written to disk.
In virtual memory, OS makes eviction decisions at the page-level . A virtual memory page is likely to be significantly larger than a typical OLTP tuple.
Thus, each page selected for eviction will contain multiple tuples, each with potentially varying levels of coldness.
A single hot tuple on a page will cause the entire page to be hot and kept in memory, even if the other tuples are cold.
It is best to make evictions at the same level of granularity that the data is accessed, which in a DBMS is at the tuple level.
Anti-caching provides a method for this finer-grained control of evicted data by building pages of cold tuples only.
Non-Blocking Fetches: Another difference is how evicted data is retrieved when it is needed.
In a virtual memory system, the OS blocks a process when it incurs（导致） a page fault from reading a memory address that is on disk.
For certain DBMSs [29, 34], this means that no transactions are executed while the virtual memory page is being fetched from disk.
In an anti-caching DBMS, a transaction that accesses evicted data is simply aborted and then restarted at a later point once the data that it needs is retrieved from disk.
In the meantime, the DBMS continues to execute other transactions without blocking.
Lastly, since every page fault triggers a disk read, queries that access multiple evicted pages will page fault several times in a sequential fashion.
We instead use a pre-pass execution phase that attempts to identify all evicted blocks needed by a transaction, which will allow all blocks to be read together .
（两个假设，查询和索引可以fit in memory） Our anti-cache design is based on two key assumptions .
Foremost is that our current prototype restricts the scope of queries to fit in main memory.
We do not consider this a significant hindrance（阻碍） , since such large queries are uncommon in OLTP workloads.
The other design assumption is that all indexes fit in memory.
The trade-offs of using large secondary indexes is a well-studied topic in database optimization and we do not believe that this requirement is overly restrictive.
We propose alternative designs to obviate（消除） the need to keep secondary indexes in memory.
H-STORE SYSTEM OVERVIEW
Given these observations, H-Store is designed to efficiently execute OLTP workloads on main memory-only nodes [18, 29].
As shown in Fig. 2, an H-Store node is a single physical computer system that manages one or more partitions.
A partition is a disjoint subset of the data .
Each partition is assigned a single-threaded execution engine at its node that is responsible for executing transactions and queries for that partition. （单线程执行引擎，对应于一个partition）
（HStore主要应对存储过程的查询） Although H-Store supports ad hoc queries, it is primarily optimized to execute transactions as stored procedures .
In this paper, we use the term transaction to refer to an invocation of a stored procedure.
Stored procedures are an effective way to optimize OLTP applications because they execute entirely at the data node, thereby reducing the number of round-trips between the client and the database.
A stored procedure contains control code (i.e., application logic) that invokes pre-defined parameterized SQL commands.
A client application initiates a transaction by sending a request to any node in the cluster.
Each transaction request contains the name of a stored procedure and the input parameters for that procedure’s control code.
H-Store assumes a workload of transactions with the following composition:
In this case, there is a database design that allocates the various partitions of each table to nodes in such a way that most transactions are local to a single node .
Looking up a banking account balance or a purchase order is an example of a single-partition transaction.
A single-partition transaction is examined in the user-space HStore client library, where parameters are substituted to form a runnable transaction.
The user-level library is aware of H-Store’s partitioning scheme , so the transaction can be sent to the correct node where it is executed from beginning to end without any blocking.
Hence, single-partition transactions are serialized at each node , and any application that consists entirely of single-partition transactions will obtain maximum parallelism.
These transactions consist of multiple phases, each of which must be completed before the next phase begins.
Moreover, one or more of the phases touches multiple partitions.
Each H-Store transaction is given a unique transaction ID , based on the time it arrived in the system.
Standard clock-skew algorithms are used to keep the various CPU clocks synchronized.
If a transaction with a higher transaction ID has already arrived at a node, then the incoming transaction is refused.
In this way transactions are synchronized in timestamp order at the various nodes, without the need for any deadlock detection.
Multi-Partition transactions use an extension of this protocol, where each local executor cannot run other transactions until the multi-partition transaction finishes execution.
This scheme gives good throughput for workloads with a preponderance of single-partition transactions.
To ensure that all modifications to the database are durable and persistent, each DBMS node continuously writes asynchronous snapshots of the entire database to disk at fixed intervals [21, 29].
In between these snapshots, the DBMS writes out a record to a command log for each transaction that completes successfully .
The DBMS combines multiple records together and writes them in a group to amortize the cost of writing to disk [16, 34].
Any modifications that are made by a transaction are not visible to the application until this record has been written.
This record only contains the original request information sent from the client, which is more lightweight than record-level logging .
ANTI-CACHING SYSTEM MODEL
We call our architecture anti-caching since it is the opposite architecture to the traditional DBMS buffer pool approach.
The disk is used as a place to spill cold tuples when the size of the database exceeds the size of main memory.
As stated earlier, unlike normal caching, a tuple is never copied. It lives in either main memory or the disk based anti-cache.
At runtime, the DBMS monitors the amount of main memory used by the database.
When the size of the database relative to the amount of available memory on the node exceeds some administrator-defined threshold,
the DBMS “evicts” cold data to the anti-cache in order to make space for new data.
To do this, the DBMS constructs a fixed-size block that contains the least recently used (LRU) tuples from the database and writes that block to the anti-cache.
It then updates a memory-resident catalog that keeps track of every tuple that was evicted . （内存catalog要track所有evicted的tuples）
When a transaction accesses one of these evicted tuples, the DBMS switches that transaction into a “pre-pass” mode to learn about all of the tuples that the transaction needs.
After this pre-pass is complete, the DBMS then aborts that transaction (rolling back any changes that it may have made) and holds it while the system retrieves the tuples in the background.
Once the data has been merged back into the in-memory tables, the transaction is released and restarted.
We now describe the underlying storage architecture of our anticache implementation.
We then discuss the process of evicting cold data from memory and storing it in the non-volatile anti-cache.
Then, we describe how the DBMS retrieves data from the anticache.
All of the DBMS’s operations on the anti-cache are transactional and any changes are both persistent and durable.
The anti-cache storage manager within each partition contains three components: （结构3部分组成）
(1) a disk-resident hash table that stores evicted blocks of tuples called the Block Table ,
(2) an in-memory Evicted Table that maps evicted tuples to block ids, and
(3) an in-memory LRU Chain of tuples for each table.
As with all tables and indexes in H-Store, these data structures do not require any latches since only one transaction is allowed to access them at a time. （无锁是因为HStore是单线程串行的）
One of the trade-offs that we need to consider is the storage overhead of this bookkeeping, given that the main goal of evicting tuples is to free up memory.
Obviously the amount of memory used to keep track of evicted tuples should only be a small fraction of the memory gained from evicting tuples.
Our current implementation also requires that all of the database’s primary key and secondary indexes fit in memory. We explore this issue further in Section 5.6.
Block Table : This is a hash table that maintains the blocks of tuples that have been evicted from the DBMS’s main memory storage.
Each block is the same fixed-size and is assigned a unique 4-byte key .
A block’s header contains the identifier for the single table that its tuples were evicted from and the timestamp when the block was created.
The body of the block contains the serialized evicted tuples from a single table.
Every tuple stored in a block is prefixed with its size and is serialized in a format that closely resembles its in-memory format (as opposed to a format that is specifically designed for disk-based storage). （序列化格式和内存格式相近）
The key portion of the Block Table stays in memory while its values (i.e., the block data) are stored on disk without OS or file-system caching.
Evicted Table: The Evicted Table keeps track of the tuples that have been written out to blocks on disk.
When a tuple is evicted, the DBMS removes it from the regular storage space for tables and adds it to a dynamically-constructed block that is then stored in the Block Table.
Each evicted tuple in a block is assigned a 4-byte identifier that corresponds to its offset in the block it resides in.
The DBMS updates any indexes containing evicted tuples to reference the Evicted Table.
As discussed in Section 3.4, the Evicted Table ensures that the DBMS is able to identify all of the evicted tuples that are needed by a transaction.
LRU Chain: Lastly, H-Store also maintains an in-memory list of all the tuples for each table in LRU order.
This allows the DBMS to quickly ascertain（find out，探明） at runtime the least-recently used tuples to combine into a new block to evict.
The LRU Chain is a doubly-linked list where each tuple points to the next and previous most-recently used tuple for its table.
Tuples are added to the tail of the chain whenever they are accessed, modified, or inserted by a transaction.
When a tuple is read or updated, it is first removed from its original location in the chain and inserted at the back.
The tuples that were previously adjacent to it in the chain are then linked to each other.
Rather than maintain a separate data structure for the LRU Chain, the DBMS embeds the pointers directly in the tuples’ headers .
To reduce the memory overhead of this, the pointer for each tuple is a 4-byte offset of that record in its table’s memory at that partition (instead of an 8-byte address location).
To reduce the CPU overhead of tracking the total ordering of each table’s LRU Chain, the DBMS selects a fraction of the transactions to monitor at runtime. （部分事务用于更新LRU，降低cpu使用）
The selected transactions are used to update data in the LRU Chain.
Because hot tuples are, by definition, accessed more frequently, they are more likely to be accessed in the transactions sampled and thus are more likely to be updated in the LRU Chain.
The rate at which transactions are sampled is controlled by parameter a, where 0 < a < 1.
We explore the affect of sampling and other trade-offs in Section 5.4.
（通过标记，某些表不需要evict） In addition, there are often tables that are accessed frequently and should not be allowed to be evicted to disk (e.g., small lookup tables).
Because these tables would be considered hot, it is unlikely that any portion of such a table would be evicted to disk.
Still, there is added overhead of maintaining the LRU chain for such tables.
To remove this, tables can be specifically flagged as evictable during schema creation.
Any table not labeled as evictable will not maintain an LRU chain and will remain entirely in main memory.
Ideally, our architecture would be able to maintain a single global ordering of tuples in the system, thus globally tracking hot and cold data.
However, the costs of maintaining a single chain across partitions would be prohibitively expensive due to the added costs of inter-partition communication.
Instead, our system maintains a separate LRU Chain per table that is local to a partition .
Thus, in order to evict data the DBMS must determine (1) what tables to evict data from and (2) the amount of data that should be evicted from a given table.
For our initial implementation, the DBMS answers these questions by the relative skew of accesses to tables.
The amount of data accessed at each table is monitored, and the amount of data evicted from each table is inversely proportional to the amount of data accessed in the table since the last eviction.
Thus, the hotter a table is, the less data will be evicted .
For the benchmarks tested, this approach is sufficient, but we expect to consider more sophisticated schemes in the future.
After determining how much data to evict from each table, HStore executes special single-partition transactions that select tuples for eviction and writes blocks to disk.
Since transactions are executed one-at-a-time at each partition, these eviction transactions automatically block all other transactions at their target partition without needing any additional locking mechanisms.
When the eviction transaction executes, it creates a new block by popping tuples off the head of the target table’s LRU Chain.
For each tuple being evicted, H-Store copies its data into the eviction block buffer.
It then adds an entry into the Evicted Table and updates all indexes to point to this entry i nstead of the original tuple location.
Each tuple in the Evicted Table includes a special evicted flag in its header that enables the DBMS to recognize when a transaction accesses evicted data.
This eviction process continues until the block is full, at which point the transaction will create the next block.
The process stops once the transaction has evicted the requisite amount of data from each table.
Groups of blocks are written out in a single sequential write.
For example, if the table is asked to evict a set of n blocks, it will create each of the n blocks independently, and only when all n blocks have been created will it write the result to disk in one sequential write .
It is also important to note that the state of the database is consistent during the eviction process.
Although indexes are updated and the tuple is removed from the original table before the block is written to disk,
the single-threaded nature of the execution engine means that no other transactions access these changes until the special transaction finishes.
Other transactions will not execute until the entire set of blocks requested for eviction are written to disk.
Also, at no point during this process is data un-recoverable if the DBMS crashes (see Section 3.6).
Main memory DBMSs, like H-Store, owe their performance advantage to processing algorithms that assume that data is in main memory.
But any system will slow down if a disk read must be processed in the middle of a transaction.
This means that we need to avoid stalling transaction execution at a partition whenever a transaction accesses an evicted tuple.
We now describe how this is accomplished with anti-caching.
A query can access evicted data through either an index or a sequential look-up (i.e., a full table scan).
For the latter, the DBMS will need to store the entire table in memory, which may exceed the physical memory available.
We discuss this problem in Section 6.1.
For index look-up queries, the system searches the target index to find the keys that match the query’s predicate.
Each key in the index points to a tuple that is either in the normal table storage or in the Evicted Table .
If none of the accessed tuples are evicted, then the DBMS allows the transaction to continue.
If evicted data is needed, the transaction will then enter a special phase to determine exactly which data is needed and where that data exists on disk.
Pre-pass Phase: A transaction enters the pre-pass phase if evicted data is needed to continue execution.
The goal of the pre-pass phase is to determine all of the evicted data that the transaction needs to access so that it can be retrieved together.
To do this, the transaction executes as normal, except that the DBMS checks the evicted flag for each tuple that it accesses to determine whether the tuple has been evicted.
If it has, then the DBMS records the evicted tuple’s block ID and offset from the Block Table (see Fig. 3).
When pre-pass has finished execution, the DBMS rolls back any changes that the transaction made at any partition and
then re-queues the transaction along with the list of evicted tuple identifiers that it attempted to access during the pre-pass phase.
Also, during the prepass phase, any in-memory tuples are updated in the LRU Chain to reduce the likelihood that these tuples are evicted before the transaction is re-queued. （Pre pass可以先刷新一遍LRU，防止需要的数据下次执行时被evict）
This minimizes the possibility of a transaction being restarted multiple times due to evicted data.
Although it is small, the overhead of aborting and restarting transactions is not zero.
Thus, in the pre-pass phase, the DBMS attempts to identify all of the data that a transaction needs by allowing that transaction to continue executing after it encounters an evicted tuple .
This allows the DBMS to batch fetch requests and minimize the possibility of restarting a transaction multiple times.
In contrast, in the event of a page fault in virtual memory, execution halts for each individual evicted page access .
（有些情况，一次pre pass无法获取所有需要的evicted tuple，可能需要多次）
For some transactions, it is not possible for the DBMS to discover all of the data that it needs in a single pre-pass.
This can occur if the non-indexed values of an evicted tuple are needed to retrieve additional tuples in the same transaction.
In this case, the initial pre-pass phase will determine all evicted data that is not dependent on currently evicted data.
Once this data is successfully merged and the transaction is restarted, this unevicted data will be used to resolve any data dependencies and determine if any additional data needs to be unevicted.
From our experience, however, we believe that such scenarios are rare.
The more typical access pattern is that a transaction retrieves the key of a record from a secondary index,
in which case the DBMS will still be able to run the transaction in the pre-pass phase because the indexes always remain in memory.
We next describe how the DBMS retrieves the evicted tuples identified during the pre-pass and merges them back into the system’s in-memory storage.
After aborting a transaction that attempts to access evicted tuples, the DBMS schedules the retrieval of the blocks that the transaction needs from the Block Table in two steps.
The system first issues a non-blocking read to retrieve the blocks from disk.
This operation is performed by a separate thread while regular transactions continue to execute at that partition.
The DBMS stages these retrieved blocks in a separate buffer that is not accessible to queries.
Any transaction that attempts to access an evicted tuple in one of these blocks is aborted as if the data was still on disk.
Once the requested blocks are retrieved, the aborted transaction is then rescheduled.
Before it starts, the DBMS performs a “stop-and-copy” operation whereby all transactions are blocked at that partition
while the unevicted tuples are merged from the staging buffer back into the regular table storage .
It then removes all of the entries for these retrieved tuples in the Evicted Table and then updates the table’s indexes to point to the real tuples .
The key issue that we must consider during this step is on how much data to merge from a retrieved block back into the in-memory storage.
For example, the DBMS can choose to merge all of the tuples from the recently retrieved block or just the tuple(s) that the previous transaction attempted to access that caused the block to be retrieved in the first place.
We now discuss two different solutions for this problem. We compare the efficacy and performance of these approaches in Section 5.1.
Block-Merging: The simplest method is for the DBMS to merge the entire retrieved block back into the regular table storage.
All of the tuples in the block are inserted back into the in-memory table.
The requested tuple(s) are placed at the back of the table’s LRU Chain.
Conversely, any tuples not needed by pending transactions are added to the front (i.e., cold end) of the LRU Chain,
which means that they are more likely to be chosen for eviction in the next round.
This ensures that only the tuples that were needed by the transaction that caused the block to be un-evicted become hot, whereas the rest of the block is still considered cold.
After the DBMS merges the tuples from the block, it can delete that block from the Evicted Table.
The overhead of merging all the tuples from the un-evicted block can be significant, especially if only a single tuple is needed from the block and all of the other tuples are re-evicted shortly thereafter.
In the worst case, there is a continuous un-eviction/re-eviction cycle, where unwanted tuples are brought into the system and then immediately re-evicted.
Tuple-Merging: To avoid this oscillation , an alternative strategy is to only merge the tuples that caused the block to be read from disk.
When a block is retrieved from disk, the DBMS extracts only the tuples that are needed from that block
(based on their offsets stored in the Evicted Table) and then only merges those tuples back into the in-memory table.
Once the desired tuples are merged, the fetched block is then discarded without updating the block on disk.
This reduces the time of merging tuples back into their tables and updating their indexes.
It now means that there are now two versions of the tuple, the one in memory and the stale one in the anti-cache on disk.
But since the DBMS removes the merged tuples’ from the Evicted Table, all subsequent look-ups of these tuples will use the in-memory version.
If this block is ever fetched again, the stale entries of the already unevicted tuples are ignored.
Over time, these “holes” in the blocks accumulate.
This means the amount of valid data that is retrieved in each block is reduced.
We employ a lazy block compaction algorithm during the merge process.
This compaction works by tracking the number of holes in each of the blocks in the Block Table.
When the DBMS retrieves a block from disk, it checks whether the number of holes in a block is above a threshold.
If it is, then the DBMS will merge the entire block back into the memory, just as with the block-merge strategy.
We discuss more sophisticated approaches in Section 6.2.
Our anti-caching model also supports distributed transactions.
H-Store will switch a distributed transaction into the “pre-pass” mode just as a single-partition transaction when it attempts to access evicted tuples at any one of its partitions.
The transaction is aborted and not requeued until it receives a notification that all of the blocks that it needs have been retrieved from the nodes in the cluster.
The system ensures that any in-memory tuples that the transaction also accessed at any partition are not evicted during the time that it takes for each node to retrieve the blocks from disk.
Snapshots & Recovery
Persistence and durability in disk-based systems is typically achieved using a combination of on-disk data and logging.
In a main memory DBMS, however, other techniques such as snapshots and command logging [22, 29] are used.
This does not change for a DBMS with anti-caching, except that now the system must also snapshot the additional data structures discussed Section 3.1.
To do this, the DBMS serializes all the contents of the regular tables and index data, as well as the contents of the Evicted Table, and writes it to disk.
At the same time, the DBMS also makes a copy of the Block Table on disk as it existed when the snapshot began.
No evictions are allowed to occur in the middle of a snapshot .
To recover after a crash, the DBMS loads in the last snapshot from disk.
This will set up the tables, indexes, Block Table, and Evicted Table as it existed before the crash.
The DBMS then replays the transactions in the command log that were created after this snapshot was taken.
With this process, all anti-caching data is persistent and the exact state of a system is recoverable in the event of a crash.
Making a snapshot of the Block Table could be prohibitively expensive for large data sizes.
Instead of making copies for each checkpoint, the DBMS takes delta snapshots.
Because the data within a block in the Block Table is not updated, the DBMS just checks to see which blocks were added or removed from the Block Table since the last snapshot.
This technique greatly reduces the amount of data copied with each snapshot invocation.