存储架构

index+log: implementations

微信扫一扫,分享到朋友圈

index+log: implementations
0

My take on index+log systems like WiscKey is that they are neither better nor worse than an LSM – it all depends on your workload. But I am certain that we know much more about an LSM than about the index+log systems. Hopefully that changes over time as some of them are thriving.

The firstindex+log system that I read about was Berkeley DB Java Edition
. The design paper is worth reading. Since then there have been a few more implementations and papers that I describe here. This list is probably incomplete: Bitcask, ForestDB, WiscKey, HashKV, TitanDB and RocksDB BlobDB.

At this point the systems that are getting into production, TitanDB and BadgerDB, use an LSM for the index. I wonder if an index structure that supports update-in-place would be better especially when the index must be cached
because I expect the CPU read-amp for an LSM to be about 2X larger than for a b-tree and a b-tree supports update-in-place which makes it easier to relocate values during GC.

While I like index+log systems I think that papers and marketing tend to overstate LSM write-amp. For production RocksDB I usually see write-amp between 10 and 20. I expect that index+log could achieve something closer to 5. This paper from CMU
explains one reason why per-level write-amp in an LSM is less than the per-level fanout (less than 10). Write skew is another reason.

The systems

Bitcask
was part of the Riak effort.

  • The index is an in-memory hash table. The index isn’t durable and the entire log has to be scanned to rebuild it on startup — whether or not this was after a crash or a clean shutdown. The slow startup is a problem.
  • The value log is circular and GC copies live values from the tail to the head of the log. Liveness is determined by an index search. 

ForestDB was a finalist in the SIGMOD 2011 student programming contest
. Eventually the project and creator moved to Couchbase. It is worth reading about here
and on the github page
. I published blog posts that compare ForestDB and RocksDB:1, 2
,3 and4. Google finds more
interesting things to read.

  • The index is a space-efficient trie.
  • The value log might have log segments. GC copies live values to the head of the log. Liveness is determined by an index search.

WiscKey is described as an LSM with key-value separation and made popular the term key-value separation
. I put it in the index+log family ofindex structures.

  • The index is an LSM. There is no redo log for the index as it can be recovered from the head of the value log.
  • Kudos for many references to amplification factors
    . The paper uses bytes read for read-amp. I prefer to consider both IO and CPU for read-amp with key comparisons for CPU and storage reads for IO.
  • It doesn’t mention that it has more cache-amp than an LSM, but few papers mention that problem. Shrinking the LSM by keeping large values separate doesn’t make the index parts of it (filter and index blocks) easier to cache as they are already separate from the data blocks. There is more to cache with index+log as Idescribe here.
  • It claims to overcome the (worst-case) problem of one storage IO per KV pair on a range scan by fetching in parallel. Assuming the storage device has enough spare IO this might hide the problem but it doesn’t fix it. With many workloads there isn’t spare IO and extra IO for reads also means extra CPU for decompression.
  • The value log is circular and single-threaded GC copies live values to the head of the log. Liveness is determined by an index search. I assume that multi-threaded GC is feasible.
  • The paper isn’t clear about the total write-amp that might occur from both the first write to the value log and GC that follows.
  • Compression isn’t explained.

BadgerDB
is a golang implementation, and much more, of the WiscKey paper.

  • It has many features and many production use cases. This is impressive. 
  • GC is  scheduled by the user
    . Based on Options.NumCompactors
    I assume it can be multi-threaded.
  • The docs state
    that the LSM can be served from RAM because the values are elsewhere. That is true but I don’t consider it a feature. It must be in RAM to avoid IO from liveness queries done by GC. An LSM isn’t a monolithic thing. There are index blocks, data blocks and filter blocks and most of the LSM, data blocks from the max level, don’t have to be cached. 
  • There is extra work on reads to find values that have been moved by GC. See the comments about BadgerDB here
    .

HashKV is an interesting paper that avoids index queries during GC.

  • Hash-based data grouping distributes KV pairs by hash into one of N logs. GC is probably done by scanning a log twice — once to get the keys and the second time to relocate the live values. A value is live when the most recent key is not a tombstone. A value might be live when needed for a snapshot. GC doesn’t do index searches so the index doesn’t have to be cached to make GC efficient but you might want to cache it to avoid doing index IO on queries — and this index is still much larger than the block index for an LSM.
  • Hotness awareness copies cold values to a cold segment to avoid repeated GC for a value that doesn’t get updated or deleted. A header for the value is kept in the non-cold log.
  • Small values are stored inline in the LSM.
  • I am curious if more log groups means more write-amp. See my comment about fsync in aprevious post.
  • I am curious whether managing the hash buckets is easy. The goal is to make sure that keys for a segment group fit in memory. The range and number of buckets must change over time. Does this have anything in common with  linear
     and  extendible
     hashing?

TitanDB
is part of TiDB and TiDB
is thriving.

  • A WAL is used for new writes. This might make it easier to compress data on the first write to the value logs.
  • GC appears to do index searches to determine liveness.
  • Compression is per-record. I hope this does per-block in the future.
  • It might let the user tune between space-amp and write-amp via discardable_ratio.
  • This is compatible with most of the RocksDB API.k

RocksDB BlobDB
is an extension to RocksDB that uses log segments for large values and stores small values in the LSM. GC copies live values and liveness is determined by an index search.

Future work

Future work for index+log systems includes:

  • Determine whether a b-tree is better than an LSM for the index structure
  • Determine whether the HashKV solution is the best way to avoid liveness queries during GC.
  • If an LSM is used for the index structure determine efficient ways to support relocating values during GC without extra overheads and complexity
    during read processing.
  • Determine whether clever things can be done during GC.
    • Block compression is easier than on first write.
    • Hot/cold value separation is possible (see HashKV). This is an example of generational GC
      even if we rarely mention that for index structures.
    • Values in a log segment can be ordered by key rather than by time of write during GC. GC can also merge multiple ordered segments to create longer sorted runs. I wonder if it is then possible to use block indexes (key+pointer per block rather than per row) to reduce cache-amp for such log segments.

阅读原文...


微信扫一扫,分享到朋友圈

index+log: implementations
0

Small Datum

Unit Tests with Rust Tutorial 101

上一篇

IPFire 2.23 发布,引入新的入侵防御系统

下一篇

评论已经被关闭。

插入图片

热门分类

往期推荐

index+log: implementations

长按储存图像,分享给朋友