Part I – Foundations of Data Systems
Chapter 1 – Reliable, Scalable, and Maintainable Applications
Reliability / 可用性
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐ ware faults, and even human error).
Note that a fault is not the same as a failure . A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. 注意区分 fault 和 failure
- hardware faults：硬盘坏了
- software errors：内存泄漏
- human errors：运维操作失误
Scalability / 可扩展性
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future
. One common reason for degradation is increased load.
Scalability is the term we use to describe a system’s ability to cope with increased load.
Note, however, that it is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale.”
Rather, discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”
First, we need to succinctly describe the current load on the system; only then can we discuss growth questions. 首先要明确现状
You can look at it in two ways:
- When you increase a load parameter and keep the system resources (CPU, mem‐ ory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
percentile：For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
Approaches for Coping with Load
An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load.
- vertical scaling, moving to a more powerful machine
- horizontal scaling, distributing the load across multiple smaller machines
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare. 没有万能的架构，都是根据场景而来的，不能照搬
In an early-stage startup or an unpro‐ ven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load. 前期先重点关注feature
Maintainability / 可维护性
Over time, many different people will work on the system (engineering and oper‐ ations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively
Simplicity: Managing Complexity
accidental complexity: we define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
CHAPTER 2 – Data Models and Query Languages
each layer hides the complexity of the layers below it by providing a clean data model.
The Object-Relational Mismatch
Most application development today is done in object-oriented programming lan‐ guages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The discon‐ nect between the models is sometimes called an impedance mismatch
However, when it comes to representing many-to-one and many-to-many relation‐ ships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model . That identifier is resolved at read time by using a join or follow-up queries.
如何选择？realational model OR document model？
If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model.
However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormal‐ izing, but then the application code needs to do additional work to keep the denor‐ malized data consistent. Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database. In such cases, using a document model can lead to significantly more complex appli‐ cation code and worse performance
Schema flexibility in the document model
Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database. 所谓的schemaless
Storage and Retrieval
- log-structured storage engines
- page-oriented storage engines, such as B-trees
In this book, log
is used in the more general sense: an append-only
sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.
An index is an additional structure that is derived from the primary data
. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes.
keep an in-memory hash map
where every key is mapped to a byte offset in the data file—the location at which the value can be found.
In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory. 适用场景是distinct key的数量不多，而且大部分操作是针对同一个key的update
- The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck. 不适用于有很多key的场景
- Range queries are not efficient. 区间查询效率不高，因为只能loop所有的key
SSTables and LSM-Trees
SSTable(Sorted String Table): 日志中的记录是按
Merging segments is simple and efficient. 因为每个日志文件都是按
In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. 不需要将所有的
LSM-Trees： 是 Log-Structured Merge-Tree 的缩写。Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines
Lucene（ES用到的索引引擎）也是用的类似的数据结构来实现的全文搜索。 This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list).
Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective
. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries
(scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.
与 log-structured indexes 的区别：
The log-structured indexes we saw earlier break the database down into variable-size segments
, typically several megabytes or more in size, and always write a segment sequentially
B-trees break the database down into fixed-size blocks or pages
, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
- The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place. B-tree 会 in-place update，而 LSM-tree只会 append（这是函数式编程的风格）
The number of references to child pages in one page of the B-tree is called the branching factor
. For example, in Figure 3-6 the branching factor is six. In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.
如何update某个key / val呢？
从 root 开始，类似二分查找，直到找到包含这个key的page，然后更新即可，注意更新是以
a B-tree with n keys always has a depth of O(log n). Most databases can fit into a B-tree that is three or four levels
deep, so you don’t need to follow many page references to find the page you are looking for. ( A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB
当数据库突然挂掉时，B-tree 仍然也会丢数据，解决方案仍然也是万金油：append-only log，不过在这里是叫：write-ahead log（WAL, also known as a redo log）This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state.
Comparing B-Trees and LSM-Trees
As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads
. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.
LSM tree 的优势：
- LSM tree 的写放大（write amplification）相对于b-tree要小
- LSM tree 更容易被压缩，且不容易出现文件碎片，磁盘利用率更高
LSM tree 的劣势
- compaction process 会可能会影响到读和写，导致读写性能不稳定，而 B-tree 是很稳定的。
- 在极端情况下，compaction process 的速度会赶不上记录被写入的速度，导致文件系统被写爆（需要一个监控）
- LSM tree 会出现一个 key 存在于多个 segments 文件的情况，而 b-tree 能保证一个key只在一个 index 文件中，因而也更容易锁住它，更容易实现事务。
Other Indexing Structures
Storing values within the index
The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere.
In the latter case, the place where rows are stored is known as a heap file
, and it stores data in no particular order. The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place. 这种方式可以节省空间，比如多个二级索引可以复用一个记录指针，真实记录只在 heap 中存一份即可
In some situations, the extra hop from the index to the heap file is too much of a per‐ formance penalty for reads, so it can be desirable to store the indexed row directly within an index
. This is known as a clustered index
A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns
, which stores some of a table’s columns within the index . This allows some queries to be answered by using the index alone. 将部分字段的value一起写在index里，若查询的就是其中的字段则可以直接返回，very smart!
The most common type of multi-column index is called a concatenated index, which simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated)
注意这跟上面说的 covering index 不太一样，这应该是我们平常理解的
Keeping everything in memory
in-memory databases: Redis / Memcached
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk. 确实挺反直觉的，原来内存数据库快是快在了“避免了从内存数据结构编码成字节”。一般直觉会认为，传统依赖硬盘的数据库的慢是因为它们要读盘、写盘，但其实由于操作系统的
online transaction processing (OLTP) vs online analytic processing (OLAP)
In the early days of business data processing, a write to the database typically corre‐ sponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands, the term transaction nevertheless stuck, referring to a group of reads and writes that form a logical unit
An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).
However, databases also started being increasingly used for data analytics, which has very different access patterns. Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record
, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.
These OLTP systems are usually expected to be highly available
and to process transactions with low latency
, since they are often critical to the operation of the business. Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions. 分析型查询不要在业务库上做
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations . The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company
. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL)
A big advantage of using a separate data warehouse, rather than querying OLTP sys‐ tems directly for analytics, is that the data warehouse can be optimized for analytic access patterns. 把 olap 数据库独立出来的一个好处是，可以针对 olap 类查询做优化
Schemas for Analytics
数据仓库中主要用到的数据模型是：star schema(also known as dimensional modeling)，翻译成
比较合适，里面主要分为两种表：fact table 和 dimension tables。感觉也没什么特别，就是一张主表，关联了多个子表。
The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.
In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred. 数据仓库中的表的字段数量通常很大，超过100，甚至几百个
分析 olap 的查询场景：
In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. 在大部分 oltp 数据库中， 数据是按行存储的，那么虽然我们只需要几个字段，但查询时需要先将这100多个字段先从硬盘中load进内存，然后再返回这几个需要的字段，就是说不管是 select 几个字段，都得把这一整行先读出来。。
其实就是为了规避这个问题，反其道而行之，将数据按列存储，所有记录的某个字段存为一个文件，那么当 select 某个字段时，直接返回这个文件即可。
The column-oriented storage layout relies on each column file containing the rows in the same order. Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the individual column files and put them together to form the 23rd row of the table. 但需要注意的是，所有列文件里的记录顺序需要是一致的，要不就没法返回某条完整的记录了。
Take a look at the sequences of values for each column in Figure 3-10: they often look quite repetitive
, which is a good sign for compression. Depending on the data in the column, different compression techniques can be used. One technique that is particularly effective in data warehouses is bitmap encoding
Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.
If n is very small (for example, a country column may have approximately 200 dis‐ tinct values), those bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse). In that case, the bitmaps can additionally be run-length encoded. 如果 n 很大，导致 bitmap 很稀疏，还可以用
Writing to Column-Oriented Storage
策略跟 LSM-tree 一样，先写到内存中的balanced tree 中，再定期写入文件
Aggregation: Data Cubes and Materialized Views
Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates. As discussed earlier, data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often? 先算一个临时结果
Disk seek time is often the bottleneck in OLTP.
Disk bandwidth (not seek time) is often the bottleneck in OLAP
Chapter 4 – Encoding and Evolution
- Backward compatibility / 新代码需要兼容老结构
- Forward compatibility / 老代码需要兼容新结构
Formats for Encoding Data
Programs usually work with data in (at least) two different representations:
- In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes
(for example, a JSON document).
Thus, we need some kind of translation between the two representations. The trans‐ lation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling) 哈哈，编码与解码、序列化和反序列化，说的都是一个东西
JSON, XML, and Binary Variants
JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision. JSON不能区分整数和浮点数，也不能指定精度
JSON and XML have good support for Unicode character strings (i.e., human- readable text), but they don’t support binary strings (sequences of bytes without a character encoding) 不支持 binary strings，解决方案是 base64
JSON 的二进制变种：MessagePack, BSON, BJSON, UBJSON, BISON, and Smile
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing) is worth the loss of human-readability. 这是 MessagePack 的情况，可以看到，二进制版本并没有比普通版本减少多少空间占用。
Thrift and Protocol Buffers
The big difference compared to Figure 4-1 is that there are no field names (userName, favoriteNumber, interests). Instead, the encoded data contains field tags, which are numbers (1, 2, and 3). 跟上面的二进制变体的最大区别是，不需要字段名称，只要一个tag，因为可以提前定义好 schema
当新增和移除字段时，Thrift 和 Protocol Buffers 都可以做到 backward and forward compatibility
Modes of Dataflow
Dataflow Through Databases
In a database, the process that writes to the database encodes the data, and the pro‐ cess that reads from the database decodes it. 有意思，这也算一种数据流动的方式
Dataflow Through Services: REST and RPC
This way of building applications has traditionally been called a service- oriented architecture (SOA), more recently refined and rebranded as microservices architecture. 哈哈，同样的东西换个说法又出来了
REST vs SOAP
REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP [34, 35]. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST 的特点是能借用 http 的就用
By contrast, SOAP is an XML-based protocol for making network API requests.vii Although it is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features. Instead, it comes with a sprawling and complex multitude of related standards. SOAP 的特点是能不用 http 的就不用，啥都自己搞
The problems with remote procedure calls (RPCs)
作者认为 RPC 从根本上就是 flawed，它的出发点是好的，想要把通过网络的访问封装成本地调用的效果，但网络请求跟本地请求有很大差异，这是绕不过去的：
Custom RPC protocols with a binary encoding format can achieve better perfor‐ mance than something generic like JSON over REST. However, a RESTful API has other significant advantages: it is good for experimentation and debugging (you can simply make requests to it using a web browser or the command-line tool curl, without any code generation or software installation), it is supported by all main‐ stream programming languages and platforms, and there is a vast ecosystem of tools available (servers, caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.) RPC 的有点是效率高点，而 REST 的巨大优势是：测试和 debug 方便，有大量的工具
- It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
- It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
- It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
- It allows one message to be sent to several recipients.
- It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).