C++ folly 库解读之 Fbstring:一个完美替代 std::string 的库(下)

书接上文 : C++ folly库解读(一) Fbstring —— 一个完美替代std::string的库(上)

  • 字符串拷贝

    • copySmall

    • copyMedium

    • copyLarge

  • 析构

  • COW

    • non-const operator

    • non-const 与 const

  • Realloc

  • 其他

    • __builtin_expect

    • CMOV 指令

    • builtin_unreachable && assume(0)

    • jemalloc

    • find 算法

    • 判断大小端

字符串拷贝

同初始化,也是根据不同的字符串类型,调用不同的函数:

  fbstring_core(const fbstring_core& rhs) {
assert(&rhs != this);
switch (rhs.category()) {
case Category::isSmall:
copySmall(rhs);
break;
case Category::isMedium:
copyMedium(rhs);
break;
case Category::isLarge:
copyLarge(rhs);
break;
default:
folly::assume_unreachable();
}
}

copySmall

template <class Char>
inline void fbstring_core<Char>:
:copySmall(const fbstring_core& rhs) {
// Just write the whole thing, don't look at details. In
// particular we need to copy capacity anyway because we want
// to set the size (don't forget that the last character,
// which stores a short string's length, is shared with the
// ml_.capacity field).

ml_ = rhs.ml_;
}

正如注释中所说,虽然 small strings 的情况下,字符串存储在 small 中,但是我们只需要把 ml
直接赋值即可,因为在一个 union 中。

copyMedium

template <class Char>
FOLLY_NOINLINE inline void fbstring_core<Char>:
:copyMedium(
const fbstring_core& rhs) {
// Medium strings are copied eagerly. Don't forget to allocate
// one extra Char for the null terminator.
auto const allocSize = goodMallocSize((1 + rhs.ml_.size_) * sizeof(Char));
ml_.data_ = static_cast<Char*>(checkedMalloc(allocSize));
// Also copies terminator.
fbstring_detail::podCopy(
rhs.ml_.data_, rhs.ml_.data_ + rhs.ml_.size_ + 1, ml_.data_);
ml_.size_ = rhs.ml_.size_;
ml_.setCapacity(allocSize / sizeof(Char) - 1, Category::isMedium);
}

medium strings 是 eager copy,所以就是”深拷贝”:

  • 为字符串分配空间、拷贝

  • 赋值 size、capacity、category.

copyLarge

template <class Char>
FOLLY_NOINLINE inline void fbstring_core<Char>:
:copyLarge(
const fbstring_core& rhs) {
// Large strings are just refcounted
ml_ = rhs.ml_;
RefCounted::incrementRefs(ml_.data_);
}

large strings 的 copy 过程很直观,因为是 COW 方式:

  • 直接赋值 ml,内含指向共享字符串的指针。

  • 共享字符串的引用计数加 1。

incrementRefs 和内部调用 fromData 这两个个函数值得看一下:

static RefCounted* fromData(Char* p) {
return static_cast<RefCounted*>(static_cast<void*>(
static_cast<unsigned char*>(static_cast<void*>(p)) -
getDataOffset()));
}

static void incrementRefs(Char* p) {
fromData(p)->refCount_.fetch_add(1, std::memory_order_acq_rel);
}

因为 ml 中指向的是 RefCounted 的 data
[1],所以我们需要通过 fromData 来找到 data_所属的 RefCounted 的地址。我把 fromData 函数内的运算拆开:

static RefCounted * fromData(Char * p) {
// 转换data_[1]的地址
void* voidDataAddr = static_cast<void*>(p);
unsigned char* unsignedDataAddr = static_cast<unsigned char*>(voidDataAddr);

// 获取data_[1]在结构体的偏移量再相减,得到的就是所属RefCounted的地址
unsigned char* unsignedRefAddr = unsignedDataAddr - getDataOffset();

void* voidRefAddr = static_cast<void*>(unsignedRefAddr);
RefCounted* refCountAddr = static_cast<RefCounted*>(voidRefAddr);

return refCountAddr;
}

值得关注的是如何转换不同类型结构体的指针并做运算,这里的做法是: Char* -> void* -> unsigned char* -> 与size_t做减法 -> void * -> RefCounted*

析构

~fbstring_core() noexcept {
if (category() == Category::isSmall) {
return;
}
destroyMediumLarge();
}
  • 如果是 small 类型,直接返回,因为利用的是栈空间。

  • 否则,针对 medium 和 large,调用 destroyMediumLarge。

FOLLY_MALLOC_NOINLINE void destroyMediumLarge() noexcept {
auto const c = category();
FBSTRING_ASSERT(c != Category::isSmall);
if (c == Category::isMedium) {
free(ml_.data_);
} else {
RefCounted::decrementRefs(ml_.data_);
}
}
  • medium :free 动态分配的字符串内存即可。

  • large : 调用 decrementRefs,针对共享字符串进行操作。

static void decrementRefs(Char * p) {
auto const dis = fromData(p);
size_t oldcnt = dis->refCount_.fetch_sub(1, std::memory_order_acq_rel);
FBSTRING_ASSERT(oldcnt > 0);
if (oldcnt == 1) {
free(dis);
}
}

逻辑也很清晰:先对引用计数减 1,如果本身就只有 1 个引用,那么直接 free 掉整个 RefCounted。

COW

最重要的一点,也是 large strings 独有的,就是 COW. 任何针对字符串写的操作,都会触发 COW,包括前面举过的[]操作,例如:

  • non-constat(size_n)

  • non-constoperator[](size_type pos “”)

  • operator+

  • append

  • ……

我们举个例子,比如 non-const
operator[](size_type pos “”)

non-const operator[](size_type pos “”)

reference operator[](size_type pos "") {
return *(begin() + pos);
}

iterator begin() {
return store_.mutableData();
}

来重点看下 mutableData() :

Char* mutableData() {
switch (category()) {
case Category::isSmall:
return small_;
case Category::isMedium:
return ml_.data_;
case Category::isLarge:
return mutableDataLarge();
}
fbstring_detail::assume_unreachable();
}

template <class Char>
inline Char* fbstring_core<Char>:
:mutableDataLarge() {
FBSTRING_ASSERT(category() == Category::isLarge);
if (RefCounted::refs(ml_.data_) > 1) { // Ensure unique.
unshare();
}
return ml_.data_;
}

同样是分三种情况。small 和 medium 直接返回字符串的地址,large 会调用 mutableDataLarge(), 可以看出,如果引用数大于 1,会进行 unshare 操作 :

void unshare(size_t minCapacity = 0);

template <class Char>
FOLLY_MALLOC_NOINLINE inline void fbstring_core<Char>:
:unshare(
size_t minCapacity) {
FBSTRING_ASSERT(category() == Category::isLarge);
size_t effectiveCapacity = std::max(minCapacity, ml_.capacity());
auto const newRC = RefCounted::create(&effectiveCapacity);
// If this fails, someone placed the wrong capacity in an
// fbstring.
FBSTRING_ASSERT(effectiveCapacity >= ml_.capacity());
// Also copies terminator.
fbstring_detail::podCopy(ml_.data_, ml_.data_ + ml_.size_ + 1, newRC->data_);
RefCounted::decrementRefs(ml_.data_);
ml_.data_ = newRC->data_;
ml_.setCapacity(effectiveCapacity, Category::isLarge);
// size_ remains unchanged.
}

基本思路:

  • 创建新的 RefCounted。

  • 拷贝字符串。

  • 对原有的共享字符串减少一个引用 decrementRefs,这个函数在上面的析构小节里分析过。

  • 设置 ml_的 data、capacity、category.

  • 注意此时还不会设置 size,因为还不知道应用程序对字符串进行什么修改。

non-const 与 const

大家可能注意到了,上面的 at 和[]强调了 non-const,这是因为 const-qualifer 针对这两个调用不会触发 COW ,还以[]为例:

// C++11 21.4.5 element access:
const_reference operator[](size_type pos "") const {
return *(begin() + pos);
}

const_iterator begin() const {
return store_.data();
}

// In C++11 data() and c_str() are 100% equivalent.
const Char* data() const { return c_str(); }

const Char* c_str() const {
const Char* ptr = ml_.data_;
// With this syntax, GCC and Clang generate a CMOV instead of a branch.
ptr = (category() == Category::isSmall) ? small_ : ptr;
return ptr;
}

可以看出区别,non-const 版本的 begin()中调用的是 mutableData(),而 const-qualifer 版本调用的是 data() -> c_str(),而 c_str()直接返回的字符串地址。

所以,当字符串用到[]、at 且不需要写操作时,最好用 const-qualifer.

我们拿 folly 自带的 benchmark 工具
[1]
测试一下:

#include "folly/Benchmark.h"
#include "folly/FBString.h"
#include "folly/container/Foreach.h"

using namespace std;
using namespace folly;

BENCHMARK(nonConstFbstringAt, n) {
::folly::fbstring str(
"fbstring is a drop-in replacement for std::string. The main benefit of fbstring is significantly increased "
"performance on virtually all important primitives. This is achieved by using a three-tiered storage strategy "
"and by cooperating with the memory allocator. In particular, fbstring is designed to detect use of jemalloc and "
"cooperate with it to achieve significant improvements in speed and memory usage.")
;
FOR_EACH_RANGE(i, 0, n) {
char &s = str[2];
doNotOptimizeAway(s);
}
}

BENCHMARK_DRAW_LINE();

BENCHMARK_RELATIVE(constFbstringAt, n) {
const ::folly::fbstring str(
"fbstring is a drop-in replacement for std::string. The main benefit of fbstring is significantly increased "
"performance on virtually all important primitives. This is achieved by using a three-tiered storage strategy "
"and by cooperating with the memory allocator. In particular, fbstring is designed to detect use of jemalloc and "
"cooperate with it to achieve significant improvements in speed and memory usage.")
;
FOR_EACH_RANGE(i, 0, n) {
const char &s = str[2];
doNotOptimizeAway(s);
}
}
int main() { runBenchmarks(); }

结果是 constFbstringAt 比 nonConstFbstringAt 快了 175%

============================================================================
delve_folly/main.cc relative time/iter iters/s
============================================================================
nonConstFbstringAt 39.85ns 25.10M
----------------------------------------------------------------------------
constFbstringAt 175.57% 22.70ns 44.06M
============================================================================

Realloc

reserve、operator+等操作,可能会涉及到内存重新分配,最终调用的都是 memory/Malloc.h 中的 smartRealloc:

inline void* checkedRealloc(void* ptr, size_t size) {
void* p = realloc(ptr, size);
if (!p) {
throw_exception<std::bad_alloc>();
}
return p;
}

/**
* This function tries to reallocate a buffer of which only the first
* currentSize bytes are used. The problem with using realloc is that
* if currentSize is relatively small _and_ if realloc decides it
* needs to move the memory chunk to a new buffer, then realloc ends
* up copying data that is not used. It's generally not a win to try
* to hook in to realloc() behavior to avoid copies - at least in
* jemalloc, realloc() almost always ends up doing a copy, because
* there is little fragmentation / slack space to take advantage of.
*/

FOLLY_MALLOC_CHECKED_MALLOC FOLLY_NOINLINE inline void* smartRealloc(
void* p,
const size_t currentSize,
const size_t currentCapacity,
const size_t newCapacity)
{
assert(p);
assert(currentSize <= currentCapacity &&
currentCapacity < newCapacity);

auto const slack = currentCapacity - currentSize;
if (slack * 2 > currentSize) {
// Too much slack, malloc-copy-free cycle:
auto const result = checkedMalloc(newCapacity);
std::memcpy(result, p, currentSize);
free(p);
return result;
}
// If there's not too much slack, we realloc in hope of coalescing
return checkedRealloc(p, newCapacity);
}

从注释和代码看为什么函数起名叫 smart
Realloc :

  • 如果(the currentCapacity – currentSize) _ 2 > currentSize,即 currentSize < 2/3 _ capacity,说明当前分配的内存利用率较低,此时认为如果使用 realloc 并且 realloc 决定拷贝当前内存到新内存,成本会高于直接 malloc(newCapacity) + memcpy + free(old_memory)。

  • 否则直接 realloc.

其他

__builtin_expect

给编译器提供分支预测信息。原型为:

long __builtin_expect (long exp, long c)

表达式的返回值为 exp 的值,跟 c 无关。我们预期 exp 的值是 c。例如下面的例子,我们预期 x 的值是 0,所以这里提示编译器,只有很小的几率会调用到 foo()

if (__builtin_expect (x, 0))
foo ();

再比如判断指针是否为空:

if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);

在 fbstring 中也用到了 builtin_expect,例如创建 RefCounted 的函数 (FOLLY_LIKELY 包装了一下
builtin_expect):

#if __GNUC__
#define FOLLY_DETAIL_BUILTIN_EXPECT(b, t) (__builtin_expect(b, t))
#else
#define FOLLY_DETAIL_BUILTIN_EXPECT(b, t) b
#endif

// Likeliness annotations
//
// Useful when the author has better knowledge than the compiler of whether
// the branch condition is overwhelmingly likely to take a specific value.
//
// Useful when the author has better knowledge than the compiler of which code
// paths are designed as the fast path and which are designed as the slow path,
// and to force the compiler to optimize for the fast path, even when it is not
// overwhelmingly likely.

#define FOLLY_LIKELY(x) FOLLY_DETAIL_BUILTIN_EXPECT((x), 1)
#define FOLLY_UNLIKELY(x) FOLLY_DETAIL_BUILTIN_EXPECT((x), 0)

static RefCounted* create(const Char* data, size_t* size) {
const size_t effectiveSize = *size;
auto result = create(size);
if (FOLLY_LIKELY(effectiveSize > 0)) { // __builtin_expect
fbstring_detail::podCopy(data, data + effectiveSize, result->data_);
}
return result;
}

从汇编角度来说,会将可能性更大的汇编紧跟着前面的汇编,防止无效指令的加载。可以参考:

  • likely() and unlikely()
    [2]

  • What is the advantage of GCC’s \_\_builtin_expect in if else statements?
    [3]

CMOV 指令

conditional move,条件传送。类似于 MOV 指令,但是依赖于 RFLAGS 寄存器内的状态。如果条件没有满足,该指令不会有任何效果。

CMOV 的优点是可以避免分支预测,避免分支预测错误对 CPU 流水线的影响。详细可以看这篇文档: amd-cmovcc.pdf
[4]

fbstring 在一些场景会提示编译器生成 CMOV 指令,例如:

const Char* c_str() const {
const Char* ptr = ml_.data_;
// With this syntax, GCC and Clang generate a CMOV instead of a branch.
ptr = (category() == Category::isSmall) ? small_ : ptr;
return ptr;
}

builtin_unreachable && assume(0)

如果程序执行到了 builtin_unreachable 和
assume(0) ,那么会出现未定义的行为。例如**builtin_unreachable 出现在一个不会返回的函数后面,而且这个函数没有声明为 **attribute\_\_((noreturn))
。例如6.59 Other Built-in Functions Provided by GCC给出的例子 :

void function_that_never_returns (void);

int g (int c)
{
if (c)
{
return 1;
}
else
{
function_that_never_returns ();
__builtin_unreachable ();
}
}

如果不加 __builtin_unreachable ();
,会报 error: control reaches end of non-void function [-Werror=return-type]

folly 将 builtin_unreachable 和
assume(0) 封装成了 assume_unreachable

[[noreturn]] FOLLY_ALWAYS_INLINE void assume_unreachable() {
assume(false);
// Do a bit more to get the compiler to understand
// that this function really will never return.
#if defined(__GNUC__)
__builtin_unreachable();
#elif defined(_MSC_VER)
__assume(0);
#else
// Well, it's better than nothing.
std::abort();
#endif
}

在 fbstring 的一些特性场景,比如 switch 判断 category 中用到。这是上面提到过的 mutableData() :

Char* mutableData() {
switch (category()) {
case Category::isSmall:
return small_;
case Category::isMedium:
return ml_.data_;
case Category::isLarge:
return mutableDataLarge();
}
folly::assume_unreachable();
}

jemalloc

  • 官网
    [5]

  • API 文档
    [6]

  • 大致的算法和原理可以参考 facebook 的这篇博客 : Scalable memory allocation using jemalloc
    [7]

find 算法

使用的简化的 Boyer-Moore 算法,文档说明是在查找成功的情况下比 std::string 的 find 快 30 倍。benchmark 代码在 FBStringBenchmark.cpp
[8]

我自己测试的情况貌似是搜索长字符串的情况会更好些。

判断大小端

// It's MSVC, so we just have to guess ... and allow an override
#ifdef _MSC_VER
# ifdef FOLLY_ENDIAN_BE
static constexpr auto kIsLittleEndian = false;
# else
static constexpr auto kIsLittleEndian = true;
# endif
#else
static constexpr auto kIsLittleEndian =
__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__;
#endif

__BYTE_ORDER__ 为预定义宏:
[9]
,值是 ORDER_LITTLE_ENDIAN
ORDER_BIG_ENDIAN
ORDER_PDP_ENDIAN
中的一个。

一般会这么使用:

/* Test for a little-endian machine */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__

c++20 引入了 std::endian
[10]
,判断会更加方便。

(完)

参考资料

[1]

benchmark 工具: https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md

[2]

likely() and unlikely(): https://kernelnewbies.org/FAQ/LikelyUnlikely

[3]

What is the advantage of GCC’s __builtin_expect in if else statements?: https://stackoverflow.com/questions/7346929/what-is-the-advantage-of-gccs-builtin-expect-in-if-else-statements

[4]

amd-cmovcc.pdf: https://www.cs.tufts.edu/comp/40-2011f/readings/amd-cmovcc.pdf

[5]

官网 : http://jemalloc.net/

[6]

API 文档: http://jemalloc.net/jemalloc.3.html

[7]

Scalable memory allocation using jemalloc: https://engineering.fb.com/2011/01/03/core-data/scalable-memory-allocation-using-jemalloc/

[8]

FBStringBenchmark.cpp: https://github.com/facebook/folly/blob/master/folly/test/FBStringTestBenchmarks.cpp.h#L120

[9]

BYTE_ORDER为预定义宏:: https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

[10]

std::endian: https://en.cppreference.com/w/cpp/types/endian

公众账号
我还没有学会写个人说明!
上一篇

美FDA称新冠病毒通过食品或食品包装传播的可能性极小

下一篇

预计Northrup的国际空间站补给货物将在当地时间周六升空

你也可能喜欢

评论已经被关闭。

插入图片