技术控

    今日:70| 主题:49409
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Faster PostgreSQL Counting

[复制链接]
近日我浮躁 发表于 2016-10-13 01:48:06
184 4

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
Everybody counts, but not always quickly. This article is a close look into how PostgreSQL optimizes counting. If you know the tricks there are ways to count rows orders of magnitude faster than you do already.
  The problem is actually underdescribed – there are several variations of counting, each with its own methods. First think whether you need an exact count or whether an estimate suffices. Next, are you counting duplicates or just distinct values? Finally do you want a lump count of an entire table or will you want to count only those rose matching extra criteria?
  We’ll analyze the techniques available for each situation and compare their speed and resource consumption. After learning about techniques for a single database we’ll use Citus to demonstrate how to parallelize counts in a distributed database.
  Table of Contents

  
       
  • Preparing the DB for tests   
  • Counts With Duplicates
           

      •       
      • Filtered Table Estimates      
           
       
  • Distinct Counts (No Duplicates)   
  • Reference of Techniques  
  Preparing the DB for tests

  The sections below use the following table for benchmarks.
  [code]-- create a million random numbers and strings
CREATE TABLE items AS
  SELECT
    (random()*1000000)::integer AS n,
    md5(random()::text) AS s
  FROM
    generate_series(1,1000000);

-- inform planner of big table size change
VACUUM ANALYZE;[/code]  Counts With Duplicates

  Exact Counts

   Let’s begin at the beginning, exact counts allowing duplication over some or all of a table, good old count(*) . Measuring the time to run this command provides a basis for evaluating the speed of other types of counting.
  Pgbench provides a convenient way to run a query repeatedly and collect statistics about performance.
  [code]# Tests in this article were run against PostgreSQL 9.5.4

echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f -

# average  84.915 ms
# stddev    5.251 ms[/code]   A note about count(1) vs count(*) . One might think that count(1) would be faster because count(*) appears to consult the data for a whole row. However the opposite is true. The star symbol is meaningless here, unlike its use in SELECT * . PostgreSQL parses The expression count(*) as a special case taking no arguments. (Historically the expression ought to have been defined as count() .) On the other hand count(1) takes an argument and PostgreSQL has to check at every row to see that ts argument, 1, is indeed still not NULL.
   Running the above benchmark with count(1) results in:
  [code]# average  98.896 ms
# stddev    7.280 ms[/code]   However both both forms of count(1) and count(*) are fundamentally slow. PostgreSQL uses multiversion concurrency control (MVCC) to ensure consistency between simultaneous transactions. This means each transaction may see different rows – and different numbers of rows – in a table. There is no single universal row count that the database could cache, so it must scan through all rows counting how many are visible. Performance for an exact count grows linearly with table size.
  [code]EXPLAIN SELECT count(*) FROM items;

Aggregate  (cost=20834.00..20834.01 rows=1 width=0)
  ->  Seq Scan on items  (cost=0.00..18334.00 rows=1000000 width=0)[/code]  The scan accounts for 88% of the total cost. As we double the table size the query time roughly doubles, with cost of scanning and aggregating growing proportionally with one other.
              Rows     Avg Time                   1M     85ms             2M     161ms             4M     343ms           How can we make this faster? Something has to give, either we can settle for an estimated rather than exact count, or we can cache the count ourselves using a manual increasing-decreasing tally. However in the second case we have to keep a tally for each table and where clause that we want to count quickly later.
   Here’s an example of the tally approach applied to the whole items table. The following trigger-based solution is adapted from A. Elein Mustain . PostgreSQL’s MVCC will maintain consistency between the items table and a table of row counts.
  [code]BEGIN;

CREATE TABLE row_counts (
  relname   text PRIMARY KEY,
  reltuples bigint
);

-- establish initial count
INSERT INTO row_counts (relname, reltuples)
  VALUES ('items', (SELECT count(*) from items));

CREATE OR REPLACE FUNCTION adjust_count()
RETURNS TRIGGER AS
$$
   DECLARE
   BEGIN
   IF TG_OP = 'INSERT' THEN
      EXECUTE 'UPDATE row_counts set reltuples=reltuples +1 where relname = ''' || TG_RELNAME || '''';
      RETURN NEW;
   ELSIF TG_OP = 'DELETE' THEN
      EXECUTE 'UPDATE row_counts set reltuples=reltuples -1 where relname = ''' || TG_RELNAME || '''';
      RETURN OLD;
   END IF;
   END;
$$
LANGUAGE 'plpgsql';

CREATE TRIGGER items_count BEFORE INSERT OR DELETE ON items
  FOR EACH ROW EXECUTE PROCEDURE adjust_count();

COMMIT;[/code]   The speed of reading and updating the cached value is independent of the table size, and reading is very fast. However this technique shifts overhead to inserts and deletes. Without the trigger the following statement takes an average of 4.7 seconds, whereas inserts with the trigger are fifty times slower :
  [code]INSERT INTO items (n, s)
  SELECT
    (random()*1000000)::integer AS n,
    md5(random()::text) AS s
  FROM generate_series(1,1000000);[/code]  Estimated Counts

  Full Table Estimates

   The previous “tally” approach for caching table counts makes inserts slow. If we’re willing to accept an estimated rather than exact count we can get fast reads like the tally but with no insert degradation. To do so we can lean on estimates gathered from PostgreSQL subsystems. Two sources are the stats collector and the autovacuum daemon .
  Here are two alternatives for getting the estimate:
  [code]-- Asking the stats collector
SELECT n_live_tup
  FROM pg_stat_all_tables
WHERE relname = 'items';

-- Updated by VACUUM and ANALYZE
SELECT reltuples
  FROM pg_class
WHERE relname = 'items';[/code]  However there’s a more accurate source that is less likely to be stale. Andrew Gierth (RhodiumToad) advises:
  Remember that reltuples isn’t the estimate that the planner actually uses; the planner uses reltuples/relpages multiplied by the current number of pages.
  Here’s the intuition. As the sample of data in a table increases then the average number of rows fitting in a physical page is likely to change less drastically than number of rows total. We can multiply the average rows per page by up-to-date information about the current number of pages occupied by a table for a more accurate estimation of the current number of rows.
  [code]-- pg_relation_size and block size have fresh info so combine them with
-- the estimation of tuples per page
SELECT
  (reltuples/relpages) * (
    pg_relation_size('items') /
    (current_setting('block_size')::integer)
  )
  FROM pg_class where relname = 'items';[/code]  Filtered Table Estimates

   The previous section gets estimated counts for entire tables, but is there a way to get one for only those rows matching a condition? Michael Fuhr came up with a clever trick to run EXPLAIN for a query and parse its output.
  [code]CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE
  rec   record;
  rows  integer;
BEGIN
  FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
    rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
    EXIT WHEN rows IS NOT NULL;
  END LOOP;
  RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;[/code]  We can use the function like this:
  [code]SELECT count_estimate('SELECT 1 FROM items WHERE n < 1000');[/code]   The accuracy of this method relies on the planner which uses several techniques to estimate the selectivity of a where clause and from there the number of rows that will be returned.
  Distinct Counts (No Duplicates)

  Exact Counts

  Default behavior under low memory

   Count with duplicates may be slow, but count distinct is much worse. With limited working memory and no indices, PostgreSQL is unable to optimize much. In its stock configuration PostgreSQL specifies a low memory limit per concurrent query ( work_mem ). On my development machine the default was four megabytes.
  Sticking with default, here is the performance for dealing with a million rows.
  [code]# Tests in this article were run against PostgreSQL 9.5.4

echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f -

# average  84.915 ms
# stddev    5.251 ms0[/code]  Running EXPLAIN shows that the bulk of the query happens in the aggregate, and that running the count on a string column takes longer than the integer column:
   
Faster PostgreSQL Counting-1 (techniques,available,situation,learning,matching)

  [code]# Tests in this article were run against PostgreSQL 9.5.4

echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f -

# average  84.915 ms
# stddev    5.251 ms1[/code]   What is happening inside the aggregate though? Its description in the explain output is opaque. We can get an idea by inquiring about a related query, select distinct rather than count distinct .

Faster PostgreSQL Counting-2 (techniques,available,situation,learning,matching)

  [code]# Tests in this article were run against PostgreSQL 9.5.4

echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f -

# average  84.915 ms
# stddev    5.251 ms2[/code]   Without more work_mem or external data structures like an index PostgreSQL merge-sorts the table between memory and disk and then iterates through the results removing duplicates, much like the classic Unix combination sort | uniq .
   Sorting takes most the query time, especially if we select the string column s rather than the integer column n . In both cases the unique filter goes at about the same speed.
  Custom aggregate

  Thomas Vondra created a custom aggregate for counting distinct values in columns of fixed-width types (additionally the types must have at most 64 bits). Without any extra work memory or indices it beats the default sort-based counting. To install
  
       
  • Clone the project, tvondra/count_distinct   
  • Run make install   
  • In your database: CREATE EXTENSION count_distinct;  
   Thomas explains how the aggregate works in this blog post but the short description is that it builds a sorted array of unique elements in memory, compacting as it goes.
  [code]# Tests in this article were run against PostgreSQL 9.5.4

echo "SELECT count(*) FROM items;" | pgbench -d count -t 50 -P 1 -f -

# average  84.915 ms
# stddev    5.251 ms3[/code]  This beats the standard count distinct aggregate which took an average of 742 ms for our dataset. Note that custom extensions written in C like count_distinct are not bound by the value of work_mem. The array constructed in this extension can exceed your memory expectations.
  HashAggregate

  When all rows to be counted can fit in work_mem then PostgreSQL uses a hash table to get distinct values:
12下一页
友荐云推荐




上一篇:Encrypt your –defaults-file
下一篇:Djinn, a theorem prover in Haskell, for Haskell (2005)
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

araw 发表于 2016-10-13 05:30:16
缺乏基情了!
回复 支持 反对

使用道具 举报

1q2 发表于 2016-10-15 18:30:55
楼下有什么好吐槽的么?
回复 支持 反对

使用道具 举报

就想爱你九年 发表于 2016-11-10 09:33:26
心里只有你一个频道,最可恨的是还没有广告。
回复 支持 反对

使用道具 举报

kukakia 发表于 2016-11-20 12:56:36
围观 围观 沙发在哪里!!!
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表