설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 163건, 최근 0 건
   

Cardinality Estimation that involves a join

글쓴이 : 모델광 날짜 : 2023-10-09 (월) 12:45 조회 : 441
This is a brief note describing how the PostgreSQL planner determines join cardinality. In this article, I will use the simplest possible example of the calculation: a query involving a single column equi-join and perfectly constructed frequency histograms at both ends of the join. 
I will begin with two data sets that exhibit a strong skew in their data distributions. The following experiment was conducted using PostgreSQL 15.1:

CREATE unlogged TABLE t1 AS
select i as ID
     , case when i < 5000 then 'a'
            when i < 8000 then 'b'
            when i < 9000 then 'c'
            when i < 9800 then 'd'
            when i < 9900 then 'e'
            when i < 9950 then 'f'
            when i < 9990 then 'g'
            else 'h'
        end as value
  from generate_series(1,10000) a(i);

CREATE unlogged TABLE t2 AS
select i as ID
     , case when i < 6000 then 'a'
            when i < 7000 then 'b'
            when i < 9100 then 'c'
            when i < 9700 then 'd'
            when i < 9900 then 'e'
            when i < 9950 then 'f'
            when i < 9990 then 'g'
            else 'h'
        end as value
  from generate_series(1,10000) a(i);

analyze t1;
analyze t2;

I have created two tables, each with10,000 rows and a highly skewed distribution. Then I have gathered statistics without chaning the default_statistics_target value of 100. Since the table contains only 8 distinct values in the VALUE column, PostgreSQL will possess a perfect frequency histogram for the VALUE columns in both tables.

I have quried the pg_stats view to examine the statistics of both tables.

select tablename, attname, n_distinct, most_common_vals, most_common_freqs
  from pg_stats
 where tablename in ('t1','t2');
tablename|attname|n_distinct|most_common_vals |most_common_freqs                             |
---------+-------+----------+-----------------+----------------------------------------------+
t1       |id     |      -1.0|                 |NULL                                          |
t1       |value  |       8.0|{a,b,c,d,e,f,g,h}|{0.4999,0.3,0.1,0.08,0.01,0.005,0.004,0.0011} |
t2       |id     |      -1.0|                 |NULL                                          |
t2       |value  |       8.0|{a,c,b,d,e,f,g,h}|{0.5999,0.21,0.1,0.06,0.02,0.005,0.004,0.0011}|


Now I am going to run a query that reports the values and frequencies from the two tables by querying pg_stats .

with w1 as (
select mcv as value, (mcf*10000)::int as t1_value_cnt
  from pg_stats as ps cross join lateral
   unnest(most_common_vals::text::text[],most_common_freqs::text::numeric[]) as mt(mcv, mcf)
 where tablename in ('t1')
   and attname = 'value'
   )
, w2 as (
select mcv as value, (mcf*10000)::int as t2_value_cnt
  from pg_stats as ps cross join lateral
   unnest(most_common_vals::text::text[],most_common_freqs::text::numeric[]) as mt(mcv, mcf)
 where tablename in ('t2')
   and attname = 'value'
 )
select w1.value, w1.t1_value_cnt, w2.t2_value_cnt, w1.t1_value_cnt*w2.t2_value_cnt as product
     , sum(w1.t1_value_cnt*w2.t2_value_cnt) over() as total
  from w1, w2
 where w1.value = w2.value;

value t1_val_cnt   t2_val_cnt  product    total
a      4999         5999     29989001    35593222
b      3000         1000     3000000    35593222
c      1000         2100      2100000    35593222
d       800           600       480000    35593222
e       100           200        20000    35593222
f        50             50         2500    35593222
g        40            40         1600    35593222
h        11            11          121    35593222

We can observe that the two tables have a highly skewed data distribution. Although the patterns of the two data sets are similar, the frequencies are not identical, of course. The total I get from this caculation represents the cardinality estimate that the planner will generate for doing an equi-join on these two tables.

Now, let's proceed with the test:


explain
select count(*)
  from t1, t2
 where t1.value = t2.value;

Aggregate  (cost=507842.78..507842.79 rows=1 width=8)
  ->  Hash Join  (cost=270.00..418859.72 rows=35593222 width=0)
        Hash Cond: (t2.value = t1.value)
        ->  Seq Scan on t2  (cost=0.00..145.00 rows=10000 width=2)
        ->  Hash  (cost=145.00..145.00 rows=10000 width=2)
              ->  Seq Scan on t1  (cost=0.00..145.00 rows=10000 width=2)

As you can see, the planner's estimate for the hash join is 35593222, which is exactly the sum of the products of the frequencies of matching values from the histogram data of the view pg_stats. For each row in the table t1, the histogram on t2 tells PostgreSQL how many rows will appear in the join.

As a refinement on this example, let's add a predicate that restricts the VALUE column, allowing us to verify whether the optimizer accurately esitmates the row count.

select count(*)
  from t1, t2
 where t1.value = t2.value
    and t1.value = 'h';

Aggregate  (cost=341.84..341.85 rows=1 width=8)
  ->  Nested Loop  (cost=0.00..341.54 rows=121 width=0)
        ->  Seq Scan on t1  (cost=0.00..170.00 rows=11 width=2)
              Filter: (value = 'h'::text)
        ->  Materialize  (cost=0.00..170.06 rows=11 width=2)
              ->  Seq Scan on t2  (cost=0.00..170.00 rows=11 width=2)
                    Filter: (value = 'h'::text)

The planner's estimate for the nested loop join is 121. The planner's estimate is exactly the sum of the products of the frequencies of the value 'h' from the pg_stats view.

Summary
For a very simple class of queries with perfect frequency histograms, there exists a remarkably straightforward way to calculate the row count that the PostgreSQL planner will predict. Pair up the rows from the two frequency histograms, multiply the corresponding frequencies and sum them.

Footnote
This is, of course, only a small step towards understanding how PostgreSQL uses histograms and covers only a type of query that is probably too simple to appear in a production database. However, I will address somewhat more realistic examples in future articles.

   

postgresdba.com