PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 163건, 최근 0 건

Extracting historical data 2

글쓴이 : 모델광 날짜 : 2023-07-17 (월) 20:00 조회 : 669

A recent post on a Naver Cafe inspired me to write this note.

Oracle, PostgreSQL 점 이력 최종 데이터 조회 방안 정리 : 네이버 카페 (naver.com)

I had previouly written an article on the same topic here:

https://www.postgresdba.com/bbs/board.php?bo_table=C05&wr_id=234&page=1

While there may not be anything new for most readers in this note, sometimes presenting an old concept in a different way offers fresh insights or better comprehension.

Here is the script provided by the Naver Cafe poster to generate some data.

DROP TABLE IF EXISTS equipments_test;

CREATE TABLE equipments_test (
equ_num NUMERIC(10,0),
equ_nm VARCHAR(30),
equ_tp_nm VARCHAR(10),
equ_sts_nm VARCHAR(10),
last_upd_dt timestamp WITHOUT time zone
);

DO $$
DECLARE
i integer;
BEGIN
FOR i IN 1 .. 200 LOOP
    INSERT INTO equipments_test (equ_num, equ_nm, equ_tp_nm, equ_sts_nm, last_upd_dt)
    VALUES (i::NUMERIC(10,0), md5(random()::text)::varchar(30), 'SENSOR'::varchar(10), 'ON'::varchar(10), date_trunc('second',current_timestamp)::timestamp);
END LOOP;
END $$;

create unique index equipments_test_pk on equipments_test(equ_num);
ALTER TABLE equipments_test ADD CONSTRAINT equipments_test_pk PRIMARY KEY USING INDEX equipments_test_pk;
create index equipments_test_x01 on equipments_test(equ_tp_nm);

DROP TABLE INTERFACE_TEST;
create table interface_test(equ_num NUMERIC(10,0), if_dt timestamp WITHOUT time zone, if_dt_seq NUMERIC(10,0), if_sts_cd varchar(10), comt varchar(50) );

DO $$
DECLARE
z integer;
i integer;
j integer;
BEGIN
FOR z IN 1 .. 200 LOOP
    FOR i IN 1 .. 100 LOOP
      FOR j IN 1 .. 100 LOOP
        INSERT INTO interface_test (equ_num, if_dt, if_dt_seq, if_sts_cd, comt)
        VALUES (z::NUMERIC(10,0), date_trunc('second',now() - INTERVAL '1 day' * i)::timestamp, j::NUMERIC(10,0), chr(65 + (j % 5))::varchar(10), md5(random()::text)::varchar(50));
      END LOOP;
    END LOOP;
END LOOP;
END $$;

create unique index interface_test_pk on interface_test(equ_num,if_dt,if_dt_seq);
alter table interface_test add constraint interface_test_pk primary key using index interface_test_pk;

ANALYZE equipments_test;
ANALYZE interface_test;

We created two tables and inserted some data. We assumed a business scenario where we receive various equipment status information. The EQUIPMENT_TEST table is a table for managing equipment and contains 200 rows. The INTERFACE_TEST table is a table that receives equipment status information, with data for each equipment for 100 days and receiving 100 records per day.

Here is the query the poster presented to retrieve the most recent status data of all equipment, along with its execution plan:

select t20.equ_num, t20.equ_nm, t20.equ_sts_nm
     , t10.if_dt, t10.if_dt_seq, t10.if_sts_cd, t10.comt
from   (
        select t1.equ_num, t1.if_dt, t1.if_dt_seq, t1.if_sts_cd, t1.comt
             , row_number() over(partition by t1.equ_num order by t1.if_dt desc, t1.if_dt_seq desc) as rnum
        from   interface_test t1
        where 1 = 1
       ) t10
     , equipments_test t20
where 1 = 1
and    t10.rnum = 1
and    t20.equ_num = t10.equ_num ;

Hash Join (actual time=3301.672..3736.212 rows=200 loops=1)
Hash Cond: (t10.equ_num = t20.equ_num)
Buffers: shared hit=12048 read=8574, temp read=32302 written=32361
-> Subquery Scan on t10 (actual time=3301.589..3735.916 rows=200 loops=1)
        Filter: (t10.rnum = 1)
        Buffers: shared hit=12045 read=8574, temp read=32302 written=32361
        -> WindowAgg (actual time=3301.588..3735.851 rows=200 loops=1)
              Run Condition: (row_number() OVER (?) <= 1)
              Buffers: shared hit=12045 read=8574, temp read=32302 written=32361
              -> Sort (actual time=3301.575..3550.909 rows=2000000 loops=1)
                    Sort Key: t1.equ_num, t1.if_dt DESC, t1.if_dt_seq DESC
                    Sort Method: external merge Disk: 129240kB
                    Buffers: shared hit=12045 read=8574, temp read=32302 written=32361
                    -> Seq Scan on interface_test t1 (actual time=0.031..134.990 rows=2000000 loops=1)
                          Buffers: shared hit=12045 read=8574
-> Hash (actual time=0.052..0.052 rows=200 loops=1)
        Buckets: 1024 Batches: 1 Memory Usage: 22kB
        Buffers: shared hit=3
        -> Seq Scan on equipments_test t20 (actual time=0.007..0.026 rows=200 loops=1)
              Buffers: shared hit=3
Planning:
Buffers: shared hit=4
Planning Time: 0.172 ms
Execution Time: 3750.834 ms

When I did the same test in PostgreSQL 15.1, I obtained the following execution plan:

Hash Join (actual time=11.490..2484.612 rows=200 loops=1)
Hash Cond: (t10.equ_num = t20.equ_num)
Buffers: shared hit=19795 read=30475
-> Subquery Scan on t10 (actual time=11.432..2484.038 rows=200 loops=1)
        Filter: (t10.rnum = 1)
        Buffers: shared hit=19793 read=30474
        -> WindowAgg (actual time=11.431..2483.857 rows=200 loops=1)
              Run Condition: (row_number() OVER (?) <= 1)
              Buffers: shared hit=19793 read=30474
              -> Incremental Sort (actual time=11.419..2295.185 rows=2000000 loops=1)
                    Sort Key: t1.equ_num, t1.if_dt DESC, t1.if_dt_seq DESC
                    Presorted Key: t1.equ_num
                    Full-sort Groups: 200 Sort Method: quicksort Average Memory: 31kB Peak Memory: 31kB
                    Pre-sorted Groups: 200 Sort Method: quicksort Average Memory: 1322kB Peak Memory: 1322kB
                    Buffers: shared hit=19793 read=30474
                    -> Index Scan using interface_test_pk on interface_test t1 (actual time=0.015..304.267 rows=2000000 loops=1)
                          Buffers: shared hit=19793 read=30474
-> Hash (actual time=0.051..0.052 rows=200 loops=1)
        Buckets: 1024 Batches: 1 Memory Usage: 22kB
        Buffers: shared hit=2 read=1
        -> Seq Scan on equipments_test t20 (actual time=0.004..0.026 rows=200 loops=1)
              Buffers: shared hit=2 read=1
Planning:
Buffers: shared hit=1 read=3
Planning Time: 0.185 ms
Execution Time: 2484.764 ms

The execution plan I got is a little different from that of the poster. In PostgreSQL 15.1, the incremental sort operation was utilized, reducing the elapsed time from 3750 ms to 2484 ms.
I assume that the poster conducted the experiment on PostgreSQL 12 or an earlier version.
And there are two things that stand out in the plan:
1) The query uses a hash join to combine the two tables.
2) The majority of the time (2295 ms out of 2484 ms) is spent on the Incremental Sort node. Although the index is accessed, the Sort operation was not eliminated.

As I mentioned in the pevious note titled "Extracting Historica Data", the hash join method is a good choice for large datasets. However, the provided sample data did not appear to be large.
Therefore, I decided to modify the query to use the nested loop strategy.
Here is the query I have rewritten:

select t20.equ_num, t20.equ_nm, t20.equ_sts_nm
     , t10.if_dt, t10.if_dt_seq, t10.if_sts_cd, t10.comt
from equipments_test t20
join lateral
      (select t1.equ_num, t1.if_dt, t1.if_dt_seq, t1.if_sts_cd, t1.comt
         from   interface_test t1
        where t20.equ_num = t1.equ_num
        order by if_dt desc, if_dt_seq DESC
        fetch next 1 rows only
       ) t10
   on true;

I have utilized a lateral join to retrieve the most recent data. Here is the execution plan:

Nested Loop (actual time=0.024..1.501 rows=200 loops=1)
Buffers: shared hit=803
-> Seq Scan on equipments_test t20 (actual time=0.007..0.022 rows=200 loops=1)
        Buffers: shared hit=3
-> Limit (actual time=0.007..0.007 rows=1 loops=200)
        Buffers: shared hit=800
        -> Index Scan Backward using interface_test_pk on interface_test t1 (actual time=0.007..0.007 rows=1 loops=200)
              Index Cond: (equ_num = t20.equ_num)
              Buffers: shared hit=800
Planning Time: 0.088 ms
Execution Time: 1.527 ms

Two notable aspects of the plan are:
1) The query uses a nested loop join to combine the two tables.
2) The query is accessing the interface_test_pk index 200 times, eliminating the WindowAgg and Incremental Sort operations.

Consequently, the block I/O dropped from 50,270 (19795+30475) to 803 and the elapsed time decreased from 2484 ms to 1.5 ms.

Conclusion

When we extract the most recent data, choosing an appropriate join method is of critical importance. For small datasets, the nested loop join is efficient, while for huge datasets, the hash join is optimal. Thus, in the given sample scenario, as the EQUIPMENTS_TEST table grows larger, the rewritten query will become slower. And at some point the original query will outperform the rewritten query.

Added on the next day of publication
I apologize for the confusion in my previous comment. I made a hasty remark about the original query outperforming the rewritten query at some point. That statement is incorrect because the orignial query is poorly written. When I looked at the originial query closely, I discoverd a significant flaw in the query. Despite having an index on the columns equ_num,if_dt,and if_dt_seq, and the query accessing the index, we still have the incremental sort operation in the execution plan. There is something wrong in the query which prevents the optimizer from removing the sort operation. To remove the incremental sort operation, I made the following change to the query:

select t20.equ_num, t20.equ_nm, t20.equ_sts_nm
     , t10.if_dt, t10.if_dt_seq, t10.if_sts_cd, t10.comt
from   (
        select t1.equ_num, t1.if_dt, t1.if_dt_seq, t1.if_sts_cd, t1.comt
             , row_number() over(partition by t1.equ_num

                               order by t1.equ_num desc, t1.if_dt desc, t1.if_dt_seq desc) as rnum
        from   interface_test t1
        where 1 = 1
       ) t10
     , equipments_test t20
where 1 = 1
and    t10.rnum = 1
and    t20.equ_num = t10.equ_num ;

I simply added the t1.equ_num desc in the ORDER BY clause.
Here is the execution plan for the modified query.

Nested Loop (actual time=0.052..491.942 rows=200 loops=1)
Buffers: shared hit=31427 read=29093
-> Subquery Scan on t10 (actual time=0.048..491.052 rows=200 loops=1)
        Filter: (t10.rnum = 1)
        Buffers: shared hit=31027 read=29093
        -> WindowAgg (actual time=0.047..490.984 rows=200 loops=1)
              Run Condition: (row_number() OVER (?) <= 1)
              Buffers: shared hit=31027 read=29093
              -> Index Scan Backward using interface_test_pk on interface_test t1 (actual time=0.039..294.642 rows=2000000 loops=1)
                    Buffers: shared hit=31027 read=29093
-> Index Scan using equipments_test_pk on equipments_test t20 (actual time=0.003..0.003 rows=1 loops=200)
        Index Cond: (equ_num = t10.equ_num)
        Buffers: shared hit=400
Planning:
Buffers: shared hit=8
Planning Time: 0.157 ms
Execution Time: 492.018 ms

Compared to the plan of the origninal query, a few things stand out in this execution plan:
1) The number of block I/Os increased from 50270(19795+30475) to 60520(31427+29093) due to the elimination of the incremental sort.
2) The elapsed time decreased from 2484 ms to 492 ms because the incremental sort was removed. Therefore, the majority of the time was spent on the WindowAgg operation.
Whenever we encounter a sort operation, the first step is to find a way to eliminate it since sorting can often cause performance degradation.

E-MAIL: root@postgresdba.com