설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 163건, 최근 0 건
   

Association Entity & Historical Data

글쓴이 : 모델광 날짜 : 2023-09-16 (토) 09:52 조회 : 376
A good, future-proof data model is one of the most challenging problems when building applications. That is especially true when working on applications to store historical data. There are two approaches for storing historical data. One is "time period modeling" and the other is "audit modeling". Choosing the most appropriate approach necessitates careful analysis.​

Recently, I was tasked with designing a table to store historical data regarding the assignment of stores to specific operational organizations.

The business requirements are as follows:
- Stores can have their operational organizations changed only once a day.
- The effective date of the operational organization can vary for each store.
- Each store must belong to at least one operational organization.
- An operational organization may not have any assigned stores.

The following ERD was reviewed  to determine which approach, "time period modelling" or "audit modelling", is more suitable.



Within the above ERD, I deliberately designed two association entities. One was tailored for audit modeling and the other for time period modeling.

In the remainder of this note, I am going to assess which model is best suited for the business requirements outlined above.

At first sight, due to the fact that PostgreSQL triggers an insert operation during updates and there is a burden of frequent vacuuming, time period modeling appears less viable. So let's proceed to generate some test data and conduct the experiment to verify the deduction made above.

drop table OPER_ORG;
create table OPER_ORG(
OPER_ID   VARCHAR(10) not NULL,
OPER_NAME VARCHAR(100) not NULL,
OPER_SP_CD VARCHAR(1) not NULL,
INSERT_ID   VARCHAR(100) not NULL
);
insert into oper_org
select i::text
     , i::text||'_OPER_ORG_NM'
     , case when mod(i,5) = 0 then 'T' else 'P' end
     , REPEAT('O',32)
  from generate_series(1,100) a(i);
create unique index oper_org_uk on oper_org(oper_id);

Here is a table designed using an audit modeling apprach.

drop table STR_OPER_ORG;  
create table STR_OPER_ORG(
STORE_ID   VARCHAR(10) not NULL,
OPER_ID    VARCHAR(10) not NULL,
APP_DT     DATE not NULL,
INSERT_ID   VARCHAR(100) not NULL
);
insert into str_oper_org
select i as store_id
     , mod(i,100)+1 as oper_id
     , '20230810'::DATE
     , REPEAT('K',32)
  from generate_series(1, 90000) a(i);
insert into str_oper_org
select i as store_id
     , mod(i,100)+1 as oper_id
     , '20230810'::DATE - mod(i,10)-2
     , REPEAT('K',32)
  from generate_series(1, 90000) a(i);
insert into str_oper_org
select i as store_id
     , mod(i,100)+1 as oper_id
     , '20230810'::DATE - 20
     , REPEAT('S',32)
  from generate_series(1, 90000) a(i);
create unique index str_oper_org_uk on str_oper_org(oper_id, app_dt, store_id);
create unique index STR_OPER_OG_UK2 on STR_OPER_ORG(OPER_ID, STORE_ID, APP_DT);

Here is a table designed using a time period modeling approach.

create table STR_OPER_ORG_TP(
STORE_ID   VARCHAR(10) not NULL,
OPER_ID    VARCHAR(10) not NULL,
APP_BEGIN_DT     DATE not NULL,
APP_END_DT       DATE not null default '9999-12-31'::DATE,
INSERT_ID   VARCHAR(100) not NULL
);
insert into STR_OPER_ORG_TP
select STORE_ID, OPER_ID, APP_DT
     , coalesce(LAG(APP_DT-1) over (partition by STORE_ID, OPER_ID order by APP_DT DESC),'99991231'::DATE)
     , INSERT_ID
  from STR_OPER_ORG;
create index STR_OPER_ORG_TP_UK on STR_OPER_ORG_TP(OPER_ID, APP_END_DT, APP_BEGIN_DT, STORE_ID);
drop index STR_OPER_ORG_TP_UK2;
create index STR_OPER_ORG_TP_UK2 on STR_OPER_ORG_TP(APP_END_DT, APP_BEGIN_DT, OPER_ID, STORE_ID);

drop table STORE;
create table STORE(
STORE_ID  VARCHAR(10) not NULL,
STORE_NM  VARCHAR(100) not null,
INSERT_ID VARCHAR(200) not NULL
);
insert into store
select i::text
     , i::text||'_store_name'
     , REPEAT('S',200)
  from generate_series(1, 90000) a(i);
select * from store limit 100;
create unique index store_uk on store(store_id);

analyze STORE, STR_OPER_ORG, OPER_ORG;

select PG_RELATION_SIZE('oper_org') as "영업조직"
      , pg_relation_size('str_oper_org') as "매장영업조직_점이력"
      , pg_relation_size('str_oper_org_tp') as "매장영업조직_선분이력"
      , pg_relation_size('store') as "매장";

영업조직 |매장영업조직_점이력|매장영업조직_선분이력|매장
-----+----------+-----------+--------+
16384|  22372352|   22806528|23781376|

Let's assume that we need to extract a store name on August 1, 2023, with the organization ID of 10.
If we have an association table following the "audit modeling" technique, we can construct a query as follows:


select C.OPER_NAME, A.STORE_NM
  from OPER_ORG C
     , (select distinct on (STORE_ID) OPER_ID, STORE_ID, APP_DT
          from STR_OPER_ORG
         where OPER_ID = '10'
           and APP_DT <= '20230801'::DATE
         order by STORE_ID desc, APP_DT desc
        ) b
      , STORE A
 where C.OPER_ID = B.OPER_iD
   and B.STORE_ID = A.STORE_ID
   and C.OPER_ID = '10'
   ;
Here is the execution plan:

Nested Loop (actual time=0.037..2.333 rows=900 loops=1)
  Buffers: shared hit=2716
  ->  Nested Loop (actual time=0.031..0.627 rows=900 loops=1)
        Buffers: shared hit=16
        ->  Index Scan using oper_org_uk on oper_org c (actual time=0.012..0.013 rows=1 loops=1)
              Index Cond: ((oper_id)::text = '10'::text)
              Buffers: shared hit=2
        ->  Subquery Scan on b (actual time=0.016..0.540 rows=900 loops=1)
              Filter: ((b.oper_id)::text = '10'::text)
              Buffers: shared hit=14
              ->  Unique (actual time=0.015..0.454 rows=900 loops=1)
                    Buffers: shared hit=14
                    ->  Index Only Scan Backward using str_oper_og_uk2 on str_oper_org (actual time=0.014..0.277 rows=1800 loops=1)
                          Index Cond: ((oper_id = '10'::text) AND (app_dt <= '2023-08-01'::date))
                          Heap Fetches: 0
                          Buffers: shared hit=14
  ->  Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
        Index Cond: ((store_id)::text = (b.store_id)::text)
        Buffers: shared hit=2700
Planning Time: 0.237 ms
Execution Time: 2.400 ms


If you are not familiar with the distinct on clause, you can write the query as follows:

select C.OPER_NAME, A.STORE_NM
  from OPER_ORG C
     , (Select MAX(OPER_ID) as oper_id, STORE_ID, MAX(APP_DT) as APP_DT
          from STR_OPER_ORG
         where OPER_ID = '10'
           and APP_DT <= '20230801'::DATE
         group by STORE_ID
        ) b
      , STORE A
 where C.OPER_ID = B.OPER_iD
   and B.STORE_ID = A.STORE_ID
   and C.OPER_ID = '10'
   ;
Nested Loop (actual time=0.427..2.617 rows=900 loops=1)
  Buffers: shared hit=2713
  ->  Nested Loop (actual time=0.420..0.619 rows=900 loops=1)
        Buffers: shared hit=13
        ->  Index Scan using oper_org_uk on oper_org c (actual time=0.013..0.014 rows=1 loops=1)
              Index Cond: ((oper_id)::text = '10'::text)
              Buffers: shared hit=2
        ->  HashAggregate (actual time=0.406..0.540 rows=900 loops=1)
              Group Key: str_oper_org.store_id
              Filter: (max((str_oper_org.oper_id)::text) = '10'::text)
              Batches: 1  Memory Usage: 193kB
              Buffers: shared hit=11
              ->  Index Only Scan using str_oper_org_uk on str_oper_org (actual time=0.014..0.183 rows=1800 loops=1)
                    Index Cond: ((oper_id = '10'::text) AND (app_dt <= '2023-08-01'::date))
                    Heap Fetches: 0
                    Buffers: shared hit=11
  ->  Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
        Index Cond: ((store_id)::text = (str_oper_org.store_id)::text)
        Buffers: shared hit=2700
Planning Time: 0.197 ms
Execution Time: 2.684 ms


If we have an association table following the "time period modeling" technique, we can write a query like this:

select C.OPER_NAME, A.STORE_NM
  from OPER_ORG C
     , STR_OPER_ORG_TP B
     , STORE A
 where C.OPER_ID = '10'
   and C.OPER_ID = B.OPER_ID
   and '2023-08-01'::DATE between B.APP_BEGIN_DT and B.APP_END_DT
   and B.STORE_ID = A.STORE_ID;

And here is the execution plan:

Nested Loop (actual time=0.022..1.976 rows=900 loops=1)
  Buffers: shared hit=2715
  ->  Nested Loop (actual time=0.019..0.224 rows=900 loops=1)
        Buffers: shared hit=15
        ->  Index Scan using oper_org_uk on oper_org c (actual time=0.006..0.007 rows=1 loops=1)
              Index Cond: ((oper_id)::text = '10'::text)
              Buffers: shared hit=2
        ->  Index Only Scan using str_oper_org_tp_uk on str_oper_org_tp b (actual time=0.011..0.148 rows=900 loops=1)
              Index Cond: ((oper_id = '10'::text) AND (app_end_dt >= '2023-08-01'::date) AND (app_begin_dt <= '2023-08-01'::date))
              Heap Fetches: 0
              Buffers: shared hit=13
  ->  Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
        Index Cond: ((store_id)::text = (b.store_id)::text)
        Buffers: shared hit=2700
Planning:
  Buffers: shared hit=6
Planning Time: 0.185 ms
Execution Time: 2.039 ms

Note that the number of block I/Os decreased slightly and the elapsed time also decreased. When we examine the index accesses of the table STR_OPER_ORG and STO_OPER_ORG_TP respectively, we can observe that the number of accessed rows descreased from 1800 to 900. As additional historical data accumulates in the associative table, the performance of the audit modeling table will degrade.
Conversely, we will maintain consistent performance with the time period modeling table.

The following table summarizes the strengths and weaknesses of both audit modeling and time period modeling.

   ADVANTAGES  DISADVANTAGES
 audit modeling - can prevent data duplication
- takes up less storage
- requeres advanced SQL skill in trieving data
- poor retrieval performace
time period modeling
- constructing a query is easy
- good retrieval performance
- the APP_END_DT column is a derived attribute
- the time period can be overlapped
- high chances of a table bloat
- bigger index/table size

Conclusion
Given this particluar business requirement described at the biginning of this note, time period modeling is more favorable than audit modeling.

Footnote

To refine the conclusion I made here, I conducted an additional experiment. Let's suppose that we need to retrieve the organization names of all the stores.
If we possess an association table following the "audit modeling" technique, we can formulate a query as follows:


select A.STORE_ID, A.STORE_NM, C.OPER_NAME
  from OPER_ORG C
     , (select distinct on (OPER_ID, STORE_ID) OPER_ID, STORE_ID, APP_DT
          from str_oper_org
         where APP_DT <= '20230801'::DATE
         order by OPER_ID desc, STORE_ID desc, APP_DT DESC
        ) b
      , STORE A
 where C.OPER_ID = B.OPER_iD
   and B.STORE_ID = A.STORE_ID
   ;
Hash Join (actual time=18.826..80.352 rows=90000 loops=1)
  Hash Cond: ((str_oper_org.store_id)::text = (a.store_id)::text)
  Buffers: shared hit=3943
  ->  Hash Join (actual time=0.061..38.456 rows=90000 loops=1)
        Hash Cond: ((str_oper_org.oper_id)::text = (c.oper_id)::text)
        Buffers: shared hit=1040
        ->  Unique (actual time=0.021..25.142 rows=90000 loops=1)
              Buffers: shared hit=1038
              ->  Index Only Scan Backward using str_oper_og_uk2 on str_oper_org (actual time=0.020..12.649 rows=117000 loops=1)
                    Index Cond: (app_dt <= '2023-08-01'::date)
                    Heap Fetches: 0
                    Buffers: shared hit=1038
        ->  Hash (actual time=0.028..0.029 rows=100 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 13kB
              Buffers: shared hit=2
              ->  Seq Scan on oper_org c (actual time=0.010..0.018 rows=100 loops=1)
                    Buffers: shared hit=2
  ->  Hash (actual time=18.690..18.690 rows=90000 loops=1)
        Buckets: 131072  Batches: 1  Memory Usage: 5837kB
        Buffers: shared hit=2903
        ->  Seq Scan on store a (actual time=0.005..8.637 rows=90000 loops=1)
              Buffers: shared hit=2903
Planning Time: 0.320 ms
Execution Time: 82.876 ms


If you are distinct on shy, you can write the query as follows:

select A.STORE_ID, A.STORE_NM, C.OPER_NAME
  from OPER_ORG C
     , (select OPER_ID, STORE_ID, max(APP_DT)
          from str_oper_org
         where APP_DT <= '20230801'::DATE
        group by oper_id, store_id
        ) b
      , STORE A
 where C.OPER_ID = B.OPER_iD
   and B.STORE_ID = A.STORE_ID
   ;

Hash Join (actual time=17.629..89.303 rows=90000 loops=1)
  Hash Cond: ((str_oper_org.store_id)::text = (a.store_id)::text)
  Buffers: shared hit=3943
  ->  Hash Join (actual time=0.044..49.968 rows=90000 loops=1)
        Hash Cond: ((str_oper_org.oper_id)::text = (c.oper_id)::text)
        Buffers: shared hit=1040
        ->  GroupAggregate (actual time=0.018..36.549 rows=90000 loops=1)
              Group Key: str_oper_org.oper_id, str_oper_org.store_id
              Buffers: shared hit=1038
              ->  Index Only Scan using str_oper_og_uk2 on str_oper_org (actual time=0.013..13.874 rows=117000 loops=1)
                    Index Cond: (app_dt <= '2023-08-01'::date)
                    Heap Fetches: 0
                    Buffers: shared hit=1038
        ->  Hash (actual time=0.022..0.023 rows=100 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 13kB
              Buffers: shared hit=2
              ->  Seq Scan on oper_org c (actual time=0.005..0.013 rows=100 loops=1)
                    Buffers: shared hit=2
  ->  Hash (actual time=17.493..17.494 rows=90000 loops=1)
        Buckets: 131072  Batches: 1  Memory Usage: 5837kB
        Buffers: shared hit=2903
        ->  Seq Scan on store a (actual time=0.005..8.183 rows=90000 loops=1)
              Buffers: shared hit=2903
Planning:
  Buffers: shared hit=1
Planning Time: 0.180 ms
Execution Time: 92.257 ms

If we have an association table following the "time period modeling" technique, we can construct a query as follows:

select C.OPER_NAME, A.STORE_NM
  from OPER_ORG C
     , STR_OPER_ORG_TP B
     , STORE A
 where 1=1
   and C.OPER_ID = B.OPER_ID
   and '2023-08-01'::DATE between B.APP_BEGIN_DT and B.APP_END_DT
   and B.STORE_ID = A.STORE_ID;
   
  Hash Join (actual time=19.798..69.635 rows=90000 loops=1)
  Hash Cond: ((b.store_id)::text = (a.store_id)::text)
  Buffers: shared hit=4054
  ->  Hash Join (actual time=0.037..26.222 rows=90000 loops=1)
        Hash Cond: ((b.oper_id)::text = (c.oper_id)::text)
        Buffers: shared hit=1151
        ->  Index Only Scan using str_oper_org_tp_uk2 on str_oper_org_tp b (actual time=0.011..12.242 rows=90000 loops=1)
              Index Cond: ((app_end_dt >= '2023-08-01'::date) AND (app_begin_dt <= '2023-08-01'::date))
              Heap Fetches: 0
              Buffers: shared hit=1149
        ->  Hash (actual time=0.021..0.022 rows=100 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 13kB
              Buffers: shared hit=2
              ->  Seq Scan on oper_org c (actual time=0.004..0.011 rows=100 loops=1)
                    Buffers: shared hit=2
  ->  Hash (actual time=19.694..19.694 rows=90000 loops=1)
        Buckets: 131072  Batches: 1  Memory Usage: 5837kB
        Buffers: shared hit=2903
        ->  Seq Scan on store a (actual time=0.003..8.881 rows=90000 loops=1)
              Buffers: shared hit=2903
Planning:
  Buffers: shared hit=14
Planning Time: 0.273 ms
Execution Time: 72.798 ms

Note that while the number of block I/Os increased from 3943 to 4053, the elapsed time decreased from 82 ms to 72 ms. The rise in block I/O is a result of the larger index size in time period modeling.
Anyhow, this experiment verifies that time period modeling is superior to audit modeling for this speicfic business requirement.



   

postgresdba.com