PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 163건, 최근 0 건

B-Tree Index Split Mechanism: Insights and Experiments

글쓴이 : 모델광 날짜 : 2023-12-29 (금) 22:52 조회 : 441

I recently came across a blog post on Oracle 18c's Scalable Sequences, which sparked me to jot down this note:
Scalable Sequences | Richard Foote's Oracle Blog (wordpress.com)
The blog post claims that Scalable Sequences can alleviate index contention issues by distributing data across different index blocks during insertion.

This led me to ponder how PostgreSQL handles B-tree index splits. If you work with an Oracle database, you are probably familiar with B-tree index split mechanics. In a stanedard block split, the contents are usually divided evenly (50-50) between the old and the new block. However, when a new entry is inserted into the right-most leaf block of a B-tree index and there is no space left in the block, a 90-10 split occurs. This means that 90% of the data remains in the original block, and 10% moves to the new leaf block. This mechanism aims to accommodate future sequential insertions without additional splits. However, this doesn't entirely solve the "hot block" issue. Heavy insertion transactions can still lead to buffer busy waits, enq:TX - index contention, and cluster wait events. That's why Oracle 18c introduce Scalable Sequences.

I was curious about PostgreSQL's approach to splitting the leaf block. Does it perform a 50-50 split, regardless of the insertion's position in the leaf page? To get the answer, I shot a question over to ChatGPT-4.

"In an Oracle database, when a data is inserted into the right-most index leaf block, the index block does the 90-10 split. Does PostgreSQL have the same mechanism when the right-most leaf page is split?"

And ChatGPT-4 responded, "PostgreSQL performs an approximately 50-50 split regardless of whether the insertion is at the right-most leaf or not. This approach is different from Oracle's 90-10 split and is more general-purpose, as it doesn't optimize specifically for sequential insertions."

After rolling up my sleeves and getting into the experiment myself, I have to say, I realized that ChatGPT-4's information was off the mark.

To explore this, I set up an experiment. I am going to compare the empty space in index leaf pages between sequentially and ramdomly populated indexes. For a more detailed inspection of index data, I'll be using the "pageinspect" extension. The pageinspect extension allows us to see the low-level page contents of indexes or tables.

Let's start by creating the pageinspect extension. You need a superuser privilege to create it.

postgres=# alter role scott with superuser;
postgres=# exit
$psql -U scott -d analdb
analdb=#create extension pageinspect;

CREATE TABLE MSG_SEND (
MSG_SEND_NO BIGINT NOT NULL,
MSG_CONTENT TEXT,
CONSTRAINT MSG_SEND_PK PRIMARY KEY (MSG_SEND_NO)
);

I have populated the table with a bunch of rows, sequentially increasing
the MSG_SEND_NO column.

INSERT INTO msg_send
SELECT i, repeat('ab',50)

FROM generate_series(1, 1000000) a(i);

Using the 'bt_multi_page_stats' function, I retrieved summary data for B-tree index pages.

SELECT blkno, live_items, avg_item_size, page_size, free_size

FROM bt_multi_page_stats('msg_send_pk', 100, 10);

blkno|live_items|avg_item_size|page_size|free_size|
-----+----------+-------------+---------+---------+
100|       367|           16|     8192|      808|
101|       367|           16|     8192|      808|
102|       367|           16|     8192|      808|
103|       367|           16|     8192|      808|
104|       367|           16|     8192|      808|
105|       367|           16|     8192|      808|
106|       367|           16|     8192|      808|
107|       367|           16|     8192|      808|
108|       367|           16|     8192|      808|

109| 367| 16| 8192| 808|

Focusing on blocks 100 to 109, we notice that the free space in each is about 10%(808/8192) of the page size. This indicates that in PostgreSQL 16.1, when a new entry is inserted into the right-most index page, a 90-10 split occurs.

I have also checked the total size of the index:

select relname, relkind, relpages, reltuples, pg_relation_size(oid)
from pg_class

where relname = 'msg_send_pk';

msg_send_pk|i | 2745 |1000000.0| 22487040|

Next, I examined the behviour with randomly inserted data. The results were quite different.

truncate table msg_send;
INSERT INTO msg_send
SELECT (mod(i,10)||lpad(i::text,8,'0'))::bigint, repeat('ab',50)
FROM generate_series(1, 1000000) a(i);

SELECT *

FROM bt_multi_page_stats('msg_send_pk', 100, 10);

blkno|live_items|avg_item_size|page_size|free_size|
-----+----------+-------------+---------+---------+
100|       204|           16|     8192|     4068|
101|       204|           16|     8192|     4068|
102|       204|           16|     8192|     4068|
103|       204|           16|     8192|     4068|
104|       204|           16|     8192|     4068|
105|       204|           16|     8192|     4068|
106|       204|           16|     8192|     4068|
107|       204|           16|     8192|     4068|
108|       367|           16|     8192|      808|
109|       204|           16|     8192|     4068|

select relname, relkind, relpages, reltuples, pg_relation_size(oid)
from pg_class

where relname = 'msg_send_pk';

msg_send_pk|i | 4723|1000000.0| 38690816|

We notice a couple of key differences:

Firstly, the total number of index pages increased significantly to 4723, up from 2745.
Secondly, most index pages contained only 204 live items, compared to 367 in the sequential case.
Finally, the majority of index pages had around 50 percent free space, which indicates that when a leaf page in a B-tree becomes full and a new entry needs to be inserted, PostgreSQL performs a 50-50 index page split. This means that the existing entries are evenly distributed betwen the old and the new index page. This approach is different from the right-most 90-10 split.

CONCLUSION

This experiment showed that PostgreSQL's B-tree index split behaviour is quite similar to that of Oracle's.

E-MAIL: root@postgresdba.com