During a database migration project from Oracle to PostgreSQL, I frequently encounter performance issues caused by different optimzer functionalites between those databases. Here is an insight from my migration
experience, focusing on how these databases handle query plans
differently.
Let's
dive into a practical scenario with a table named "T1', holding records
across 50 random days starting from November 1st, 2023, and see how
PostgreSQL's query plans evolve over time, comparing it with Oracle's
approach.
Below is the script I have created to illustrate my point. I have conducted the test on PostgreSQL 15.1:
CREATE TABLE t1 AS
SELECT '20231101'::date + round((50*random())::int)::int as reg_dt
FROM generate_series(1,1000000);
ANALYZE t1;
SELECT reg_dt, count(*)
FROM t1
GROUP BY reg_dt;
reg_dt |count|
----------+-----+
2023-11-01| 9895|
2023-11-02|20194|
2023-11-03|19842|
2023-11-04|20155|
.......
2023-12-16|19902|
2023-12-17|20103|
2023-12-18|20062|
2023-12-19|19755|
2023-12-20|20038|
2023-12-21|10155|
This setup scatters 1,000,000 rows over 50 days, averaging around 15,000 rows per day from November 1st to December 21st, 2023.
Let's
assume that as time goes by, about 15,000 rows are added per day. Today
is 21st of December and we query for tuples from the day onwards:
SELECT *
FROM t1
WHERE reg_dt >= '20231221'::date;
And here is the execution plan. Take a note of the estimated number of rows.
Gather (cost=1000.00..11718.33 rows=10300 width=4)
Workers Planned: 2
-> Parallel Seq Scan on t1 (cost=0.00..9688.33 rows=4292 width=4)
Filter: (reg_dt >= '2023-12-21'::date)
The
actual count stands at 10,155, yet the optimizer predicted we would
fetch 10,300 rows. You might wonder, how did it come up with 10,300?
select b.*
from pg_stats a cross join lateral
unnest(a.most_common_vals::text::text[], a.most_common_freqs::text::numeric[]) as b(mcv, mcf)
where tablename = 't1'
order by b.mcv;
mcv |mcf
----------+-----------
2023-11-01|0.010066667
2023-11-02|0.019833334
2023-11-03|0.019266667
....
....
2023-12-19|0.020466667
2023-12-20|0.021766666
2023-12-21|0.0103
The planner just multiplied the most common frequency value on 2023-12-21 by the total number of rows:
100,000 * 0.0`103 = 10300
So far, so good. Let's suppose that the following day, 2023-12-22, we inserted an additional 10,000 rows.
INSERT INTO t1
SELECT '20231222'::date as reg_dt
FROM generate_series(1,10000);
And on the 22nd of December, we run the following query:
select *
from t1
where reg_dt >= '20231222'::date;
How do you suppose the optimizer will estimate the number of rows now? Can it recognize the addition of 10,000 new rows?
Let's examine the resulting execution plan:
Gather (cost=1000.00..10688.43 rows=1 width=4)
Workers Planned: 2
-> Parallel Seq Scan on t1 (cost=0.00..9688.33 rows=1 width=4)
Filter: (reg_dt >= '2023-12-22'::date)
To our disappointment, the planner is estimating that it would return just one row, which poses a significant challenge in generating execution plans. This discrepancy stems from a lack of updated statistics for the new date, a predicament specific to PostgreSQL in database management.
In Oracle when you provide a predicate which is outside either the upper bound or below the lower bound, the optimizer uses a scaling off algorithm.
If the optimizer knows that the highest value is 21st of December, then
if you query for the 22nd of December, the optimizer estimates it would
retrieve around 90 % of the number of rows of the highest value. If you
query for the 23rd then 80 % of the number of rows of the highest
value, and eventually all the way down to one as you go further and
further away into the future from the highest value that the optimizer
knows about. To mitigate this issue, Oracle has real time statistics feature on Exadata. What real time statistics does is as you are doing
DML, the database will actually track the DMLs and from time to tome
Oracle updates the lowest and highest values for various columns. This
is not like a full gathering of statistics.
On the other hand, in PostgreSQL if the predicate goes above the known bounday, PostgreSQL simply assumes one, a stark contrast to Oracle's approach.
Conclusion
Diverging
from Oracle, PostgreSQL (as of version 15.1) significantly misestimates
the number of rows for boundary conditions. You can employ the
following strategies to mitigate this issue:
1. Frequent Statistics Gathering
Adjust the table-level statistics gathering percentage thresholds. By default, a table needs to undergo a 20 % change in activity before statistics are updated.
2.
You can fix the execution plan with hints. With this strategy you will
always get a consistent plan. However, bear in mind that consistent
plans do not necessarily mean consistent performance. If you use the
hint, you are compromising performance potentially by making sure we do
not get any performance dramas all of a sudden. Performance may degrade
slowly over time.