/
Instance entity type performance investigation

Instance entity type performance investigation

Problem

In Quesnelia Bugfest environment, this query is EXTREMELY slow:

(instance_discovery_suppress == "false") AND (instance_metadata_updated_date > "2024-01-01")

In the Corsair rancher environment (which contains an older copy of the bugfest data), it is almost instant

Investigation

For the sake of investigation, I’m using a simplified version of the query:

select count(*) from fs09000000_mod_fqm_manager.src_inventory_instance where "left"(lower((jsonb -> 'metadata'::text) ->> 'updatedDate'::text), 600) > '2024-01-01' and "left"(lower(jsonb ->> 'discoverySuppress'::text), 600) = 'false'

Results

This should reveal any weirdness related to indexes without being muddied with the overhead of actually retrieving a large result set.

In QBF, this query took ~29.75 minutes. In corsair-perf, it took ~180ms.

Query plans

The corsair-perf query plan:

Aggregate (cost=20.47..20.48 rows=1 width=8) (actual time=0.080..0.080 rows=1 loops=1) -> Index Scan using instance_metadata_updateddate_idx on instance (cost=0.56..20.46 rows=4 width=0) (actual time=0.021..0.077 rows=1 loops=1) " Index Cond: (""left""(lower(((jsonb -> 'metadata'::text) ->> 'updatedDate'::text)), 600) > '2024-01-01'::text)" " Filter: (""left""(lower((jsonb ->> 'discoverySuppress'::text)), 600) = 'false'::text)" Rows Removed by Filter: 6 Planning Time: 0.207 ms Execution Time: 0.107 ms

The QBF query plan:

Finalize Aggregate (cost=1122370.61..1122370.62 rows=1 width=8) (actual time=584934.346..584938.671 rows=1 loops=1) -> Gather (cost=1122370.40..1122370.61 rows=2 width=8) (actual time=584926.688..584938.661 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=1121370.40..1121370.41 rows=1 width=8) (actual time=584804.488..584804.489 rows=1 loops=3) -> Parallel Bitmap Heap Scan on instance (cost=43108.06..1120724.62 rows=258311 width=0) (actual time=608.260..584702.007 rows=168613 loops=3) " Recheck Cond: (""left""(lower(((jsonb -> 'metadata'::text) ->> 'updatedDate'::text)), 600) > '2024-01-01'::text)" Rows Removed by Index Recheck: 955694 " Filter: (""left""(lower((jsonb ->> 'discoverySuppress'::text)), 600) = 'false'::text)" Rows Removed by Filter: 59563 Heap Blocks: exact=13864 lossy=122007 -> Bitmap Index Scan on instance_metadata_updateddate_idx (cost=0.00..42953.08 rows=633136 width=0) (actual time=689.052..689.052 rows=708224 loops=1) " Index Cond: (""left""(lower(((jsonb -> 'metadata'::text) ->> 'updatedDate'::text)), 600) > '2024-01-01'::text)" Planning Time: 0.197 ms Execution Time: 584938.724 ms

Notably (as seen by the row counts in the query plans), QBF has significantly more records that have been updated since 2024-01-01, so the query plans are expected to be pretty different.

The Rows Removed by Index Recheck: 955694 stands out, along with the lossy=122007 below it. Some Googling reveals that this is related to the Bitmap Index Scan on the updated date. When the bitmap is too big to fit in memory, it seems PostgreSQL will do a thing where some of the items in the bitmap refer to blocks instead of specific rows. In that case, it has to run through each of those to find which rows within the block match the condition.

The memory allocated for this is set with work_mem. Let’s check that:

select * from pg_settings where name = 'work_mem';

This returned 4096 KiB (so 4 MiB)

Experiment

Hypothesis: Increasing work_mem will remove the index recheck and improve performance.

Method (on QBF):

  1. Start a transaction

    start transaction;
  2. Increase work_mem to a large value

    set session work_mem = '100MB';
  3. Run the test query with the increased memory. Also, run explain analyze on it to see what the plan is

  4. Rollback the transaction to undo the work_mem change
    rollback;

All together:

start transaction; set session work_mem = '100MB'; -- explain analyse select count(*) from fs09000000_mod_fqm_manager.src_inventory_instance where "left"(lower((jsonb -> 'metadata'::text) ->> 'updatedDate'::text), 600) > '2024-01-01' and "left"(lower(jsonb ->> 'discoverySuppress'::text), 600) = 'false'; rollback;

Results

Timing: With the memory change, the query took 7 s 693 ms

Query plan:

Finalize Aggregate (cost=1025580.79..1025580.80 rows=1 width=8) (actual time=111542.651..111558.872 rows=1 loops=1) -> Gather (cost=1025580.57..1025580.78 rows=2 width=8) (actual time=111538.839..111558.864 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=1024580.57..1024580.58 rows=1 width=8) (actual time=111530.206..111530.207 rows=1 loops=3) -> Parallel Bitmap Heap Scan on instance (cost=43108.06..1023934.80 rows=258311 width=0) (actual time=450.669..111468.812 rows=168613 loops=3) " Recheck Cond: (""left""(lower(((jsonb -> 'metadata'::text) ->> 'updatedDate'::text)), 600) > '2024-01-01'::text)" " Filter: (""left""(lower((jsonb ->> 'discoverySuppress'::text)), 600) = 'false'::text)" Rows Removed by Filter: 59563 Heap Blocks: exact=134477 -> Bitmap Index Scan on instance_metadata_updateddate_idx (cost=0.00..42953.08 rows=633136 width=0) (actual time=326.824..326.824 rows=708224 loops=1) " Index Cond: (""left""(lower(((jsonb -> 'metadata'::text) ->> 'updatedDate'::text)), 600) > '2024-01-01'::text)" Planning Time: 5.281 ms Execution Time: 111558.950 ms

Note: I actually ran the explain analyze first, so there could be some caching that interfered since the explain analyze took 111.5 seconds and the actual query only took ~7.7 seconds. Since those are very very different, I waited a while (for the cache to invalidate due to people using QBF, then reran the experiment again without the explain analyze. With this 2nd run, the query took 2 m 16 s.

Conclusion

It seems increasing work_mem increases query performance for that query/environment by more than 10x. 100MB is probably too high, so the ideal value should probably be determined by a DBA or someone with DB performance tuning experience.

More weirdness

Kriti noticed that this query is about 10x slower in QBF compared to corsair-perf:

SELECT count(*) FROM fs09000000_mod_inventory_storage.instance WHERE jsonb ->> 'title' ~* 'Krakow';

The query plan looks about the same for both DBs, but it runs in 1-2 minutes on corsair-perf and about 12 minutes on QBF

This isn’t really anything like a query that FQM would produce, but the performance difference between QBF and corsair-perf indicates that something weird is going on.

The only thing that really stands out to me is the number of dead tuples:

select n_live_tup, n_dead_tup, relname--, * from pg_stat_all_tables where relname = 'instance' and schemaname = 'fs09000000_mod_inventory_storage'

corsair-perf:

n_live_tup

n_dead_tup

relname

n_live_tup

n_dead_tup

relname

8152387

382

instance

In QBF:

n_live_tup

n_dead_tup

relname

n_live_tup

n_dead_tup

relname

8302938

352878

instance

352k dead tuples is a lot, but doesn’t seem all that crazy… And it looks like the table was autovacuumed and autoanalyzed on 5/28 (today is 6/6), so it’s not that stale. I’m not sure what’s going on.

Related content