[MODINVSTOR-513] hitcount estimation degeneration when upgrading MIS 19.1.2 (fameflower) -> MIS 19.2.0-SNAPSHOT (RMB30) Created: 02/Jun/20  Updated: 15/Jun/20  Resolved: 08/Jun/20

Status: Closed
Project: mod-inventory-storage
Components: None
Affects versions: None
Fix versions: 19.2.0

Type: Bug Priority: P2
Reporter: Jakub Skoczen Assignee: Mikhail Fokanov
Resolution: Done Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
is blocked by RMB-637 Run ANALYZE after index (re-)creation Closed
Gantt End to Start
has to be done before MODINVSTOR-504 Release mod-inventory-storage 19.2.0 Closed
Relates
relates to RMB-591 remove or harmonize 'optimizedSql' ex... Closed
relates to MODINVSTOR-511 Create 19.2.0-SNAPSHOT-4e7cbbb with t... Closed
relates to MODINVSTOR-514 Migrating from Fameflower to Goldenro... Closed
relates to RMB-638 Revert PgUtil.streamGet parameter reo... Closed
relates to RMB-644 implement generic sort optimization Blocked
relates to RMB-583 Move f_unaccent and concat* from publ... Closed
relates to RMB-633 Check DB_CONNECTIONRELEASEDELAY for n... Closed
Sprint: CP: sprint 90
Story Points: 2
Development Team: Core: Platform
Release: Q2 2020

 Description   

Mikhail Fokanov Oleksii Popov this ticket is created to track the work Mikhail is doing on bugfest with Sobha Duvvuri

Cate Boerema Charlotte Whitt Marc Johnson

Problem

We are seeing imprecise hitcounts reported after upgrading Bugfest from MIS used for Fameflower to the latest MIS snapshot using RMB 30 (which included the harmonisation of hit count between Instance records sorted by title). Most hitcounts are reported as ~37000.

Analyis from Mikhail

I found out that the instance table wan't analysed due to some error during the deployment of the new version of inventory.
Manually running of:
ANALYZE fs09000000_mod_inventory_storage.instance;
fixes the problem.
But VACUUM with ANALYZE was failing (possibly due to timeout).
But just ANALYZE query is pretty fast, so it completed.
So, as far as I can see, the root cause is that because of changes which was implemented in scope of RMB-583 Closed during the deployment there is a migration which involves REINDEX and ANALYZE but due to timeout of this procedures the database wasn't completely migrated, but stuck in the kind of inconsistent state and therefore counters are not accurate.

Documentation:



 Comments   
Comment by Julian Ladisch [ 02/Jun/20 ]

RMB has full logging enabled when running the upgrade SQL scripts:
https://github.com/folio-org/raml-module-builder/blob/v30.0.0/domain-models-runtime/src/main/java/org/folio/rest/persist/PostgresClient.java#L3505-L3516
RMB first uses log.info to print "trying to execute: " and the SQL query to run.
Then it tries to run the query.
On success RMB uses log.info to print "Successfully executed: " and the successfull SQL query again.
On failure RMB uses log.error to print the exception message and the exception stacktrace.
I don't see any possibility how to increase logging.

Comment by Julian Ladisch [ 02/Jun/20 ]

Creating a mod-inventory-storage jar with RMB 30 but the RMB-583 Closed commit reverted is difficult and requires much work because several other later commits depend on it.
My suggestion to try to reproduce the hitcount issue is this:
Start with the mod-inventory-storage 19.1.2 installation.
DROP TABLE fs09000000_mod_inventory_storage.rmb_internal_index;
This will remove the information about the existing indexes and RMB will recreate all of them on next upgrade.
Now run an upgrade from 19.1.2 to 19.1.2, this is an upgrade to the same version. It will recreate all indexes and may reproduce the issue.

Comment by Jakub Skoczen [ 03/Jun/20 ]

Julian Ladisch Are you proposing that we try to reproduce the issue with 19.1.2 (fameflower version) ONLY? Why do you expect that the problem is already present in that version?

Comment by Julian Ladisch [ 03/Jun/20 ]

The problem may be the result of long running index recreations that get interrupted. If this is the case the 19.1.2 to 19.1.2 "upgrade" will also show the problem.

Comment by Julian Ladisch [ 04/Jun/20 ]

Tenant upgrade uses the jdbc driver. It does not use the vertx driver and is therefore not affected by the switch to the vertx-pg-client.

Comment by Julian Ladisch [ 04/Jun/20 ]

We can reproduce the problem and it is not caused by interrupted reindexing.
PostgreSQL has an autovacuum daemon that is enabled by default and automatically runs VACUUM and ANALYZE:
https://www.postgresql.org/docs/current/routine-vacuuming.html
There are two configuration parameters (see https://www.postgresql.org/docs/current/runtime-config-autovacuum.html#GUC-AUTOVACUUM-ANALYZE-THRESHOLD ):

  • autovacuum_analyze_threshold: Specifies the minimum number of inserted, updated or deleted tuples needed to trigger an ANALYZE in any one table. The default is 50 tuples.
  • autovacuum_analyze_scale_factor: Specifies a fraction of the table size to add to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE. The default is 0.1 (10% of table size).

My guess is that recreating the index deletes the statistics but no ANALYZE is executed because there is no tuple change.
Therefore I propose to add ANALYZE to the upgrade script so that it runs after the index recreation.

Comment by Mikhail Fokanov [ 04/Jun/20 ]

Julian Ladisch, could you create a branch with this fix in order to deploy and test it?

Comment by Julian Ladisch [ 04/Jun/20 ]

Reproducing this with the performance dataset.
Show the correct counts:

postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source');
  1259
postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master');
  1863
postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last');
  4206
postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be');
 23596

Drop and recreate the index:

postgres=# drop index instance_title_idx_ft;
DROP INDEX
postgres=# create index instance_title_idx_ft on instance using gin ((to_tsvector('simple', f_unaccent(jsonb->>'title'))));
CREATE INDEX

Now EXPLAIN returns the same estimation 13806 for all search strings:

postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source');
 Aggregate  (cost=50122.97..50122.98 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=147.00..50088.45 rows=13806 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..143.55 rows=13806 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master');
 Aggregate  (cost=50122.97..50122.98 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=147.00..50088.45 rows=13806 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..143.55 rows=13806 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last');
 Aggregate  (cost=50122.97..50122.98 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=147.00..50088.45 rows=13806 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..143.55 rows=13806 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be');
 Aggregate  (cost=50122.97..50122.98 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=147.00..50088.45 rows=13806 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..143.55 rows=13806 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery)

Run ANALYZE to collect statistics:

postgres=# analyze instance;
ANALYZE

Now we get good estimations (1288, 1288, 3220, 22541):

postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source');
 Aggregate  (cost=5287.40..5287.41 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=29.98..5284.18 rows=1288 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..29.66 rows=1288 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master');
 Aggregate  (cost=5287.40..5287.41 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=29.98..5284.18 rows=1288 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..29.66 rows=1288 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last');
 Aggregate  (cost=12824.56..12824.57 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=48.96..12816.51 rows=3220 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..48.15 rows=3220 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery)
postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be');
 Aggregate  (cost=77481.57..77481.58 rows=1 width=8)
   ->  Bitmap Heap Scan on instance  (cost=226.69..77425.21 rows=22541 width=0)
         Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery)
         ->  Bitmap Index Scan on instance_title_idx_ft  (cost=0.00..221.05 rows=22541 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery)
Comment by Julian Ladisch [ 05/Jun/20 ]

Upgrading mod-inventory-storage to RMB 30.0.2 with the fix RMB-637 Closed "Run ANALYZE after index (re-)creation" solves this issue.

Comment by Marc Johnson [ 05/Jun/20 ]

Jakub Skoczen Julian Ladisch Adam Dickmeiss Mikhail Fokanov

We've had challenges in the past (with triggers) where some environments are using database users with limited permissions.

What is the best way for us to communicate that any module using RAML Module Builder 30.0.2 will need to have a user than can analyse statistics?

Does this issue make upgrading to RAML Module Builder 30.0.2 required? As I understand it 30.0.0 will drop and recreate the indexes, which without re-analysis of the statistics, will result in incorrect counts?

And do we need to redo the upgrade re-testing?

Comment by Julian Ladisch [ 05/Jun/20 ]

Using a database user for upgrading that does not have ANALYZE permissions doesn't fail the upgrade, there will be a warning only:

postgres=> \c postgres diku_mod_inventory_storage
You are now connected to database "postgres" as user "diku_mod_inventory_storage".
postgres=> analyze instance;
WARNING:  skipping "instance" --- only table or database owner can analyze it
ANALYZE

The ANALYZE documentation says:

To analyze a table, one must ordinarily be the table's owner or a superuser. However, database owners are allowed to analyze all tables in their databases [...] ANALYZE will skip over any tables that the calling user does not have permission to analyze.

The Privileges documentation says:

The owner is normally the role that executed the creation statement.

FOLIO uses a bootstrap user that runs each installation and each upgrade. If they don't change this bootstrap user then the bootstrap user is still the owner of the table and therefore can run ANALYZE. Why would they change the bootstrap user?

Comment by Julian Ladisch [ 05/Jun/20 ]

Does this issue make upgrading to RAML Module Builder 30.0.2 required? As I understand it 30.0.0 will drop and recreate the indexes, which without re-analysis of the statistics, will result in incorrect counts?

RMB 30 will drop and recreate some indexes. This also drops statistics resulting in bad estimated counts. PostgreSQL will automatically run ANALYZE to calculate new statistics after 50 changes (insert, update, delete) have been made to the underlying data.

This issue is about this time frame. People test the upgraded system without changing data.

RMB 30.0.2 runs ANALYZE at the end of the upgrade process and avoids this time frame with incorrect estimation and therefore solved this issue.

Comment by Marc Johnson [ 05/Jun/20 ]

Julian Ladisch

FOLIO uses a bootstrap user that runs each installation and each upgrade. If they don't change this bootstrap user then the bootstrap user is still the owner of the table and therefore can run ANALYZE. Why would they change the bootstrap user?

Thank you. I wasn't entirely sure which user ran the upgrades. I asked the question because an upgrade script was written previously that disabled triggers, this worked successfully in the hosted environments but not in others, making me a little cautions of these things.

Comment by Marc Johnson [ 05/Jun/20 ]

Julian Ladisch Thanks

This issue is about this time frame. People test the upgraded system without changing data.

RMB 30.0.2 runs ANALYZE at the end of the upgrade process and avoids this time frame with incorrect estimation and therefore solved this issue.

Then I think the question might be to Jakub Skoczen and Cate Boerema

Are there circumstances where it is acceptable for the record counts to be possible significantly incorrect for a while after an upgrade and eventually improve?

If not, then I think that suggests that all modules need to upgrade to RAML Module Builder 30.0.2 in order to prevent this.

Julian Ladisch please correct me if I haven't followed correctly.

Comment by Sobha Duvvuri [ 05/Jun/20 ]

I had similar question too Julian Ladisch - this fix seems to me that all modules should upgrade to RMB 30.0.2.
Does the volume of data impact the timing of index re-creation and engine running autovacuum in any way or just the number of inserts, updates or deletes made? Should all modules gain advantage of the added ANALYZE step?
Also, it took 3+ hours for MODINVSTOR to upgrade from 19.1.2 to 19.2.0-snapshot. Anyway in which this could be reduced? Thanks!

Comment by Julian Ladisch [ 05/Jun/20 ]

Does the volume of data impact the timing of index re-creation and engine running autovacuum in any way or just the number of inserts, updates or deletes made?

There are two configuration parameters (see https://www.postgresql.org/docs/current/runtime-config-autovacuum.html#GUC-AUTOVACUUM-ANALYZE-THRESHOLD ):

  • autovacuum_analyze_threshold: Specifies the minimum number of inserted, updated or deleted tuples needed to trigger an ANALYZE in any one table. The default is 50 tuples.
  • autovacuum_analyze_scale_factor: Specifies a fraction of the table size to add to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE. The default is 0.1 (10% of table size).
Comment by Julian Ladisch [ 05/Jun/20 ]

Should all modules gain advantage of the added ANALYZE step?

Yes. We cannot assume that there will be enough data changes in a reasonable time.

Jakub posted in Slack #releases yesterday:

All modules using RMB MUST at minimum upgrade to version 30.0.0 but it is HIGHLY RECOMMENDED to upgrade to the latest bugfix release available.

Comment by Marc Johnson [ 08/Jun/20 ]

Julian Ladisch

Yes. We cannot assume that there will be enough data changes in a reasonable time.

I think I'm a little confused. This statement suggests that you believe waiting till the data changes occur is unacceptable. Is that a reasonable interpretation?

If so, that suggests it is required / expected that storage modules upgrade to RAML Module Builder 30.0.2 or later. However, you referred to Jakub Skoczen advice (quoted below) that it is only recommended.

All modules using RMB MUST at minimum upgrade to version 30.0.0 but it is HIGHLY RECOMMENDED to upgrade to the latest bugfix release available.

Please can either of you help me understand what the expectations are.

Comment by Marc Johnson [ 08/Jun/20 ]

Sobha Duvvuri

Also, it took 3+ hours for MODINVSTOR to upgrade from 19.1.2 to 19.2.0-snapshot. Anyway in which this could be reduced?

I don't think it is likely this can be reduced at this point. This upgrade process is dropping and re-creating many large indexes and re-analysing the statistics, these are very time consuming operations.

Jakub Skoczen Mikhail Fokanov or Jakub Skoczen are best placed to confirm this.

Comment by Mikhail Fokanov [ 08/Jun/20 ]

Sobha Duvvuri, Marc Johnson this index recreation is not related to hit counts. Hit counts implementation doesn't affect it. Indexes are affected by https://folio-org.atlassian.net/browse/RMB-583. So, maybe Julian Ladisch can answer the question.

Comment by Julian Ladisch [ 08/Jun/20 ]

>> Should all modules gain advantage of the added ANALYZE step?
> Yes.

Should = highly recommended. There may be cases where this is not needed. For example if the table is so small that one cannot notice any performance difference. Or when upgrading changes all records which automatically starts the autovacuum daemon that runs ANALYZE.

Comment by Julian Ladisch [ 08/Jun/20 ]

Please use a separate issue ( MODINVSTOR-514 Closed ) to discuss the long upgrading time.

Comment by Marc Johnson [ 08/Jun/20 ]

Closing this as due to be imminently released. Please re-open if necessary.

Generated at Thu Feb 08 23:22:03 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.