[MODINVSTOR-513] hitcount estimation degeneration when upgrading MIS 19.1.2 (fameflower) -> MIS 19.2.0-SNAPSHOT (RMB30) Created: 02/Jun/20 Updated: 15/Jun/20 Resolved: 08/Jun/20 |
|
| Status: | Closed |
| Project: | mod-inventory-storage |
| Components: | None |
| Affects versions: | None |
| Fix versions: | 19.2.0 |
| Type: | Bug | Priority: | P2 |
| Reporter: | Jakub Skoczen | Assignee: | Mikhail Fokanov |
| Resolution: | Done | Votes: | 0 |
| Labels: | platform-backlog | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | CP: sprint 90 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 2 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||||||||||||||||||||||
| Release: | Q2 2020 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Mikhail Fokanov Oleksii Popov this ticket is created to track the work Mikhail is doing on bugfest with Sobha Duvvuri Cate Boerema Charlotte Whitt Marc Johnson Problem We are seeing imprecise hitcounts reported after upgrading Bugfest from MIS used for Fameflower to the latest MIS snapshot using RMB 30 (which included the harmonisation of hit count between Instance records sorted by title). Most hitcounts are reported as ~37000. Analyis from Mikhail I found out that the instance table wan't analysed due to some error during the deployment of the new version of inventory. Documentation:
|
| Comments |
| Comment by Julian Ladisch [ 02/Jun/20 ] |
|
RMB has full logging enabled when running the upgrade SQL scripts: |
| Comment by Julian Ladisch [ 02/Jun/20 ] |
|
Creating a mod-inventory-storage jar with RMB 30 but the
|
| Comment by Jakub Skoczen [ 03/Jun/20 ] |
|
Julian Ladisch Are you proposing that we try to reproduce the issue with 19.1.2 (fameflower version) ONLY? Why do you expect that the problem is already present in that version? |
| Comment by Julian Ladisch [ 03/Jun/20 ] |
|
The problem may be the result of long running index recreations that get interrupted. If this is the case the 19.1.2 to 19.1.2 "upgrade" will also show the problem. |
| Comment by Julian Ladisch [ 04/Jun/20 ] |
|
Tenant upgrade uses the jdbc driver. It does not use the vertx driver and is therefore not affected by the switch to the vertx-pg-client.
|
| Comment by Julian Ladisch [ 04/Jun/20 ] |
|
We can reproduce the problem and it is not caused by interrupted reindexing.
My guess is that recreating the index deletes the statistics but no ANALYZE is executed because there is no tuple change. |
| Comment by Mikhail Fokanov [ 04/Jun/20 ] |
|
Julian Ladisch, could you create a branch with this fix in order to deploy and test it? |
| Comment by Julian Ladisch [ 04/Jun/20 ] |
|
Reproducing this with the performance dataset. postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source'); 1259 postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master'); 1863 postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last'); 4206 postgres=# select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be'); 23596 Drop and recreate the index: postgres=# drop index instance_title_idx_ft; DROP INDEX postgres=# create index instance_title_idx_ft on instance using gin ((to_tsvector('simple', f_unaccent(jsonb->>'title')))); CREATE INDEX Now EXPLAIN returns the same estimation 13806 for all search strings: postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source'); Aggregate (cost=50122.97..50122.98 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=147.00..50088.45 rows=13806 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..143.55 rows=13806 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master'); Aggregate (cost=50122.97..50122.98 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=147.00..50088.45 rows=13806 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..143.55 rows=13806 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last'); Aggregate (cost=50122.97..50122.98 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=147.00..50088.45 rows=13806 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..143.55 rows=13806 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be'); Aggregate (cost=50122.97..50122.98 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=147.00..50088.45 rows=13806 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..143.55 rows=13806 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery) Run ANALYZE to collect statistics: postgres=# analyze instance; ANALYZE Now we get good estimations (1288, 1288, 3220, 22541): postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'source'); Aggregate (cost=5287.40..5287.41 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=29.98..5284.18 rows=1288 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..29.66 rows=1288 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''source'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'master'); Aggregate (cost=5287.40..5287.41 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=29.98..5284.18 rows=1288 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..29.66 rows=1288 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''master'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'last'); Aggregate (cost=12824.56..12824.57 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=48.96..12816.51 rows=3220 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..48.15 rows=3220 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''last'''::tsquery) postgres=# explain select count(*) from instance where to_tsvector('simple', f_unaccent(jsonb->>'title')) @@ to_tsquery('simple', 'be'); Aggregate (cost=77481.57..77481.58 rows=1 width=8) -> Bitmap Heap Scan on instance (cost=226.69..77425.21 rows=22541 width=0) Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery) -> Bitmap Index Scan on instance_title_idx_ft (cost=0.00..221.05 rows=22541 width=0) Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''be'''::tsquery) |
| Comment by Julian Ladisch [ 05/Jun/20 ] |
|
Upgrading mod-inventory-storage to RMB 30.0.2 with the fix
|
| Comment by Marc Johnson [ 05/Jun/20 ] |
|
Jakub Skoczen Julian Ladisch Adam Dickmeiss Mikhail Fokanov We've had challenges in the past (with triggers) where some environments are using database users with limited permissions. What is the best way for us to communicate that any module using RAML Module Builder 30.0.2 will need to have a user than can analyse statistics? Does this issue make upgrading to RAML Module Builder 30.0.2 required? As I understand it 30.0.0 will drop and recreate the indexes, which without re-analysis of the statistics, will result in incorrect counts? And do we need to redo the upgrade re-testing? |
| Comment by Julian Ladisch [ 05/Jun/20 ] |
|
Using a database user for upgrading that does not have ANALYZE permissions doesn't fail the upgrade, there will be a warning only: postgres=> \c postgres diku_mod_inventory_storage You are now connected to database "postgres" as user "diku_mod_inventory_storage". postgres=> analyze instance; WARNING: skipping "instance" --- only table or database owner can analyze it ANALYZE The ANALYZE documentation says:
The Privileges documentation says:
FOLIO uses a bootstrap user that runs each installation and each upgrade. If they don't change this bootstrap user then the bootstrap user is still the owner of the table and therefore can run ANALYZE. Why would they change the bootstrap user? |
| Comment by Julian Ladisch [ 05/Jun/20 ] |
RMB 30 will drop and recreate some indexes. This also drops statistics resulting in bad estimated counts. PostgreSQL will automatically run ANALYZE to calculate new statistics after 50 changes (insert, update, delete) have been made to the underlying data. This issue is about this time frame. People test the upgraded system without changing data. RMB 30.0.2 runs ANALYZE at the end of the upgrade process and avoids this time frame with incorrect estimation and therefore solved this issue. |
| Comment by Marc Johnson [ 05/Jun/20 ] |
Thank you. I wasn't entirely sure which user ran the upgrades. I asked the question because an upgrade script was written previously that disabled triggers, this worked successfully in the hosted environments but not in others, making me a little cautions of these things. |
| Comment by Marc Johnson [ 05/Jun/20 ] |
|
Julian Ladisch Thanks
Then I think the question might be to Jakub Skoczen and Cate Boerema Are there circumstances where it is acceptable for the record counts to be possible significantly incorrect for a while after an upgrade and eventually improve? If not, then I think that suggests that all modules need to upgrade to RAML Module Builder 30.0.2 in order to prevent this. Julian Ladisch please correct me if I haven't followed correctly. |
| Comment by Sobha Duvvuri [ 05/Jun/20 ] |
|
I had similar question too Julian Ladisch - this fix seems to me that all modules should upgrade to RMB 30.0.2. |
| Comment by Julian Ladisch [ 05/Jun/20 ] |
There are two configuration parameters (see https://www.postgresql.org/docs/current/runtime-config-autovacuum.html#GUC-AUTOVACUUM-ANALYZE-THRESHOLD ):
|
| Comment by Julian Ladisch [ 05/Jun/20 ] |
Yes. We cannot assume that there will be enough data changes in a reasonable time. Jakub posted in Slack #releases yesterday:
|
| Comment by Marc Johnson [ 08/Jun/20 ] |
I think I'm a little confused. This statement suggests that you believe waiting till the data changes occur is unacceptable. Is that a reasonable interpretation? If so, that suggests it is required / expected that storage modules upgrade to RAML Module Builder 30.0.2 or later. However, you referred to Jakub Skoczen advice (quoted below) that it is only recommended.
Please can either of you help me understand what the expectations are. |
| Comment by Marc Johnson [ 08/Jun/20 ] |
I don't think it is likely this can be reduced at this point. This upgrade process is dropping and re-creating many large indexes and re-analysing the statistics, these are very time consuming operations. Jakub Skoczen Mikhail Fokanov or Jakub Skoczen are best placed to confirm this. |
| Comment by Mikhail Fokanov [ 08/Jun/20 ] |
|
Sobha Duvvuri, Marc Johnson this index recreation is not related to hit counts. Hit counts implementation doesn't affect it. Indexes are affected by https://folio-org.atlassian.net/browse/RMB-583. So, maybe Julian Ladisch can answer the question. |
| Comment by Julian Ladisch [ 08/Jun/20 ] |
|
>> Should all modules gain advantage of the added ANALYZE step? Should = highly recommended. There may be cases where this is not needed. For example if the table is so small that one cannot notice any performance difference. Or when upgrading changes all records which automatically starts the autovacuum daemon that runs ANALYZE. |
| Comment by Julian Ladisch [ 08/Jun/20 ] |
|
Please use a separate issue (
|
| Comment by Marc Johnson [ 08/Jun/20 ] |
|
Closing this as due to be imminently released. Please re-open if necessary. |