Investigation - slow running query during data import on big datasets

Description

There are slow running query detected during DI BIB create from mod search side:

WITH cte AS (SELECT id, name, name_type_id, authority_id, last_updated_date FROM fs09000000_mod_search.contributor WHERE last_updated_date > $1 ORDER BY last_updated_date ) SELECT c.id, c.name, c.name_type_id, c.authority_id, c.last_updated_date, json_agg( CASE WHEN sub.instance_count IS NULL THEN NULL ELSE json_build_object( 'count', sub.instance_count, 'typeId', sub.type_ids, 'shared', sub.shared, 'tenantId', sub.tenant_id ) END ) AS instances FROM cte c LEFT JOIN (SELECT cte.id, ins.tenant_id, ins.shared, array_agg(DISTINCT ins.type_id) FILTER (WHERE ins.type_id <> '') AS type_ids, count(DISTINCT ins.instance_id) AS instance_count FROM fs09000000_mod_search.instance_contributor ins INNER JOIN cte ON ins.contributor_id = cte.id GROUP BY cte.id, ins.tenant_id, ins.shared) sub ON c.id = sub.id GROUP BY c.id, c.name, c.name_type_id, c.authority_id, c.last_updated_date ORDER BY last_updated_date ASC

This query takes (two occurrences found) 657973.300 ms and 252287.791 ms

mod-search version: 4.0.7

PTF data set (rcp1 env. )

 

 

Data volume in mod-search schema:

  • search.instance - 4 109 321

  • search.instance_contributor - 8 327 231

Dataset Size:

Record type

Number of records

Instances

30 243 010

Holdings

30 301 825

Items

31 390 568

marc-records

49 547 275

More details in DI report https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/729841670/Data+Import+test+report+Ramsons+non-ECS

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

5
  • 05 Mar 2025, 11:00 AM
  • 05 Mar 2025, 11:00 AM
  • 05 Mar 2025, 11:00 AM
  • 05 Mar 2025, 11:00 AM
  • 05 Mar 2025, 11:00 AM

Confluence content

mentioned on

Checklist

hide

Activity

Show:

Viacheslav Kolesnyk March 10, 2025 at 3:42 PM

each instance in those files has one same contributor, and considering bugfest has 500_000 instances with the same contributor I would assume you have that number as well. Same goes for subjects. So I assume those queries took that long because of the huge dataset which is expected

Roman_Fedynyshyn March 5, 2025 at 11:00 AM

Hey here they are

Pavlo Smahin March 5, 2025 at 10:43 AM

Hi , please provide files that you used for this testing.

Done

Details

Assignee

Reporter

Priority

Story Points

Sprint

Development Team

Spitfire

Release

Not For Release

RCA Group

Data related (ex. Can be detected with large dataset only)

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs
Created February 18, 2025 at 10:34 AM
Updated March 12, 2025 at 1:08 PM
Resolved March 12, 2025 at 1:07 PM
TestRail: Cases
TestRail: Runs

Flag notifications