Investigation - slow running query during data import on big datasets

Description

There are slow running query detected during DI BIB create from mod search side:

WITH cte AS (SELECT id,
	                    name,
	                    name_type_id,
	                    authority_id,
	                    last_updated_date
	             FROM fs09000000_mod_search.contributor
	             WHERE last_updated_date > $1
	             ORDER BY last_updated_date
	             )
	SELECT c.id,
	       c.name,
	       c.name_type_id,
	       c.authority_id,
	       c.last_updated_date,
	       json_agg(
	               CASE
	                                WHEN sub.instance_count IS NULL THEN NULL
	                                ELSE json_build_object(
	                       'count', sub.instance_count,
	                       'typeId', sub.type_ids,
	                       'shared', sub.shared,
	                       'tenantId', sub.tenant_id
	               )
	               END
	       ) AS instances
	FROM cte c
	         LEFT JOIN
	     (SELECT cte.id,
	             ins.tenant_id,
	             ins.shared,
	             array_agg(DISTINCT ins.type_id) FILTER (WHERE ins.type_id <> '') AS type_ids,
	             count(DISTINCT ins.instance_id)                                  AS instance_count
	      FROM fs09000000_mod_search.instance_contributor ins
	               INNER JOIN cte
	                          ON ins.contributor_id = cte.id
	      GROUP BY cte.id,
	               ins.tenant_id,
	               ins.shared) sub ON c.id = sub.id
	GROUP BY c.id,
	         c.name,
	         c.name_type_id,
	         c.authority_id,
	         c.last_updated_date
	      ORDER BY last_updated_date ASC

This query takes (two occurrences found) 657973.300 ms and 252287.791 ms

mod-search version: 4.0.7

PTF data set (rcp1 env. )

Data volume in mod-search schema:

search.instance - 4 109 321
search.instance_contributor - 8 327 231

Dataset Size:

Record type	Number of records
Instances	30 243 010
Holdings	30 301 825
Items	31 390 568
marc-records	49 547 275

More details in DI report https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/729841670/Data+Import+test+report+Ramsons+non-ECS

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

05 Mar 2025, 11:00 AM
05 Mar 2025, 11:00 AM
05 Mar 2025, 11:00 AM
05 Mar 2025, 11:00 AM
05 Mar 2025, 11:00 AM

Confluence content

mentioned on

Checklist

hide

Activity

Show:

Viacheslav Kolesnyk March 10, 2025 at 3:42 PM

@Roman_Fedynyshyn each instance in those files has one same contributor, and considering bugfest has 500_000 instances with the same contributor I would assume you have that number as well. Same goes for subjects. So I assume those queries took that long because of the huge dataset which is expected

Roman_Fedynyshyn March 5, 2025 at 11:00 AM

Hey @Pavlo Smahin here they are

Pavlo Smahin March 5, 2025 at 10:43 AM

Hi @Roman_Fedynyshyn , please provide files that you used for this testing.

Done

Details

Assignee

Viacheslav Kolesnyk

Reporter

Roman_Fedynyshyn

Labels

back-endepam-spitfire

Priority

Story Points

Sprint

None

Development Team

Spitfire

Release

Not For Release

RCA Group

Data related (ex. Can be detected with large dataset only)

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs

Created February 18, 2025 at 10:34 AM

Updated March 12, 2025 at 1:08 PM

Resolved March 12, 2025 at 1:07 PM

Configure

TestRail: Cases

TestRail: Runs

Investigation - slow running query during data import on big datasets

Description

CSP Request Details

CSP Rejection Details

Potential Workaround

Attachments

Confluence content

mentioned on

Checklist

Activity

Viacheslav Kolesnyk March 10, 2025 at 3:42 PM

Roman_Fedynyshyn March 5, 2025 at 11:00 AM

Pavlo Smahin March 5, 2025 at 10:43 AM

Details

Assignee

Reporter

Labels

Priority

Story Points

Sprint

Development Team

Release

RCA Group

TestRail: Cases

TestRail: Runs

Flag notifications

Something's gone wrong