[TRILIUM] ECS-member tenant reindex

[TRILIUM] ECS-member tenant reindex

Overview

PERF-1236: [] [ECS] Reindex for mod-search (ecs member reindex)In Review

Te purpose of this report is to highlight performance results of ECS-member tenant reindex. In scope of this testing following goals need to be achieved:

 

Summary

Baseline

  • Baseline test finished successfully in 3 hr 49 min (reruns may differ from 3.5 hours to 4.5 hours due to complexity of process).

    • Merge phase finished in 2 hr 14 min.

      • All ranges merged successfully. NOTABLE OBSERVATION: during merge phase of instances it’s impossible to duration of merge as afterwards, when merge finished, instance row start to track upload and overrides merge durations. (see table below).

    • Upload phase finished in 1 hr 35 min.

      • Two upload ranges got failed status due to timeouts. ‘subject' and 'contributor’

        • NOTABLE OBSERVATION: if one of ranges fail to upload - status changed to UPLOAD_FAILED and never changing back even if retry was successful.

  • Resource usages:

    • DB CPU spikes up to 40% during merge phase (may consider scaling DB down to 2x)

    • There’s only two main modules included into process mod-search and mod-inventory-storage.

      • mod-search max CPU 360% (during upload phase)

      • mod-inventory-storage max CPU 110%

Full reindex (ECS reindex branch)

  • Full reindex with member tenant reindex brach finished successfully (tested 2 times) with comparable performance. Full duration 3 hr 35 min

    • Merge phase finished in 2 hr 15 min

    • Upload phase finished in 1 hr 20 min

  • For most part resource usage looks the same - except CPU usage on mod-search. CPU usage on mod-search grew due to latest fixes released

Member tenant reindex

  • Member tenant (001 tenant with 6.5M instances) reindex completed successfully

    • Total duration 1 hour 25 minutes 18 seconds

      • Staging phase (parallel)

        • ~17 minutes

      • Merge phase

        • 37 minutes

      • Upload phase

        • ~30 minutes

  • Resource usage following the same patterns as in full reindex.

 

 

Baseline results

 

PTF has performed multiple tests before establishing baseline with this particular version the list of tests below:

  1. Test with master branch Mar 16-20 (constantly failing reindex on upload call_numbers). Issue was happening due to several call_numbers that are attached to 8M items in DB. mod-search was crashing with OOM (out of memory exception) on upload of call numbers.

    1. From dev team side - introduced memory management improvements

    2. From PTF side - cleaned abnormal call numbers.

    3. Together these actions fixed multiple OOM and reindex may be finished

  2. Test Mar 22. Reindex finished successfully in 4 hr 7 min. (Merge 2 hr 45 min, Upload 1 hr 23 min.). Surprisingly good results

  3. Test Mar 23, With FSE deployed ECS-member-reindex branch on asptc env. Tested full reindex for comparison with master. Duration of full reindex is 2 hours longer ±6 hours in total. Issue happened on merge phase of instances, this time it took much longer. MSEARCH-1196: Stabilize full reindex merge phaseClosed

  4. Tests Mar 24-27. Start testing using previous DB snapshot as it doesn’t have DB updates from branch. Results are much worse. usually duration is 6-8 hours. (same issue with instances merge observed)

  5. Tests Mar 30 - Apr 3

    1. cleaned once again call numbers on snapshot

    2. reverted once again on previous DB (by manipulation with tables itself). Results are the same if not worse (once full reindex took 14 hours). Issue with instance are still there

    3. Retest once again on Apr 2 showed that issue with instances is random and “once in a while“ may be not so severe. this time full reindex finished in 4,5 hours

  6. Test Apr3 - Apr 7. dev team introduced fixes for subject/instance merge phase (eliminating deadlocks).

    1. PTF performed multiple tests and all of them finished successfully and merge phase for all of them was 2-2.5 hours. (issue fixed)

  7. NEXT step: there’s still and issue with upload phase that may behave unexpectedly: MSEARCH-1197: Stabilize full reindex upload phaseClosed

image-20260407-120902.png
  1. In test reported here as baseline this issue didn’t reproduce. (supposably issue is fixed with reindex-upload-improvements branch)

  2. Apr 8 retested with member tenant reindex branch. full reindex completed successfully (tested 2 times). Member tenant reindex failed due to timeout

 

Reindex status

in order to track reindex status use query:

SELECT entity_type, status, total_merge_ranges, processed_merge_ranges, total_upload_ranges, processed_upload_ranges, start_time_merge, end_time_merge, start_time_upload, end_time_upload FROM [tenant]_mod_search.reindex_status;
  • Merge phase finished in 2 hr 14 min

  • Upload phase finished in 1 hr 35 min.

  • Total duration 3 hr 49 min

Entity type

status

total ranges

processed ranges

total to upload

processed upload

start time merge

end time merge

start time upload

end time upload

ITEM

MERGE_COMPLETED

57871

57871

0

0

4/6/26 14:23

4/6/26 16:23

 

 

HOLDINGS

MERGE_COMPLETED

46275

46275

0

0

4/6/26 14:23

4/6/26 16:37

 

 

SUBJECT

UPLOAD_FAILED*

0

0

65536

65536

 

 

4/6/26 16:37

4/6/26 16:43

CONTRIBUTOR

UPLOAD_FAILED*

0

0

65536

65536

 

 

4/6/26 16:37

4/6/26 16:49

CLASSIFICATION

UPLOAD_COMPLETED

0

0

65536

65536

 

 

4/6/26 16:37

4/6/26 17:50

CALL_NUMBER

UPLOAD_COMPLETED

0

0

65536

65536

 

 

4/6/26 16:37

4/6/26 17:54

INSTANCE

UPLOAD_COMPLETED

0

0

20827

20827

 

 

4/6/26 16:37

4/6/26 18:12

NOTE: even that ‘subject' and 'contributor’ entities upload marked as failed total to upload ranges and uploaded ranges are equal mens that there was retry and all ranges being uploaded successfully.

On order to verify that no ranges failed run:

SELECT id, entity_type, lower, upper, created_at, finished_at, status, fail_cause FROM [tenant]_mod_search.upload_range where status !='SUCCESS';

 

image-20260407-105720.png

*on screen above - no failed ranges.

DB metrics

DB CPU is stable, during merge phase reached max 40%. DB scaling down to 2XL may be considered.

image-20260407-083829.png

Database load is predictable and without anomalies.

Locks that may be visible on a screen happened on reindex_status table because of constant monitoring while mod-search was updating statuses (not affecting performance or process itself).

No DB deadlocks or other anomalies observed. (with previous versions deadlocks was happening during reindex on subject table, they were ignored as they not affecting process itself (when deadlock happening mod-search processing records one-by-one) however they severely affecting performance of instance merge ).

image-20260407-083456.png

 

image-20260407-084034.png

 

Services metrics

 

Mod-search

snapshot #401 (build date: Apr 5) from deadlocks-improvements branch

image-20260407-084528.png
image-20260407-084820.png

 

 

mod-inventory-storage

 

image-20260407-084551.png

 

image-20260407-084841.png

Open Search metrics

 

Nothing significant on other charts. not including them into report.

 

indexing rate

image-20260409-102806.png

 

CPU data nodes

image-20260409-102909.png

 

 

 

Full reindex (ECS reindex branch)

Reindex status

Full reindex completed successfully without failed merge ranges or upload ranges.

  • Merge phase completed in 2 hr 15 min

  • Upload phase completed in 1 hr 20 min

Performance is comparable with baseline 3 hr 49 min in baseline test vs 3 hr 35 min with ECS-member-tenant reindex branch

Entity type

status

total ranges

processed ranges

total to upload

processed upload

start time merge

end time merge

start time upload

end time upload

ITEM

MERGE_COMPLETED

57871

57871

0

0

4/8/26 9:21

4/8/26 11:24

 

 

HOLDINGS

MERGE_COMPLETED

46275

46275

0

0

4/8/26 9:21

4/8/26 11:37

 

 

SUBJECT

UPLOAD_COMPLETED

0

0

4096

4096

 

 

4/8/26 11:37

4/8/26 12:25

CLASSIFICATION

UPLOAD_COMPLETED

0

0

4096

4096

 

 

4/8/26 11:37

4/8/26 12:27

CONTRIBUTOR

UPLOAD_COMPLETED

0

0

4096

4096

 

 

4/8/26 11:37

4/8/26 12:27

CALL_NUMBER

UPLOAD_COMPLETED

0

0

4096

4096

 

 

4/8/26 11:37

4/8/26 12:29

INSTANCE

UPLOAD_COMPLETED

0

0

20827

20827

 

 

4/8/26 11:38

4/8/26 12:52

 

DB metrics

 

image-20260409-110301.png

 

Locks on screen below occurred due to constant monitoring of reindex status.

image-20260409-110151.png

 

image-20260409-110413.png

 

Services metrics

image-20260409-110718.png
image-20260409-110758.png

 

Open Search metrics

indexing rate

image-20260415-082051.png

 

CPU utilization

image-20260415-082148.png

 

 

Member tenant reindex

 

Reindex status

Reindex of member tenant completed successfully. No issues found during test.

 

Total duration 1 hour 25 minutes 18 seconds

  • Staging phase (parallel)

    • ~17 minutes

  • Merge phase

    • 37 minutes

  • Upload phase

    • ~30 minutes

 

Entity type

Status

Total ranges

Processed ranges

Total ranges

Processed ranges

Start time merge

End time merge

start time upload

end time upload

tenant id

start time staging

end time staging

CONTRIBUTOR

UPLOAD_COMPLETED

0

0

4096

4096

 

 

4/10/26 22:51

4/10/26 22:53

cs00000int_0001

 

 

ITEM

STAGING_COMPLETED

19927

19927

0

0

4/10/26 21:57

4/10/26 22:11

 

 

cs00000int_0001

4/10/26 22:14

4/10/26 22:51

HOLDINGS

STAGING_COMPLETED

14560

14561

0