Data Import Update multi tenant (Poppy)

Overview

  • This document contains the results of testing Data Import for MARC Bibliographic records with an update job in the Poppy release. Ticket: PERF-779

Summary

Test set run №2 (updated configuration for mod-permission)

  • Three tests of DI with update jobs with 25k records were carried out on the pcp1 cluster.
  • DI with a new configuration for mod-permission was successful. DI jobs are processed one by one in test run #3.
  • Mod-permission version changed from 6.4.0 → 6.3.2 doesn't solve the issue with NullPointerException: "mappingParameters" is null for configuration parameter for mod-permission in Mem Hard Limit=1684, Mem Soft limit=1544, Xmx=1024m. 
  • Changed configuration parameter for mod-permission in Mem Hard Limit=1684→2384, Mem Soft limit=1544→2244, Xmx=1024m→1500m allowed to run tests successfully and avoided issue with "mappingParameters" is null.
  • Memory consumption for mod-data-import was 36%, mod-inventory - 94%. No memory leaks are suspected.
  • CPU utilization for mod-data-import spiked to 72%, mod-inventory - 171%. It was observed that CPU utilization for mod-inventory has a growing trend.
  • DB CPU utilization was 97%.
  • DB connections - 500 on average and 681 during spikes connected to mod-permission spikes on the CPU graph.
  • Slow queries found -  SELECT jsonb FROM fs07000002_mod_permissions.permissions

Test set run №1

  • Three tests of DI with update jobs with 25k records were carried out on the pcp1 cluster.
  • The first and second tests were successful. The third test of DI on three tenants running concurrently failed after a third data import started. Errors attached.
  • Slow queries found for SELECT jsonb FROM fs07000001_mod_permissions.permissions

  • Issue with a spike from mod-permission that led to test #3 problems were investigated. The module stopped after a timeout. Additional analysis with a method that consumed most of the resources attached.

Recommendations and Jiras

  • Created Jira ticket to Investigate "mappingParameters" is null error: PERF-801

Test Results

Test set run#2 (updated configuration for mod-permission)

ProfileTest #TenantMARC File

DI Duration Poppy (hh:mm:ss)

Results

DI MARC Bib Update (PTF - Updates Success - 1)1fso700000125K.mrc00:17:18Completed
2fso900000025K.mrc00:23:11Completed
fso700000225K.mrc00:24:16Completed
3fso900000025K.mrc00:26:36Completed
fso700000225K.mrc00:38:01Completed
fso700000125K.mrc00:49:24
Completed


Test set run#1

ProfileTest #TenantMARC File

DI Duration Poppy (hh:mm:ss)

Results

DI MARC Bib Update (PTF - Updates Success - 1)1fso900000025K.mrc00:20:23Completed
2fso900000025K.mrc00:24:48Completed
fso700000125K.mrc00:31:53Completed
3fso900000025K.mrc00:27:10Completed
fso700000125K.mrc00:36:09with errors*
fso700000225K.mrc00:30:29with errors*

Comparison

The following table compares previous results of Poppy release with current results.

ProfileMARC FileTest run #DI DurationDI Delta, (hh:mm:ss)
Poppy (previous)Poppy
DI MARC Bib Update (PTF - Updates Success - 1)

25K.mrc

100:14:5000:17:18+ 00:02:28
2not tested00:24:16-
3not tested00:49:24-


Errors for test set #1

 Errors test #3


15:23:10 [] [fs07000001] [] [] WARN  teModifyEventHandler handle:: Error while MARC record modifying

javax.ws.rs.BadRequestException: Date when processing started is not set, expected snapshot status is PARSING_IN_PROGRESS, actual - PROCESSING_IN_PROGRESS

at org.folio.services.RecordServiceImpl.lambda$3(RecordServiceImpl.java:112) ~[ms.jar:?]

at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[ms.jar:?]


15:11:11 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 1 subscriptionPattern: SubscriptionDefinition(eventType=DI_RAW_RECORDS_CHUNK_READ, subscriptionPattern=pcp1\.Default\.\w{1,}\.DI_RAW_RECORDS_CHUNK_READ) offset: 9738

org.folio.services.exceptions.RawChunkRecordsParsingException: java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

at org.folio.verticle.consumers.RawMarcChunksKafkaHandler.lambda$handle$1(RawMarcChunksKafkaHandler.java:98) ~[ms.jar:?]

at io.vertx.core.impl.future.Composition.onFailure(Composition.java:50) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) ~[ms.jar:?]


Caused by: java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

at org.folio.services.afterprocessing.FieldModificationServiceImpl.lambda$remove9Subfields$2(FieldModificationServiceImpl.java:32) ~[ms.jar:?]

at io.vertx.core.impl.future.Mapping.onSuccess(Mapping.java:35) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[ms.jar:?]


15:11:10 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR RestUtil             Response HTTP code is not equals 200, 201, 204. Response code: 504

15:11:10 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR ngParametersProvider Something happened while initializing mapping parameters

javax.ws.rs.BadRequestException: HTTP 400 Bad Request

at org.folio.dataimport.util.RestUtil.validateAsyncResult(RestUtil.java:159) ~[ms.jar:?]

at org.folio.services.mappers.processor.MappingParametersProvider.lambda$getIdentifierTypes$2(MappingParametersProvider.java:243) ~[ms.jar:?]

java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

Unable to retrieve permissions (code 400): Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:5432

Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:54
15:09:48 [037670/locations;130709/perms] [fs07000001] [] [mod_permissions] ERROR PermsAPI             Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:5432

Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-959 [ACTIVE]
pcp1/mod-search/ff1a258966fc4c69b8f88d1655267aa4

15:16:47 [] [] [] [] WARN  KafkaMessageListener Failed to index resource event [eventType: CREATE, tenantId: fs09000000, id: cbc38005b4e939fe85401fc369bdcd3049c80b8f]
org.folio.search.exception.SearchOperationException: Failed to perform elasticsearch request [index=pcp1_contributor_fs09000000, type=bulkApi, message: 30,000 milliseconds timeout on connection http-outgoing-959 [ACTIVE]]
    at org.folio.search.utils.SearchUtils.performExceptionalOperation(SearchUtils.java:97) ~[classes!/:3.0.5]
    at org.folio.search.repository.AbstractResourceRepository.executeBulkRequest(AbstractResourceRepository.java:55) ~[classes!/:3.0.5]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    
15:25:15 [] [] [] [] WARN  KafkaMessageListener Failed to index resource event [eventType: CREATE, tenantId: fs07000002, id: 9d97f034-a06a-4bba-8895-a86ab367b87a]
feign.FeignException$Unauthorized: [401 Unauthorized] during [GET] to [http://user-tenants?tenantId=fs07000002] [UserTenantsClient#getUserTenants(String)]: [Invalid token]


pcp1/mod-source-record-manager
15:02:50 [] [fs07000001] [9eb67301-6f6e-468f-9b1a-6134dc39a684] [] ERROR RestUtil             Response HTTP code is not equals 200, 201, 204. Response code: 500

mod-permission heap dump analysis

 Heap dump analysis

Problem Suspect

46 instances of io.vertx.pgclient.impl.codec.PgCodec, loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xd5740ac8 occupy 860.04 MB (88.35%) bytes.

Biggest instances:

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e5da00 - 40.11 MB (4.12%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0778 - 36.84 MB (3.78%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e38210 - 32.91 MB (3.38%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e340 - 31.03 MB (3.19%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e638 - 30.90 MB (3.17%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e200 - 30.89 MB (3.17%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6d757a0 - 29.14 MB (2.99%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e0c0 - 28.82 MB (2.96%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f53c78 - 28.82 MB (2.96%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e4f8 - 27.18 MB (2.79%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54170 - 27.14 MB (2.79%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd90 - 25.69 MB (2.64%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bced0 - 25.44 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcbe0 - 25.43 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da06f8 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e38258 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66b3640 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd10 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd50 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bce10 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66c3830 - 25.39 MB (2.61%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e840 - 25.05 MB (2.57%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e480 - 24.06 MB (2.47%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bce50 - 22.06 MB (2.27%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcf50 - 21.79 MB (2.24%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54668 - 21.35 MB (2.19%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54b60 - 20.21 MB (2.08%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0738 - 17.89 MB (1.84%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0d10 - 17.89 MB (1.84%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0678 - 14.16 MB (1.45%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da06b8 - 14.16 MB (1.45%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcc50 - 10.93 MB (1.12%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcc90 - 10.91 MB (1.12%) bytes.
  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66c38b0 - 10.63 MB (1.09%) bytes.

These instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 5.02 KB (0.00%) bytes.

Keywords

  • io.vertx.pgclient.impl.codec.PgCodec
  • jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xd5740ac8
  • java.util.HashMap$Node[]

Resource utilization table test set#1

 Test set #1 Resource utilization
 Resource utilization table
CPUMemory
mod-inventory-b145.45mod-inventory-b100.19
mod-di-converter-storage-b66.41mod-source-record-storage-b69.66
nginx-okapi54.57mod-source-record-manager-b62.67
mod-source-record-storage-b37.45mod-data-import-b52.66
mod-data-import-b26.83okapi-b52.38
mod-source-record-manager-b26.69mod-users-b38.56
okapi-b25.98mod-di-converter-storage-b36.63
mod-inventory-storage-b17.04mod-inventory-storage-b17.56
mod-users-b14.38nginx-okapi6.14
pub-okapi0.40pub-okapi4.58


Service CPU Utilization

Mod-inventory - 148% during DI on 3 tenants, data-import - 58%, 

Service Memory Utilization

Mod-inventory grew from 93% to 100%. Data-import - 52%.

DB CPU Utilization

The highest - 87% during DI on 3 tenants

DB Connections

DB connections for all tests were 250 in average and 345 during spike. Spike here shows when errors happened.

DB load

Top SQL-queries:

Here we observe that two queries are very slow. Should be investigated.

SELECT jsonb FROM fs07000001_mod_permissions.permissions

SELECT jsonb FROM fs07000002_mod_permissions.permissions



Resource utilization table test set#2

 Resource utilization table
CPUMemory
mod-inventory-b171mod-inventory-b93.65
mod-di-converter-storage-b80.99mod-permissions-b81.97
nginx-okapi75.33mod-source-record-storage-b75.97
mod-source-record-storage-b40.94mod-source-record-manager-b52.97
mod-source-record-manager-b36.95okapi-b40.22
okapi-b33.33mod-di-converter-storage-b36.31
mod-permissions-b30.91mod-data-import-b35.57
mod-inventory-storage-b22.84mod-inventory-storage-b17.25
mod-data-import-b1.742nginx-okapi4.799
pub-okapi0.38pub-okapi4.408

Service CPU Utilization

The highest mod-inventory - 171% during DI on 1 tenant, data-import - 135% in 3 tenants test, mod-permission spiked in 3 tenants test to 462%. 

Service Memory Utilization

Mod-inventory - 93%. Data-import - 36%.

DB CPU Utilization

The highest - 97% during all tests

DB Connections

DB connections before test started was 400, after for all tests were 500 in average and 681 during spike. Spikes are connected to spikes in CPU utilization for mod-permission.

DB load


Top SQL-queries:

All tests

3 tenants test


Appendix

Infrastructure

PTF -environment pcp1

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 database  instances, writer

    NameMemory GIBvCPUsmax_connections

    db.r6g.xlarge

    32 GiB4 vCPUs2731
  • MSK tenant
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Kafka consolidated topics and file splitting features enabled on a non-ecs-enabled environment. 

For Test set run №2 mod-permissions parameters were changed to mod-permissions_version=6.4.0->6.3.2, Mem Hard Limit=1684->2384 , Mem Soft limit=1544->2244, Xmx=1024m->1500m

ModuleTask Def. RevisionModule VersionTask CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
pcp1-pvt
Thu Feb 01 11:17:08 UTC 2024
mod-data-import20mod-data-import:3.0.71204818442561292384512FALSE
mod-inventory-storage14mod-inventory-storage:27.0.424096369020483076384512FALSE
mod-source-record-storage17mod-source-record-storage:5.7.525600500020483500384512FALSE
mod-inventory13mod-inventory:20.1.722880259210241814384512FALSE
mod-di-converter-storage17mod-di-converter-storage:2.1.52102489612876888128FALSE
mod-source-record-manager16mod-source-record-manager:3.7.825600500020483500384512FALSE
nginx-okapi9nginx-okapi:2023.06.1421024896128000FALSE
okapi-b11okapi:5.1.23168414401024922384512FALSE
pub-okapi9pub-okapi:2023.06.142102489612876800FALSE

Methodology/Approach

DI tests scenario(DI MARC Bib Update) were started from UI.

Test set №1

  • Test 1: Manually tested 25k records files DI started on one tenant only.
  • Test 2: Manually tested 25k records files DI started on 2 tenants concurrently, step 30%.
  • Test 3: Manually tested 25k records files DI started on 3 tenants concurrently, step 30%.

Test set №2

  • Changed configuration parameter for mod-permission in Mem Hard Limit=1684→2384, Mem Soft limit=1544→2244, Xmx=1024m→1500m.
  • Test 1: Manually tested 25k records files DI started on one tenant only.
  • Test 2: Manually tested 25k records files DI started on 2 tenants concurrently, step 30%.
  • Test 3: Manually tested 25k records files DI started on 3 tenants concurrently, step 30%.

Additional links

Link to Jira ticket: https://folio-org.atlassian.net/browse/PERF-801