Data Import Update multi tenant (Poppy)

Data Import Update multi tenant (Poppy)

Overview

  • This document contains the results of testing Data Import for MARC Bibliographic records with an update job in the Poppy release. Ticket: PERF-779

Summary

Test set run №2 (updated configuration for mod-permission)

  • Three tests of DI with update jobs with 25k records were carried out on the pcp1 cluster.

  • DI with a new configuration for mod-permission was successful. DI jobs are processed one by one in test run #3.

  • Mod-permission version changed from 6.4.0 → 6.3.2 doesn't solve the issue with NullPointerException: "mappingParameters" is null for configuration parameter for mod-permission in Mem Hard Limit=1684, Mem Soft limit=1544, Xmx=1024m. 

  • Changed configuration parameter for mod-permission in Mem Hard Limit=1684→2384, Mem Soft limit=1544→2244, Xmx=1024m→1500m allowed to run tests successfully and avoided issue with "mappingParameters" is null.

  • Memory consumption for mod-data-import was 36%, mod-inventory - 94%. No memory leaks are suspected.

  • CPU utilization for mod-data-import spiked to 72%, mod-inventory - 171%. It was observed that CPU utilization for mod-inventory has a growing trend.

  • DB CPU utilization was 97%.

  • DB connections - 500 on average and 681 during spikes connected to mod-permission spikes on the CPU graph.

  • Slow queries found -  SELECT jsonb FROM fs07000002_mod_permissions.permissions

Test set run №1

  • Three tests of DI with update jobs with 25k records were carried out on the pcp1 cluster.

  • The first and second tests were successful. The third test of DI on three tenants running concurrently failed after a third data import started. Errors attached.

  • Slow queries found for SELECT jsonb FROM fs07000001_mod_permissions.permissions

  • Issue with a spike from mod-permission that led to test #3 problems were investigated. The module stopped after a timeout. Additional analysis with a method that consumed most of the resources attached.

Recommendations and Jiras

  • Created Jira ticket to Investigate "mappingParameters" is null error: PERF-801

Test Results

Test set run#2 (updated configuration for mod-permission)

Profile

Test #

Tenant

MARC File

DI Duration Poppy (hh:mm:ss)

Results

Profile

Test #

Tenant

MARC File

DI Duration Poppy (hh:mm:ss)

Results

DI MARC Bib Update (PTF - Updates Success - 1)

1

fso7000001

25K.mrc

00:17:18

Completed

2

fso9000000

25K.mrc

00:23:11

Completed

fso7000002

25K.mrc

00:24:16

Completed

3

fso9000000

25K.mrc

00:26:36

Completed

fso7000002

25K.mrc

00:38:01

Completed

fso7000001

25K.mrc

00:49:24

Completed



Test set run#1

Profile

Test #

Tenant

MARC File

DI Duration Poppy (hh:mm:ss)

Results

Profile

Test #

Tenant

MARC File

DI Duration Poppy (hh:mm:ss)

Results

DI MARC Bib Update (PTF - Updates Success - 1)

1

fso9000000

25K.mrc

00:20:23

Completed

2

fso9000000

25K.mrc

00:24:48

Completed

fso7000001

25K.mrc

00:31:53

Completed

3

fso9000000

25K.mrc

00:27:10

Completed

fso7000001

25K.mrc

00:36:09

with errors*

fso7000002

25K.mrc

00:30:29

with errors*

Comparison

The following table compares previous results of Poppy release with current results.

Profile

MARC File

Test run #

DI Duration

DI Delta, (hh:mm:ss)

Poppy (previous)

Poppy

DI MARC Bib Update (PTF - Updates Success - 1)


25K.mrc


1

00:14:50

00:17:18

+ 00:02:28

2

not tested

00:24:16

-

3

not tested

00:49:24

-



Errors for test set #1



15:23:10 [] [fs07000001] [] [] WARN  teModifyEventHandler handle:: Error while MARC record modifying

javax.ws.rs.BadRequestException: Date when processing started is not set, expected snapshot status is PARSING_IN_PROGRESS, actual - PROCESSING_IN_PROGRESS

at org.folio.services.RecordServiceImpl.lambda$3(RecordServiceImpl.java:112) ~[ms.jar:?]

at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[ms.jar:?]



15:11:11 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 1 subscriptionPattern: SubscriptionDefinition(eventType=DI_RAW_RECORDS_CHUNK_READ, subscriptionPattern=pcp1\.Default\.\w{1,}\.DI_RAW_RECORDS_CHUNK_READ) offset: 9738

org.folio.services.exceptions.RawChunkRecordsParsingException: java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

at org.folio.verticle.consumers.RawMarcChunksKafkaHandler.lambda$handle$1(RawMarcChunksKafkaHandler.java:98) ~[ms.jar:?]

at io.vertx.core.impl.future.Composition.onFailure(Composition.java:50) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) ~[ms.jar:?]



Caused by: java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

at org.folio.services.afterprocessing.FieldModificationServiceImpl.lambda$remove9Subfields$2(FieldModificationServiceImpl.java:32) ~[ms.jar:?]

at io.vertx.core.impl.future.Mapping.onSuccess(Mapping.java:35) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60) ~[ms.jar:?]

at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211) ~[ms.jar:?]



15:11:10 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR RestUtil             Response HTTP code is not equals 200, 201, 204. Response code: 504

15:11:10 [] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [] ERROR ngParametersProvider Something happened while initializing mapping parameters

javax.ws.rs.BadRequestException: HTTP 400 Bad Request

at org.folio.dataimport.util.RestUtil.validateAsyncResult(RestUtil.java:159) ~[ms.jar:?]

at org.folio.services.mappers.processor.MappingParametersProvider.lambda$getIdentifierTypes$2(MappingParametersProvider.java:243) ~[ms.jar:?]

java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

java.lang.NullPointerException: Cannot invoke "org.folio.processing.mapping.defaultmapper.processor.parameters.MappingParameters.getLinkingRules()" because "mappingParameters" is null

Unable to retrieve permissions (code 400): Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:5432

Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:54
15:09:48 [037670/locations;130709/perms] [fs07000001] [] [mod_permissions] ERROR PermsAPI             Timeout for DB_HOST:DB_PORT=db.pcp1.folio-eis.us-east-1:5432

Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-959 [ACTIVE]
pcp1/mod-search/ff1a258966fc4c69b8f88d1655267aa4

15:16:47 [] [] [] [] WARN  KafkaMessageListener Failed to index resource event [eventType: CREATE, tenantId: fs09000000, id: cbc38005b4e939fe85401fc369bdcd3049c80b8f]
org.folio.search.exception.SearchOperationException: Failed to perform elasticsearch request [index=pcp1_contributor_fs09000000, type=bulkApi, message: 30,000 milliseconds timeout on connection http-outgoing-959 [ACTIVE]]
    at org.folio.search.utils.SearchUtils.performExceptionalOperation(SearchUtils.java:97) ~[classes!/:3.0.5]
    at org.folio.search.repository.AbstractResourceRepository.executeBulkRequest(AbstractResourceRepository.java:55) ~[classes!/:3.0.5]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    
15:25:15 [] [] [] [] WARN  KafkaMessageListener Failed to index resource event [eventType: CREATE, tenantId: fs07000002, id: 9d97f034-a06a-4bba-8895-a86ab367b87a]
feign.FeignException$Unauthorized: [401 Unauthorized] during [GET] to [http://user-tenants?tenantId=fs07000002] [UserTenantsClient#getUserTenants(String)]: [Invalid token]


pcp1/mod-source-record-manager
15:02:50 [] [fs07000001] [9eb67301-6f6e-468f-9b1a-6134dc39a684] [] ERROR RestUtil             Response HTTP code is not equals 200, 201, 204. Response code: 500

mod-permission heap dump analysis

Problem Suspect

46 instances of io.vertx.pgclient.impl.codec.PgCodec, loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xd5740ac8 occupy 860.04 MB (88.35%) bytes.

Biggest instances:

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e5da00 - 40.11 MB (4.12%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0778 - 36.84 MB (3.78%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e38210 - 32.91 MB (3.38%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e340 - 31.03 MB (3.19%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e638 - 30.90 MB (3.17%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e200 - 30.89 MB (3.17%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6d757a0 - 29.14 MB (2.99%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e0c0 - 28.82 MB (2.96%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f53c78 - 28.82 MB (2.96%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e4f8 - 27.18 MB (2.79%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54170 - 27.14 MB (2.79%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd90 - 25.69 MB (2.64%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bced0 - 25.44 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcbe0 - 25.43 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da06f8 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6e38258 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66b3640 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd10 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcd50 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bce10 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66c3830 - 25.39 MB (2.61%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e840 - 25.05 MB (2.57%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f4e480 - 24.06 MB (2.47%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bce50 - 22.06 MB (2.27%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcf50 - 21.79 MB (2.24%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54668 - 21.35 MB (2.19%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6f54b60 - 20.21 MB (2.08%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0738 - 17.89 MB (1.84%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0d10 - 17.89 MB (1.84%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da0678 - 14.16 MB (1.45%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd6da06b8 - 14.16 MB (1.45%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcc50 - 10.93 MB (1.12%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66bcc90 - 10.91 MB (1.12%) bytes.

  • io.vertx.pgclient.impl.codec.PgCodec @ 0xd66c38b0 - 10.63 MB (1.09%) bytes.

These instances are referenced from one instance of java.util.HashMap$Node[], loaded by <system class loader>, which occupies 5.02 KB (0.00%) bytes.

Keywords

  • io.vertx.pgclient.impl.codec.PgCodec

  • jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xd5740ac8

  • java.util.HashMap$Node[]

Resource utilization table test set#1



Service CPU Utilization

Mod-inventory - 148% during DI on 3 tenants, data-import - 58%, 

Service Memory Utilization

Mod-inventory grew from 93% to 100%. Data-import - 52%.

DB CPU Utilization

The highest - 87% during DI on 3 tenants

DB Connections

DB connections for all tests were 250 in average and 345 during spike. Spike here shows when errors happened.

DB load

Top SQL-queries:

Here we observe that two queries are very slow. Should be investigated.

SELECT jsonb FROM fs07000001_mod_permissions.permissions

SELECT jsonb FROM fs07000002_mod_permissions.permissions





Resource utilization table test set#2

CPU

Memory

mod-inventory-b

171

mod-inventory-b

93.65

mod-di-converter-storage-b

80.99

mod-permissions-b

81.97

nginx-okapi

75.33

mod-source-record-storage-b

75.97

mod-source-record-storage-b

40.94

mod-source-record-manager-b

52.97

mod-source-record-manager-b

36.95

okapi-b

40.22

okapi-b

33.33

mod-di-converter-storage-b

36.31

mod-permissions-b

30.91

mod-data-import-b

35.57

mod-inventory-storage-b

22.84

mod-inventory-storage-b

17.25

mod-data-import-b

1.742

nginx-okapi

4.799

pub-okapi

0.38

pub-okapi

4.408

Service CPU Utilization

The highest mod-inventory - 171% during DI on 1 tenant, data-import - 135% in 3 tenants test, mod-permission spiked in 3 tenants test to 462%. 

Service Memory Utilization

Mod-inventory - 93%. Data-import - 36%.

DB CPU Utilization

The highest - 97% during all tests

DB Connections

DB connections before test started was 400, after for all tests were 500 in average and 681 during spike. Spikes are connected to spikes in CPU utilization for mod-permission.

DB load



Top SQL-queries:

All tests

3 tenants test



Appendix

Infrastructure

PTF -environment pcp1

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • 1 database  instances, writer

  • MSK tenant

    • 4 m5.2xlarge brokers in 2 zones

    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

  • Kafka consolidated topics and file splitting features enabled on a non-ecs-enabled environment. 

For Test set run №2 mod-permissions parameters were changed to mod-permissions_version=6.4.0->6.3.2, Mem Hard Limit=1684->2384 , Mem Soft limit=1544->2244, Xmx=1024m->1500m

Module

Task Def. Revision

Module Version

Task Count

Mem Hard Limit

Mem Soft limit

CPU units

Xmx

MetaspaceSize

MaxMetaspaceSize

R/W split enabled