SPIKE: Multiple tenant DI testing - import jobs are hanging (MODSOURCE-581)

Multitenant configuration (2 tenants, 1 instance for every DI module, 1 partition for every DI topic):


moduleconfiguration
mod-data-import

mod-data-import-2.8.0-SNAPSHOT.260

name: DB_MAXPOOLSIZE value: '5'

name: JAVA_OPTIONS -XX:MaxRAMPercentage=66.0 -Djava.util.logging.config.file=vertx-default-jul-logging.properties

mod-source-record-manager

mod-source-record-manager-3.7.0-SNAPSHOT.762

name: DB_MAXPOOLSIZE value: '15'

name: JAVA_OPTIONS -XX:MaxRAMPercentage=66.0 -Djava.util.logging.config.file=vertx-default-jul-logging.properties

name: DB_RECONNECTINTERVAL value: '1000'

name: DB_RECONNECTATTEMPTS value: '3'

mod-source-record-storagemod-source-record-storage-5.6.3-SNAPSHOT.608594f
name: DB_MAXPOOLSIZE value: '15'
name: JAVA_OPTIONS -XX:MaxRAMPercentage=66.0 -Djava.util.logging.config.file=vertx-default-jul-logging.properties
mod-inventory

mod-inventory-20.1.0-SNAPSHOT.607

name: DB_MAXPOOLSIZE value: '5'

name: JAVA_OPTIONS -XX:MaxRAMPercentage=85.0 -Dorg.folio.metadata.inventory.storage.type=okapi

mod-inventory-storage

mod-inventory-storage-26.0.0

 name: DB_MAXPOOLSIZE value: '5'

name: JAVA_OPTIONS -XX:MaxRAMPercentage=66.0

mod-di-converter-storage

mod-di-converter-storage-2.1.0-SNAPSHOT.9

name: DB_MAXPOOLSIZE value: '5'

name: JAVA_OPTIONS -XX:MaxRAMPercentage=66.0 -Djava.util.logging.config.file=vertx-default-jul-logging.properties


With the default configuration, when importing 10k records in parallel, I sometimes faced import termination with Timeout exceptions:

> SRM: 2023-04-13 12:00:00.944 [vert.x-worker-thread-11] ERROR PostgresClient Opening SQLConnection failed: Timeout


During the investigation on Folijet-PefrRancher and on other environments like Bugfest and PTF it has been noticed that imports are usually stacked when DI hasn't got enough resources.

In single-tenant mode all imports handle sequentially (OCLC imports have the ability to build into the process of importing large files, but the imports are still sequential).

In multi-user mode, the system runs in parallel and imports from different tenants run in parallel, which increases the need for additional resources.

Imports stop stacked and TimeoutExceptions disappear when the number of connections to the database increases.

One of the major bottlenecks in the parallels import is the database because the number of DB connections has multiplied. At means that multitenant systems need more resources.



I found that RMB and Vertx provide some metrics related to some parameters of modules' works like the number of connections, number of queries, and number of requests.

Its described in RMB-655 - Getting issue details... STATUS and RANCHER-621 - Getting issue details... STATUS (how we can use it by JMX and Prometheus in Gafana).

I created a task to deeply investigate the work of DI, its performance, and its use of resources: MODSOURMAN-980 - Getting issue details... STATUS