Page Comparison

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODDATAIMP-124

Data-import process currently consists of a few stages. Uploaded file is being chunked, records from each chunk are parsed, saved to storage as Source Records, mapped to Instances, saved to Inventory, and corresponding instanceIds are set to Source Records. Chunk size and the number of chunks being processed simultaneously can be changed (by default it's 50 and 10 respectively).

...

Import is considered finished when all the chunks are processed successfully or marked as ERROR, appropriate JobExecution status is set and the file is visible in the logs section on the UI.

Performance was measured locally on folio-testing-backend Vagrant box version 5.0.0-20190619.2334 (allocated memory 16 GB) with mod-source-record-manager and mod-data-import deployed additionally. For each module running on a docker container was allocated 256 MB of JVM heap memory.

Files used for testing:

View file

name	msplit30000.mrc
height	150

View file

name	RecordsForSRS_20190322.json
height	150

msplit30000.mrc contains 30,000 raw MARC bibliographic records, RecordsForSRS_20190322.json contains 28,306 MARC records in json format.

Results are shown below, default values are highlighted. In average, it takes about 25 sec per 1000 raw marc records and about 22 sec per 1000 json records.

Image Removed

...

Data-import performance was also tested Performance was measured on https://folio-snapshot-load.aws.indexdata.com using the same files, but only with with default chunk size and queue size parameters (50 and 10 respectively). It is

Environment characteristics: AWS t2.xlarge instance (4 CPUs, 16 GB RAM)

It consistently takes 8 min to load each of the files, which makes about 17 sec to load 1000 records.

...

Data-import performance was also tested locally on folio-testing-backend Vagrant box version 5.0.0-20190619.2334 (2 CPUs, 16 GB RAM) with mod-source-record-manager and mod-data-import deployed additionally. For each module running on a docker container was allocated 256 MB of JVM heap memory.

Results are shown below, default values are highlighted. In average, it takes about 25 sec per 1000 raw MARC records and about 22 sec per 1000 json records.

Image Added

Testing results are better on the AWS environment because it has more computing power.

...

Conclusions

Before concept of batch operations was introduced importing of 30,000 raw MARC records took about 38-40 min on the local environment. Performing Steps 4, 8 and 13 at the diagram above with batch save/update drastically decreased the number of http requests between the modules. Before it took more then 200 requests to process a chunk of 50 records (Steps 4, 8, 9 and 13 on the diagram), waiting for the responses significantly slowed down the process. With Steps 4, 8 and 13 as batch operations it now requires 53 http calls to do the same. It is fair to assume that implementing batch save to mod-inventory-storage (Step 9 at the diagram) will improve the data-import performance a bit more

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODINVSTOR-291

. However, main improvements in performance are expected with applying of event driven approach

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	UXPROD-1806

.

Versions Compared

Old Version 3

New Version Current

Key

Conclusions