Skip to end of banner
Go to start of banner

data-import performance

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Data-import process currently consists of a few stages. Uploaded file is being chunked, records from each chunk are parsed, saved to storage as Source Records, mapped to Instances, saved to Inventory, and corresponding instanceIds are set to Source Records. Chunk size and the number of chunks being processed simultaneously can be changed (by default it's 50 and 10 respectively).

The actual data import starts at the point when file is uploaded. Right now, that can only be triggered by the so-called "secret" button, which triggers a default job to import MARC bibliographic records into SRS and create associated Inventory instances. This calls the POST endpoint /data-import/uploadDefinitions/{uploadDefinitionId}/processFiles

Import is considered finished when all the chunks are processed successfully or marked as ERROR , appropriate JobExecution status is set and the file is visible in the logs section on the UI.


Performance was measured locally on folio-testing-backend Vagrant box version 5.0.0-20190619.2334 (allocated memory 16 GB) with mod-source-record-manager and mod-data-import deployed additionally. For each module running on a docker container was allocated 256 MB of JVM heap memory.

Files used for testing:

 

msplit30000.mrc contains 30,000 raw MARC bibliographic records, RecordsForSRS_20190322.json contains 28,306 MARC records in json format.

Results are shown below, default values are highlighted. In average, it takes about 25 sec per 1000 raw marc records and about 22 sec per 1000 json records.


Data-import performance was also tested on https://folio-snapshot-load.aws.indexdata.com using the same files, but only with default chunk size and queue size parameters. It is consistently takes 8 min to load each of the files, which makes about 17 sec to load 1000 records



  • No labels