Data Import Observations for Improvements
This page is used to brainstorm observations and issues with the current implementation of Iris (as of Hotfix 1).
Stability
- Huge update jobs often stuck at 99%. It has been observed that mod-inventory crashed several times due to OOM while the job is being run.
- Could be due to mod-inventory's memory's usage?
- Could be due to under-deployed brokers?
- While an import is running, if a module (mod-srm/srs or mod-inventory) gets restarted for any reason, the job may not finish with all records created as expected.
- The jobs get stuck intermittently and require restarting all DI modules. Don't know the cause.
Scalability/Performance
- Currently one CREATE import job of 50K records can run reliably, takes around 2 hours. Multiple CREATE jobs may be run but depending on the timing of when the second (or n-th job) starts, the first job is slowed down and the second or n-th job takes a very long time to get started and finishes. This is because of the overwhelming number of messages that are queued up in the Kafka topics that the first job created.
- Consequently running concurrently DI jobs in a multi-tenants cluster is almost impossible when the number of records per each job are tens of thousands.
- To date 100K CREATE imports have not been done successfully, either more or less records than anticipated have been created.
- Current hardware's resources cannot accommodate both DI and circulation load successfully.
- During CREATE imports, CPU utilization % of mod-srm is 600%, mod-srs is 400%, mod-inventory is over 250%.
- During UPDATE imports, CPU utilization % of mod-srm is 500%, mod-srs is 400%, mod-inventory for a long duration around 700%.
- As a result circulation activities' response times are increased by 1/3.
- Polling mechanism to get statuses on the DI landing page is slow
- Polling mechanism executes a slow query on the DB side. Query needs to improve.
- Polling may not be necessary if we move to the push model using websocket.
- Can compression be done on the producer and consumer side or the broker side and which one is more efficient?
Functionality
- It is very time-consuming and requires a lot of manual effort to clean up old Kafka topics after upgrading DI modules to the new FOLIO release. Migration script to clean up old topics will greatly enhance upgrades.
- Some of the job profiles are very complex(Create, Update) and there is no easy way to copy existing Data Import Job profile from one FOLIO tenant to another or from one FOLIO release to another. We have to manually create a Job profile either from UI or backend(database - requires a lot of effort in aligning metadata). It would save a lot of time and effort if there is a way to export the existing Job profile from tenant 1 and import the same to tenant 2.
- Stuck jobs do not provide end-users and DevOps insightful information to troubleshoot. Only having to dig through logs when a problem might be spotted. There was a JIRA to end the job with an error status and to surface the issue. Additionally perhaps other relevant errors from the modules could be collected and displayed in a "detail" link/button for technical users to examine.