[MODUIMP-4] Bulk import of users needs performance improvements Created: 20/Jun/18  Updated: 19/Jan/21

Status: Open
Project: mod-user-import
Components: None
Affects versions: None
Fix versions: None

Type: Tech Debt Priority: P3
Reporter: Tod Olson Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: back-end, migration-load, qulto, sprint46, sprint47, sprint48
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: Zip Archive users.zip     PNG File vagrant_top.png    
Issue links:
Blocks
blocks FOLIO-1470 Add mod-user-import to snapshot build Closed
is blocked by MODUSERS-85 Improve performance of getting users ... Closed
is blocked by MODPERMS-46 Creating empty permission set during ... Blocked
Defines
defines UXPROD-2873 Mod-user-import stabilization Open
Relates
relates to MODUSERS-3 Add bulk-loading functionality to mod... Closed
relates to UXPROD-850 Migration Tools Open
Sprint:

 Description   

We need performant bulk import of users for both migration and ongoing operations.

The current user import module submits users one-by-one, which is rather slow at a scale of, say, 90K users, especially when trying to test data loading at scale. SysOps foresees similar issues with loading data into other modules.

There are two general scenarios:

  1. At migration time, sites will need to do an initial bulk import from their existing systems, and new UUIDs will somehow need to be minted as part of the migration process.
  2. In ongoing operations, sites will do regular bulk updates from their campus identity management (IdM) infrastructure to refresh their campus users. In this case, existing UUIDs and some internal FOLIO fields will need to be preserved.

The migration scenario here seems like a special case of the bulk import to support operations.

There will be a need for bulk imports across FOLIO modules for both migration and operations. Users seems like a good starting point to consider some common strategies or techniques for bulk import that could inform best practices for project as a whole.

This is related to MODUSERS-3 Closed .



 Comments   
Comment by Cate Boerema (Inactive) [ 21/Jun/18 ]

Jakub Skoczen, István Bender, thoughts on how we can improve performance here? Is it time to add bulk loading functionality per MODUSERS-3 Closed ? FYI Khalilah Gambrell, as well.

Comment by István Bender [ 25/Jun/18 ]

I can profile which part takes long time but my gut feeling is that calling the mod-users API twice per user (create/update user and adding empty permission set) must be the bottleneck. Bulk user creation would definitely improve the performance of user import.

Comment by István Bender [ 28/Aug/18 ]

Tod Olson could you send me a sample user import JSON file which contains 90k users to import. I will do some profiling and performance tests.

Comment by Jon Miller [ 29/Aug/18 ]

I generated a sample file of 100,000 users and attached it to this issue. The file just contains generated data and a subset of the JSON attributes. I'm thinking this will get you going. I can create a more realistic file if needed.

Comment by István Bender [ 30/Aug/18 ]

Jon Miller I generated a new one (using http://www.mockaroo.com/) because some attributes (patronGroup, addresses, type, dateOfBirth) were missing from your sample file. I cannot upload it here because the compressed size is 15M and Jira allows only max 10M file size to upload.

An example user in my samle JSON:

    {
        "active": true,
        "barcode": "19-971-1678",
        "externalSystemId": "8294b291-f227-4cb8-8635-762fadc8ce7f",
        "patronGroup": "graduate",
        "personal": {
            "addresses": [
                {
                    "addressLine1": "74 Fuller Point",
                    "addressTypeId": "Payment",
                    "city": "San Francisco",
                    "postalCode": "94132",
                    "primaryAddress": true,
                    "region": "California"
                }
            ],
            "dateOfBirth": "1968-04-23",
            "email": "aguppie0@vinaora.com",
            "firstName": "Aubrey",
            "lastName": "Guppie",
            "mobilePhone": "235(862)784-4412",
            "phone": "48(210)504-6148",
            "preferredContactTypeId": "mail"
        },
        "type": "patron",
        "username": "969402e4-cf2d-4255-be18-8673c58939f3"
    }

Feel free to use my schema: https://www.mockaroo.com/c238b770

Comment by Jon Miller [ 30/Aug/18 ]

István Bender Thanks

Comment by Cate Boerema (Inactive) [ 05/Sep/18 ]

Hi István Bender. Any test results yet?

Comment by István Bender [ 05/Sep/18 ]

Zoltan Erdos will do it in the current sprint. He will also give an estimate how many effort to integrate mod-user-import in mod-users. I hope we will be able to answer these questions at the end of this week.

Comment by Zoltan Erdos [ 11/Sep/18 ]

Test environment.

16GB DDR4 RAM, i7-7820HQ CPU, 240GB SSD

base vagrant box : folio/testing
Replaced/Redeployed modules, from locally source code with embedded postgres:
mod-users, mod-permissions, mod-user-import, mod-users-bl

Used ram for these 4 module (run from intelliJ IDEA) was 2.5GB(max value) when I imported 100.000 user at once. The json file with the users was 92MB.

Importing time test results:

5k insert - 1.0 min;0.78 minutes; 0.85 min, 0.72 min
5k update - 0.5 min; 0.47 min; 0.46 min, 0.47 min, 0.5 min
100k insert: 24.5 min, 25min

The 40% of the import time is spent in mod users to get users by external id-s.
The 20% is for the POST or PUT the users into mod-users.
The rest 40% is for POST permissions info into mod-permission module.

It is important to add temporarily more RAM for mod-user-import, mod-user, mod-permission if they want to import a lot of user (more than 10k).

The slowest parts of the user import are:
Querying mod-users by external-id (bulk GET with 10-30 user at once).
Insert permissions to mod-permissions.

Recommendation for mod-users developers:
The user import would be faster, if there will be an index on the users table on the externalSystemId field.

Recommendation for mod-permissions developers:
Review postPermsUsers implementation is necessary, because POST permessionUser it takes a little longer than we expected. (10 times longer(3-5ms vs 30-50ms)) than post new user into mod-users)

Comment by István Bender [ 28/Sep/18 ]

We performed new bulk import test in the following environment:

16GB DDR4 RAM, i7-7820HQ CPU, 240GB SSD

  • base vagrant box : folio/testing version: 5.0.0-20180925.1100
  • mod_users, version: 15.3.0
  • mod_user_import, version: 3.1.1
  • mod_permissions, version: 5.4.0

All modules are running in Vagrant box none of them hosted locally.

Experiences:

  • Vagrant box and services are starting much slower than usual
  • The performance of user import is much lower than it was during our last measure. 1000 user's import took more then 5 minutes!
  • We experienced a lot of postgres process running at the same time in the box consuming huge CPU (see attached image)

We didn't have time to dig deeper and identify the root cause of poor performance but something may changed in a wrong way since our last benchmark. I cannot state that the cause of slow-down is mod-users or mod-permission modules. Perhaps there is something completely different reason.

Jakub Skoczen What should we do now? What do you suggest considering we don't have too much time until our contract termination.

Comment by patty.wanninger [ 24/Feb/20 ]

Cate Boerema, Ian and I are reviewing this ticket - would this be superceded by Modusers-3?

Still looking for a benchamrk of 70 records per second.

Generated at Thu Feb 08 23:12:16 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.