[FOLIO-1034] Vagrant box becomes irresponsive while importing a large number of users Created: 17/Jan/18  Updated: 12/Nov/18  Resolved: 08/Mar/18

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P3
Reporter: Katalin Lovagné Szűcs Assignee: Katalin Lovagné Szűcs
Resolution: Done Votes: 0
Labels: sprint31, sprint32, sprint33
Remaining Estimate: Not Specified
Time Spent: 1 day, 30 minutes
Original estimate: Not Specified

Attachments: File run.sh     File user_import_test_data_with_1_users.json     File user_import_test_data_with_5000_users.json    
Issue links:
Blocks
blocks FOLIO-1024 Mod-users-Import module erroring with... Closed
is blocked by FOLIO-1112 Update FOLIO CI to make additional mo... Closed
Relates
relates to FOLIO-1037 User Import fails on repeat import of... Closed
relates to CIRC-86 Reuse HTTP client within circulation ... Closed
Sprint:

 Description   

The logs (Okapi, users module, auth module) do not contain any error messages.
The import module stops with the following error messages:
11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: b1b8af9c-1385-4e19-b77f-cecda3ca4729
11:06:40 ERROR UserImportAPI

{"endpoint":"/users/c6e24324-21a7-4bab-9fa4-3db25accbcfa","statusCode":500,"errorMessage":"Error getting expanded permissions for [\"users.collection.get\",\"addresstypes.collection.get\",\"usergroups.collection.get\",\"users.item.get\",\"users.item.post\",\"users.item.put\",\"perms.users.item.post\"] : Address not available: /10.0.2.15:9130"}

11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 70b50626-c394-4840-ac44-aa5d239a9501
11:06:40 ERROR UserImportAPI

{"endpoint":"/users/651d756f-30d9-4049-9dc0-a34e054e52ef","statusCode":500,"errorMessage":"Unable to retrieve permissions for user with id'1ad737b0-d847-11e6-bf26-cec0c932ce01': Address not available: /10.0.2.15:9130"}

11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 3f61dad2-cfed-4a4e-a210-8d26a76e1fdc
11:06:40 ERROR UserImportAPI

{"endpoint":"/users/75019021-5e8a-4672-a031-510b878e796d","statusCode":500,"errorMessage":"Error getting expanded permissions for [\"users.collection.get\",\"addresstypes.collection.get\",\"usergroups.collection.get\",\"users.item.get\",\"users.item.post\",\"users.item.put\",\"perms.users.item.post\"] : Address not available: /10.0.2.15:9130"}

11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 7b08650a-1561-4351-90c8-da9bb1717286
11:06:40 ERROR UserImportAPI

{"endpoint":"/users/f0a44049-2c7e-4c9f-8357-34f78c770f3a","statusCode":500,"errorMessage":"Error getting expanded permissions for [\"users.collection.get\",\"addresstypes.collection.get\",\"usergroups.collection.get\",\"users.item.get\",\"users.item.post\",\"users.item.put\",\"perms.users.item.post\"] : Address not available: /10.0.2.15:9130"}

11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 6c2cda17-52e9-46c0-8b7d-21ba018eb941
11:06:40 ERROR UserImportAPI

{"endpoint":"/users/1b0a83f1-84b2-4466-b99d-72231bfa7ac3","statusCode":500,"errorMessage":"Unable to retrieve permissions for user with id'1ad737b0-d847-11e6-bf26-cec0c932ce01': Address not available: /10.0.2.15:9130"}

The number of open connections on the Vagrant machine:
sudo lsof -i | wc -l = 181 after starting Vagrant machine
sudo lsof -i | wc -l = 16750 after the first request with 5000 users finished successfully.
sudo lsof -i | grep okapi | wc -l = 16604
sudo lsof -i | wc -l = 28480 when the second request fails.
sudo lsof -i | grep okapi | wc -l = 28334

The request data is:

  • $ {okapiUrl}

    /user-import

  • headers:
    • Content-Type: application/json
    • x-okapi-tenant
    • x-okapi-token
  • body: see the attached file

The user whose authentication token is used needs to have a permission for importing users: User import (user-import.all).



 Comments   
Comment by Adam Dickmeiss [ 17/Jan/18 ]

I'm using script run.sh to try this out. It takes a while. At this stage .. while posting .. I see now indication of socket leaks. I'll let it run for some time (still in progress). I am using Folio Ansible box v5.0.0-20180117.396

Comment by Adam Dickmeiss [ 17/Jan/18 ]

I tried to import 1 record (user_import_test_data_with_1_users.json). First time, I get a nice report back. 2nd time, it hangs. I need basic importing to work before getting thousands inserted/updated.. I'm using the run.sh script as attached. I'm pretty sure I'm doing something wrong here.

Comment by Adam Dickmeiss [ 17/Jan/18 ]

I recreated the Ansible box and imported 283 records OK (to ensure all : new). Then I tried with the 5000 record batch.. It got through after intensive work..

top - 16:16:50 up 8 min,  2 users,  load average: 4.50, 4.91, 2.88
Tasks: 163 total,   2 running, 161 sleeping,   0 stopped,   0 zombie
%Cpu(s): 68.5 us, 16.3 sy,  0.0 ni, 10.7 id,  0.0 wa,  0.0 hi,  4.5 si,  0.0 st
KiB Mem:   8195824 total,  7292780 used,   903044 free,    41740 buffers
KiB Swap:   473084 total,        0 used,   473084 free.   755144 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                         
 1305 folio     20   0 4691860 1.099g  20880 S  45.9 14.1   3:35.83 java -javaagent:/opt/agent-bond/agent-bond.jar=jolokia{{host=0.0.0.0}},jmx_exporter{{9779:/opt/agent-bond/jmx_exporter_config.yml}} -cp . -jar /usr/verticles/mod-permissions-fat.jar                                                                           
 1804 folio     20   0 4698292 856752  21152 S   0.3 10.5   0:40.30 java -javaagent:/opt/agent-bond/agent-bond.jar=jolokia{{host=0.0.0.0}},jmx_exporter{{9779:/opt/agent-bond/jmx_exporter_config.yml}} -cp . -jar /usr/verticles/mod-users-fat.jar              

The socket count is 200, which is similar to what it is upon booting.. In other words: no socket leak seen.

Comment by Kurt Nordstrom [ 17/Jan/18 ]

I wonder if this is a possible case where a simple permissions cache in the authtoken module would start to make a lot of sense.

Comment by Katalin Lovagné Szűcs [ 18/Jan/18 ]

One thing to mention: while testing I ran the user-import module on my laptop in Eclipse and referenced to that instance in the discovery service. I tested it with the embedded module and the socket count seems to stop at about 200. So maybe the problem occurs only when deploying an external service.
Update: tried to import 1000 users after the successful import of 5000 users. I think something went wrong because at one time the logs just stopped and no response was returned to Postman. Also my laptop (with 8GB RAM) did not respond after the first request, I had to close every application (I gave my virtualbox a maximum of 4 GB memory).

Comment by Kurt Nordstrom [ 19/Jan/18 ]

I tried to import 1 record (user_import_test_data_with_1_users.json). First time, I get a nice report back. 2nd time, it hangs. I need basic importing to work before getting thousands inserted/updated.. I'm using the run.sh script as attached. I'm pretty sure I'm doing something wrong here.

Adam Dickmeiss I got this same behavior when I tried with my setup. Did you ever identify any cause for it?

Comment by Kurt Nordstrom [ 22/Jan/18 ]

I went ahead and made a branch of mod-authtoken that had rudimentary caching to avoid having to do repeat lookups of permissions for the same user id. With the cache turned off, my VM takes 22 minutes to load 5000 records. With the cache turned on, it completes in 7 minutes.

This suggests to me that caching is a worthy addition to the authtoken module, as permissions lookups are a bottleneck for a repetition-intensive task like import.

Sadly, it does not solve the issue with updating already imported records. I will continue to investigate that.

Comment by Adam Dickmeiss [ 22/Jan/18 ]

Sounds good. Should improve the response times by the system as a whole, then

Comment by Kurt Nordstrom [ 22/Jan/18 ]

I still haven't located the source of the users update issue. I consistently get a hung request when I attempt any re-import of existing users.

The last output that I get from user-import is the output of the request to query the existing user by externalId. Katalin Lovagné Szűcs, do you have any thoughts as to what happens differently when a user already exists as opposed to not existing at this particular juncture?

Comment by Katalin Lovagné Szűcs [ 29/Jan/18 ]

I realized that the users schema has changed in mod-users and I am using a separate one. Can I reuse the schema in mod-users but change the required fields? (I want externalSystemId to be required instead of id).

Comment by Katalin Lovagné Szűcs [ 29/Jan/18 ]

I fixed the user update problem so (after I manage to merge my PR) you can test with it. I just tried with the 5000 users and the vagrant machine has stopped itself (or rather paused). My pull request just in case: https://github.com/folio-org/mod-user-import/pull/19

Comment by Kurt Nordstrom [ 30/Jan/18 ]

I can confirm that the user update works now. Should probably update https://folio-org.atlassian.net/browse/FOLIO-1037, I think.

I am able to import/update 5000 users on my machine (running with jars) without issue. With my experimental caching for permissions activated, it takes about 7-8 minutes.

Would it be possible to get a larger file to test with? Say 20k users?

Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ]

I have a larger data set with 20k users, I uploaded the JSON file to Google drive: https://drive.google.com/drive/folders/1m8nVqSTg-dMrltF7Az8Fc08Eqq7GA2EM?usp=sharing
I can generate even larger data sets if necessary.

Comment by Kurt Nordstrom [ 31/Jan/18 ]

The 20k file did manage to bork things up good and proper. Here's the stack dump:

14:51:27 ERROR utingContextImplBase Unexpected exception in route
java.lang.OutOfMemoryError: Java heap space
        at org.hibernate.validator.internal.metadata.raw.ConstrainedParameter.merge(ConstrainedParameter.java:150)
        at org.hibernate.validator.internal.metadata.raw.ConstrainedExecutable.merge(ConstrainedExecutable.java:285)
        at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.addToExecutablesByDeclaringType(ExecutableMetaData.java:373)
        at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.add(ExecutableMetaData.java:347)
        at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.<init>(ExecutableMetaData.java:317)
        at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BuilderDelegate.<init>(BeanMetaDataImpl.java:615)
        at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BeanMetaDataBuilder.addMetaDataToBuilder(BeanMetaDataImpl.java:559)
        at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BeanMetaDataBuilder.add(BeanMetaDataImpl.java:546)
        at org.hibernate.validator.internal.metadata.BeanMetaDataManager.createBeanMetaData(BeanMetaDataManager.java:167)
        at org.hibernate.validator.internal.metadata.BeanMetaDataManager.getOrCreateBeanMetaData(BeanMetaDataManager.java:194)
        at org.hibernate.validator.internal.metadata.BeanMetaDataManager.getBeanMetaData(BeanMetaDataManager.java:142)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:783)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:664)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:795)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:681)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:795)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:681)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
        at org.hibernate.validator.internal.engine.ValidatorImpl.validate(ValidatorImpl.java:205)
        at org.folio.rest.RestVerticle.isValidRequest(RestVerticle.java:1454)
        at org.folio.rest.RestVerticle.lambda$11(RestVerticle.java:1275)
        at org.folio.rest.RestVerticle$$Lambda$86/1086721990.accept(Unknown Source)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at org.folio.rest.RestVerticle.parseParams(RestVerticle.java:1223)
        at org.folio.rest.RestVerticle.route(RestVerticle.java:428)
        at org.folio.rest.RestVerticle.lambda$12(RestVerticle.java:253)
        at org.folio.rest.RestVerticle$$Lambda$58/1650886178.handle(Unknown Source)
        at io.vertx.ext.web.impl.RouteImpl.handleContext(RouteImpl.java:217)
        at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:78)
        at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:118)
14:51:28 INFO  ProxyContext         633150/user-import RES 500 86342187us mod-user-import-2.0.1-SNAPSHOT http://localhost:9134/user-import
Comment by Kurt Nordstrom [ 31/Jan/18 ]

Without delving into the code, I wonder if the user-import module needs to be configured to only work on a chunk of records of some particular limit? Perhaps 1000 at a time? Is it trying to load all 20k into memory at once?

Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ]

Yes, it is loaded to memory at once but the processing happens in smaller (10 for now) batches.

Comment by Kurt Nordstrom [ 31/Jan/18 ]

When the records are loaded to memory, what representation are they stored as? Is it just JSON? Or POJOs?

Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ]

POJOs.

Comment by Kurt Nordstrom [ 31/Jan/18 ]

So it looks like whatever RMB is doing when it creates/validates the POJOs is heavy enough that we don't want to just load them all at once. Would it be difficult to restrict the number of POJOs loaded to a certain number?

Comment by István Bender [ 01/Mar/18 ]

I released a new version of mod-user-import: 3.0.0. It contains improvements since 2.0.0 and tested in Vagrant box. I haven't experienced any memory issue in this version. I also did profiling the code and no leak was experienced.
Note that there is no prebuilt vagrant box in the cloud which runs the latest docker container, I deployed it manually in my box. So please do the same or build a new Vagrant box before testing!

Generated at Thu Feb 08 23:10:20 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.