[FOLIO-1034] Vagrant box becomes irresponsive while importing a large number of users Created: 17/Jan/18 Updated: 12/Nov/18 Resolved: 08/Mar/18 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | P3 |
| Reporter: | Katalin Lovagné Szűcs | Assignee: | Katalin Lovagné Szűcs |
| Resolution: | Done | Votes: | 0 |
| Labels: | sprint31, sprint32, sprint33 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | 1 day, 30 minutes | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||
| Description |
|
The logs (Okapi, users module, auth module) do not contain any error messages. 11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 70b50626-c394-4840-ac44-aa5d239a9501 11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 3f61dad2-cfed-4a4e-a210-8d26a76e1fdc 11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 7b08650a-1561-4351-90c8-da9bb1717286 11:06:40 ERROR UserImportAPI Failed to update user with externalSystemId: 6c2cda17-52e9-46c0-8b7d-21ba018eb941 The number of open connections on the Vagrant machine: The request data is:
The user whose authentication token is used needs to have a permission for importing users: User import (user-import.all). |
| Comments |
| Comment by Adam Dickmeiss [ 17/Jan/18 ] |
|
I'm using script run.sh to try this out. It takes a while. At this stage .. while posting .. I see now indication of socket leaks. I'll let it run for some time (still in progress). I am using Folio Ansible box v5.0.0-20180117.396 |
| Comment by Adam Dickmeiss [ 17/Jan/18 ] |
|
I tried to import 1 record (user_import_test_data_with_1_users.json). First time, I get a nice report back. 2nd time, it hangs. I need basic importing to work before getting thousands inserted/updated.. I'm using the run.sh script as attached. I'm pretty sure I'm doing something wrong here. |
| Comment by Adam Dickmeiss [ 17/Jan/18 ] |
|
I recreated the Ansible box and imported 283 records OK (to ensure all : new). Then I tried with the 5000 record batch.. It got through after intensive work..
top - 16:16:50 up 8 min, 2 users, load average: 4.50, 4.91, 2.88
Tasks: 163 total, 2 running, 161 sleeping, 0 stopped, 0 zombie
%Cpu(s): 68.5 us, 16.3 sy, 0.0 ni, 10.7 id, 0.0 wa, 0.0 hi, 4.5 si, 0.0 st
KiB Mem: 8195824 total, 7292780 used, 903044 free, 41740 buffers
KiB Swap: 473084 total, 0 used, 473084 free. 755144 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1305 folio 20 0 4691860 1.099g 20880 S 45.9 14.1 3:35.83 java -javaagent:/opt/agent-bond/agent-bond.jar=jolokia{{host=0.0.0.0}},jmx_exporter{{9779:/opt/agent-bond/jmx_exporter_config.yml}} -cp . -jar /usr/verticles/mod-permissions-fat.jar
1804 folio 20 0 4698292 856752 21152 S 0.3 10.5 0:40.30 java -javaagent:/opt/agent-bond/agent-bond.jar=jolokia{{host=0.0.0.0}},jmx_exporter{{9779:/opt/agent-bond/jmx_exporter_config.yml}} -cp . -jar /usr/verticles/mod-users-fat.jar
The socket count is 200, which is similar to what it is upon booting.. In other words: no socket leak seen. |
| Comment by Kurt Nordstrom [ 17/Jan/18 ] |
|
I wonder if this is a possible case where a simple permissions cache in the authtoken module would start to make a lot of sense. |
| Comment by Katalin Lovagné Szűcs [ 18/Jan/18 ] |
|
One thing to mention: while testing I ran the user-import module on my laptop in Eclipse and referenced to that instance in the discovery service. I tested it with the embedded module and the socket count seems to stop at about 200. So maybe the problem occurs only when deploying an external service. |
| Comment by Kurt Nordstrom [ 19/Jan/18 ] |
Adam Dickmeiss I got this same behavior when I tried with my setup. Did you ever identify any cause for it? |
| Comment by Kurt Nordstrom [ 22/Jan/18 ] |
|
I went ahead and made a branch of mod-authtoken that had rudimentary caching to avoid having to do repeat lookups of permissions for the same user id. With the cache turned off, my VM takes 22 minutes to load 5000 records. With the cache turned on, it completes in 7 minutes. This suggests to me that caching is a worthy addition to the authtoken module, as permissions lookups are a bottleneck for a repetition-intensive task like import. Sadly, it does not solve the issue with updating already imported records. I will continue to investigate that. |
| Comment by Adam Dickmeiss [ 22/Jan/18 ] |
|
Sounds good. Should improve the response times by the system as a whole, then |
| Comment by Kurt Nordstrom [ 22/Jan/18 ] |
|
I still haven't located the source of the users update issue. I consistently get a hung request when I attempt any re-import of existing users. The last output that I get from user-import is the output of the request to query the existing user by externalId. Katalin Lovagné Szűcs, do you have any thoughts as to what happens differently when a user already exists as opposed to not existing at this particular juncture? |
| Comment by Katalin Lovagné Szűcs [ 29/Jan/18 ] |
|
I realized that the users schema has changed in mod-users and I am using a separate one. Can I reuse the schema in mod-users but change the required fields? (I want externalSystemId to be required instead of id). |
| Comment by Katalin Lovagné Szűcs [ 29/Jan/18 ] |
|
I fixed the user update problem so (after I manage to merge my PR) you can test with it. I just tried with the 5000 users and the vagrant machine has stopped itself (or rather paused). My pull request just in case: https://github.com/folio-org/mod-user-import/pull/19 |
| Comment by Kurt Nordstrom [ 30/Jan/18 ] |
|
I can confirm that the user update works now. Should probably update https://folio-org.atlassian.net/browse/FOLIO-1037, I think. I am able to import/update 5000 users on my machine (running with jars) without issue. With my experimental caching for permissions activated, it takes about 7-8 minutes. Would it be possible to get a larger file to test with? Say 20k users? |
| Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ] |
|
I have a larger data set with 20k users, I uploaded the JSON file to Google drive: https://drive.google.com/drive/folders/1m8nVqSTg-dMrltF7Az8Fc08Eqq7GA2EM?usp=sharing |
| Comment by Kurt Nordstrom [ 31/Jan/18 ] |
|
The 20k file did manage to bork things up good and proper.
14:51:27 ERROR utingContextImplBase Unexpected exception in route
java.lang.OutOfMemoryError: Java heap space
at org.hibernate.validator.internal.metadata.raw.ConstrainedParameter.merge(ConstrainedParameter.java:150)
at org.hibernate.validator.internal.metadata.raw.ConstrainedExecutable.merge(ConstrainedExecutable.java:285)
at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.addToExecutablesByDeclaringType(ExecutableMetaData.java:373)
at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.add(ExecutableMetaData.java:347)
at org.hibernate.validator.internal.metadata.aggregated.ExecutableMetaData$Builder.<init>(ExecutableMetaData.java:317)
at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BuilderDelegate.<init>(BeanMetaDataImpl.java:615)
at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BeanMetaDataBuilder.addMetaDataToBuilder(BeanMetaDataImpl.java:559)
at org.hibernate.validator.internal.metadata.aggregated.BeanMetaDataImpl$BeanMetaDataBuilder.add(BeanMetaDataImpl.java:546)
at org.hibernate.validator.internal.metadata.BeanMetaDataManager.createBeanMetaData(BeanMetaDataManager.java:167)
at org.hibernate.validator.internal.metadata.BeanMetaDataManager.getOrCreateBeanMetaData(BeanMetaDataManager.java:194)
at org.hibernate.validator.internal.metadata.BeanMetaDataManager.getBeanMetaData(BeanMetaDataManager.java:142)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:783)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:664)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:795)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:681)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraint(ValidatorImpl.java:795)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateCascadedConstraints(ValidatorImpl.java:681)
at org.hibernate.validator.internal.engine.ValidatorImpl.validateInContext(ValidatorImpl.java:416)
at org.hibernate.validator.internal.engine.ValidatorImpl.validate(ValidatorImpl.java:205)
at org.folio.rest.RestVerticle.isValidRequest(RestVerticle.java:1454)
at org.folio.rest.RestVerticle.lambda$11(RestVerticle.java:1275)
at org.folio.rest.RestVerticle$$Lambda$86/1086721990.accept(Unknown Source)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at org.folio.rest.RestVerticle.parseParams(RestVerticle.java:1223)
at org.folio.rest.RestVerticle.route(RestVerticle.java:428)
at org.folio.rest.RestVerticle.lambda$12(RestVerticle.java:253)
at org.folio.rest.RestVerticle$$Lambda$58/1650886178.handle(Unknown Source)
at io.vertx.ext.web.impl.RouteImpl.handleContext(RouteImpl.java:217)
at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:78)
at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:118)
14:51:28 INFO ProxyContext 633150/user-import RES 500 86342187us mod-user-import-2.0.1-SNAPSHOT http://localhost:9134/user-import
|
| Comment by Kurt Nordstrom [ 31/Jan/18 ] |
|
Without delving into the code, I wonder if the user-import module needs to be configured to only work on a chunk of records of some particular limit? Perhaps 1000 at a time? Is it trying to load all 20k into memory at once? |
| Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ] |
|
Yes, it is loaded to memory at once but the processing happens in smaller (10 for now) batches. |
| Comment by Kurt Nordstrom [ 31/Jan/18 ] |
|
When the records are loaded to memory, what representation are they stored as? Is it just JSON? Or POJOs? |
| Comment by Katalin Lovagné Szűcs [ 31/Jan/18 ] |
|
POJOs. |
| Comment by Kurt Nordstrom [ 31/Jan/18 ] |
|
So it looks like whatever RMB is doing when it creates/validates the POJOs is heavy enough that we don't want to just load them all at once. Would it be difficult to restrict the number of POJOs loaded to a certain number? |
| Comment by István Bender [ 01/Mar/18 ] |
|
I released a new version of mod-user-import: 3.0.0. It contains improvements since 2.0.0 and tested in Vagrant box. I haven't experienced any memory issue in this version. I also did profiling the code and no leak was experienced. |