[FOLIO-2276] Adjust memory settings source-record-storage, source-record-manager Created: 20/Sep/19  Updated: 03/Jun/20  Resolved: 27/Sep/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Ian Hardy Assignee: Ian Hardy
Resolution: Done Votes: 0
Labels: ci, data-import, devops, platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Relates
relates to UIDATIMP-281 Data Import breaks Closed
Sprint: CP: sprint 73, CP: sprint 72
Story Points: 2
Development Team: Core: Platform

 Description   

Kateryna Senchenko reported source-record-manager was down on folio snapshot-load. After restarting source-record-manger with a higher container memory limit, the same load caused an OOMkill on source-record-storage.

To remediate I'd propose increasing the container memory limit in the module descriptor for these two modules, while leaving the heap size of 256 unchanged. However, I want to be mindful to not over-provision these after https://folio-org.atlassian.net/browse/FOLIO-2242 is completed. Open to other suggestions as well David Crossley Wayne Schneider.



 Comments   
Comment by Wayne Schneider [ 20/Sep/19 ]

Can we do some testing without the -Xmx setting in a container with a memory limit set? I'm curious to see if Java does a better job managing the burst under those conditions.

Comment by Ian Hardy [ 20/Sep/19 ]

Yes, I'll try that first.

Comment by Ian Hardy [ 20/Sep/19 ]

Built folio-snapshot-test w/-Xmx setting removed for srs and source-record -manager. On a test trying to upload 5 files w/500 marc records each the source-record-manager container ends up getting OOMKilled.

JAVA_OPTIONS are: "JAVA_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap",

For some quick and dirty logging I ran:

while true
do
  docker stats --no-stream d8d2b604bc2c >> srm.txt
  sleep 1
done

and saw it hit the limit:

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d8d2b604bc2c        364.21%             275.6MiB / 341.3MiB   80.75%              4.19MB / 585kB      0B / 0B             43
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d8d2b604bc2c        103.85%             326.3MiB / 341.3MiB   95.60%              5.5MB / 6.2MB       0B / 0B             50
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d8d2b604bc2c        2.24%               326.5MiB / 341.3MiB   95.65%              5.5MB / 6.2MB       0B / 0B             49
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d8d2b604bc2c        451.66%             341.2MiB / 341.3MiB   99.95%              11.9MB / 6.65MB     0B / 0B             50
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
d8d2b604bc2c        0.00%               0B / 0B             0.00%               0B / 0B             0B / 0B             0
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %   

Will try leaving the java options the same (no xmx) and bumping up the container limit.

Comment by Ian Hardy [ 20/Sep/19 ]

Looks like I can get through uploading 10 files of 500 marc records each with the host memory limit set to 536870912 leaving Xmx out. Kateryna Senchenko is this pretty close to a "typical" test you'll do of data import? how many files/records were you using for testing on Friday?

Comment by Kateryna Senchenko [ 23/Sep/19 ]

Hi Ian Hardy,
10 files of 500 marc records each looks good enough, although we're aiming at loading 30 000 records at once. On Friday it was failing when I tried to upload 4-5 files of 15 to 1000 records, and it looked like it mattered how many files were uploaded, not that much the size of the file.
Also we identified an issue in mod-data-import MODDATAIMP-192 Closed that could result in increased memory consumption when we load multiple files. I'd like to test it more (loading one big file vs multiple small files) to confirm that as soon as data-import is up on snapshot-load.

Comment by Ian Hardy [ 25/Sep/19 ]

I increased the memory limit on SRM and loaded 1 file of 30k records (this is the test that Kateryna Senchenko said failed this morning). Watching the memory usage of mod SRM I saw it peak at about 570MiB. the limit is currently configured at 682MiB which works out to aroudn 715 MB. Let me know if this seems like a reasonable limit here.

Comment by Wayne Schneider [ 25/Sep/19 ]

Ann-Marie Breaux suggests that additional problems occur when you load multiple files (simultaneously or consecutively?), so that may also be something you want to test.

Comment by Ian Hardy [ 25/Sep/19 ]

Good point Wayne Schneider. I did a batch of 3 files with 30,000 records, and one file of 60,000 just to test the limits of the current config. The 3 at 30,000 worked fine (now they queue up). After loading the 60,000 record file mod-inventory-storage crashed, but I'll consider that outside the scope of this issue. Ann-Marie BreauxKateryna Senchenko does that seem like a reasonable test, and if so, shall we leave the memory settings where there are now?

Comment by Wayne Schneider [ 25/Sep/19 ]

That's great, Ian Hardy!

Ann-Marie Breaux David Crossley this is kind of an interesting documentation challenge. Is there user documentation for data import currently? If so, it would seem logical to add a section for system administrators explaining how the memory settings can be tuned to support larger record loads.

Comment by Ann-Marie Breaux (Inactive) [ 26/Sep/19 ]

Hi Wayne Schneider I don't think we've documented recommended configuration/memory for Data Import, but it's something we could do.

Oleksii Kuzminov Kateryna Senchenko What do you think about adding something here? https://folio-org.atlassian.net/wiki/display/FOLIJET/Data-import+user+guides

Comment by Oleksii Kuzminov [ 26/Sep/19 ]

Ann-Marie Breaux Yes, we will update documentation

Comment by Taras Spashchenko [ 26/Sep/19 ]

Ian Hardy, could you please add parameters for jvm
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/path/to/file/gc.log
the same way you do it for Xmx. The log file will be created in the local file system of your container. So you can just grab it from there or you can mount an external volume to the container beforehand. if you can share those files with us it can help us to understand what is the cause of that unusual memory consumption.

Comment by Ann-Marie Breaux (Inactive) [ 26/Sep/19 ]

Perfect Oleksii Kuzminov - thank you!

Comment by David Crossley [ 26/Sep/19 ]

It would be useful to document stuff at the git README. And link to your other documentation.

Comment by Taras Spashchenko [ 27/Sep/19 ]

Hello all, not sure if Oleksii Kuzminov has already shared the link to the post regarding memory management for containerized java processes.
Just to make it clear.
https://merikan.com/2019/04/jvm-in-a-container/

so we can reduce the memory limit for the container and use -XX:MinRAMPercentage & -XX:MAXRAMPercentage to allocate more memory for JAVA Heap

Comment by Wayne Schneider [ 27/Sep/19 ]

Thanks, everyone, for all the work on this.

Taras Spashchenko and Oleksii Kuzminov – you can set the Java options and memory for the container yourselves in the module descriptor template. If you have specific recommendations, at this point I'd suggest you:

  1. Validate your assumptions by running your application in a local JVM and examining the memory usage. You can run a local instance of your application with a Vagrant box using the procedure documented here: https://github.com/folio-org/folio-ansible/blob/master/doc/index.md#running-backend-modules-on-your-host-system
  2. Build a local container for your application with the Dockerfile in your application repository. Launch it with the settings that you want to implement, and test its behavior.
Comment by Ian Hardy [ 27/Sep/19 ]

I'll close this one since the srm/srs/data import modules are no longer getting killed in the reference environment. Further changes can be made to memory settings in the launch descriptor if needed.

Comment by Ann-Marie Breaux (Inactive) [ 01/Oct/19 ]

Thanks everyone for your analysis and attention to this - seems like last week was a big one for memory work!

Generated at Thu Feb 08 23:19:29 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.