[FOLIO-2334] Spike: Investigate using JVM features to manage container memory Created: 31/Oct/19  Updated: 03/Jun/20  Resolved: 18/Nov/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: David Crossley Assignee: David Crossley
Resolution: Done Votes: 0
Labels: devops, platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
blocks MODSOURCE-80 Re-assess the container Memory alloca... Closed
blocks MODSOURMAN-223 Re-assess the container Memory alloca... Closed
blocks FOLIO-2358 Use JVM features (UseContainerSupport... Closed
blocks FOLIO-2315 Re-assess the memory allocation in de... Blocked
Relates
relates to FOLIO-2367 Remove openjdk8-jre-alpine Closed
relates to FOLIO-1729 Use container memory limits to manage... Closed
relates to FOLIO-2185 SPIKE: how to maintain resource deplo... Closed
Sprint: CP: sprint 76
Story Points: 5
Development Team: Core: Platform

 Description   

To enable the JVM to use defaults in a container environment, Java 10 introduced "UseContainerSupport" and that was backported to Java 8 (8u191+). Use that in conjunction with "MaxRAMPercentage".

The following steps to explore that:

1. Roll 2 new base FOLIO JVM images (folioci/openjdk8-jre-alpine and folioci/openjdk8-jre)
2. Update a Dockerfile in two test modules (which use each of those)
3. Test launching containers with legacy settings in the LaunchDescriptor
4. Test launching containers with new settings in the LaunchDescriptor

Outcome:

  • update documentation on dev.folio.org of what memory setting should be used in individual MDs and how the Dockerfile should be updated to ulitize the new base image that includes updated JVM that support those settings.
  • Update FOLIO-2315 Blocked and linked issues to link to this documentation


 Comments   
Comment by David Crossley [ 15/Nov/19 ]

Results of spike investigation:

Built new docker image folioci/alpine-jre-openjdk8:latest
based on alpine:latest currently "3.10.3"
and openjdk version "1.8.0_222"
The "UseContainerSupport" setting is true by default.

Verified with various modules (and local vagrant VM), rebuilding each module within it and launching their new local docker container.

JAVA_OPTIONS: "-XX:MaxRAMPercentage=66.0 -XX:+PrintFlagsFinal"

(Note that the value must be "66.0" not "66".)

Verified that that sets "MaxHeapSize" to be 66% of the "Memory" value specified in the module default LaunchDescriptor.
That leaves the remainder of the memory allocation for other non-heap needs.

So some modules may need to reassess their LaunchDescriptor "Memory" setting.

Also verified that existing LD settings will still operate as is now (probably better with this newer underlying docker image).

Some notes for the roll-out process:

Each module can utilise the new docker image, and adjust to the new LD settings. Then a follow-up job is to adjust their folio-ansible group_vars settings, which over-ride these. As shown above, the current group_vars settings will still operate until then.

Comment by David Crossley [ 15/Nov/19 ]

Some useful resources discovered while investigating:

https://medium.com/adorsys/usecontainersupport-to-the-rescue-e77d6cfea712
"Please note that setting -Xmx and -Xms disables the automatic heap sizing."

https://stackoverflow.com/a/55463537
Andrei Pangin
and follow the links to his other answer
https://stackoverflow.com/a/53624438
and then to his beaut presentation.

Comment by David Crossley [ 15/Nov/19 ]

Still investigating some other FOLIO modules.

Comment by David Crossley [ 18/Nov/19 ]

See the results of this spike listed above, and this ticket's modified issue Description to specify the Outcome and next steps. Please wait for FOLIO-2315 Blocked (and all linked tickets) to be updated.

Comment by David Crossley [ 29/Nov/19 ]

steve.osguthorpe asked on ERM-638 Closed to confirm that modules do benefit from the new memory settings.

Further summary to accompany the results listed above:

During testing we added "-XX:+PrintFlagsFinal" and investigated the docker logs. This shows that "MaxHeapSize" is correctly set to 66% of the container memory allocation.

After rollout of the new base docker image and MaxRAMPercentage setting, we monitor the folio-snapshot-load reference environment each day.

Every hour we assess 'docker stats' for all modules. This shows that their total memory usage is remaining below that level. For a longer-running system there would probably be an increase, as non-heap memory is futher utilised and returned.

We also grep each module's docker logs every hour to ensure no "java.lang.OutOfMemoryError".

So we believe that this usage of UseContainerSupport and MaxRAMPercentage does provide an appropriate way to manage the memory allocation. It reserves one-third of the container memory for non-heap use. Developers can adjust the total container memory via their LaunchDescriptor to raise it if they need more, but hopefully trim it down to provide a more lean system. The 66% reservation is an average estimate, and could also be adjusted for certain modules.

(When the other devops people return from the Thanksgiving break, then they might be able to expand my answers.)

Comment by David Crossley [ 29/Nov/19 ]

steve.osguthorpe also asked on ERM-638 Closed "whether the CGroup is correctly available". Could you please explain how to verify that.

Comment by John Malconian [ 02/Dec/19 ]

steve.osguthorpe is most likely referring to this error that occurs in the container log when using the old fabric8-based container:

cat: can't open '/sys/fs/cgroup/memory/memory.memsw.limit_in_bytes': No such file or directory

This specific error is generated when cgroup swap auditing is not enabled in the host kernel. This is default for recent Debian/Ubuntu kernels (it can be enabled via a kernel parameter). The old fabric8 base image had hooks for setting this control group, because it was originally written for Redhat/CentOS where cgroup swap auditing is enabled by default. At any rate, we are no longer using a base image based on fabric8 so this error should no longer appear. We do not set any container limitations on swap anyway - just RAM. Control group auditing for RAM is enforced.

Comment by steve.osguthorpe [ 03/Dec/19 ]

David Crossley - Thanks for the confirmation.
John Malconian - Thank you too for the expansion and yes it is.
I do however have another question and it's to do with the none-heap settings, Metaspace (which used to be called permgen) is always set to a high number (the maximum), which basically removes any upper limits. SO even though you've specified max ram that's only for Heap. Are there any plans/recommendations for us to incorporate that setting? Should as a developer just, set my max ram percentage for value in the memory section of the descriptor? How do external ops teams know that at 800Mb we can only have 50% for heap, but at 3GB you can allocate up to 90%?

Comment by Wayne Schneider [ 03/Dec/19 ]

steve.osguthorpe there are two keys in the module descriptor that can communicate that kind of information to the external operators:

launchDescriptor/dockerArgs/HostConfig/Memory: Total memory allocation for the container
launchDescriptor/env: You can set the JAVA_OPTIONS environment variable as you see fit. For example, you could use MinRAMPercentage instead of MaxRAMPercentage if you felt that was more appropriate.

One rule of thumb proposed by Craig McNally, which seems sensible to me, is that you set the size for a single tenant with a standard workload, whatever that means to you, with the expectation that the operator will scale containers horizontally to meet higher demand. Like all rules of thumb, it probably won't work for every circumstance, but it seems a reasonable starting point.

Beyond that, you can of course communicate specific resource needs in the module README.

Does that address your concerns?

Comment by steve.osguthorpe [ 03/Dec/19 ]

Wayne Schneider Thanks. That seems completely reasonable to me.

Generated at Thu Feb 08 23:19:54 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.