[FOLIO-2638] many platform-core and ui-users builds are dying OOM in CI. why??? Created: 08/Jun/20  Updated: 10/Jun/20  Resolved: 09/Jun/20

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: TBD
Reporter: Zak Burke Assignee: John Malconian
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Sprint: DevOps: sprint 90
Story Points: 8
Development Team: FOLIO DevOps

 Description   

We have been seeing build failures for numerous PRs in ui-users and platform-core. One option discussed on Slack is more build nodes.
Example from ui-users PR 1347. Everything below Last few GCs repeats until the process times out after 30-60 minutes.

$ stripes test karma --bundle --karma.singleRun --karma.browsers ChromeDocker --karma.reporters mocha junit --coverage
Starting Karma tests...

START:

<--- Last few GCs --->

[1304:0x4376ec0]    67996 ms: Scavenge 2040.2 (2050.1) -> 2039.6 (2050.1) MB, 12.7 / 0.0 ms  (average mu = 0.139, current mu = 0.003) allocation failure 
[1304:0x4376ec0]    68075 ms: Scavenge 2040.4 (2050.1) -> 2039.7 (2050.1) MB, 9.9 / 0.0 ms  (average mu = 0.139, current mu = 0.003) allocation failure 
[1304:0x4376ec0]    68162 ms: Scavenge 2040.5 (2050.1) -> 2039.8 (2050.3) MB, 10.2 / 0.0 ms  (average mu = 0.139, current mu = 0.003) allocation failure 


<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x13c03d9]
    1: StubFrame [pc: 0x1347261]
Security context: 0x0ee6c78808d1 <JSObject>
    2: /* anonymous */(aka /* anonymous */) [0x1cd67e0ed9e1] [/home/jenkins/workspace/folio-org_ui-users_UIU-1273/project/node_modules/terser/dist/bundle.min.js:~1] [pc=0x9a8771d5b6d](this=0x05738c1004b1 <undefined>,0x153379fcd419 <AST_Object map = 0x32e8d1462089>,0x1b1316d46479 <On map = 0x183e21109129>)
    3: /* anonymous */ [0...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Writing Node.js report to file: report.20200605.194727.1304.0.001.json
Node.js report completed
 1: 0xa02f90 node::Abort() [/usr/bin/node]
 2: 0xa033b5 node::OnFatalError(char const*, char const*) [/usr/bin/node]
 3: 0xb76ffe v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 4: 0xb77379 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 5: 0xd23ad5  [/usr/bin/node]
 6: 0xd24166 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/usr/bin/node]
 7: 0xd309e5 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/bin/node]
 8: 0xd31895 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/bin/node]
 9: 0xd3434c v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/bin/node]
10: 0xcfaf1b v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/bin/node]
11: 0x103d85e v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/usr/bin/node]
12: 0x13c03d9  [/usr/bin/node]

Similar results for platform-core PRs.



 Comments   
Comment by Zak Burke [ 08/Jun/20 ]

John Malconian, I'm investigating whether a code change on our end, or a change in some third-party dep could be responsible, but there are no red flags so far. Is it easy to throw RAM at this? If so, that may be best to take off the pressure while folks are trying to publish releases this week, and we can continue the investigation later.

Comment by John Malconian [ 09/Jun/20 ]

Zak Burke This is pretty strange. Nothing has changed on the build nodes or build images recently that I'm aware of. Have we updated Node recently David Crossley? I don't think so.

Currently we have five build nodes, each with 32 GB of RAM (t2.2xlarge) and only two jobs can run on a build node at one time. If I add build nodes, we'd still have the same issue unless I restricted jobs to one job per node. Build jobs would queue up. The other option would be to to upgrade the instance type. The next tier would double RAM to 64GB per node. I think that could get expensive. What do you think, Peter Murray?

Comment by David Crossley [ 09/Jun/20 ]

No recent build image changes, as far as i know.

Comment by Zak Burke [ 09/Jun/20 ]

I've been able to duplicate this platform-core build failure locally, sometimes:

export NODE_OPTIONS="--max-old-space-size=2048 $NODE_OPTIONS"; stripes build stripes.config.js --okapi https://localhost:9130 --tenant diku ./output

Unfortunately, I didn't notice when build-platform-core-snapshot first started failing++ so the diff between the current and last-good yarn.lock files has grown, but I'll see if I can get anywhere with it today.

The ui-users#master build started failing on Thursday, and we do still have those old yarn.lock files. That build doesn't have the --max-old-space-size but I will add it and see if I can replicate the failure.

++ I gotta figure out a way to monitor the UI master builds and the platform-core and platform-complete snapshot builds to get notified when they fail; I made a dashboard but I have to go look at it. I know we have Slack#folio-ci, but that's awfully noisy and I only care about certain failed builds.

Comment by Zak Burke [ 09/Jun/20 ]

The next tier would double RAM to 64GB per node.

Re-reading that, if we currently have 32GB, is it possible to dedicate more to node, e.g.

export NODE_OPTIONS="--max-old-space-size=4096 $NODE_OPTIONS";
Comment by Zak Burke [ 09/Jun/20 ]

Also, why does build-platform-core-snapshot fail if build-platform-complete-snapshot succeeds?!?

Comment by Peter Murray [ 09/Jun/20 ]

I'm up for an experiment. How hard would it be to create a higher-capacity instance and pipe some jobs through it to see if the issue goes away?

The EC2 reserved instances we have now are "Convertable", so if we need to move to a higher capacity instance or a different instance family, we can certainly do that. (As long as we stay in the same AWS region, we just pay the difference in cost at the time of conversion.)

Comment by John Malconian [ 09/Jun/20 ]

platform-core is built with the following NODE_OPTIONS. max-old-space-size is half the amount allocated to platform-complete.

export NODE_OPTIONS="--max-old-space-size=2048 $NODE_OPTIONS"; stripes build stripes.config.js --okapi https://localhost:9130 --tenant diku ./output

I'll update that to 4096.

Additionally, NODE_OPTIONS is set for 'stripes build' but I'm not sure what, if anything, NODE_OPTIONS is set to during unit test runs. It looks like we just need to allocate more memory overall to Node.

Comment by Zak Burke [ 09/Jun/20 ]

OK, so, we allocate more memory to Node but leave the AWS instance alone . LMK when this is in place so I can test some of the PRs that have been blocked.

John Malconian, do you think it's worth trying to figure out what may have changed in ui-users or platform-core to trigger this now, or are we content with "Node is a RAM hog; let's just feed the beast and move on to more interesting problems"?

Comment by John Malconian [ 09/Jun/20 ]

Ok. So I merged a change to the Node-based pipeline that explicitly sets NODE_OPTIONS="--max-old-space-size=3072" prior to unit tests running. Tested against a branch of ui-users and it seems to do the trick. I'm not sure why unit tests require more memory now, but apparently they do. Not sure it's worth spending a ton of time figuring out why. It seems to be the Way of the Node. I also doubled max-old-space-size for platform-core builds and that issue seems to be resolved.

I think we can close this issue out, but let's keep it open for another day so that we can verify that the issue has really been resolved.

Comment by Zak Burke [ 09/Jun/20 ]

We've been building platform-complete with 4096MB but others with "only" 2048MB, Now we allocate 4096 to other builds as well.

Generated at Thu Feb 08 23:22:09 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.