[FOLIO-3275] run Karate tests against scratch env with OL enabled Created: 30/Aug/21  Updated: 27/Sep/21  Resolved: 27/Sep/21

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Story Priority: TBD
Reporter: Hanna Hulevich Assignee: Steve Ellis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File MicrosoftTeams-image (1)-1.png     PNG File image-2021-09-18-11-07-54-237.png     PNG File image-2021-09-18-11-08-19-748.png     Text File mod-audit.log     Text File mod-circulation-storage.log    
Issue links:
Blocks
is blocked by MODPATBLK-101 Update module descriptor for optimist... Closed
Sprint: CP: sprint 122, CP: sprint 123
Story Points: 2
Development Team: Core: Platform

 Description   

Run Karate tests against scratch env with optimistic locking enabled.

Currently the FAT modules that have overlap with optimistic locking are as follows:

  • mod-audit
  • mod-data-export
  • mod-feesfines
  • mod-patron-blocks (maybe bc mod-patron was only one in scope)
  • mod-search
  • mod-data-import

This means that these module's tests can be run separately to save time since running the full suite of tests for FAT takes quite a bit of time.

There is a branch in FAT that has the okapi ingress for the CP team's rancher scratch added to each sub-project's karate config: https://github.com/folio-org/folio-integration-tests/tree/core-platform-team-optimistic-locking

To run the test for each of the relevant modules listed above do:

./runtests.sh mod-audit scratch

and so on for each module. Most of the tests in mod-feesfines are not implemented (they are marked as @Undefined).  Same for mod-patron-blocks. But there are still a few implemented tests in each. But the other modules (mod-audit, mod-data-export, mod-search) have implemented lots of tests.



 Comments   
Comment by Steve Ellis [ 16/Sep/21 ]

Overview

Karate tests are not passing consistently in the scratch env for modules I've been able to test.

Details

Summary of progress configuring the rancher environment and karate tests to run:

  • Have spent the day tweaking workload memory. This has helped test reliability somewhat, but it hasn't been completely effective.
  • A handful of modules were restarting a lot. Adding more memory has fixed that.
  • Okapi logs are only available on stdout. Need to redirect to a file to try to get a better handle on them.
  • Increased request timeouts in karate tests and this has solved any issues related to performance and failures due to this.

Summary of FAT karate module tests that are in scope (see above) so far:

  • mod-feesfines - No problems. All tests pass.
  • mod-audit - Tests fail inconsistently (tests mostly pass but different tests fail at different times). Thought it was performance related but now I'm now I'm not so sure. This project's tests however rely heavily on programmatic pauses between requests, which to me is a code smell.
  • mod-patron-blocks - Was failing with Incompatible version for module mod-inventory-17.1.0-coreplatform.1 interface instance-storage. Need 7.8. Have 8. Tried reinstalling the module via helm chart thinking it wasn't current. But it won't start. Need to investigate the startup logs via kubectl.
  • mod-data-export - Tests can't connect to module. Need to troubleshoot.
  • mod-search - Haven't tested.

Summary of spot checks of other modules that are out of scope for OL just to benchmark:

  • mod-notes - No problems all tests pass.
  • acquisitions - Lots of failures. This is easily the karate project with the most tests.

The fact that the spot checks in other (non OL modules) suggests that failures are due to a poorly functioning environment rather than OL.

Interested Parties
Hanna Hulevich, Jakub Skoczen, Julian Ladisch, Ian Hardy, Ann-Marie Breaux, Khalilah Gambrell

Comment by Steve Ellis [ 16/Sep/21 ]

If anyone would like to replicate what I'm seeing do the following:

  1. Clone and checkout my branch https://github.com/folio-org/folio-integration-tests/tree/core-platform-team-optimistic-locking
  2. Run ./runtests.sh mod-audit scratch
  3. Run ./runtests.sh mod-audit testing - to see the same tests succeed against the testing ref env
Comment by Steve Ellis [ 16/Sep/21 ]

After thinking about this for a bit I think it's likely the env is memory starved. The cluster's memory is maxed out. I can't allocate more memory to a pod without taking it away from somewhere else.

Part of the problem is the nice cluster dashboard hasn't been visible all day. Probably because there's not enough memory to create it...

These scratch envs may not have been intended to run platform-complete. We've been heaping on a lot of modules onto this thing. More modules need more of course.

Comment by Hleb Surnovich [ 17/Sep/21 ]

We assume, that there's a problem not only with memory, but with CPU too. The charts show, that some of cluster nodes are overloaded in a partiqular time slot. Thus, it can affect the tests are not running. As we found out from metrics and from the general information, it's hardly possible to increase the memory or CPU limits in this cluster without infrastructure scaling 

Comment by Steve Ellis [ 18/Sep/21 ]

Based on Hleb Surnovich's chart I asked around about the process for getting more resources for our cluster. John Malconian was able to add three more nodes to increase the underlying compute resources available.

However I'm not sure if adding these nodes has had an effect on the cpu and ram for the cluster. Comparing what I could see before the change and what I can see after, the number of cores and ram seems to be the same.

Screenshot before adding nodes:

Screenshot after adding nodes:

My thought is that perhaps our workloads have not been allocated to these new nodes, but I don't have visibility into the nodes due to permissions.

Another thing that came up in my conversation with John, was that it could be that we shouldn't be setting resources.limit.cpu on the pod level. This may be what is driving up that high percentage (88%) of reserved cpu. Hleb's chart suggests that our cpu can go in excess of 100% which seems like it could lead to unpredictable behavior.

At our institution we don't set any value for resources.limit.cpu in our folio k8s deployment, which we think tells k8s to let any pod use whatever cpu it wants. Our reserved cpu is therefore much lower.

I've concentrated on trying to closely observe this particular karate test, which includes 3 scenarios. (mod-audit is in scope for OL.) When run against the scratch env, I've seen arbitrary combinations of the 3 tests succeed or fail – something that does not happen at all when run against a reference env. Since these tests rely on pauses, I've increased those to 30 seconds, from 5 seconds, with the idea that giving the updates more time to propagate through a resource constrained system would improve matters. It hasn't.

It could be that these tests are failing because of OL, but I don't think we can yet rule out that the scratch env is configured correctly.

 

 

 

Comment by Steve Ellis [ 19/Sep/21 ]

A few updates on this.

I am able to add ram and modify cpu quotas on pods, whereas before John's change I wasn't, so perhaps those views summarizing ram and cpu are noise, and the new nodes have opened things up.

But the karate test I'm focusing on still is failing in the same unpredictable way.

I'm uploading some logs from mod-circulation and mod-audit, two modules that are in scope for the test in question. There are some things of note there, mostly from postgres complaining about indexes not being present. So new theory: maybe the db isn't being configured the same in the scratch env as in the reference envs, making things super slow. Seems plausible, but is just a guess.

At this point I think we really have two possibilities:

  • Everything is fine with the cluster, and it's just these tests that we need to worry about. Other functional tests can proceed.
  • These problems are signs of a bigger configuration problem with the cluster, like the db config being broken, or k8s cpu quotas being broken, causing requests to take ridiculous amounts of time. In this case functional testing should probably pause while we figure out how to fix the env. 

 Again I would be grateful if I could get a few more eyes on this. Maybe something will occur to someone. Instructions for running the tests are above.

Why I still think this is a performance issue

  •  From the beginning the cluster has been super slow to register modules (but unusually slow for mod-audit and mod-circulation), the two modules that are in the problematic test. Increasing timeouts in my karate tests has helped that.
  • I've seen at least once vertx exceeding its timeout in a module. Can't recall which one.
  • The problematic tests rely heavily on pauses, which implies that the requests they are waiting to complete, are "fire and forget". But I've increased these pauses to what I would consider crazy levels (5 secs to 1 minute) without any effect.
  • There's no "smoking gun" unhandled exception or other error in the logs for mod-inventory or mod-inventory-storage which would imply that this has something to do with OL.

 

 

Comment by Steve Ellis [ 21/Sep/21 ]

The performance issues have been resolved by removing resources.limits.cpu and resources.requests.cpu from each pod spec in our namespace. To protect other team's clusters, I added a namespace-wide limit of 16000m to our namespace (core-platform).

Generated at Thu Feb 08 23:26:55 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.