[FOLIO-3275] run Karate tests against scratch env with OL enabled Created: 30/Aug/21 Updated: 27/Sep/21 Resolved: 27/Sep/21 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Story | Priority: | TBD |
| Reporter: | Hanna Hulevich | Assignee: | Steve Ellis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue links: |
|
||||||||
| Sprint: | CP: sprint 122, CP: sprint 123 | ||||||||
| Story Points: | 2 | ||||||||
| Development Team: | Core: Platform | ||||||||
| Description |
|
Run Karate tests against scratch env with optimistic locking enabled. Currently the FAT modules that have overlap with optimistic locking are as follows:
This means that these module's tests can be run separately to save time since running the full suite of tests for FAT takes quite a bit of time. There is a branch in FAT that has the okapi ingress for the CP team's rancher scratch added to each sub-project's karate config: https://github.com/folio-org/folio-integration-tests/tree/core-platform-team-optimistic-locking To run the test for each of the relevant modules listed above do: ./runtests.sh mod-audit scratch and so on for each module. Most of the tests in mod-feesfines are not implemented (they are marked as @Undefined). Same for mod-patron-blocks. But there are still a few implemented tests in each. But the other modules (mod-audit, mod-data-export, mod-search) have implemented lots of tests. |
| Comments |
| Comment by Steve Ellis [ 16/Sep/21 ] |
|
Overview Karate tests are not passing consistently in the scratch env for modules I've been able to test. Details Summary of progress configuring the rancher environment and karate tests to run:
Summary of FAT karate module tests that are in scope (see above) so far:
Summary of spot checks of other modules that are out of scope for OL just to benchmark:
The fact that the spot checks in other (non OL modules) suggests that failures are due to a poorly functioning environment rather than OL. Interested Parties |
| Comment by Steve Ellis [ 16/Sep/21 ] |
|
If anyone would like to replicate what I'm seeing do the following:
|
| Comment by Steve Ellis [ 16/Sep/21 ] |
|
After thinking about this for a bit I think it's likely the env is memory starved. The cluster's memory is maxed out. I can't allocate more memory to a pod without taking it away from somewhere else. Part of the problem is the nice cluster dashboard hasn't been visible all day. Probably because there's not enough memory to create it... These scratch envs may not have been intended to run platform-complete. We've been heaping on a lot of modules onto this thing. More modules need more of course. |
| Comment by Hleb Surnovich [ 17/Sep/21 ] |
|
We assume, that there's a problem not only with memory, but with CPU too. The charts show, that some of cluster nodes are overloaded in a partiqular time slot. Thus, it can affect the tests are not running. As we found out from metrics and from the general information, it's hardly possible to increase the memory or CPU limits in this cluster without infrastructure scaling |
| Comment by Steve Ellis [ 18/Sep/21 ] |
|
Based on Hleb Surnovich's chart I asked around about the process for getting more resources for our cluster. John Malconian was able to add three more nodes to increase the underlying compute resources available. However I'm not sure if adding these nodes has had an effect on the cpu and ram for the cluster. Comparing what I could see before the change and what I can see after, the number of cores and ram seems to be the same. Screenshot before adding nodes: Screenshot after adding nodes: My thought is that perhaps our workloads have not been allocated to these new nodes, but I don't have visibility into the nodes due to permissions. Another thing that came up in my conversation with John, was that it could be that we shouldn't be setting resources.limit.cpu on the pod level. This may be what is driving up that high percentage (88%) of reserved cpu. Hleb's chart suggests that our cpu can go in excess of 100% which seems like it could lead to unpredictable behavior. At our institution we don't set any value for resources.limit.cpu in our folio k8s deployment, which we think tells k8s to let any pod use whatever cpu it wants. Our reserved cpu is therefore much lower. I've concentrated on trying to closely observe this particular karate test, which includes 3 scenarios. (mod-audit is in scope for OL.) When run against the scratch env, I've seen arbitrary combinations of the 3 tests succeed or fail – something that does not happen at all when run against a reference env. Since these tests rely on pauses, I've increased those to 30 seconds, from 5 seconds, with the idea that giving the updates more time to propagate through a resource constrained system would improve matters. It hasn't. It could be that these tests are failing because of OL, but I don't think we can yet rule out that the scratch env is configured correctly.
|
| Comment by Steve Ellis [ 19/Sep/21 ] |
|
A few updates on this. I am able to add ram and modify cpu quotas on pods, whereas before John's change I wasn't, so perhaps those views summarizing ram and cpu are noise, and the new nodes have opened things up. But the karate test I'm focusing on still is failing in the same unpredictable way. I'm uploading some logs from mod-circulation and mod-audit, two modules that are in scope for the test in question. There are some things of note there, mostly from postgres complaining about indexes not being present. So new theory: maybe the db isn't being configured the same in the scratch env as in the reference envs, making things super slow. Seems plausible, but is just a guess. At this point I think we really have two possibilities:
Again I would be grateful if I could get a few more eyes on this. Maybe something will occur to someone. Instructions for running the tests are above. Why I still think this is a performance issue
|
| Comment by Steve Ellis [ 21/Sep/21 ] |
|
The performance issues have been resolved by removing resources.limits.cpu and resources.requests.cpu from each pod spec in our namespace. To protect other team's clusters, I added a namespace-wide limit of 16000m to our namespace (core-platform). |