[FOLIO-2097] Unexpected "timeout exceeded" fails in BigTest unit tests Created: 12/Jun/19 Updated: 03/Jun/20 Resolved: 09/Mar/20 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | P2 |
| Reporter: | maksym_dryha | Assignee: | Zak Burke |
| Resolution: | Done | Votes: | 0 |
| Labels: | bigtest, ui-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Core: F - Sprint 83, Core: F - Sprint 84 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 13 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Development Team: | Prokopovych | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
| Comments |
| Comment by Aliaksei Chumakou [ 16/Jul/19 ] |
|
at Acquisitions we have the same thing,
FAILED TESTS:
Organization details
✖ note is displayed
Chrome 71.0.3578 (Linux 0.0.0)
Error: convergent assertion was successful, but exceeded the 2000ms timeout
at loop (test/bigtest/index.js:8405:17)
|
| Comment by Oleksandr Yehorov [ 16/Jul/19 ] |
|
Same thing in ui-data-import module
FAILED TESTS:
when match profile is edited and there is no associated job profiles
✖ match profile details renders updated match profile
Chrome 71.0.3578 (Linux 0.0.0)
Error: convergent assertion was successful, but exceeded the 2000ms timeout
at loop (test/bigtest/index.js:8198:17)
|
| Comment by Anton Emelianov (Inactive) [ 16/Jul/19 ] |
|
Oleksandr Yehorov, Aliaksei Chumakou, maksym_dryha According to Zak Burke: That being said, did you observe the reported problem when stripes-connect version was v5.2.1? Or it was with the stripes-connect version that can cause slow response time? |
| Comment by Zak Burke [ 16/Jul/19 ] |
|
For the record, I've seen these "convergence successful but it exceeded timeout" BigTest failures too. I'd love to have an easy option to set the default timeout to something more lenient, and to convert them from errors to warnings. |
| Comment by Aliaksei Chumakou [ 17/Jul/19 ] |
|
Anton Emelianov We've seen such issues before, but very frequently they appeared recently, it might be connected with stripes-connect fix described above by Zak |
| Comment by Aliaksei Chumakou [ 20/Nov/19 ] |
|
Anton Emelianov we started to see this again. Basically, here is a suggestion to disable tests to be able merge the PRs and build artifacts from master branch - https://folio-project.slack.com/archives/C58TABALV/p1574189630038800 |
| Comment by Anton Emelianov (Inactive) [ 20/Nov/19 ] |
|
Aliaksei Chumakou Every test framework has its pros and cons. I don't see how this issue can cause us to trow away BigTest. We had similar situations before when developers complained that BigTest cannot run on Windows , for example. And we were able to resolve it. We should try to resolve this failure as well and I can ask Jeffrey Cherewaty to look into it. It will be much cheaper to fix this particular problem and will cause much less disturbance than swapping the test framework. Let's disable this test for now so that you can push the PR through. |
| Comment by Aliaksei Chumakou [ 20/Nov/19 ] |
|
Anton Emelianov I'm not talking above about "to throw out " BigTest. I just proposed to investigate if we can get more stable and productive test tool for UI. And Suggested to investigate this inside one of Acquisitions module. |
| Comment by Anton Emelianov (Inactive) [ 20/Nov/19 ] |
|
1) BigTest is being used for UI only. I am very clear about what you proposing and I cannot agree to it. That being said, I will escalate this issue to Frontsite. I understand that the current situation is not acceptable and needs to be resolved. |
| Comment by Aliaksei Chumakou [ 20/Nov/19 ] |
|
Anton Emelianov Thanks for the escalating! Let's hope it could be resolved. However, I've minor additions to the items you listed: |
| Comment by maksym_dryha [ 20/Nov/19 ] |
|
Spitfire team is also facing problems with unreasonably failing tests. We also had to disable them all on CI |
| Comment by Ann-Marie Breaux (Inactive) [ 21/Nov/19 ] |
|
Hi all - should the title of this issue include BigTest? Is the issue all centered around BigTest, or is it broader than that? I see that Folijet had issues this past summer, but I'm not sure whether there are issues going on now with Folijet. Adding a couple dev watchers in case they want to mention anything. |
| Comment by Charles Lowell [ 26/Nov/19 ] |
|
Do we have links to the CI runs that are failing? |
| Comment by Aliaksei Chumakou [ 26/Nov/19 ] |
|
Charles Lowell https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/575/console |
| Comment by Taras Tkachenko [ 03/Dec/19 ] |
|
Hi guys.
buildNPM {
publishModDescriptor = 'yes'
runLint = 'yes'
runSonarqube = true
runTest = 'yes'
runTestOptions = '--karma.singleRun --karma.browsers ChromeDocker --karma.reporters mocha junit --coverage --bigtest.defaultTimeout 4000'
}
Aliaksei Chumakou Zak Burke |
| Comment by Ann-Marie Breaux (Inactive) [ 03/Dec/19 ] |
|
Hi Anton Emelianov Zak Burke Charles Lowell Would it be possible to make a decision on possibly changing the severity, or some other way to resolve this ASAP? This is preventing some of Folijet's key UI work from showing on the hosted-ref-environments, and thus blocking our release tomorrow. Thank you! |
| Comment by Zak Burke [ 03/Dec/19 ] |
|
In short, it's complicated. Some food for thought: The two-second timeout is hard-coded into BigTest (ugh) and while it is possible to override it, doing so is not a great option because the timeouts are often inconsistent: tests 1, 3, and 5 will fail on one build and succeed on the next, but then test 2 will fail. That basically leaves us with adding a manual timeout to every single test. Additionally, the way timeouts are handled in BigTest makes it impossible for us to change the severity and convert the timeouts from errors to warnings. An exception is an exception and BigTest does not offer a way to filter them. On top of that, we frequently see errors like Error: convergence exceeded the 2000ms timeout in our before/beforeEach blocks from BigTest's assertion handler which are distinct from Error: convergent assertion was successful, but exceeded the 2000ms timeout timeouts in the tests themselves. These are particularly frustrating as they don't indicate an actual test timeout but rather a test setup timeout. This stinks because it is basically a complaint that "the app took too long to load" but the thing is, in real life, we're not going to reload the entire app on every click. To be clear, it is important to reload the entire app before each test in order to make sure there are no side effects from one test to another; but it's not fair/not realistic to expect a test to accommodate that. Finally, we've also seen BigTest continue to timeout in assertions (i.e. during test setup) even with really long timeouts, e.g. as we pushed from 2000 to 4000 to 10000 ms, we got the same Error: convergent assertion was successful, but exceeded the 10000ms timeout failure, which serious made me wonder if it was telling the truth. I don't recall seeing BigTest pause for 10 seconds on any test, but I don't have any logs at the tip of my fingers to support this claim. |
| Comment by Taras Tkachenko [ 03/Dec/19 ] |
|
Thanks Zak Burke for explaining this all. |
| Comment by Zak Burke [ 03/Dec/19 ] |
|
I don't have a good solution. Increasing the default timeout may help, but it's not a slam dunk. As noted above, longer timeouts sometimes led to the exact same outcome: a successful assertion that still exceeds the timeout window. Maybe a short-term option is to run the tests the way we do lint: capture the output to a PR comment if they exit with an error, but remove them from the conditions for merging a PR. I believe SonarCloud functions similarly – we see the results, but a failed SonarCloud gate doesn't block a PR. See also this very long thread about UI testing on Slack #stripes, in particular my comment at the end:
|
| Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ] |
|
Zak Burke Anton Emelianov Jakub Skoczen Charles Lowell How do we resolve this today, so that we can get our UI app released as required? |
| Comment by Taras Tkachenko [ 04/Dec/19 ] |
|
Hi Ann-Marie Breaux Zak Burke Anton Emelianov Jakub Skoczen Charles Lowell if there are better options please suggest and we will try to implement them after this release. |
| Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ] |
|
OK - thanks Taras Tkachenko - and just so I'm clear - turning off the tests in Jenkinsfile only affects UI-data-import - no other apps, right? |
| Comment by Taras Tkachenko [ 04/Dec/19 ] |
|
Ann-Marie Breaux yes. |
| Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ] |
|
OK - good to know, Taras Tkachenko. Please let me know once the most recent UI changes are live on folio-snapshot, and I'll try to get the stories tested/closed ASAP. |
| Comment by Zak Burke [ 06/Dec/19 ] |
|
Notably, reverting from @bigtest/interactor v0.9.3 down to v0.7.2, at the suggestion of Viktor Soroka and Igor Godlevskyi in a conversation on #releases resolved timeout whack-a-mole problems we were having in ui-users. This can be seen in logs from three subsequent runs of PR #1076: run-4, run-5, run-6. A few things are worthy of note here:
|
| Comment by Aliaksei Chumakou [ 09/Dec/19 ] |
|
Zak Burke FYI - I've tried this for ui-orders, no luck... https://github.com/folio-org/ui-orders/pull/695 |
| Comment by Craig McNally [ 16/Dec/19 ] |
|
Anton Emelianov Zak Burke Aliaksei Chumakou Hey guys, I was curious what the outcome was on this... Did we really end up disabling unit tests for some/all ui modules? That seems dangerous to me. I personally think it would be better to do something like what Zak proposed; send the test failures to the PR comments but allow the PR to be merged. |
| Comment by Zak Burke [ 31/Dec/19 ] |
|
We stopped running unit tests in CI for ui-inventory by updating the Jenkinsfile for that repository only. This is certainly more dangerous than running them on every PR and we hope to turn them back on once we get things to stabilize. Here's the rub though: we have ~200 unit tests. Say the random failure rate is 1/1000. We run the tests three times on every PR (twice on the branch before merging, and again as part of the merge to master). That means we run ~600 tests for each PR, which means an effective failure rate >50%. And if that random failure rate is constant, the more unit tests we add the higher that effective failure rate goes. |
| Comment by Zak Burke [ 03/Jan/20 ] |
|
More fun:
The structure of this block is beforeEach setup the UI test for "restricted" test for "inactive" test for "warning" IOW, that before-each block that failed for the "inactive" test already succeeded once because we see the "restricted" test passing. Ugh. |
| Comment by Viktor Soroka [ 07/Jan/20 ] |
|
Is there a difference between environment setup for continuous-integration/jenkins/branch vs continuous-integration/jenkins/pr-merge when it comes to the pull request? As I see, most of the time tests run ok for pr-merge but fail for the branch. John Malconian, Ian Hardy, Jakub Skoczen |
| Comment by Anton Emelianov (Inactive) [ 07/Jan/20 ] |
|
Also, is there a difference in how bundles are being built for pr-merge and branch? |
| Comment by John Malconian [ 13/Jan/20 ] |
|
No differences environmentally that come to mind - but a branch build is just the branch and a PR is branch-merged-with-master build. |
| Comment by Viktor Soroka [ 14/Jan/20 ] |
|
Is it possible to increase the memory for the containers for these jobs? I have a hunch that it can improve things. |
| Comment by Zak Burke [ 16/Jan/20 ] |
|
Memory may indeed be the limiting factor here, but perhaps because we have a leak in the way our tests run. side note: I had to turn off source-maps in stripes-core to get the Chrome profiler to run; otherwise it would get stuck in initialization. h/t https://github.com/electron/electron/issues/17772 |
| Comment by Zak Burke [ 16/Jan/20 ] |
|
Following Michal Kuklis’s hunch, I did some additional memory profiling of Chrome/Karma during unit test runs and found memory does indeed balloon over the course of testing from ~300MB to more than 3GB. In regular use, a Folio tab is stable between ~250MB-~500MB. That smells like pollution of the redux store; still investigating. |
| Comment by Zak Burke [ 17/Jan/20 ] |
|
The left frame in this screenshot shows a memory heap snapshot after running one of the scripts in the test suite one time; the right frame shows a snapshot after running the same test in the same window three times in a row. So, clearly there's a memory leak here. |
| Comment by Cate Boerema (Inactive) [ 25/Feb/20 ] |
|
Zak Burke I just learned of this bug from Anton Emelianov. I had no idea it was something you were supposed to be working on. Should we put it in the current sprint? |
| Comment by Zak Burke [ 25/Feb/20 ] |
|
Cate Boerema, it's a long story. This is essentially a follow-up to problems first identified in
The first three are resolved and we are once again able to successfully merge PRs in ui-users. I haven't turned unit tests back on for ui-inventory but that would be worthwhile to attempt at this point. |
| Comment by Cate Boerema (Inactive) [ 25/Feb/20 ] |
|
Thanks, Zak Burke. Should we create a story for turning the unit tests back on in Inventory and put it in this sprint? What about
|
| Comment by Zak Burke [ 25/Feb/20 ] |
|
I filed
I haven't dug into
|
| Comment by Cate Boerema (Inactive) [ 05/Mar/20 ] |
|
Hi Zak Burke can you please put some points on this issue so we can count it towards our bugfixing quota? |
| Comment by Zak Burke [ 09/Mar/20 ] |
|
I'm closing this for now given that it's effectively become an umbrella for a bunch of separate memory-leak tickets and OOM errors appear to be the proximate cause for BigTest throwing a "timeout exceeded" error. |