[FOLIO-2097] Unexpected "timeout exceeded" fails in BigTest unit tests Created: 12/Jun/19  Updated: 03/Jun/20  Resolved: 09/Mar/20

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P2
Reporter: maksym_dryha Assignee: Zak Burke
Resolution: Done Votes: 0
Labels: bigtest, ui-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File Screen Shot 2019-06-12 at 11.02.08 AM.png     PNG File Screen Shot 2019-12-06 at 10.08.57 AM (2).png     PNG File Screen Shot 2020-01-15 at 9.59.00 PM.png     PNG File Screen Shot 2020-01-17 at 7.36.42 AM.png     Text File pr-1076-4.txt     Text File pr-1076-5.txt     Text File pr-1076-6.txt    
Issue links:
Blocks
blocks UIDATIMP-335 Release module Closed
blocks UIOR-334 TECH-DEBT to cover related invoices b... Closed
Duplicate
is duplicated by FOLIO-2360 SPIKE: unit test results are inconsis... Closed
Relates
relates to UIIN-1005 Turn on testing in Jenkinsfile and re... Closed
relates to UIU-1567 restore unit tests in CI Closed
relates to FOLIO-2433 in dev-mode, react has a memory leak Closed
relates to STCOM-636 mousetrap, a transitive dep of <HotKe... Closed
relates to STCOM-637 Remove class objects, refs from state... Closed
relates to UIEH-813 eHoldings tests are randomly fail on CI Closed
relates to UIU-1450 memory leak in withDeclareLost or wit... Closed
relates to UIU-1566 disable unit tests in CI Closed
relates to UIU-1457 restore CommandList and HasCommand Closed
Sprint: Core: F - Sprint 83, Core: F - Sprint 84
Story Points: 13
Development Team: Prokopovych

 Description   

From times to times it happens that a properly written test for a correctly working code fails because

convergent assertion was successful, but exceeded the 2000ms timeout


Sometimes it can take up to 5-10 re-runs (with no changes to the tests or tested code) to make the tests pass. I haven't noticed any regularity, any ways to figure out which tests and when are vulnerable to this kind of fails.



 Comments   
Comment by Aliaksei Chumakou [ 16/Jul/19 ]

at Acquisitions we have the same thing,
https://github.com/folio-org/ui-organizations/pull/86
https://github.com/folio-org/ui-orders/pull/519

FAILED TESTS:
  Organization details
    ✖ note is displayed
      Chrome 71.0.3578 (Linux 0.0.0)
    Error: convergent assertion was successful, but exceeded the 2000ms timeout
        at loop (test/bigtest/index.js:8405:17)
Comment by Oleksandr Yehorov [ 16/Jul/19 ]

Same thing in ui-data-import module

FAILED TESTS:
  when match profile is edited and there is no associated job profiles
    ✖ match profile details renders updated match profile
      Chrome 71.0.3578 (Linux 0.0.0)
    Error: convergent assertion was successful, but exceeded the 2000ms timeout
        at loop (test/bigtest/index.js:8198:17)
Comment by Anton Emelianov (Inactive) [ 16/Jul/19 ]

Oleksandr Yehorov, Aliaksei Chumakou, maksym_dryha

According to Zak Burke:
Possibly related: We had a stripes-connect bug that slowed down UI interaction across the board (https://folio-org.atlassian.net/browse/STCON-85). Do you have a tip-of-master build of stripes-connect in the environment where you are running these tests? The last npm-folio release, v5.2.1 was OK, but npm-folioci builds from v5.2.300080 through v5.2.300084 are broken.

That being said, did you observe the reported problem when stripes-connect version was v5.2.1? Or it was with the stripes-connect version that can cause slow response time?

Comment by Zak Burke [ 16/Jul/19 ]

For the record, I've seen these "convergence successful but it exceeded timeout" BigTest failures too. I'd love to have an easy option to set the default timeout to something more lenient, and to convert them from errors to warnings.

Comment by Aliaksei Chumakou [ 17/Jul/19 ]

Anton Emelianov We've seen such issues before, but very frequently they appeared recently, it might be connected with stripes-connect fix described above by Zak

Comment by Aliaksei Chumakou [ 20/Nov/19 ]

Anton Emelianov we started to see this again. Basically, here is a suggestion to disable tests to be able merge the PRs and build artifacts from master branch - https://folio-project.slack.com/archives/C58TABALV/p1574189630038800
You can observe how frequent master branch builds were failed https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/
Basically, I think we need something more stable than BigTest. And I suggest we give a try to Jest and React testing library - who is more know how to test a framework code (React) more than a creators of it (Facebook)? right?) We can do this as PoC on some repo from Acquisitions and take a look at the results. What do you think?

Comment by Anton Emelianov (Inactive) [ 20/Nov/19 ]

Aliaksei Chumakou Every test framework has its pros and cons. I don't see how this issue can cause us to trow away BigTest. We had similar situations before when developers complained that BigTest cannot run on Windows , for example. And we were able to resolve it. We should try to resolve this failure as well and I can ask Jeffrey Cherewaty to look into it. It will be much cheaper to fix this particular problem and will cause much less disturbance than swapping the test framework.

Let's disable this test for now so that you can push the PR through.

Comment by Aliaksei Chumakou [ 20/Nov/19 ]

Anton Emelianov I'm not talking above about "to throw out " BigTest. I just proposed to investigate if we can get more stable and productive test tool for UI. And Suggested to investigate this inside one of Acquisitions module.
You're talking about disabling the test. But it's not the one specific test, they fail randomly, as mentioned in the ticket's description. And the ticket is around for about half year now. I see that BigTest relies on timeouts, as well as previous Nightmare. And it makes CI fail in unpredictable way, as we can see.
My goal is to make a PoC with testing tools with wider community support (Jest, RTL), that could allow us to write and run unit tests in more convenient way and be assure that code is covered and not broken. It's not necessary the replacement of BigTest.

Comment by Anton Emelianov (Inactive) [ 20/Nov/19 ]

Aliaksei Chumakou,

1) BigTest is being used for UI only. I am very clear about what you proposing and I cannot agree to it.
2) My understanding was that it's a single place where one test times out. If multiple test are failing randomly it is much more serious problem that needs to be addressed.
3) BigTest doesn't rely on time out. It was one of the reason why BigTest was selected. If someone put the timeout into the test it a different issue. I would appreciate if you could look into it and see why the timeout has been implemented there.
4) CI should be predictable but replacing test framework is a nuclear option that we cannot afford now and resources should be spend on the other tasks in the backlog.

That being said, I will escalate this issue to Frontsite. I understand that the current situation is not acceptable and needs to be resolved.

Comment by Aliaksei Chumakou [ 20/Nov/19 ]

Anton Emelianov Thanks for the escalating! Let's hope it could be resolved. However, I've minor additions to the items you listed:
2) It's not a single place, this current ticket was always about randomly failing convergent assertion was successful, but exceeded the 2000ms timeout
3) According to the ticket's description there is a timeout of 2000ms, that bigtests waits for the test condition. If machine running tests is pretty slow and opening of the page is slower than 2000ms - it will fail with the message above. Am I wrong while saying that BigTest relies on the timeout?
4) I agree, that's why I don't ask to put any significant effort into the investigation or PoC. It's just us, Acquisitions-UI, putting some effort in a spare time, just to make FOLIO world better. Again, we are not talking about replacement, I've never said that. All concern of mine is to avoid tickets like this - when tests are failing without any obvious reason and we have to DISABLE them running on CI just to resume building artifacts on the master branch.
Thanks again for the fast responses!

Comment by maksym_dryha [ 20/Nov/19 ]

Spitfire team is also facing problems with unreasonably failing tests. We also had to disable them all on CI

Comment by Ann-Marie Breaux (Inactive) [ 21/Nov/19 ]

Hi all - should the title of this issue include BigTest? Is the issue all centered around BigTest, or is it broader than that? I see that Folijet had issues this past summer, but I'm not sure whether there are issues going on now with Folijet. Adding a couple dev watchers in case they want to mention anything.

Comment by Charles Lowell [ 26/Nov/19 ]

Do we have links to the CI runs that are failing?

Comment by Aliaksei Chumakou [ 26/Nov/19 ]

Charles Lowell https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/575/console
https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/569/console
https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/568/console
https://jenkins-aws.indexdata.com/job/folio-org/job/ui-orders/job/master/566/console
and below

Comment by Taras Tkachenko [ 03/Dec/19 ]

Hi guys.
I just wonder if we can change this case (timeout excess) severity from ERROR to WARNING to make valid tests to pass?
It could also be a temporary solution while the permanent one is in progress, however it will save a lot of time for us.
Another idea is to make default timeouts and make them configurable from npm/yarn test running scripts or Jenkinsfile options to have a possibility to increase it from 2000ms to whatever is needed:

buildNPM {
  publishModDescriptor = 'yes'
  runLint = 'yes'
  runSonarqube = true
  runTest = 'yes'
  runTestOptions = '--karma.singleRun --karma.browsers ChromeDocker --karma.reporters mocha junit --coverage --bigtest.defaultTimeout 4000'
}

Aliaksei Chumakou Zak Burke
What do you think?

Comment by Ann-Marie Breaux (Inactive) [ 03/Dec/19 ]

Hi Anton Emelianov Zak Burke Charles Lowell

Would it be possible to make a decision on possibly changing the severity, or some other way to resolve this ASAP? This is preventing some of Folijet's key UI work from showing on the hosted-ref-environments, and thus blocking our release tomorrow. Thank you!

Comment by Zak Burke [ 03/Dec/19 ]

In short, it's complicated. Some food for thought:

The two-second timeout is hard-coded into BigTest (ugh) and while it is possible to override it, doing so is not a great option because the timeouts are often inconsistent: tests 1, 3, and 5 will fail on one build and succeed on the next, but then test 2 will fail. That basically leaves us with adding a manual timeout to every single test.

Additionally, the way timeouts are handled in BigTest makes it impossible for us to change the severity and convert the timeouts from errors to warnings. An exception is an exception and BigTest does not offer a way to filter them.

On top of that, we frequently see errors like Error: convergence exceeded the 2000ms timeout in our before/beforeEach blocks from BigTest's assertion handler which are distinct from Error: convergent assertion was successful, but exceeded the 2000ms timeout timeouts in the tests themselves. These are particularly frustrating as they don't indicate an actual test timeout but rather a test setup timeout. This stinks because it is basically a complaint that "the app took too long to load" but the thing is, in real life, we're not going to reload the entire app on every click. To be clear, it is important to reload the entire app before each test in order to make sure there are no side effects from one test to another; but it's not fair/not realistic to expect a test to accommodate that.

Finally, we've also seen BigTest continue to timeout in assertions (i.e. during test setup) even with really long timeouts, e.g. as we pushed from 2000 to 4000 to 10000 ms, we got the same Error: convergent assertion was successful, but exceeded the 10000ms timeout failure, which serious made me wonder if it was telling the truth. I don't recall seeing BigTest pause for 10 seconds on any test, but I don't have any logs at the tip of my fingers to support this claim.

Comment by Taras Tkachenko [ 03/Dec/19 ]

Thanks Zak Burke for explaining this all.
So is switching CI tests off in Jenkinsfile the only solution for this?

Comment by Zak Burke [ 03/Dec/19 ]

I don't have a good solution. Increasing the default timeout may help, but it's not a slam dunk. As noted above, longer timeouts sometimes led to the exact same outcome: a successful assertion that still exceeds the timeout window.

Maybe a short-term option is to run the tests the way we do lint: capture the output to a PR comment if they exit with an error, but remove them from the conditions for merging a PR. I believe SonarCloud functions similarly – we see the results, but a failed SonarCloud gate doesn't block a PR.

See also this very long thread about UI testing on Slack #stripes, in particular my comment at the end:

The default is 2000ms seems too short in some circumstances. But one thing I have noticed is that most timeouts are not in tests but in the before blocks while awaiting some selector thing. In that case, my strategy is to await more things to buy time. Imagine calling await Z while somebody sings the A-B-Cs. It’d time out. But if you await A; await B, await C… then you would succeed. It’s definitely hacking the system a bit, but it has worked well in some cases where you can await #some-section; await #some-form; await #some-element… to buy time.

Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ]

Zak Burke Anton Emelianov Jakub Skoczen Charles Lowell How do we resolve this today, so that we can get our UI app released as required?

Comment by Taras Tkachenko [ 04/Dec/19 ]

Hi Ann-Marie Breaux
With this all said I'm afraid the only option for now is to switch off the tests on CI via Jenkinsfile in the project root.
After the release I will create a technical debt story to try to resolve this via "multiple setup awaits" method suggested by Zak Burke in the post earlier.
While this story is not resolved we will keep tests running locally before each push with tests ending section screenshot and/or test log in the PR comments.

Zak Burke Anton Emelianov Jakub Skoczen Charles Lowell if there are better options please suggest and we will try to implement them after this release.

Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ]

OK - thanks Taras Tkachenko - and just so I'm clear - turning off the tests in Jenkinsfile only affects UI-data-import - no other apps, right?

Comment by Taras Tkachenko [ 04/Dec/19 ]

Ann-Marie Breaux yes.
It affects just tests running on Jenkins just for ui-data-import and can be turned back on any time when the problem will be resolved.

Comment by Ann-Marie Breaux (Inactive) [ 04/Dec/19 ]

OK - good to know, Taras Tkachenko. Please let me know once the most recent UI changes are live on folio-snapshot, and I'll try to get the stories tested/closed ASAP.

Comment by Zak Burke [ 06/Dec/19 ]

Notably, reverting from @bigtest/interactor v0.9.3 down to v0.7.2, at the suggestion of Viktor Soroka and Igor Godlevskyi in a conversation on #releases resolved timeout whack-a-mole problems we were having in ui-users. This can be seen in logs from three subsequent runs of PR #1076: run-4, run-5, run-6.

A few things are worthy of note here:

  1. there are no commits between runs 4, 5, and 6 yet different tests fail on each run
  2. the only commit between run 3 (successful) and run 4 (failure) was to the CHANGELOG Screen Shot 2019-12-06 at 10.08.57 AM (2).png
Comment by Aliaksei Chumakou [ 09/Dec/19 ]

Zak Burke FYI - I've tried this for ui-orders, no luck... https://github.com/folio-org/ui-orders/pull/695

Comment by Craig McNally [ 16/Dec/19 ]

Anton Emelianov Zak Burke Aliaksei Chumakou Hey guys,

I was curious what the outcome was on this... Did we really end up disabling unit tests for some/all ui modules? That seems dangerous to me. I personally think it would be better to do something like what Zak proposed; send the test failures to the PR comments but allow the PR to be merged.

Comment by Zak Burke [ 31/Dec/19 ]

We stopped running unit tests in CI for ui-inventory by updating the Jenkinsfile for that repository only. This is certainly more dangerous than running them on every PR and we hope to turn them back on once we get things to stabilize. Here's the rub though: we have ~200 unit tests. Say the random failure rate is 1/1000. We run the tests three times on every PR (twice on the branch before merging, and again as part of the merge to master). That means we run ~600 tests for each PR, which means an effective failure rate >50%. And if that random failure rate is constant, the more unit tests we add the higher that effective failure rate goes.

Comment by Zak Burke [ 03/Jan/20 ]

More fun:

User Edit: Proxy/Sponsor
...
Expire the relationship
✔ relationship status should be be restricted
✖ "before each" hook for "relationship status should be inactive"
Chrome 78.0.3904 (Linux 0.0.0) User Edit: Proxy/Sponsor "before each" hook for "relationship status should be inactive" FAILED
Error: convergent assertion was successful, but exceeded the 5000ms timeout
at loop (test/bigtest/index.js:610:17)

The structure of this block is

beforeEach setup the UI
  test for "restricted"
  test for "inactive"
  test for "warning"

IOW, that before-each block that failed for the "inactive" test already succeeded once because we see the "restricted" test passing. Ugh.

Comment by Viktor Soroka [ 07/Jan/20 ]

Is there a difference between environment setup for continuous-integration/jenkins/branch vs continuous-integration/jenkins/pr-merge when it comes to the pull request? As I see, most of the time tests run ok for pr-merge but fail for the branch. John Malconian, Ian Hardy, Jakub Skoczen

Comment by Anton Emelianov (Inactive) [ 07/Jan/20 ]

Also, is there a difference in how bundles are being built for pr-merge and branch?

Comment by John Malconian [ 13/Jan/20 ]

No differences environmentally that come to mind - but a branch build is just the branch and a PR is branch-merged-with-master build.

Comment by Viktor Soroka [ 14/Jan/20 ]

Is it possible to increase the memory for the containers for these jobs? I have a hunch that it can improve things.

Comment by Zak Burke [ 16/Jan/20 ]

Memory may indeed be the limiting factor here, but perhaps because we have a leak in the way our tests run.

side note: I had to turn off source-maps in stripes-core to get the Chrome profiler to run; otherwise it would get stuck in initialization. h/t https://github.com/electron/electron/issues/17772

Comment by Zak Burke [ 16/Jan/20 ]

Following Michal Kuklis’s hunch, I did some additional memory profiling of Chrome/Karma during unit test runs and found memory does indeed balloon over the course of testing from ~300MB to more than 3GB. In regular use, a Folio tab is stable between ~250MB-~500MB. That smells like pollution of the redux store; still investigating.

Comment by Zak Burke [ 17/Jan/20 ]

The left frame in this screenshot shows a memory heap snapshot after running one of the scripts in the test suite one time; the right frame shows a snapshot after running the same test in the same window three times in a row. So, clearly there's a memory leak here.

Comment by Cate Boerema (Inactive) [ 25/Feb/20 ]

Zak Burke I just learned of this bug from Anton Emelianov. I had no idea it was something you were supposed to be working on. Should we put it in the current sprint?

Comment by Zak Burke [ 25/Feb/20 ]

Cate Boerema, it's a long story. This is essentially a follow-up to problems first identified in FOLIO-2360 Closed . It's a bit of an umbrella issue that covers the problems we've had running unit tests on pull requests. We've identified several separate memory leaks that all manifest in the same way, i.e. failing unit tests:

The first three are resolved and we are once again able to successfully merge PRs in ui-users. I haven't turned unit tests back on for ui-inventory but that would be worthwhile to attempt at this point.

Comment by Cate Boerema (Inactive) [ 25/Feb/20 ]

Thanks, Zak Burke. Should we create a story for turning the unit tests back on in Inventory and put it in this sprint?

What about FOLIO-2433 Closed ?

Comment by Zak Burke [ 25/Feb/20 ]

I filed UIIN-1005 Closed to turn unit tests on in CI for UIIN and assigned it to this team but not this sprint, at least not yet. 20 tests fail when I run them locally.

I haven't dug into FOLIO-2433 Closed with John Malconian, but I will probably need some assistance from him and/or Ryan Berger to get the CI environment and our webpack configs all in sync. It would be great to do work on this this quarter if you think we can get it in.

Comment by Cate Boerema (Inactive) [ 05/Mar/20 ]

Hi Zak Burke can you please put some points on this issue so we can count it towards our bugfixing quota?

Comment by Zak Burke [ 09/Mar/20 ]

I'm closing this for now given that it's effectively become an umbrella for a bunch of separate memory-leak tickets and OOM errors appear to be the proximate cause for BigTest throwing a "timeout exceeded" error.

Generated at Thu Feb 08 23:18:11 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.