[FOLIO-974] proxyRequestResponse failure Created: 05/Dec/17 Updated: 12/Nov/18 Resolved: 17/Jan/18 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | P2 |
| Reporter: | Mike Taylor | Assignee: | shale99 |
| Resolution: | Done | Votes: | 0 |
| Labels: | sprint28 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | 1 day, 1 hour | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||||||
| Description |
|
I am seeing this kind of thing quite a bit: GET http://localhost:9130/codex-instances?limit=30&query=cql.allRecords%3D1%20sortby%20title 500 Internal Server Error proxyRequestResponse failure: http://10.0.2.15:9146: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.2.15:9146 Connection refused: /10.0.2.15:9146 Here's how it manifests. First, a request to mod-codex-mock will hang for a long time – 100 seconds or so – then return with the failure as documented. Subsequent to this, any further requests are rejected with the same diagnostic, but immediately. |
| Comments |
| Comment by Heikki Levanto [ 07/Dec/17 ] |
|
I tried to reproduce this by running a tight loop of curl requests (get one, list all, the query from above), as fast as curl could output on the screen, on four console windows in parallel, for thousands of iterations. Nothing crashed, and I did not see any 500-results. This was on a local installation, on a Okapi with just this one module loaded. Your setup is more complex. I don't have the ansible setup on my laptop. Is there any way you could look in the logs, and verify that the port 9146 is indeed used by mod-codex-mock, and not some other module (you have at least mod-authtoken involved in your stack). I would like to know more about when and how this happens. Does it happen at the first request, or after some time? Can you get to the logs, and see around the first 500-error that hangs, if there is anything suspicious, especially near the beginning of the request? If you can identify the request-id, could you grep the logs for that, and paste here - that should tell a bit more. |
| Comment by Heikki Levanto [ 07/Dec/17 ] |
|
I filed
|
| Comment by Mike Taylor [ 07/Dec/17 ] |
|
Thanks for looking into this. It doesn't happen on the first request, but randomly at some later time. At present, we have a plague of 500s, not only in mod-codex-mock. They seem to appear at random, and as soon as a module emits one of them, it keeps doing so. Wayne Schneider and shale99 think it's an out-of-memory problem in the VM configuration. They're looking into it. I think you can put this on hold until they've confirmed or denied. |
| Comment by Heikki Levanto [ 07/Dec/17 ] |
|
I tried to kill a module, and provoke Okapi to report this error, mostly to see Okapi.476 work. It did so, but the report came immediately, no long timeout. That is what happens when the module is not running. I wonder what happens in the first failing request that takes so long. Probably the module accepts the request, and then crashes hard. So hard that it stops listening at all. |
| Comment by Heikki Levanto [ 07/Dec/17 ] |
|
Ok, I put this on hold for today. If need be, I can look at it again tomorrow - but then I will be away for a week. At least I got the excuse to improve Okapi error reporting. |
| Comment by shale99 [ 07/Dec/17 ] |
|
Mike Taylor - just to make sure i understand - if you hit the codex and at one point start hitting 500's, from that point on, do all modules in the vm also throw 500's , or do they work properly? |
| Comment by Mike Taylor [ 07/Dec/17 ] |
|
Once any give module gives a 500 once, it will keep doing it forever. Other modules may be unaffected at that time. |
| Comment by Adam Dickmeiss [ 08/Dec/17 ] |
|
Mike Taylor .. It would be helpful if you could dig out the okapi.log .. If you run the Ansible log locally you can do "vagrant ssh" and fetch /var/log/folio/okapi/okapi.log .. |
| Comment by Mike Taylor [ 08/Dec/17 ] |
|
Sorry, it's long gone: I've got into the habit now of almost reflexively doing a vagrant destroy -f && vagrant up once I start seeing 500s. I will try to remember not to do this the next time it happens, until I've pulled out the logs first. (Wayne Schneider: it occurs to me – can Ansible/Vagrant be set up so that it runs a custom script when you destroy a VM, and could we have it copy the Okapi log somewhere well-known on the host system, so that we have it available if we want it subsequently?) |
| Comment by shale99 [ 09/Dec/17 ] |
|
would it be possible to get the /var/log/messages after the 500's start appearing for a specific module? |
| Comment by Mike Taylor [ 09/Dec/17 ] |
|
Yep, I'll try to get everything the next time this happens. – Mike. |
| Comment by Zak Burke [ 11/Dec/17 ] |
|
I've seen the same symptoms as @mike; once a 500 hits then it's all 500s thereafter. Bumping the RAM for the VM from the default to 8192MB has mitigated the problem. Here's a snip from /var/log/folio/okapi/okapi.log from folio/testing-backend 5.0.0-20171205.272 the log last week: 20171206_var-log-folio-okapi-okapi.log |
| Comment by Zak Burke [ 11/Dec/17 ] |
|
Lending credence to the out-of-memory theory:
$ vagrant ssh
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Dec 5 16:55:59 2017 from 10.0.2.2
-bash: fork: Cannot allocate memory
vagrant@contrib-jessie:~$ sudo su -
-bash: fork: Cannot allocate memory
vagrant@contrib-jessie:~$ logout
|
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
See attached okapi.log, messages.log and inventory-storage.log from Niels Erik Nielsen's VM. At Dec 11 10:20:01 the VM OOM killer goes into action. It destroys PID 2045. After this point inventory-storage-docker.log is silent and Okapi starts to return 500s for requests to it. |
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
See
|
| Comment by shale99 [ 11/Dec/17 ] |
|
nice! so this is legit from that standpoint |
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
I started a frew folio/testing VM and I get:
total used free shared buffers cached
Mem: 4058872 3927480 131392 21888 12276 194164
-/+ buffers/cache: 3721040 337832
Swap: 473084 4468 468616
that's 3 percent free of RAM according to nmon.. Not a lot to eat from as the system gets used, so definitely we should beef up those VMs. |
| Comment by shale99 [ 11/Dec/17 ] |
|
Adam Dickmeiss look at the second line - buffers / cache - this would indicate almost 10% free - the OS can free up the buffers/cache if needed. but i agree this is low, and digging into those 10% will potentially impact performance |
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
It doesn't take more than few minutes to break things memory wise.. This time, mod-authtoken was the first one to stop. Not because of OOM but because the JVM could not allocate more memory.. docker logs .. show: INFO: No 'user_id' field found in token # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 880096 bytes for Chunk::new # Can not save log file, dump to screen.. # |
| Comment by Mike Taylor [ 11/Dec/17 ] |
|
Great that you guys are getting to the bottom of this. If there anything Okapi, or another component, could do differently to make such failures more polite or explicit? |
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
Having beef'ed up to 8GB and using the UI.. That is just clicking 'Users' , 'Items' and 'Checkout' and pretty much nothing else, I see certain increased RSS increase for at least two modules. 1229 folio 20 0 4672544 1.016g 20440 S 0.0 13.0 1:36.41 java 1825 folio 20 0 4680964 804312 20912 S 0.0 9.8 0:24.20 java 1229 is mod-permissions:latest and 1825 is mod-inventory-storage:latest. |
| Comment by shale99 [ 11/Dec/17 ] |
|
i wonder if this is just java which will continue to allocate memory as long as there is no memory pressure? i dont know when it will run a garbage collection, if objects are long lived (which may be problematic) and are moving to the old gen and gc isnt running then maybe this would lead to the memory ballooning - although this memory maybe freeable by a gc - it would be great to add the gclog cmd line argument as well so we can see the memory behavior of the modules - i honestly have not done that (which , now that i think about it, is quite unreasonable to not know the rmb memory behavior from a gc standpoint) can we add the gc logs arg to the command line of the modules? something like this?!?! -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -Xloggc:<path_2_logs>/gclogs |
| Comment by shale99 [ 11/Dec/17 ] |
|
thinking about this - in the past i have had the gc logs generated by each running module, and this was fed into an ELK server - which made this info viewable via a browser and you could see how the internal memory in the jvm was behaving during different situations in the modules.. maybe something to think about |
| Comment by Adam Dickmeiss [ 11/Dec/17 ] |
|
After more interaction with the UI, the stat is now: 1229 folio 20 0 4672544 1.060g 20440 S 0.3 13.6 4:46.55 java 817 okapi 20 0 4584008 933812 18620 S 0.0 11.4 1:34.31 java Okapi now in second place. |
| Comment by shale99 [ 11/Dec/17 ] |
|
any chance to run jstat on the process? |
| Comment by Nathaniel Ryckman [ 13/Dec/17 ] |
|
Hi all, I found an interesting article that might be related, Java Inside Docker.
Best, |
| Comment by shale99 [ 25/Dec/17 ] |
|
Wayne Schneider , John Malconian - i would like to close this issue , or should I move it to one of you? |
| Comment by Wayne Schneider [ 26/Dec/17 ] |
|
shale99 – I believe this issue can be closed. Relevant work is going on
|