[FOLIO-974] proxyRequestResponse failure Created: 05/Dec/17  Updated: 12/Nov/18  Resolved: 17/Jan/18

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P2
Reporter: Mike Taylor Assignee: shale99
Resolution: Done Votes: 0
Labels: sprint28
Remaining Estimate: Not Specified
Time Spent: 1 day, 1 hour
Original estimate: Not Specified

Attachments: Text File 20171206_var-log-folio-okapi-okapi.log     File inventory-storage-docker.log     File messages.log     File okapi.log    
Issue links:
Relates
relates to FOLIO-968 Container RAM usage is getting out of... Closed
relates to STCOR-119 Report login failure more explicitly ... Closed
relates to FOLIO-976 Use real module jar rather than modul... Closed
relates to OKAPI-477 Missing timing info for RES log entry Closed
relates to FOLIO-979 More RAM for folio-ansible, Please Closed
relates to OKAPI-478 High socket count in folio/testing Closed
relates to OKAPI-476 Improve error reporting for proxyRequ... Closed
Sprint:

 Description   

I am seeing this kind of thing quite a bit:

GET http://localhost:9130/codex-instances?limit=30&query=cql.allRecords%3D1%20sortby%20title

500 Internal Server Error
proxyRequestResponse failure: http://10.0.2.15:9146: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.2.15:9146 Connection refused: /10.0.2.15:9146

Here's how it manifests. First, a request to mod-codex-mock will hang for a long time – 100 seconds or so – then return with the failure as documented. Subsequent to this, any further requests are rejected with the same diagnostic, but immediately.



 Comments   
Comment by Heikki Levanto [ 07/Dec/17 ]

I tried to reproduce this by running a tight loop of curl requests (get one, list all, the query from above), as fast as curl could output on the screen, on four console windows in parallel, for thousands of iterations. Nothing crashed, and I did not see any 500-results. This was on a local installation, on a Okapi with just this one module loaded. Your setup is more complex.

I don't have the ansible setup on my laptop. Is there any way you could look in the logs, and verify that the port 9146 is indeed used by mod-codex-mock, and not some other module (you have at least mod-authtoken involved in your stack).

I would like to know more about when and how this happens. Does it happen at the first request, or after some time? Can you get to the logs, and see around the first 500-error that hangs, if there is anything suspicious, especially near the beginning of the request? If you can identify the request-id, could you grep the logs for that, and paste here - that should tell a bit more.

Comment by Heikki Levanto [ 07/Dec/17 ]

I filed OKAPI-476 Closed to get more info in that error message.

Comment by Mike Taylor [ 07/Dec/17 ]

Thanks for looking into this.

It doesn't happen on the first request, but randomly at some later time.

At present, we have a plague of 500s, not only in mod-codex-mock. They seem to appear at random, and as soon as a module emits one of them, it keeps doing so.

Wayne Schneider and shale99 think it's an out-of-memory problem in the VM configuration. They're looking into it. I think you can put this on hold until they've confirmed or denied.

Comment by Heikki Levanto [ 07/Dec/17 ]

I tried to kill a module, and provoke Okapi to report this error, mostly to see Okapi.476 work. It did so, but the report came immediately, no long timeout. That is what happens when the module is not running.

I wonder what happens in the first failing request that takes so long. Probably the module accepts the request, and then crashes hard. So hard that it stops listening at all.

Comment by Heikki Levanto [ 07/Dec/17 ]

Ok, I put this on hold for today. If need be, I can look at it again tomorrow - but then I will be away for a week.

At least I got the excuse to improve Okapi error reporting.

Comment by shale99 [ 07/Dec/17 ]

Mike Taylor - just to make sure i understand - if you hit the codex and at one point start hitting 500's, from that point on, do all modules in the vm also throw 500's , or do they work properly?

Comment by Mike Taylor [ 07/Dec/17 ]

Once any give module gives a 500 once, it will keep doing it forever. Other modules may be unaffected at that time.

Comment by Adam Dickmeiss [ 08/Dec/17 ]

Mike Taylor .. It would be helpful if you could dig out the okapi.log .. If you run the Ansible log locally you can do "vagrant ssh" and fetch /var/log/folio/okapi/okapi.log ..

Comment by Mike Taylor [ 08/Dec/17 ]

Sorry, it's long gone: I've got into the habit now of almost reflexively doing a vagrant destroy -f && vagrant up once I start seeing 500s. I will try to remember not to do this the next time it happens, until I've pulled out the logs first.

(Wayne Schneider: it occurs to me – can Ansible/Vagrant be set up so that it runs a custom script when you destroy a VM, and could we have it copy the Okapi log somewhere well-known on the host system, so that we have it available if we want it subsequently?)

Comment by shale99 [ 09/Dec/17 ]

would it be possible to get the /var/log/messages after the 500's start appearing for a specific module?
it is a possibility that the memory limit is getting reached somehow, and something like the OOM (out of memory) killer is triggered by linux which kills the the specific process hence leading to the module being down (linux issues a kill -9 in such a case) - hence okapi returns a 500 since it can not reach the module?
not sure i have all the details to say the above is correct , but if we can also see the /var/log/messages we can see what is happening on the os side as well

Comment by Mike Taylor [ 09/Dec/17 ]

Yep, I'll try to get everything the next time this happens.

– Mike.

Comment by Zak Burke [ 11/Dec/17 ]

I've seen the same symptoms as @mike; once a 500 hits then it's all 500s thereafter. Bumping the RAM for the VM from the default to 8192MB has mitigated the problem.

Here's a snip from /var/log/folio/okapi/okapi.log from folio/testing-backend 5.0.0-20171205.272 the log last week: 20171206_var-log-folio-okapi-okapi.log

Comment by Zak Burke [ 11/Dec/17 ]

Lending credence to the out-of-memory theory:

$ vagrant ssh

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Dec  5 16:55:59 2017 from 10.0.2.2
-bash: fork: Cannot allocate memory
vagrant@contrib-jessie:~$ sudo su -
-bash: fork: Cannot allocate memory
vagrant@contrib-jessie:~$ logout
Comment by Adam Dickmeiss [ 11/Dec/17 ]

See attached okapi.log, messages.log and inventory-storage.log from Niels Erik Nielsen's VM. At Dec 11 10:20:01 the VM OOM killer goes into action. It destroys PID 2045. After this point inventory-storage-docker.log is silent and Okapi starts to return 500s for requests to it.

Comment by Adam Dickmeiss [ 11/Dec/17 ]

See FOLIO-976 Closed for a something to be desired when debugging this.

Comment by shale99 [ 11/Dec/17 ]

nice! so this is legit from that standpoint
i think we need to start integrating monitoring into this - suggest nmon or just making sure sar keeps history. also, if we can run the /admin/memory and save into a file we can also see the memory behavior of the modules (not a ton of info , but a start)

Comment by Adam Dickmeiss [ 11/Dec/17 ]

I started a frew folio/testing VM and I get:

            total       used       free     shared    buffers     cached
Mem:       4058872    3927480     131392      21888      12276     194164
-/+ buffers/cache:    3721040     337832
Swap:       473084       4468     468616

that's 3 percent free of RAM according to nmon.. Not a lot to eat from as the system gets used, so definitely we should beef up those VMs.

Comment by shale99 [ 11/Dec/17 ]

Adam Dickmeiss look at the second line - buffers / cache - this would indicate almost 10% free - the OS can free up the buffers/cache if needed. but i agree this is low, and digging into those 10% will potentially impact performance

Comment by Adam Dickmeiss [ 11/Dec/17 ]

It doesn't take more than few minutes to break things memory wise.. This time, mod-authtoken was the first one to stop. Not because of OOM but because the JVM could not allocate more memory.. docker logs .. show:

INFO: No 'user_id' field found in token
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 880096 bytes for Chunk::new
# Can not save log file, dump to screen..
#
Comment by Mike Taylor [ 11/Dec/17 ]

Great that you guys are getting to the bottom of this.

If there anything Okapi, or another component, could do differently to make such failures more polite or explicit?

Comment by Adam Dickmeiss [ 11/Dec/17 ]

Having beef'ed up to 8GB and using the UI.. That is just clicking 'Users' , 'Items' and 'Checkout' and pretty much nothing else, I see certain increased RSS increase for at least two modules.

1229 folio     20   0 4672544 1.016g  20440 S   0.0 13.0   1:36.41 java                                                                                                                       
1825 folio     20   0 4680964 804312  20912 S   0.0  9.8   0:24.20 java             

1229 is mod-permissions:latest and 1825 is mod-inventory-storage:latest.

Comment by shale99 [ 11/Dec/17 ]

i wonder if this is just java which will continue to allocate memory as long as there is no memory pressure? i dont know when it will run a garbage collection, if objects are long lived (which may be problematic) and are moving to the old gen and gc isnt running then maybe this would lead to the memory ballooning - although this memory maybe freeable by a gc - it would be great to add the gclog cmd line argument as well so we can see the memory behavior of the modules - i honestly have not done that (which , now that i think about it, is quite unreasonable to not know the rmb memory behavior from a gc standpoint)

can we add the gc logs arg to the command line of the modules?

something like this?!?!

-XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -Xloggc:<path_2_logs>/gclogs

Comment by shale99 [ 11/Dec/17 ]

thinking about this - in the past i have had the gc logs generated by each running module, and this was fed into an ELK server - which made this info viewable via a browser and you could see how the internal memory in the jvm was behaving during different situations in the modules.. maybe something to think about

Comment by Adam Dickmeiss [ 11/Dec/17 ]

After more interaction with the UI, the stat is now:

 1229 folio     20   0 4672544 1.060g  20440 S   0.3 13.6   4:46.55 java                                                                                                                       
  817 okapi     20   0 4584008 933812  18620 S   0.0 11.4   1:34.31 java    

Okapi now in second place.

Comment by shale99 [ 11/Dec/17 ]

any chance to run jstat on the process?

Comment by Nathaniel Ryckman [ 13/Dec/17 ]

Hi all,

I found an interesting article that might be related, Java Inside Docker.

What if the value of the Heap could be calculated based on the container restriction automatically?

This is actually possible if you use a base Docker image provided by the Fabric8 community. The image fabric8/java-jboss-openjdk8-jdk uses a script that calculates the container restriction and uses 50% of the available memory as an upper boundary.

Best,
Nate

Comment by shale99 [ 25/Dec/17 ]

Wayne Schneider , John Malconian - i would like to close this issue , or should I move it to one of you?

Comment by Wayne Schneider [ 26/Dec/17 ]

shale99 – I believe this issue can be closed. Relevant work is going on FOLIO-968 Closed . Thanks!

Generated at Thu Feb 08 23:09:54 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.