[FOLIO-516] Error 500's from Okapi with 5 or more concurrent requests Created: 23/Mar/17  Updated: 12/Nov/18  Resolved: 25/Apr/17

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P1
Reporter: Niels Erik Nielsen Assignee: Adam Dickmeiss
Resolution: Done Votes: 0
Labels: sprint11
Remaining Estimate: Not Specified
Time Spent: 4 days, 6 hours, 30 minutes
Original estimate: Not Specified

Attachments: Text File fine-logs-transcript.txt     File okapi.log.1     File okapi.log.2     Text File vagrant-docker-logs.txt     Text File vagrant-logs-new-box.txt    
Issue links:
Blocks
blocks STRIPES-324 Permissions listed in sets are blank Closed
Duplicate
is duplicated by STRIPES-270 Permissions: Getting Fetch Errors Whe... Closed
Relates
relates to OKAPI-312 hang in recursive calls for parallel ... Closed
relates to OKAPI-370 Overloaded Okapi dies Closed
relates to STRIPES-325 Add a "hasAllPerms" configuration opt... Closed
Sprint:

 Description   

For a while now, I have - on-and-off - seen internal server errors from Okapi, apparently when issuing 5-6 or more concurrent requests to Okapi.

I'm running folio-backend-auth black box 0.7.2

I first saw it in UI scan, not when adding a single item, which would make following requests (each request preceded with an OPTIONS request):

GET item-storage
PUT item-storage
POST loan-storage
GET loan-storage

But when adding two or more items in one go, the issue occurs more often, though not always.

I've then seen it on the users view page two - where we now have about five concurrent GET requests to loans, users, permissions (x2), groups.

I've begun attempts at reproducing with cUR but need to add more concurrent requests to try and imitate the UI actions.



 Comments   
Comment by Niels Erik Nielsen [ 24/Mar/17 ]

Update: These errors may have occurred well before the first request resulting in error 500 – see subsequent comments on this jira. Seems to be related to postgreSQL connection errors. Example of one such error:

2017-03-24 13:02:26,119 INFO  DockerModuleHandle   startContainer
2017-03-24 13:02:26,229 INFO  DockerModuleHandle   startContainer
2017-03-24 13:29:09,673 ERROR PostgreSQLConnection Error with message -> ErrorMessage(fields=Map(Line -> 2875, File -> postgres.c, SQLSTATE -> 57P01, Routine -> ProcessInterrupts, V -> FATAL, Message -> terminating connection due to administrator command, Severity -> FATAL))
2017-03-24 13:29:09,673 ERROR PostgreSQLConnection Error with message -> ErrorMessage(fields=Map(Line -> 2875, File -> postgres.c, SQLSTATE -> 57P01, Routine -> ProcessInterrupts, V -> FATAL, Message -> terminating connection due to administrator command, Severity -> FATAL))
2017-03-24 13:29:09,674 ERROR PostgreSQLConnection Error on connection
com.github.mauricio.async.db.postgresql.exceptions.GenericDatabaseException: ErrorMessage(fields=Map(Line -> 2875, File -> postgres.c, SQLSTATE -> 57P01, Routine -> ProcessInterrupts, V -> FATAL, Message -> terminating connection due to administrator command, Severity -> FATAL))
        at com.github.mauricio.async.db.postgresql.PostgreSQLConnection.onError(PostgreSQLConnection.scala:172)
        at com.github.mauricio.async.db.postgresql.codec.PostgreSQLConnectionHandler.channelRead0(PostgreSQLConnectionHandler.scala:206)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:334)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:326)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:334)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:326)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1320)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:334)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:905)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:563)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:504)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:418)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:390)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
        at java.lang.Thread.run(Thread.java:745)
2017-03-24 13:29:09,675 ERROR PostgreSQLConnection Error on connection
Comment by shale99 [ 26/Mar/17 ]

Hi NE
is there anything else in the logs - maybe before this exception - that you can add to the issue?

Comment by Niels Erik Nielsen [ 27/Mar/17 ]

Hmm, hang on, I suspect I may have messed up the time stamps. Those errors could actual be from before the 500 errors.

Now I've installed a brand new box (0.8.0) - just to make sure the issue isn't just found in older boxes. I'm uploading the entire log, from the creation of the machine (03-26) and till the 500's occur (03-27) okapi.log.1

I subsequently opened the UI page (user view) again, and this time the requests in question succeeded; no additional lines written to the log.

I will try again later to see if the issue occurs again after a while or not.

Comment by Niels Erik Nielsen [ 27/Mar/17 ]

So, for a number of page loads there were no 500 errors, then it occurred again, with just a single line added to the log file it seems:

2017-03-27 09:32:18,365 ERROR ConnectionBase       java.lang.IllegalArgumentException: invalid version format: 0

The new, entire log file: okapi.log.2

Comment by Niels Erik Nielsen [ 27/Mar/17 ]

And, it occurred again, but without any additional log lines written to okapi.log.

Comment by Niels Erik Nielsen [ 27/Mar/17 ]

And, finally, now the requests succeeded, no additional log lines.

Comment by Mike Taylor [ 27/Mar/17 ]

I am frequently, but intermittently, seeing this on my MacBook, where Stripes is running against v0.8.0 of Wayne's auth-blackbox. I get it when trying to view a user record, which results in five more or less simultaneous requests. They all fail with an HTTP 500, and no content in the response – see the attached screenshot.

This is a bit of a showstopper for me: I am basically unable to work on the Users module now. Can we at least increase the Magic Number in our Postgres configuration?

Comment by Mike Taylor [ 28/Mar/17 ]

Folks, I have pushed this to P1 as it is now blocking three of us in different ways;

  • Niels-Erik in doing multiple checkins.
  • Me in working on full-record display of users.
  • Cate in running demos (see STRIPES-270 Closed )

Can someone please get on it?

Comment by shale99 [ 28/Mar/17 ]

currently downloading latest black box.
ran jmeter to first test parallel requests directly against mod-users
10 concurrent users - 25 request made of post / put / get requests - some 400 errors but no 500
ran again via okapi -> mod-users - alot of 400 errors but no 500
will run against black box shortly

Comment by Mike Taylor [ 28/Mar/17 ]

Thanks, Shale, great to know you're on it.

Yes, this is an annoyingly random issue. I see it often (but not quite all the time) on my MacBook, but hardly at all on my Ubuntu box. I hope you can track down a reliable way to reproduce!

(I suppose it's possible that the problem is in Okapi rather then in mod-users, so it might be worth running your tests through Okapi rather than directly against mod-users. The Postgres errors that NE reported above suggest not, of course; but it could even be that putting the requests through Okapi affects the timing in a way that makes the problem more likely to manifest.)

Comment by shale99 [ 28/Mar/17 ]

yes, i agree, i dont think this is an okapi issue at all, the jmeter test just allows me to bombard the system occasionally to see how it reacts

Comment by shale99 [ 28/Mar/17 ]

Hi Mike
If you have not restarted the black box - any chance you an take a look at the mod-users logs? it could really shed some light on the situation.
if it is possible then to get to the logs you would need to

sudo docker ps

locate the mod-users in the list and get the container id and then

sudo docker logs hex_container_id

hopefully the module logged the stack traces and then we can see what happened

i would like to take this opportunity and suggest either forwarding the logs to a file per module in the /var
or better yet an ELK install that will just aggregate the logs and expose them via http (no parsing of the logs, just expose them) - so it should be very easy to do and will really help to debug future issues

Comment by Mike Taylor [ 28/Mar/17 ]

I attached vagrant-docker-logs.txt. It doesn't look very informative to me, but it might be more helpful for you.

Comment by shale99 [ 28/Mar/17 ]

yea, that log is clean (no errors)

Comment by Mike Taylor [ 28/Mar/17 ]

What log was Niels Erik Nielsen looking at earlier in this thread, where he found messages like

2017-03-24 13:29:09,673 ERROR PostgreSQLConnection Error with message -> ErrorMessage(fields=Map(Line -> 2875, File -> postgres.c, SQLSTATE -> 57P01, Routine -> ProcessInterrupts, V -> FATAL, Message -> terminating connection due to administrator command, Severity -> FATAL))

?

Comment by shale99 [ 28/Mar/17 ]

i believe the okapi.log, no?

Comment by Niels Erik Nielsen [ 28/Mar/17 ]

Yes, okapi.log.

But as I mentioned, those errors probably did not occur at the time of the 500s I was seeing, so I simply don't know if they're related.

Comment by shale99 [ 28/Mar/17 ]

i have the env set up - stripes sample platform running against the black box - so far, looks solid (tested 100's of 'getting user' manually) - do we have some automation for recording ui so i can let that run in the background?

Comment by Mike Taylor [ 28/Mar/17 ]

I don't know what there is – it may be that Wolfram has mode something like this.

Maybe ask on one of the broader Slack channels?

Comment by shale99 [ 28/Mar/17 ]

replicated, updating soon

Comment by shale99 [ 28/Mar/17 ]

i am not sure this is an RMB issue - i have gone through the logs of all modules and the only module that i see that throws an exception at the time the error occurs is the log belonging to the auth module

logger.error("AuthZ> Unable to retrieve permissions for " + username + ": " + res.cause().getMessage());

AuthZ> Unable to retrieve permissions for diku_admin connection was closed

the code does not print out the stack trace so this is all that there is - i can continue looking at this but it may be more productive for someone who knows the code better

Comment by Mike Taylor [ 28/Mar/17 ]

That would be Kurt.

Comment by shale99 [ 28/Mar/17 ]

added kurt as watcher

Comment by shale99 [ 28/Mar/17 ]

seems like this is caused when trying to get a users permissions via http from /perms/users/??
kurt, you probably know better

Comment by Kurt Nordstrom [ 28/Mar/17 ]

Whenever a request comes through, it gets sent to the authtoken module which, in turn, sends a request to the permissions module to determine what the user's current permissions are, to determine whether or not to route the request through.

Is the permissions module having an issue with higher volume?

Comment by shale99 [ 28/Mar/17 ]

permission seems ok, i only see 200 satus codes in the log

Comment by shale99 [ 28/Mar/17 ]

response times of the permission mod are under 15 milli per requet

Comment by Mike Taylor [ 28/Mar/17 ]

shale99, what makes you think the /perms/users request is the guilty one? Whenever I see this, the Users module fires off five requests more or less simultaneously; then they all- block for maybe 10 seconds; then they _all fail more or less simultaneously with the 500.

Comment by shale99 [ 28/Mar/17 ]

that is what i saw in the code -
the logs in the mod-users seem ok - i can recheck, they seem ok in the permissions as well. the only place i saw an error was in the mod-auth
looked in the code at the error and followed it to an http call to /perms/users

Comment by Mike Taylor [ 28/Mar/17 ]

OK, cool.

Kurt, are you able to reproduce this?

Comment by Kurt Nordstrom [ 28/Mar/17 ]

I haven't attempted to replicate yet. One thing I'll mention is that I do tend to get a few initial failures the first time I attempt to use the auth modules, because, for whatever reason, the initial request to the permissions module tends to take longer than the authtoken's timeout is. Subsequent requests fire fine without issues.

Comment by Mike Taylor [ 28/Mar/17 ]

This is definitely not about initial requests. Until the five-requests-at-once moment, the permissions module has been behaving fine.

Comment by shale99 [ 28/Mar/17 ]

kurt, would it be possible to maybe add to the http connection a timer (before the call and when the callback / exception is called) and print that out so maybe we can see if this is caused by a network hiccups?

Comment by Kurt Nordstrom [ 28/Mar/17 ]

Which connection? Do you mean in the authtoken module?

Comment by shale99 [ 28/Mar/17 ]

yes, i believe this is the 500

usePermissionsSource.getPermissionsForUser(username).setHandler((AsyncResult<JsonArray> res) -> {

      if(res.failed()) {
        logger.error("AuthZ> Unable to retrieve permissions for " + username + ": " + res.cause().getMessage());
        ctx.response()
                .setStatusCode(500)
                //.end("Unable to retrieve permissions for user");
                .end();
        return;
}
Comment by Kurt Nordstrom [ 28/Mar/17 ]

Sure, I'll add the timer.

Comment by Kurt Nordstrom [ 28/Mar/17 ]

It should report the ms taken in the log now, when the perms lookup fails.

Comment by Wayne Schneider [ 28/Mar/17 ]

I've released the Vagrant box folio/folio-backend-auth v0.8.1 with this version of the mod-auth modules. To update your box, cd into the directory where you have your Vagrantfile, then:

$ vagrant box update
$ vagrant destroy
$ vagrant up
Comment by shale99 [ 29/Mar/17 ]

NE, Mike,
I will upgrade and re-test - if you guys can also upgrade it would be great so that if you hit the 500 again we can see if this is indeed a networking issue causing the error

Comment by Mike Taylor [ 29/Mar/17 ]

Will do.

Reading the comments leading up to this, it seems that what we're describing here is an update that only gives us improved logging – am I right? So there is a special reason to think this new auth-blackbox will actually fix my problem?

Comment by shale99 [ 29/Mar/17 ]

correct, this wont fix the problem, From my standpoint, I would like to log this so that we have verification that this is indeed the issue and that a delay in the http request is causing the 500 error. If this is the case, then how to handle should be discussed. I think Kurt would probably know best what should be done in such a case. Should there be a retry automatically? is there another action that can be taken?

Comment by Mike Taylor [ 29/Mar/17 ]

I downloaded the new VM and am running it now. I can confirm that it doesn't fix the problem.

I will attach the new logs. Did I get everything of interest?

Comment by shale99 [ 29/Mar/17 ]

this is the mod-user log, and it looks ok - no errors - i assume you've encountered the 500 than, can you attach the mod-auth log? you should see it when running the

sudo docker logs

command

Comment by Mike Taylor [ 29/Mar/17 ]

Read on – the log I posted includes the mod-users, mod-authtoken and mod-permissions logs.

Comment by shale99 [ 29/Mar/17 ]

sorry, missed it - yup this is indeed the problem

Mar 29, 2017 7:43:18 AM mod-auth-authtoken-module
SEVERE: AuthZ> Unable to retrieve permissions for diku_admin: Connection was closed request took 10011 ms
Mar 29, 2017 7:43:18 AM mod-auth-authtoken-module
SEVERE: AuthZ> Unable to retrieve permissions for diku_admin: Connection was closed request took 10018 ms
Mar 29, 2017 7:43:18 AM mod-auth-authtoken-module
SEVERE: AuthZ> Unable to retrieve permissions for diku_admin: Connection was closed request took 10021 ms
Mar 29, 2017 7:43:18 AM mod-auth-authtoken-module
SEVERE: AuthZ> Unable to retrieve permissions for diku_admin: Connection was closed request took 10022 ms
Mar 29, 2017 7:43:18 AM mod-auth-authtoken-module
SEVERE: AuthZ> Unable to retrieve permissions for diku_admin: Connection was closed request took 10059 ms

and in the code

HttpClientOptions options = new HttpClientOptions();
options.setConnectTimeout(10);
options.setIdleTimeout(10)
Comment by shale99 [ 29/Mar/17 ]

Looking at the perm log , it seems that it is responding ok - see the Log entries HTTP_1_1 200 (response code) 730 (response size) after that is the response times which range from 98 - 1379 milliseconds

[Wed Mar 29 07:42:53 UTC 2017] org.folio.rest.tools.PomReader INFO module name: permissions_module, version: 2.2.0-SNAPSHOT 
[Wed Mar 29 07:42:53 UTC 2017] org.folio.rest.persist.PostgresClient INFO Using schema: diku 
[Wed Mar 29 07:42:53 UTC 2017] org.folio.rest.persist.PostgresClient INFO postgreSQLClientConfig = {"host":"10.0.2.15","database":"mod_auth","password":"...","port":5432,"username":"diku_permissions_module"} 
[Wed Mar 29 07:42:53 UTC 2017] io.vertx.ext.asyncsql.impl.PostgreSQLClientImpl INFO Creating configuration for 10.0.2.15:5432 
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51264 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 1379 tid=diku OK  
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51265 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 826 tid=diku OK  
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:42:54 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51264 GET /perms/users/diku_admin/permissions expanded=true&full=true HTTP_1_1 200 8768 111 tid=diku OK  
[Wed Mar 29 07:42:59 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:42:59 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51265 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 81 tid=diku OK  
[Wed Mar 29 07:42:59 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:42:59 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51264 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 69 tid=diku OK  
[Wed Mar 29 07:43:00 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:00 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51265 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 41 tid=diku OK  
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51264 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 98 tid=diku OK  
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51265 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 126 tid=diku OK  
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51290 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 139 tid=diku OK  
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51288 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 199 tid=diku OK  
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO org.folio.rest.RestVerticle start  invoking getPermsUsersByUsernamePermissions 
[Wed Mar 29 07:43:18 UTC 2017] org.folio.rest.tools.utils.LogUtil INFO 10.0.2.15:51289 GET /perms/users/diku_admin/permissions expanded=true HTTP_1_1 200 730 206 tid=diku OK 
Comment by shale99 [ 29/Mar/17 ]

What we need here is a request ID that is generated by Okapi and is maintained in the header across all requests so we can identify the exact call and the exact entry for that call across modules. i have mentioned this in the past, and i think this may happen - i wouldnt mind a quick and dirty implementation for now, could help us almost immediately to debug problems

Comment by shale99 [ 29/Mar/17 ]

Hi Kurt,
if an exception is thrown in the body handler below while parsing the json , so that the future.complete() isnt reached - can this cause this? (just speculating)

    HttpClientRequest req = client.getAbs(requestUrl, res-> {
      if(res.statusCode() == 200) {
        res.bodyHandler(res2 -> {
          JsonObject permissionsObject = new JsonObject(res2.toString());
          logger.debug("Got permissions: " + permissionsObject.getJsonArray("permissionNames").encodePrettily());
          future.complete(permissionsObject.getJsonArray("permissionNames"));
        });
}
Comment by shale99 [ 29/Mar/17 ]

Last comment - sorry for the spam

RMB modules have a URL endpoint to turn on debugging for any class / package - the RMB (if in debug mode on the RestVerticle.class) will print out the incoming request's body and i believe the outgoing response body - so we can also see if the in/outs are as we expect - this is in the /admin endpoints - and i have no clue how to run those through okapi - so if someone knows , we can turn this on in the perm module

Comment by Adam Dickmeiss [ 29/Mar/17 ]

Uncaught exceptions are mostly seen as hangs for Vert.x .. Has bit me several times!

Comment by Heikki Levanto [ 29/Mar/17 ]

The /admin endpoints can not be invoked through Okapi directly. Either you have to bypass Okapi and call the module directly, or you have to fake a ModuleDescriptor that defines that endpoint - but only for one module in the whole installation. We have discussed this before, and it can be solved, but there are some difficulties with multiple instances and multiple modules.

As to the request ID, it would not be too hard to implement. It would require a small change in the auth module, and one in Okapi. The modules would not have to do more than they do now - pass the X-Okapi-Token to any request they make. They would receive an extra header X-Okapi-Request-Id which they could log as they please. Okapi could include that in its logs too. (I thought I had filed an issue for this, but can not find it)

Comment by shale99 [ 29/Mar/17 ]

I think the Id would really prove valuable since we're already crossing module boundaries

Comment by Mike Taylor [ 29/Mar/17 ]

Uncaught exceptions are mostly seen as hangs for Vert.x

That is awful!

The /admin endpoints can not be invoked through Okapi directly. Either you have to bypass Okapi and call the module directly, or you have to fake a ModuleDescriptor that defines that endpoint - but only for one module in the whole installation. We have discussed this before, and it can be solved, but there are some difficulties with multiple instances and multiple modules.

I think it would be very well worth coming up with a general solution to this in future, although I realise it's not going to get done right now. For example, some special-casing in Okapi where /_/admin/INSTANCEID is proxied to /admin on the module with the specified INSTANCEID.

But, folks – what are the prospects for fixing the present issue before we have mechanisms like this and the request-ID?

Comment by Kurt Nordstrom [ 29/Mar/17 ]

shale99 Thanks for the catch about the possible exception handling. Tightening that block up (though I don't know if it is the culprit here or not.)

Comment by shale99 [ 29/Mar/17 ]

np, not sure either - i have set up my env with debug log level on a couple of packages in the mod-perms and see if i can see what is returned when the error occurs

Comment by shale99 [ 29/Mar/17 ]

Mike, NE, Kurt,
i have been pressing for a while with no reproduce, can you guys maybe send the following urls to the perms module (you need to replace the port with the one that appears in the output of docker ps) - dont know if there is enough debug info in there - but it is much more then now - so could help

http://localhost:9139/admin/loglevel?level=FINE&java_package=org.folio.rest.RestVerticle
http://localhost:9139/admin/loglevel?level=FINE&java_package=org.folio.rest.impl

Comment by Kurt Nordstrom [ 29/Mar/17 ]

shale99 This is tangential to the discussion, but how hard would it be to give the RMB an option to enable debug logging on the command line string, as opposed to requiring a request to be sent to the /admin endpoint?

Comment by shale99 [ 29/Mar/17 ]

i can add it - but i might be missing your angle on this, you mean to have it startup in debug mode for testing purposes?
if so, maybe wayne can add a call (dont know if it is too complicated) - to the endpoint when launching?
but it is easy enough to add the command line as well

Comment by Kurt Nordstrom [ 29/Mar/17 ]

shale99 Yes, that's exactly what I want...to be able to start in debug mode with just a flag in the deployment descriptor. Yes, I can hit the endpoint manually by looking it up in Okapi, but it's kind of cumbersome. If that's something we can do, I'll open a separate issue for it so we don't derail this one.

Comment by Mike Taylor [ 29/Mar/17 ]

Clarification: those HTTP requests that shale99 mentioned are PUTs with no payload, and they only work on RMB-based modules – so in this case, mod-users and mod-permissions, but not mod-authcheck.

Comment by Mike Taylor [ 29/Mar/17 ]

Here is the result of turning up the log levels. (Next I will post the resulting logs.)

vagrant@jessie:~$ curl -w '\n' -X PUT -d '' 'http://localhost:9133/admin/loglevel?level=FINE&java_package=org.folio.rest.RestVerticle'
{
  "map" : {
    "org.folio.rest.RestVerticle" : "FINE"
  },
  "empty" : false
}
vagrant@jessie:~$ curl -w '\n' -X PUT -d '' 'http://localhost:9133/admin/loglevel?level=FINE&java_package=java_package=org.folio.rest.impl'
{
  "map" : { },
  "empty" : true
}
vagrant@jessie:~$ 

That second one doesn't look right to me.

Comment by Mike Taylor [ 29/Mar/17 ]

Oh, I see what I did wrong. Here you go:

vagrant@jessie:~$ curl -w '\n' -X PUT -d '' 'http://localhost:9133/admin/loglevel?level=FINE&java_package=org.folio.rest.impl'
{
  "map" : {
    "org.folio.rest.impl.AdminAPI" : "FINE",
    "org.folio.rest.impl.UsersAPI" : "FINE",
    "org.folio.rest.impl.ProcessUploads" : "FINE",
    "org.folio.rest.impl.TenantAPI" : "FINE",
    "org.folio.rest.impl.JobAPI" : "FINE",
    "org.folio.rest.impl.JobsRunner" : "FINE",
    "org.folio.rest.impl.UserGroupAPI" : "FINE"
  },
  "empty" : false
}
vagrant@jessie:~$ 
Comment by Mike Taylor [ 29/Mar/17 ]

That was turning the logging up for mod-users; here's the result of turning it up for mod-permissions:

vagrant@jessie:~$ curl -w '\n' -X PUT -d '' 'http://localhost:9139/admin/loglevel?level=FINE&java_package=org.folio.rest.RestVerticle'
{
  "map" : {
    "org.folio.rest.RestVerticle" : "FINE"
  },
  "empty" : false
}
vagrant@jessie:~$ curl -w '\n' -X PUT -d '' 'http://localhost:9139/admin/loglevel?level=FINE&java_package=org.folio.rest.impl'
{
  "map" : {
    "org.folio.rest.impl.PermsAPI" : "FINE",
    "org.folio.rest.impl.AdminAPI" : "FINE",
    "org.folio.rest.impl.ProcessUploads" : "FINE",
    "org.folio.rest.impl.TenantAPI" : "FINE",
    "org.folio.rest.impl.JobAPI" : "FINE",
    "org.folio.rest.impl.JobsRunner" : "FINE"
  },
  "empty" : false
}
vagrant@jessie:~$ 
Comment by Mike Taylor [ 29/Mar/17 ]

I attached a transcript of the mod-users and mod-permissions FINE logging. Hope it helps!

BTW., a big thank you, guys – I really appreciate you all putting in the hard yards on this.

Comment by shale99 [ 29/Mar/17 ]

Kurt, i have to head out, that once in a year day has come and i need to prepare. I want to push a fix into master that will print out the returned json in debug mode (seems this is broken for some reason) - will do if you think that can help understand the issue

Comment by Kurt Nordstrom [ 29/Mar/17 ]

shale99 Returned JSON from which module? I don't want to keep you from your observances.

Comment by shale99 [ 29/Mar/17 ]

just my birthday, no biggie, adding to the RMB - but i can upgrade perms to use the RMB version and then you can see the outputs , relevant?

Comment by Kurt Nordstrom [ 29/Mar/17 ]

I have some other diagnostics I want to try, so I'd say you can wait until tomorrow to worry about the push.

Comment by Mike Taylor [ 04/Apr/17 ]

Folks, we really really need to get this fixed. As things stand FOLIO services are Just Plain Broken.

The most recent random manifestation of the problem is preventing Jeremy from finishing up the work on LIBAPP-83 Closed for the demo we're supposed to be doing in literally one hour from now.

Comment by shale99 [ 05/Apr/17 ]

Hi Kurt,
are you looking into this or should i?

Comment by Mike Taylor [ 07/Apr/17 ]

Kurt, are you looking into this? shale99 is not, he's out today.

Comment by Kurt Nordstrom [ 07/Apr/17 ]

I have started some basic investigation, but I have nothing conclusive yet. I will try to carve out some time this afternoon to investigate further.

Comment by Mike Taylor [ 14/Apr/17 ]

CAN SOMEONE PLEASE FIX THIS?

It's been at P1 for a couple of weeks. It is simply not possible to coherently do front-end development work when the back-end keeps randomly getting into a state where every request fails for no reason.

Comment by Kurt Nordstrom [ 14/Apr/17 ]

I did some investigation and I think it might be due to timeouts in the authtoken module while looking up permissions. Trying with a longer timeout threshold and also looking into ways to improve the lookup performance.

Comment by Mike Taylor [ 14/Apr/17 ]

Thanks, but I don't think it's timeouts. All this is happening very fast.

Surely stage one is just to ensure you never ever return a 5xx without some diagnostic text in the response body?

Comment by Kurt Nordstrom [ 14/Apr/17 ]

I've been avoiding passing specific information about the backend into the response body for 500's and instead writing the information out to logs when server errors occur.

Maybe a debug flag in which the information gets sent to the response would be useful in cases like this.

Comment by Mike Taylor [ 14/Apr/17 ]

Why either/or? Why not just emit the messages in the HTTP responses and log them?

Comment by shale99 [ 18/Apr/17 ]

hi kurt,
i believe the culprit is the following function, getPermsUsersByUsernamePermissions() in the permissions-module. i would like to add some logging info to that function if thats ok,
there is also a recursive function called from within that function getExpandedPermissions(), not familiar with the biz logic - but maybe something there?

Comment by Kurt Nordstrom [ 18/Apr/17 ]

Fine by me. Let me know what you turn up.

Comment by shale99 [ 19/Apr/17 ]

Hey guys,
so apologies, i thought this was being handled so i did not continue to look into this. I understood from adam that was not the case and that he is checking a few things. If that does not turn up anything i will dive back into this.

Wayne Schneider - would it be possible to start java processes within docker with debugging enabled - i think this makes sense for future cases as well, basically the following params should be added to the java command line and this will allow connecting with a remote debugger to debug the java code running in a docker container (need to map the port)

-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8000

(the port would be different for each of the started java modules in docker)

Comment by Adam Dickmeiss [ 19/Apr/17 ]

The 500-problem has now been reproduced with using curl exclusively. I think we're closer to nailing this one down. This is not the startup problem.. It really seems to happen when we fire concurrent requests to the system.

Comment by Kurt Nordstrom [ 19/Apr/17 ]

Adam Dickmeiss Can you share your curl recipe for reproducing the issue? I'd like to try to witness it locally.

Comment by Mike Taylor [ 19/Apr/17 ]

That is great news, Adam!

Comment by Heikki Levanto [ 19/Apr/17 ]

I think we reproduced the issue in a simplified form, using Okapi in dev mode and the Perl sample module. That means no databases were involved at all, no docker images, nothing but Okapi and the perl script. The problem only happens when the module makes a request to some module - in this case itself.

More info to follow, probably tomorrow

Comment by Kurt Nordstrom [ 19/Apr/17 ]

Augh! Don't leave me in suspense! At least give me a bash script or something to reproduce the issue.

Comment by Heikki Levanto [ 19/Apr/17 ]

Console 1:
java -Dloglevel=DEBUG -jar okapi-core/target/okapi-core-fat.jar dev

Console 2:
cd ~/proj/folio/folio-sample-modules/simple-perl
./example.sh

Script to run on test consoles:
while curl -f -H "X-Okapi-Tenant: testlib" http://localhost:9130/simple
do
echo -n `date`
sleep 0.01
done

Start about 10 of those consoles, and pay attention to the time stamp. At some point things stop moving. Ctrl-C on Okapi, and the scripts die nicely.

Comment by Adam Dickmeiss [ 20/Apr/17 ]

Looks like something we can reproduce with Okapi so assigning to me and Heikki.

Comment by Heikki Levanto [ 20/Apr/17 ]

With a bit bigger script, I can start 20 clients. As long as the requests make another request, the system freezes almost immediately. If I make a request to /hello instead, the perl module makes no further requests, and then thing seems to run happily with 50 requests.

Comment by Heikki Levanto [ 20/Apr/17 ]

We have it. When Okapi makes a request to a module, it uses a global HttpClient to get a connection object. The HttpClient has a pool of connections, and if all are in use, it waits until one is free. But since the module will not finish its response until it has made its own call to Okapi, the pool has one less connection free. Repeat with many parallel requests, and there are no free connections in the pool to serve the request from the module, and the whole system hangs. A classical deadlock.

The solution is to use a HttpClient for each request, so the request from the module gets its own HttpClient, and its own connection pool, and will not need to wait for any other request to finish.

Expect a new Okapi release later today.

Comment by Adam Dickmeiss [ 20/Apr/17 ]

Okapi 1.2.4 is out and a bug OKAPI-312 Closed was made the issue or a subset thereof (there might be other issues!). It is of course interesting whehter 1.2.4 will make the problem go away as far as UI development is concerned.

Comment by Mike Taylor [ 20/Apr/17 ]

This is the absolutely best news! Thanks for chasing it through.

Comment by Mike Taylor [ 20/Apr/17 ]

Wayne Schneider Can we please have a new folio-backend-auth that includes this? Thanks!

Comment by Niels Erik Nielsen [ 20/Apr/17 ]

Most bugs are just annoying. Intriguing bugs are rare

Comment by Adam Dickmeiss [ 25/Apr/17 ]

Can UI people tell us if this problem seems fixed.. If yes, we'll close this one.

Comment by Mike Taylor [ 25/Apr/17 ]

I have not seen a mysterious 500 since the release of folio-backend-auth v0.10.0. So I think we can close this, at least provisionally – we can always re-open if it crops up again.

Comment by Heikki Levanto [ 26/Apr/17 ]

I think I was still able to provoke a few 500's, but on much heavier loading. Something like 50 clients hammering the system as fast as they can, running for tens of minutes before I saw the first. Probably not worth worrying about at this point, but I wanted to record the fact. The loop.sh script with the folio-sample-modules/simple-perl is what I used for the test.

Generated at Thu Feb 08 23:06:22 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.