[FOLIO-1960] Folio back-end stopped responding after high load Created: 11/Apr/19 Updated: 03/Jun/20 Resolved: 03/Jun/19 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | P3 |
| Reporter: | Piotr Kalashuk | Assignee: | John Malconian |
| Resolution: | Done | Votes: | 0 |
| Labels: | platform-backlog | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue links: |
|
||||||||
| Sprint: | CP: sprint 63, Core: Platform - Sprint 61, CP: sprint 64 | ||||||||
| Story Points: | 2 | ||||||||
| Development Team: | Core: Platform | ||||||||
| Description |
|
http://folio-testing.aws.indexdata.com/ stopped operating after following steps (a bit abnormal use case):
As a result testing environment stopped responding on the last step The Postman collection emulating this case: create-big-order-and-delete.postman_collection.json |
| Comments |
| Comment by Jakub Skoczen [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
John Malconian before we restart Okapi or rebuild the environment – can we retain the logs and attach them to this issues to try to understand the reason for the failure? | ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
The log was truncated and useless but for completeness: 2019-04-11 01:48:33,511 INFO jFactory$Slf4jLogger [LOCAL] [dev] [3.11] Prefer IPv4 stack is true, prefer IPv6 addresses is false
2019-04-11 01:48:33,521 INFO jFactory$Slf4jLogger [LOCAL] [dev] [3.11] Picked [10.36.1.35]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
2019-04-11 01:48:33,533 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Hazelcast 3.11 (20181023 - 1500bbb) starting at [10.36.1.35]:5701
2019-04-11 01:48:33,533 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Copyright (c) 2008-2018, Hazelcast, Inc. All Rights Reserved.
2019-04-11 01:48:33,533 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] A non-empty group password is configured for the Hazelcast member. Starting with Hazelcast version 3.8.2, members with the same group name, but with different group passwords (that do not use authentication) form a cluster. The group password configuration will be removed completely in a future release.
2019-04-11 01:48:33,704 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Backpressure is disabled
2019-04-11 01:48:34,065 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Creating MulticastJoiner
2019-04-11 01:48:34,255 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Starting 8 partition threads and 5 generic threads (1 dedicated for priority tasks)
2019-04-11 01:48:34,258 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
2019-04-11 01:48:34,267 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] [10.36.1.35]:5701 is STARTING
2019-04-11 01:48:37,048 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11]
Members {size:1, ver:1} [
Member [10.36.1.35]:5701 - 3a369bef-fe91-4036-85ab-582125595564 this
]
2019-04-11 01:48:37,068 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] [10.36.1.35]:5701 is STARTED
2019-04-11 01:48:37,465 INFO jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Initializing cluster partition table arrangement...
2019-04-11 01:48:39,497 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "env" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,503 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "name" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,510 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "deployments" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,512 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "inst_id" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,514 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "tenants" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,515 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "tenant_id" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,517 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "modules" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,518 INFO SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "module_id" already exists, skipping, Severity -> NOTICE))
| ||||||||||||||||||||||||||||||||||||||
| Comment by Ian Hardy [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Attaching okapi log and the first handful of lines from mod-orders and mod-orders-storage. It looks like Okapi stopped logging shortly after it started and the modules started throwing errors on startup before 9 utc when Piotr did his test. | ||||||||||||||||||||||||||||||||||||||
| Comment by Piotr Kalashuk [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
The test failed with 500 code in the middle of run: http://folio-testing-backend01.aws.indexdata.com:9130/orders/receiving-history?limit=39607&query=purchaseOrderId=6c31289e-181e-4459-bf00-cd412063ca30 and receivingStatus==Expected fails with some exception {
"errors": [
{
"message": "Generic error",
"code": "genericError",
"parameters": [],
"cause": null
}
],
"total_records": 1
}
| ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Looks like the second test run duplicated the initial run's results/error. The Okapi logging issue is separate from this (
| ||||||||||||||||||||||||||||||||||||||
| Comment by Marc Johnson [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
John Malconian Piotr Kalashuk Did the second run cause the entire system to be unresponsive? To me, that is very different to some specific requests responding with 500 error responses. | ||||||||||||||||||||||||||||||||||||||
| Comment by Piotr Kalashuk [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Rough quantity of the requests.
{
"locations": [
{
"locationId": "fcd64ce1-6995-48f0-840e-89ffa2288371",
"quantityPhysical": 1
},
{
"locationId": "f34d27c6-a8eb-461b-acd6-5dea81771e70",
"quantityPhysical": 2
}
]
}
The code which increases number of pieces for each next PO Line is for (let i = 0; i < poLinesQuantity - 2; i++) { let newLine = utils.copyJsonObj(physLine); // Increase unit of quantity based on index for each PO Line newLine.locations.forEach(loc => loc.quantityPhysical += i); newLine.cost.quantityPhysical = newLine.locations.map(loc => loc.quantityPhysical).reduce((a, b) => a + b, 0); order.compositePoLines.push(newLine); } This means that resulted quantities are:
So in total 39607 When an order is created the mod-orders sends bunch of requests to mod-orders-storage. One request to create order, separate requests to create PO Lines and pieces. So for the order I used for testing
Additionally to above Create big order request from Postman collection sends requests to inventory to verify that no any item created i.e. 200 requests. Which means that once this Postman Test request is fully completed number of request to okapi was around 40 000 All of these requests are sent asynchronously i.e. almost all of them around same time. The request to delete order and all related entities which is sent by Postman Collection is even more heavy.
Once this Delete order and all related records Postman Test request is fully completed number of request to okapi was around 40 200 So total rough number of requests based on tables above: 80 218 | ||||||||||||||||||||||||||||||||||||||
| Comment by Piotr Kalashuk [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Please note that this is abnormal test because I think no one is going to create an order for 40 000 resources... | ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
I've provisioned http://folio-1960.aws.indexdata.com that we can dedicate to troubleshooting this issue. Okapi is at version 2.26.1. Okapi logging should not be an issue.
| ||||||||||||||||||||||||||||||||||||||
| Comment by Piotr Kalashuk [ 11/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
John Malconian the test was ran ~30-35 mins ago and "delete" request hang. Now okapi on that environment does not respond even on login request. | ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 12/Apr/19 ] | ||||||||||||||||||||||||||||||||||||||
|
I uploaded the okapi/module logs to this issue - okapi-04-11.tar.gz contains several okapi logs. I think things start to go wonky in okapi.log.4. Couple of initial observations. 1. mod-audit-filter starts to spit out a bunch of java exceptions in the middle of that log file that look something like: Apr 11, 2019 4:37:35 PM org.folio.audit.service.AuditFilterService WARNING: Error to extract auditData: java.lang.NullPointerException at org.folio.audit.service.AuditFilterService.getBodyAsJsonObject(AuditFilterService.java:277) at org.folio.audit.service.AuditFilterService.collectAuditData(AuditFilterService.java:139) at org.folio.audit.service.AuditFilterService.prePostHandler(AuditFilterService.java:54) at io.vertx.ext.web.impl.RouteImpl.handleContext(RouteImpl.java:225) at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:120) at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:133) at io.vertx.ext.web.handler.impl.BodyHandlerImpl$BHandler.doEnd(BodyHandlerImpl.java:231) at io.vertx.ext.web.handler.impl.BodyHandlerImpl$BHandler.end(BodyHandlerImpl.java:211) at io.vertx.ext.web.handler.impl.BodyHandlerImpl.lambda$handle$0(BodyHandlerImpl.java:74) at io.vertx.core.http.impl.HttpServerRequestImpl.handleEnd(HttpServerRequestImpl.java:417) at io.vertx.core.http.impl.Http1xServerConnection.handleEnd(Http1xServerConnection.java:482) at io.vertx.core.http.impl.Http1xServerConnection.processMessage(Http1xServerConnection.java:456) at io.vertx.core.http.impl.Http1xServerConnection.handleMessage(Http1xServerConnection.java:144) at io.vertx.core.http.impl.HttpServerImpl$ServerHandlerWithWebSockets.handleMessage(HttpServerImpl.java:728) at io.vertx.core.http.impl.HttpServerImpl$ServerHandlerWithWebSockets.handleMessage(HttpServerImpl.java:619) at io.vertx.core.net.impl.VertxHandler.lambda$channelRead$1(VertxHandler.java:146) at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337) at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:195) at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:144) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:141) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) 2. It seems as though Okapi "freezes up" but this is not the case. Okapi still receives and logs requests: 402713/authn REQ 10.36.20.124:41316 diku POST /authn/login mod-authtoken-2.2.0-SNAPSHOT.44 mod-audit-filter-0.0.6-SNAPSHOT.27 mod-login-5.2.0-SNAPSHOT.52 mod-audit-filter-0.0.6-SNAPSHOT.27 The issue (or one of them) appears to to be mod-authtoken. After restarting the mod-authtoken module, Okapi operations resume normally. There is nothing of interest in the mod-authtoken logs, however, when shutting down the module, the following error is logged: java.lang.IllegalStateException at io.vertx.core.net.impl.VertxEventLoopGroup.next(VertxEventLoopGroup.java:37) at io.vertx.core.net.impl.VertxEventLoopGroup.register(VertxEventLoopGroup.java:53) at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:255) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) I'll keep this system available for additional testing/troubleshooting. | ||||||||||||||||||||||||||||||||||||||
| Comment by Jakub Skoczen [ 06/May/19 ] | ||||||||||||||||||||||||||||||||||||||
|
John Malconian Kurt Nordstrom Guys, what are the next steps for this issue? Is the issue in mod-authtoken you best guess? If so, should we create a ticket for mod-authtoken? | ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 17/May/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Jakub Skoczen Would it be a good idea to add a test for this to the perf-test environment? | ||||||||||||||||||||||||||||||||||||||
| Comment by John Malconian [ 24/May/19 ] | ||||||||||||||||||||||||||||||||||||||
|
Also can be re-tested at https://folio-snapshot-load.aws.indexdata.com which is rebuilt daily. Okapi can be reached directly at https://folio-snapshot-load-okapi.aws.indexdata.com | ||||||||||||||||||||||||||||||||||||||
| Comment by Jakub Skoczen [ 03/Jun/19 ] | ||||||||||||||||||||||||||||||||||||||
|
John Malconian I am going to close this issue – please create a new one if there is something the developers should start investigating. |