[FOLIO-1960] Folio back-end stopped responding after high load Created: 11/Apr/19  Updated: 03/Jun/20  Resolved: 03/Jun/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P3
Reporter: Piotr Kalashuk Assignee: John Malconian
Resolution: Done Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File collectionRun_200PoLines.png     PNG File collectionRun_50PoLines.png     PNG File collection_vars.png     File create-big-order-and-delete.postman_collection.json     Text File mod-orders-storage.txt     Text File mod-orders.txt     File okapi-04-11.tar.gz     Text File okapi.txt    
Issue links:
Relates
relates to FOLIO-1967 Rescue folio-testing and build dedica... Closed
Sprint: CP: sprint 63, Core: Platform - Sprint 61, CP: sprint 64
Story Points: 2
Development Team: Core: Platform

 Description   

http://folio-testing.aws.indexdata.com/ stopped operating after following steps (a bit abnormal use case):

  • Created a big order in "Open" status with 200 purchase order lines which in total created 40 000+ entities (mostly piece records) in the database
  • Several times attempted to get all piece records
  • Attempted to delete all entities created by the order

As a result testing environment stopped responding on the last step

The Postman collection emulating this case: create-big-order-and-delete.postman_collection.json
The collection requires folio-vagrant.postman_environment variables.
Default number of desired PO Lines is set to 10 (current value can be set to 200 like in test I did):

The collection creates order with desired quantity of PO Lines. When order is created, no any record is created in the inventory and Tests collection sends request to inventory to check that no items created.



 Comments   
Comment by Jakub Skoczen [ 11/Apr/19 ]

John Malconian before we restart Okapi or rebuild the environment – can we retain the logs and attach them to this issues to try to understand the reason for the failure?

Comment by John Malconian [ 11/Apr/19 ]

The log was truncated and useless but for completeness:

2019-04-11 01:48:33,511 INFO  jFactory$Slf4jLogger [LOCAL] [dev] [3.11] Prefer IPv4 stack is true, prefer IPv6 addresses is false
2019-04-11 01:48:33,521 INFO  jFactory$Slf4jLogger [LOCAL] [dev] [3.11] Picked [10.36.1.35]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
2019-04-11 01:48:33,533 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Hazelcast 3.11 (20181023 - 1500bbb) starting at [10.36.1.35]:5701
2019-04-11 01:48:33,533 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Copyright (c) 2008-2018, Hazelcast, Inc. All Rights Reserved.
2019-04-11 01:48:33,533 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] A non-empty group password is configured for the Hazelcast member. Starting with Hazelcast version 3.8.2, members with the same group name, but with different group passwords (that do not use authentication) form a cluster. The group password configuration will be removed completely in a future release.
2019-04-11 01:48:33,704 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Backpressure is disabled
2019-04-11 01:48:34,065 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Creating MulticastJoiner
2019-04-11 01:48:34,255 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Starting 8 partition threads and 5 generic threads (1 dedicated for priority tasks)
2019-04-11 01:48:34,258 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
2019-04-11 01:48:34,267 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] [10.36.1.35]:5701 is STARTING
2019-04-11 01:48:37,048 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] 
Members {size:1, ver:1} [
    Member [10.36.1.35]:5701 - 3a369bef-fe91-4036-85ab-582125595564 this
]
2019-04-11 01:48:37,068 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] [10.36.1.35]:5701 is STARTED
2019-04-11 01:48:37,465 INFO  jFactory$Slf4jLogger [10.36.1.35]:5701 [dev] [3.11] Initializing cluster partition table arrangement...
2019-04-11 01:48:39,497 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "env" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,503 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "name" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,510 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "deployments" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,512 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "inst_id" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,514 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "tenants" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,515 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "tenant_id" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,517 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 209, File -> parse_utilcmd.c, SQLSTATE -> 42P07, Routine -> transformCreateStmt, V -> NOTICE, Message -> relation "modules" already exists, skipping, Severity -> NOTICE))
2019-04-11 01:48:39,518 INFO  SQLConnectionHandler Received notice NoticeMessage(fields=Map(Line -> 805, File -> index.c, SQLSTATE -> 42P07, Routine -> index_create, V -> NOTICE, Message -> relation "module_id" already exists, skipping, Severity -> NOTICE))
Comment by Ian Hardy [ 11/Apr/19 ]

Attaching okapi log and the first handful of lines from mod-orders and mod-orders-storage. It looks like Okapi stopped logging shortly after it started and the modules started throwing errors on startup before 9 utc when Piotr did his test.

okapi.txt mod-orders-storage.txt mod-orders.txt

Comment by Piotr Kalashuk [ 11/Apr/19 ]

The test failed with 500 code in the middle of run:

The reason is that 39607 number of pieces created for the order and all of them cannot be retrieved. The request

http://folio-testing-backend01.aws.indexdata.com:9130/orders/receiving-history?limit=39607&query=purchaseOrderId=6c31289e-181e-4459-bf00-cd412063ca30 and receivingStatus==Expected

fails with some exception

{
    "errors": [
        {
            "message": "Generic error",
            "code": "genericError",
            "parameters": [],
            "cause": null
        }
    ],
    "total_records": 1
}
Comment by John Malconian [ 11/Apr/19 ]

Looks like the second test run duplicated the initial run's results/error. The Okapi logging issue is separate from this ( OKAPI-722 Closed ). However, the lack of logging makes this issue difficult to troubleshoot.

Comment by Marc Johnson [ 11/Apr/19 ]

John Malconian Piotr Kalashuk Did the second run cause the entire system to be unresponsive? To me, that is very different to some specific requests responding with 500 error responses.

Comment by Piotr Kalashuk [ 11/Apr/19 ]

Rough quantity of the requests.
The first 2 PO Lines have static quantities (4 and 3). Starting from the third PO Line the quantity of expected piece records is increased by progression. The template is

{
  "locations": [
    {
      "locationId": "fcd64ce1-6995-48f0-840e-89ffa2288371",
      "quantityPhysical": 1
    },
    {
      "locationId": "f34d27c6-a8eb-461b-acd6-5dea81771e70",
      "quantityPhysical": 2
    }
  ]
}

The code which increases number of pieces for each next PO Line is

for (let i = 0; i < poLinesQuantity - 2; i++) {
    let newLine = utils.copyJsonObj(physLine);

    // Increase unit of quantity based on index for each PO Line
    newLine.locations.forEach(loc => loc.quantityPhysical += i);
    newLine.cost.quantityPhysical = newLine.locations.map(loc => loc.quantityPhysical).reduce((a, b) => a + b, 0);
    order.compositePoLines.push(newLine);
}

This means that resulted quantities are:

PO Line # Total Quantities
3 1+2=3
4 2+3=5
5 3+4=7
6 4+5=9
... ...
200 199+200=399

So in total 39607

When an order is created the mod-orders sends bunch of requests to mod-orders-storage. One request to create order, separate requests to create PO Lines and pieces. So for the order I used for testing

Type of request quantity
Create order 1
Create PO Line 200
Create piece 39607
Update order 1

Additionally to above Create big order request from Postman collection sends requests to inventory to verify that no any item created i.e. 200 requests. Which means that once this Postman Test request is fully completed number of request to okapi was around 40 000

All of these requests are sent asynchronously i.e. almost all of them around same time.

The request to delete order and all related entities which is sent by Postman Collection is even more heavy.

Type of request quantity
Get order 1
Get pieces for each PO Line 200
Delete pieces 39607
Verify that still no items available in inventory 200
Delete PO Line 200
Delete order 1

Once this Delete order and all related records Postman Test request is fully completed number of request to okapi was around 40 200

So total rough number of requests based on tables above: 80 218

Comment by Piotr Kalashuk [ 11/Apr/19 ]

Please note that this is abnormal test because I think no one is going to create an order for 40 000 resources...

Comment by John Malconian [ 11/Apr/19 ]

I've provisioned http://folio-1960.aws.indexdata.com that we can dedicate to troubleshooting this issue. Okapi is at version 2.26.1. Okapi logging should not be an issue.

  • mod-orders-storage:5.0.0-SNAPSHOT.83
  • mod-orders:5.0.0-SNAPSHOT.146
Comment by Piotr Kalashuk [ 11/Apr/19 ]

John Malconian the test was ran ~30-35 mins ago and "delete" request hang. Now okapi on that environment does not respond even on login request.

Comment by John Malconian [ 12/Apr/19 ]

I uploaded the okapi/module logs to this issue - okapi-04-11.tar.gz contains several okapi logs. I think things start to go wonky in okapi.log.4. Couple of initial observations.

1. mod-audit-filter starts to spit out a bunch of java exceptions in the middle of that log file that look something like:

Apr 11, 2019 4:37:35 PM org.folio.audit.service.AuditFilterService
WARNING: Error to extract auditData: 
java.lang.NullPointerException
	at org.folio.audit.service.AuditFilterService.getBodyAsJsonObject(AuditFilterService.java:277)
	at org.folio.audit.service.AuditFilterService.collectAuditData(AuditFilterService.java:139)
	at org.folio.audit.service.AuditFilterService.prePostHandler(AuditFilterService.java:54)
	at io.vertx.ext.web.impl.RouteImpl.handleContext(RouteImpl.java:225)
	at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:120)
	at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:133)
	at io.vertx.ext.web.handler.impl.BodyHandlerImpl$BHandler.doEnd(BodyHandlerImpl.java:231)
	at io.vertx.ext.web.handler.impl.BodyHandlerImpl$BHandler.end(BodyHandlerImpl.java:211)
	at io.vertx.ext.web.handler.impl.BodyHandlerImpl.lambda$handle$0(BodyHandlerImpl.java:74)
	at io.vertx.core.http.impl.HttpServerRequestImpl.handleEnd(HttpServerRequestImpl.java:417)
	at io.vertx.core.http.impl.Http1xServerConnection.handleEnd(Http1xServerConnection.java:482)
	at io.vertx.core.http.impl.Http1xServerConnection.processMessage(Http1xServerConnection.java:456)
	at io.vertx.core.http.impl.Http1xServerConnection.handleMessage(Http1xServerConnection.java:144)
	at io.vertx.core.http.impl.HttpServerImpl$ServerHandlerWithWebSockets.handleMessage(HttpServerImpl.java:728)
	at io.vertx.core.http.impl.HttpServerImpl$ServerHandlerWithWebSockets.handleMessage(HttpServerImpl.java:619)
	at io.vertx.core.net.impl.VertxHandler.lambda$channelRead$1(VertxHandler.java:146)
	at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
	at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:195)
	at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:144)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:141)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

2. It seems as though Okapi "freezes up" but this is not the case. Okapi still receives and logs requests:

402713/authn REQ 10.36.20.124:41316 diku POST /authn/login mod-authtoken-2.2.0-SNAPSHOT.44 mod-audit-filter-0.0.6-SNAPSHOT.27 mod-login-5.2.0-SNAPSHOT.52 mod-audit-filter-0.0.6-SNAPSHOT.27

The issue (or one of them) appears to to be mod-authtoken. After restarting the mod-authtoken module, Okapi operations resume normally. There is nothing of interest in the mod-authtoken logs, however, when shutting down the module, the following error is logged:

java.lang.IllegalStateException
	at io.vertx.core.net.impl.VertxEventLoopGroup.next(VertxEventLoopGroup.java:37)
	at io.vertx.core.net.impl.VertxEventLoopGroup.register(VertxEventLoopGroup.java:53)
	at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:255)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
	at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

I'll keep this system available for additional testing/troubleshooting.

Comment by Jakub Skoczen [ 06/May/19 ]

John Malconian Kurt Nordstrom Guys, what are the next steps for this issue? Is the issue in mod-authtoken you best guess? If so, should we create a ticket for mod-authtoken?

Comment by John Malconian [ 17/May/19 ]

Jakub Skoczen Would it be a good idea to add a test for this to the perf-test environment?

Comment by John Malconian [ 24/May/19 ]

Also can be re-tested at https://folio-snapshot-load.aws.indexdata.com which is rebuilt daily. Okapi can be reached directly at https://folio-snapshot-load-okapi.aws.indexdata.com

Comment by Jakub Skoczen [ 03/Jun/19 ]

John Malconian I am going to close this issue – please create a new one if there is something the developers should start investigating.

Generated at Thu Feb 08 23:17:10 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.