Kafka Thread Blocked Timeout (KCache)

Description

Overview:
Every hour mod-inventory's connections to Kafka seemed to time out and logged the following error:
Thread Thread[vert.x-worker-thread-0,5,main] has been blocked for 651992 ms, time limit is 60000 ms
See the attached screenshot for full error stack trace.

When this happens, there's a spike in both mod-inventory and Kafka's CPU usage, lasting 20 minutes to re-establish connections and registering to partitions and topics. This is disruptive to performance, especially during a long operation like data import.

Please see the attached diagrams for more info.

Steps to Reproduce:

  1. Just leave the module running for hours and observe this behavior in the logs or CPU graphs.

Expected Results:
No errors
Actual Results:
Errors produced

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

6
  • 19 Apr 2021, 01:37 PM
  • 19 Apr 2021, 01:32 PM
  • 19 Apr 2021, 01:31 PM
  • 08 Apr 2021, 08:58 PM
  • 08 Apr 2021, 08:57 PM
  • 08 Apr 2021, 08:52 PM

Checklist

hide

TestRail: Results

Activity

Show:

Ann-Marie BreauxSeptember 17, 2021 at 8:06 PM

Hi My understanding is that EBSCO FSE is making some changes to the Cornell configuration, and that most of the issues stem from something going on with Inventory. I don't have all the details, but copying in case she can provide more info

Jenn ColtSeptember 17, 2021 at 4:58 PM

We are experiencing this issue repeatedly at Cornell this week as we attempt to load a large number of MARC records. We have divided the records into files of 1000 to try and avoid taxing the system, however this leaves us with a large number of loads to complete. We see this thread blocked error:

io.vertx.core.VertxException: Thread blocked
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[ms.jar:?]
16:51:29 [] [] [] [] WARN ? Thread Thread[vert.x-worker-thread-15,5,main] has been blocked for 486202 ms, time limit is 60000 ms
 
We see these errors as well around the same time:
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for icnl.Default.fs00001034.DI_SRS_MARC_BIB_RECORD_CREATED-0:120000 ms has passed since batch creation
 
We also see:
16:41:08.008 [vert.x-worker-thread-19] ERROR EventHandlingUtil [93477433eqId] DI_SRS_MARC_BIB_RECORD_CREATED_Producer write error for event DI_SRS_MARC_BIB_RECORD_CREATED:
 

Ann-Marie BreauxApril 23, 2021 at 2:20 PM

Issue didn't come up with several imports after this fix, so OK to close

Martin TranApril 19, 2021 at 1:38 PM

Attached are two graphs showing Kafka topic message rates (MessagesRates-with/out-events_cache). At 4:20, the graph without events_cache shows a big dip in all messages, but that's because events_cache's messages was overwhelming the system, as seen in the messages-with-events_cache graph where messages rate was nearly 1000 per second.

In the widened event_cache-message-rates graph, there are spikes of message rates before and after the data import. This happened on its own, as described above.

Done

Details

Assignee

Reporter

Priority

Story Points

Sprint

Development Team

Folijet

Fix versions

Release

R1 2021 Bug Fix

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs

Created April 8, 2021 at 8:51 PM
Updated September 17, 2021 at 8:06 PM
Resolved April 22, 2021 at 2:17 PM
TestRail: Cases
TestRail: Runs

Flag notifications