Kafka Thread Blocked Timeout (KCache)

Description

Overview:
Every hour mod-inventory's connections to Kafka seemed to time out and logged the following error:
Thread Thread[vert.x-worker-thread-0,5,main] has been blocked for 651992 ms, time limit is 60000 ms
See the attached screenshot for full error stack trace.

When this happens, there's a spike in both mod-inventory and Kafka's CPU usage, lasting 20 minutes to re-establish connections and registering to partitions and topics. This is disruptive to performance, especially during a long operation like data import.

Please see the attached diagrams for more info.

Steps to Reproduce:

Just leave the module running for hours and observe this behavior in the logs or CPU graphs.

Expected Results:
No errors
Actual Results:
Errors produced

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

19 Apr 2021, 01:37 PM
19 Apr 2021, 01:32 PM
19 Apr 2021, 01:31 PM
08 Apr 2021, 08:58 PM
08 Apr 2021, 08:57 PM
08 Apr 2021, 08:52 PM

Linked issues

blocks

MODPUBSUB-164

Release v2.0.5

defines

UXPROD-2614

NFR: Data Import (Batch Importer for Bib Acq) & PubSub R1 2021 Technical, NFR, & Misc bug work

Checklist

hide

TestRail: Results

Activity

Show:

Ann-Marie BreauxSeptember 17, 2021 at 8:06 PM

Hi @Jenn Colt My understanding is that EBSCO FSE is making some changes to the Cornell configuration, and that most of the issues stem from something going on with Inventory. I don't have all the details, but copying @Anya in case she can provide more info

Jenn ColtSeptember 17, 2021 at 4:58 PM

We are experiencing this issue repeatedly at Cornell this week as we attempt to load a large number of MARC records. We have divided the records into files of 1000 to try and avoid taxing the system, however this leaves us with a large number of loads to complete. We see this thread blocked error:

io.vertx.core.VertxException: Thread blocked
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[ms.jar:?]
16:51:29 [] [] [] [] WARN ? Thread Thread[vert.x-worker-thread-15,5,main] has been blocked for 486202 ms, time limit is 60000 ms

We see these errors as well around the same time:
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for icnl.Default.fs00001034.DI_SRS_MARC_BIB_RECORD_CREATED-0:120000 ms has passed since batch creation

We also see:
16:41:08.008 [vert.x-worker-thread-19] ERROR EventHandlingUtil [93477433eqId] DI_SRS_MARC_BIB_RECORD_CREATED_Producer write error for event DI_SRS_MARC_BIB_RECORD_CREATED:

Ann-Marie BreauxApril 23, 2021 at 2:20 PM

Issue didn't come up with several imports after this fix, so OK to close

Martin TranApril 19, 2021 at 1:38 PM

Attached are two graphs showing Kafka topic message rates (MessagesRates-with/out-events_cache). At 4:20, the graph without events_cache shows a big dip in all messages, but that's because events_cache's messages was overwhelming the system, as seen in the messages-with-events_cache graph where messages rate was nearly 1000 per second.

In the widened event_cache-message-rates graph, there are spikes of message rates before and after the data import. This happened on its own, as described above.

Done

Details
Assignee
Ruslan Lavrov
Reporter
Martin Tran
Labels
category-performancedata-importdi-performanceepam-folijet
Priority
P2
Story Points
3
Sprint
None
Development Team
Folijet
Parent
UXPROD-47 Batch Importer (Bib/Acq)
Fix versions
2.0.5
Release
R1 2021 Bug Fix
TestRail: Cases
Open TestRail: Cases
TestRail: Runs
Open TestRail: Runs

Created April 8, 2021 at 8:51 PM

Updated September 17, 2021 at 8:06 PM

Resolved April 22, 2021 at 2:17 PM

TestRail: Cases

TestRail: Runs

Kafka Thread Blocked Timeout (KCache)

Description

CSP Request Details

CSP Rejection Details

Potential Workaround

Attachments

Linked issues

blocks

defines

Checklist

TestRail: Results

Activity

Ann-Marie BreauxSeptember 17, 2021 at 8:06 PM

Jenn ColtSeptember 17, 2021 at 4:58 PM

Ann-Marie BreauxApril 23, 2021 at 2:20 PM

Martin TranApril 19, 2021 at 1:38 PM

Details

Assignee

Reporter

Labels

Priority

Story Points

Sprint

Development Team

Parent

Fix versions

Release

TestRail: Cases

TestRail: Runs

Flag notifications

Something's gone wrong

Something's gone wrong