[FOLIO-3049] Intermittent mod-source-record-storage/mod-pubsub initialization failure Created: 01/Mar/21  Updated: 23/Apr/21  Resolved: 20/Apr/21

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P2
Reporter: Wayne Schneider Assignee: David Crossley
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: File fsl-852-20210323.tar.gz    
Issue links:
Blocks
is blocked by MODPUBSUB-155 Intermittent module initialization fa... Closed
Sprint: DevOps Sprint 109, DevOps Sprint 110, DevOps Sprint 111, DevOps Sprint 112
Development Team: FOLIO DevOps

 Description   

In our Jenkins reference builds, we are getting intermittent failures initializing mod-source-record-storage. The build fails with the error:

Tenant operation failed for module mod-source-record-storage-4.2.0-SNAPSHOT.263: org.folio.util.pubsub.exceptions.ModuleRegistrationException: Module's subscribers were not registered in PubSub. HTTP status: 422

We are seeing this error condition both in AWS reference builds and in Vagrant box builds. Unfortunately, the error does not occur consistently.

Oleksii Kuzminov suggests:

It can be related to modules' setup order. PubSub should be ready first to be able register other modules

FWIW, in the reference builds, mod-pubsub is deployed first (because it is first in the dependency graph). Perhaps it is not always ready to process requests by the time mod-source-record-storage attempts to register.



 Comments   
Comment by David Crossley [ 18/Mar/21 ]

This intermittent behaviour has been has been happening most days on the reference environment daily builds, for just one of the set of envs. Mostly for folio-snapshot-core, but sometimes folio-snapshot-load, and once on folio-testing-core-backend.

Comment by David Crossley [ 18/Mar/21 ]

Oh, why did i utter those words. Murphy hit me.

This problem happened on folio-snapshot-core during today's normal daily builds. As usual, a re-run was okay.

Then later today i was trying to finalise some testing on an unrelated ticket, and it hit me three times in a row on folio-snapshot-test. These with mod-patron-blocks. A following run on folio-testing-test was okay. Then another run on folio-snapshot-test failed (this time with mod-source-record-storage). Give up.

Comment by Jakub Skoczen [ 18/Mar/21 ]

Oleksii Kuzminov Since these modules depend on mod-pubsub Okapi will ensure that mod-pubsub is initialized first. Is it possible that mod-pubsub returns from the 'init' call too early, before it is properly initizialized?

Comment by Wayne Schneider [ 19/Mar/21 ]

Bumping this to P2 as this is now blocking other work (aside from just being very very frustrating).

Comment by David Crossley [ 19/Mar/21 ]

Today the daily reference environment builds with both folio-snapshot-core and folio-snapshot-load (mod-patron-blocks) failed, and re-run was okay for each.

Comment by Wayne Schneider [ 23/Mar/21 ]

Unable to reproduce the problem today. There have been some changes since last week in mod-pubsub and presumably other affected modules, as well. Putting this in review (though there is no code to review); I'd be very surprised if the problem just went away, though.

Comment by David Crossley [ 24/Mar/21 ]

Ha, Murphy strikes again. I have not noticed any on the reference environments for the past four days. However, there was an out-of-band rebuild earlier today of folio-snapshot-load/852 which failed (this time with mod-patron-blocks).
Attached fsl-852-20210323.tar.gz

Comment by Wayne Schneider [ 24/Mar/21 ]

Thanks, David Crossley, now we are getting somewhere.

This appears to be the relevant part of the pubsub log from when mod-patron-blocks is initializing:

17:53:18 [] [] [] [] INFO  ingModuleServiceImpl Event type ITEM_AGED_TO_LOST does not exist, creating a temporary definition
17:53:18 [] [] [] [] INFO  ingModuleServiceImpl Event type ITEM_CLAIMED_RETURNED does not exist, creating a temporary definition
17:53:18 [] [] [] [] ERROR DbUtil               Rollback transaction
io.vertx.pgclient.PgException: { "message": "insert or update on table \"messaging_module\" violates foreign key constraint \"fk_event_type_id\"", "severity": "ERROR", "code": "23503", "detail": "Key (event_type_id)=(ITEM_CLAIMED_RETURNED) is not present in table \"event_descriptor\".", "file": "ri_triggers.c", "line": "3255", "routine": "ri_ReportViolation", "schema": "pubsub_config", "table": "messaging_module", "constraint": "fk_event_type_id" }
	at io.vertx.pgclient.impl.codec.ErrorResponse.toException(ErrorResponse.java:31) ~[mod-pubsub-server-fat.jar:?]
	at io.vertx.pgclient.impl.codec.QueryCommandBaseCodec.handleErrorResponse(QueryCommandBaseCodec.java:57) ~[mod-pubsub-server-fat.jar:?]
	at io.vertx.pgclient.impl.codec.ExtendedQueryCommandCodec.handleErrorResponse(ExtendedQueryCommandCodec.java:90) ~[mod-pubsub-server-fat.jar:?]
	at io.vertx.pgclient.impl.codec.PgDecoder.decodeError(PgDecoder.java:236) ~[mod-pubsub-server-fat.jar:?]
	at io.vertx.pgclient.impl.codec.PgDecoder.decodeMessage(PgDecoder.java:122) [mod-pubsub-server-fat.jar:?]
	at io.vertx.pgclient.impl.codec.PgDecoder.channelRead(PgDecoder.java:102) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [mod-pubsub-server-fat.jar:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [mod-pubsub-server-fat.jar:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [mod-pubsub-server-fat.jar:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [mod-pubsub-server-fat.jar:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [mod-pubsub-server-fat.jar:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
17:53:18 [] [] [] [] ERROR ExceptionHelper      { "message": "insert or update on table \"messaging_module\" violates foreign key constraint \"fk_event_type_id\"", "severity": "ERROR", "code": "23503", "detail": "Key (event_type_id)=(ITEM_CLAIMED_RETURNED) is not present in table \"event_descriptor\".", "file": "ri_triggers.c", "line": "3255", "routine": "ri_ReportViolation", "schema": "pubsub_config", "table": "messaging_module", "constraint": "fk_event_type_id" }
17:53:18 [] [] [] [] INFO  LogUtil              org.folio.rest.RestVerticle start  invoking postPubsubEventTypesDeclareSubscriber
17:53:18 [] [] [] [] INFO  LogUtil              10.36.1.167:60996 POST /pubsub/event-types/declare/subscriber  HTTP_1_1 422 284 20 tid=diku Unprocessable Entity 
Comment by Wayne Schneider [ 24/Mar/21 ]

Blocking on MODPUBSUB-155 Closed until the developers can take a look.

Comment by David Crossley [ 15/Apr/21 ]

I am monitoring daily builds to ensure fixed.

Comment by David Crossley [ 19/Apr/21 ]

Not seen any occurrence yet. Will keep monitoring.

Comment by David Crossley [ 23/Apr/21 ]

Not seen this behaviour since the previous comment.

Generated at Thu Feb 08 23:25:14 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.