TEST
- 1 CloudWatch CPU example
- 2 Sidecar analysis
- 3 Service logs analysis
- 4 DB analysis (performance insights)
- 5 RDS Performance Insights โ Summary
- 5.1 ๐ง Big picture
- 6 ๐จ Key Findings
- 7 ๐ Correlation with your previous findings
- 7.1 From CPU analysis:
- 7.2 From logs:
- 7.3 From DB (this data):
- 8 ๐ง What is REALLY happening (end-to-end)
- 9 ๐จ Critical insight (this is the key)
- 10 ๐ฏ Breaking down responsibility
- 11 โ ๏ธ Why vacuum is a big problem here
- 12 ๐ About that spike (~23:00)
- 13 ๐งช Secondary spike (~09:00)
- 14 ๐ฏ Final conclusion
- 15 ๐ What to check next (very practical)
- 16 ๐ง One-line truth
ย
ย
CloudWatch CPU example
ย
ย
ย
Hereโs a clean Confluence-ready version (structured, readable, no fluff):
h2. CPU Utilization Summary (Active Services Only)
Analysis period: ~24 hours
Cluster: asptc-pvt
Only services with meaningful CPU activity are included. Idle/near-idle services are excluded.
h3. ๐ฅ High-Impact Services
|| Service || Min CPU || Max CPU ||
| mod-linked-data-import-b | ~7 | ~1800 |
| mod-search-b | ~21 | ~1447 |
| mod-linked-data-b | ~5 | ~1100 |
| mod-source-record-storage-b | ~34 | ~896 |
| mod-source-record-manager-b | ~1.6 | ~945 |
h3. โ๏ธ Mid-Tier Active Services
|| Service || Min CPU || Max CPU ||
| mod-data-export-worker-b | ~5 | ~786 |
| mod-inn-reach-b | ~25 | ~683 |
| mod-users-b | ~8 | ~653 |
| mod-scheduler-b | ~15 | ~609 |
| mod-inventory-storage-b | ~6 | ~584 |
| mod-inventory-b | ~4 | ~593 |
h3. ๐งฉ Other Active Services
|| Service || Min CPU || Max CPU ||
| mod-circulation-b | ~13 | ~678 |
| mod-orders-storage-b | ~5 | ~662 |
| mod-quick-marc-b | ~6 | ~554 |
h2. โก Services with Significant CPU Spikes
Criteria:
Large gap between min and max CPU
Peak CPU > 100
h3. ๐จ Major Spike Contributors
mod-linked-data-import-b
mod-search-b
mod-linked-data-b
mod-source-record-storage-b
mod-source-record-manager-b
These services form a clear processing pipeline:
{code}
mod-linked-data-import
โ
mod-linked-data
โ
mod-source-record-storage
โ
mod-search
{code}
h3. โก Secondary Spike Contributors
mod-data-export-worker-b
mod-inn-reach-b
mod-users-b
mod-scheduler-b
mod-inventory-b / mod-inventory-storage-b
These are likely:
impacted by upstream load
or executing background/scheduled tasks
h2. ๐ง Key Observations
System load is burst-driven, not evenly distributed
CPU spikes originate from data import workflows
Downstream services (storage, search) amplify load significantly
Peak load events trigger cluster-wide CPU spikes
h2. ๐ฏ Summary
The system is dominated by import-driven CPU bursts, where:
mod-linked-data-import-b is the primary driver
mod-search-b and storage modules act as amplifiers
Result is a cascading load pattern across multiple services
ย
Sidecar analysis
ย
h1. Log Analysis โ Authentication & Application Layer Behavior
Source: CloudWatch Logs Insights
Sample Size: 10,000 log entries
Scope: Sidecar / Keycloak / Vert.x execution flow
h2. ๐ Summary
The analyzed logs show a system under load with:
High volume of authentication-related processing
Repetitive async execution patterns
Consistent occurrence of errors within the same execution paths
Key highlights:
~11% of logs are related to authentication/token handling
396 ERROR entries (~4%) and 238 WARN entries (~2.3%)
Errors are not random โ they occur in repeated async execution chains
No DB or Kafka-related issues detected in this dataset
Overall, the system demonstrates auth-driven load combined with recurring execution errors, contributing to CPU spikes and response time degradation.
h2. โก Key Metrics
|| Metric || Value ||
| Total Logs | 10,000 |
| Authentication-related logs | 1,113 (~11%) |
| Errors (ERROR) | 396 (~4%) |
| Warnings (WARN) | 238 (~2.3%) |
| Timeout indicators | Not found |
| DB-related logs | Not found |
| Kafka-related logs | Not found |
h2. โฑ๏ธ Response Time Observations
Although explicit response time values are not present in logs, the following patterns indicate degradation:
Repeated async execution chains in Vert.x event loop
Errors occurring within request-processing paths
High frequency of authentication operations
Implications:
Increased request queuing in event loop
Slower request processing under load
Elevated tail latency (p95/p99)
Additional latency caused by retries or failed executions
h2. ๐จ Key Findings
h3. 1. Authentication Layer is a Major Load Driver
~11% of logs contain authentication/token-related messages
Indicates frequent interaction with Keycloak
Likely causes:
token validation per request
insufficient caching
repeated introspection
h3. 2. Repeated Error Execution Pattern
The most frequent repeated log entries are stack trace lines such as:
{code}
at org.folio.sidecar.integration.keycloak.KeycloakService.lambda$handleResponse
at io.vertx.core.impl.future.FutureImpl$1.onSuccess
at io.netty.channel.nio.NioEventLoop.run
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks
{code}
These lines appear ~238 times identically, indicating:
Errors are occurring in the same execution path
This path is triggered repeatedly under load
The issue is systemic, not isolated
h3. 3. Exact Error Characteristics
From the dataset:
396 ERROR-level logs are present
Errors are associated with:
Keycloak response handling
Vert.x async processing
Netty event loop execution
Observed behavior:
Errors produce repeated stack traces (same methods)
No diversity of error types โ indicates single dominant failure path
Errors are embedded in async callbacks (Future/Handler chains)
Interpretation:
Likely failure occurs during:
authentication response processing
or handling of Keycloak responses
Errors are continuously triggered during load โ no recovery between executions
h3. 4. Error Impact on System Behavior
The detected errors contribute to system degradation in multiple ways:
Each error introduces additional processing overhead
Errors in async flow may trigger:
retries
reprocessing
additional callbacks
This leads to:
increased CPU utilization
additional load on event loop
longer request processing times
๐ Effectively, errors amplify load instead of failing fast
h3. 5. Event Loop Saturation
Frequent execution of:
NioEventLoop.run
SingleThreadEventExecutor.runAllTasks
FutureImpl callbacks
Indicates:
Heavy load on Vert.x event loop
Limited thread capacity handling many async operations
Combined with errors:
Event loop processes both:
normal requests
error handling / retries
๐ This increases queueing and latency
h3. 6. No Evidence of Data Layer Issues
No JDBC / SQL logs detected
No Kafka producer/consumer logs
Conclusion:
Bottleneck is not in DB or messaging layer
Issue is isolated to:
authentication layer
async processing layer
h2. ๐ Observed Execution Flow
{code}
Incoming Requests
โ
Sidecar Authentication
โ
Keycloak Response Handling
โ
Async Processing (Vert.x)
โ
ERROR occurs (same execution path)
โ
Repeated execution / retry
โ
Event Loop Saturation
โ
Increased CPU + Response Time
{code}
h2. ๐ง Key Points
Errors are repetitive and concentrated in one execution path
Main hotspot: KeycloakService.handleResponse
Errors occur inside async processing chain
Event loop is handling both normal and failed executions
Errors amplify load rather than reducing it
Authentication flow is tightly coupled with error path
No evidence of DB/Kafka involvement
h2. ๐ฏ Conclusion
The system behavior indicates:
Authentication response handling in Keycloak integration is a critical failure point under load.
Characteristics:
Repeated errors in the same execution path
Continuous triggering under load
No effective mitigation (e.g., caching or throttling)
Errors increase system load instead of isolating failures
Results:
Increased CPU utilization
Event loop saturation
Amplified request processing overhead
Degraded response times (especially p95/p99)
h2. ๐ Recommended Focus Areas
Investigate KeycloakService.handleResponse error cause
Analyze exact error messages preceding stack traces
Validate token caching configuration
Reduce authentication calls per request
Review retry behavior in async flow
Increase Vert.x / Quarkus thread pool capacity
Introduce monitoring for:
error rate over time
response time (p95/p99)
Ensure failures fail fast instead of triggering repeated processing
If you want next step, I can:
๐ pinpoint exact root cause (based on full raw log context before stack traces)
๐ or map this directly to specific config changes in your ECS setup
ย
ย
Service logs analysis
ย
h1. Log Analysis โ Service Layer (Non-Sidecar)
Source: CloudWatch Logs Insights
Sample Size: 2,074 log entries
Scope: Application service (same timeframe as sidecar logs)
h2. ๐ Summary
The analyzed service logs show a stable and clean execution profile with:
No errors or warnings detected
No authentication-related overhead
No DB or Kafka issues observed
Low repetition and no hot execution paths
In contrast to sidecar logs, the service itself behaves normally under the same load conditions.
h2. โก Key Metrics
|| Metric || Value ||
| Total Logs | 2,074 |
| Errors (ERROR) | 0 |
| Warnings (WARN) | 0 |
| Authentication-related logs | 0 |
| Timeout indicators | 0 |
| DB-related logs | 0 |
| Kafka-related logs | 0 |
h2. โฑ๏ธ Response Time Observations
No direct response time metrics are present in logs, but behavior indicates:
No signs of request queuing
No async saturation patterns
No retries or error-related delays
Implication:
Response times at service level are expected to be:
stable
consistent
not degraded under load
๐ This strongly contrasts with sidecar behavior, where response degradation is likely.
h2. ๐จ Key Findings
h3. 1. No Errors or Failures Detected
0 ERROR logs
0 WARN logs
Interpretation:
Service executes requests successfully
No visible failure paths
No retries or reprocessing
๐ Confirms high stability at service level
h3. 2. No Authentication Overhead
No auth/token-related logs detected
Interpretation:
Authentication is handled outside the service (sidecar layer)
Service is not directly impacted by Keycloak processing
๐ Confirms separation of concerns:
sidecar = auth
service = business logic
h3. 3. No Async Saturation Patterns
Unlike sidecar logs, there are no repeated stack traces such as:
Vert.x event loop execution
Netty thread processing
Future/async callback chains
Interpretation:
No event loop pressure observed
No indication of thread starvation
No async bottlenecks
h3. 4. Minimal Repetition Pattern
Most repeated log:
{code}
ConsortiumTenantExecutor Changing context from cs00000int_0004 to cs00000int
{code}
Appears only a few times (very low frequency)
Represents normal tenant context switching
๐ Not a performance concern
h3. 5. No Evidence of Data Layer Activity
No JDBC / SQL logs
No Kafka logs
Conclusion:
Service logs do not indicate:
DB bottlenecks
messaging issues
h2. ๐ Comparison with Sidecar Logs
|| Aspect || Sidecar Logs || Service Logs ||
| Errors | Present (~4%) | None |
| Auth activity | High (~11%) | None |
| Repetition | Very high (same stack traces) | Minimal |
| Async pressure | High (event loop saturation) | None |
| Response time impact | Likely degraded | Stable |
| Bottleneck location | Auth layer | Not present |
h2. ๐ง Key Points
Service layer is stable and not a bottleneck
No errors, retries, or failure amplification
No async or thread pressure observed
No authentication overhead at service level
Behavior is consistent with healthy application logic execution
h2. ๐ฏ Conclusion
The service itself is not contributing to performance degradation.
Instead:
All instability, errors, and load amplification originate from the sidecar (authentication layer)
Service processes requests cleanly once they reach it
Response time degradation is introduced before requests reach the service
๐ Root cause is external to service logic
h2. ๐ Final Insight (Most Important)
{panel:title=Key Takeaway}
The system bottleneck is not in the service layer, but in the authentication flow (sidecar + Keycloak), which introduces errors, retries, and event loop saturation before requests reach the service.
{panel}
If you want next step, I can:
๐ merge both analyses into a single โEnd-to-End Request Flow Bottleneckโ Confluence page
๐ or give you exact config changes (Keycloak / sidecar / Quarkus) to fix this
ย
ย
ย
DB analysis (performance insights)
ย
RDS Performance Insights โ Summary
๐ง Big picture
DB load (AAS) peaked around ~20 (near max vCPU line)
Load is dominated by:
CPU (green) initially
then locks + IO + vacuum-related waits
After peak โ system drops to low steady state
Later (~09:00) โ smaller secondary spike
๐จ Key Findings
1. Heavy INSERT workload into mod_search tables
Top queries are almost entirely:
INSERT INTO cs00000int_mod_search.subjectINSERT INTO cs00000int_mod_search.contributor