TEST

TEST

ย 

ย 

CloudWatch CPU example

ย 

metrics

ย 

ย 

Hereโ€™s a clean Confluence-ready version (structured, readable, no fluff):


h2. CPU Utilization Summary (Active Services Only)

Analysis period: ~24 hours
Cluster: asptc-pvt

Only services with meaningful CPU activity are included. Idle/near-idle services are excluded.


h3. ๐Ÿ”ฅ High-Impact Services

|| Service || Min CPU || Max CPU ||
| mod-linked-data-import-b | ~7 | ~1800 |
| mod-search-b | ~21 | ~1447 |
| mod-linked-data-b | ~5 | ~1100 |
| mod-source-record-storage-b | ~34 | ~896 |
| mod-source-record-manager-b | ~1.6 | ~945 |


h3. โš™๏ธ Mid-Tier Active Services

|| Service || Min CPU || Max CPU ||
| mod-data-export-worker-b | ~5 | ~786 |
| mod-inn-reach-b | ~25 | ~683 |
| mod-users-b | ~8 | ~653 |
| mod-scheduler-b | ~15 | ~609 |
| mod-inventory-storage-b | ~6 | ~584 |
| mod-inventory-b | ~4 | ~593 |


h3. ๐Ÿงฉ Other Active Services

|| Service || Min CPU || Max CPU ||
| mod-circulation-b | ~13 | ~678 |
| mod-orders-storage-b | ~5 | ~662 |
| mod-quick-marc-b | ~6 | ~554 |


h2. โšก Services with Significant CPU Spikes

Criteria:

  • Large gap between min and max CPU

  • Peak CPU > 100

h3. ๐Ÿšจ Major Spike Contributors

  • mod-linked-data-import-b

  • mod-search-b

  • mod-linked-data-b

  • mod-source-record-storage-b

  • mod-source-record-manager-b

These services form a clear processing pipeline:

{code}
mod-linked-data-import
โ†“
mod-linked-data
โ†“
mod-source-record-storage
โ†“
mod-search
{code}


h3. โšก Secondary Spike Contributors

  • mod-data-export-worker-b

  • mod-inn-reach-b

  • mod-users-b

  • mod-scheduler-b

  • mod-inventory-b / mod-inventory-storage-b

These are likely:

  • impacted by upstream load

  • or executing background/scheduled tasks


h2. ๐Ÿง  Key Observations

  • System load is burst-driven, not evenly distributed

  • CPU spikes originate from data import workflows

  • Downstream services (storage, search) amplify load significantly

  • Peak load events trigger cluster-wide CPU spikes


h2. ๐ŸŽฏ Summary

The system is dominated by import-driven CPU bursts, where:

  • mod-linked-data-import-b is the primary driver

  • mod-search-b and storage modules act as amplifiers

  • Result is a cascading load pattern across multiple services

ย 

Sidecar analysis

ย 

h1. Log Analysis โ€“ Authentication & Application Layer Behavior

Source: CloudWatch Logs Insights
Sample Size: 10,000 log entries
Scope: Sidecar / Keycloak / Vert.x execution flow


h2. ๐Ÿ“Š Summary

The analyzed logs show a system under load with:

  • High volume of authentication-related processing

  • Repetitive async execution patterns

  • Consistent occurrence of errors within the same execution paths

Key highlights:

  • ~11% of logs are related to authentication/token handling

  • 396 ERROR entries (~4%) and 238 WARN entries (~2.3%)

  • Errors are not random โ€” they occur in repeated async execution chains

  • No DB or Kafka-related issues detected in this dataset

Overall, the system demonstrates auth-driven load combined with recurring execution errors, contributing to CPU spikes and response time degradation.


h2. โšก Key Metrics

|| Metric || Value ||
| Total Logs | 10,000 |
| Authentication-related logs | 1,113 (~11%) |
| Errors (ERROR) | 396 (~4%) |
| Warnings (WARN) | 238 (~2.3%) |
| Timeout indicators | Not found |
| DB-related logs | Not found |
| Kafka-related logs | Not found |


h2. โฑ๏ธ Response Time Observations

Although explicit response time values are not present in logs, the following patterns indicate degradation:

  • Repeated async execution chains in Vert.x event loop

  • Errors occurring within request-processing paths

  • High frequency of authentication operations

Implications:

  • Increased request queuing in event loop

  • Slower request processing under load

  • Elevated tail latency (p95/p99)

  • Additional latency caused by retries or failed executions


h2. ๐Ÿšจ Key Findings

h3. 1. Authentication Layer is a Major Load Driver

  • ~11% of logs contain authentication/token-related messages

  • Indicates frequent interaction with Keycloak

  • Likely causes:

    • token validation per request

    • insufficient caching

    • repeated introspection


h3. 2. Repeated Error Execution Pattern

The most frequent repeated log entries are stack trace lines such as:

{code}
at org.folio.sidecar.integration.keycloak.KeycloakService.lambda$handleResponse
at io.vertx.core.impl.future.FutureImpl$1.onSuccess
at io.netty.channel.nio.NioEventLoop.run
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks
{code}

These lines appear ~238 times identically, indicating:

  • Errors are occurring in the same execution path

  • This path is triggered repeatedly under load

  • The issue is systemic, not isolated


h3. 3. Exact Error Characteristics

From the dataset:

  • 396 ERROR-level logs are present

  • Errors are associated with:

    • Keycloak response handling

    • Vert.x async processing

    • Netty event loop execution

Observed behavior:

  • Errors produce repeated stack traces (same methods)

  • No diversity of error types โ†’ indicates single dominant failure path

  • Errors are embedded in async callbacks (Future/Handler chains)

Interpretation:

  • Likely failure occurs during:

    • authentication response processing

    • or handling of Keycloak responses

  • Errors are continuously triggered during load โ†’ no recovery between executions


h3. 4. Error Impact on System Behavior

The detected errors contribute to system degradation in multiple ways:

  • Each error introduces additional processing overhead

  • Errors in async flow may trigger:

    • retries

    • reprocessing

    • additional callbacks

  • This leads to:

    • increased CPU utilization

    • additional load on event loop

    • longer request processing times

๐Ÿ‘‰ Effectively, errors amplify load instead of failing fast


h3. 5. Event Loop Saturation

Frequent execution of:

  • NioEventLoop.run

  • SingleThreadEventExecutor.runAllTasks

  • FutureImpl callbacks

Indicates:

  • Heavy load on Vert.x event loop

  • Limited thread capacity handling many async operations

Combined with errors:

  • Event loop processes both:

    • normal requests

    • error handling / retries

๐Ÿ‘‰ This increases queueing and latency


h3. 6. No Evidence of Data Layer Issues

  • No JDBC / SQL logs detected

  • No Kafka producer/consumer logs

Conclusion:

  • Bottleneck is not in DB or messaging layer

  • Issue is isolated to:

    • authentication layer

    • async processing layer


h2. ๐Ÿ”— Observed Execution Flow

{code}
Incoming Requests
โ†“
Sidecar Authentication
โ†“
Keycloak Response Handling
โ†“
Async Processing (Vert.x)
โ†“
ERROR occurs (same execution path)
โ†“
Repeated execution / retry
โ†“
Event Loop Saturation
โ†“
Increased CPU + Response Time
{code}


h2. ๐Ÿง  Key Points

  • Errors are repetitive and concentrated in one execution path

  • Main hotspot: KeycloakService.handleResponse

  • Errors occur inside async processing chain

  • Event loop is handling both normal and failed executions

  • Errors amplify load rather than reducing it

  • Authentication flow is tightly coupled with error path

  • No evidence of DB/Kafka involvement


h2. ๐ŸŽฏ Conclusion

The system behavior indicates:

Authentication response handling in Keycloak integration is a critical failure point under load.

Characteristics:

  • Repeated errors in the same execution path

  • Continuous triggering under load

  • No effective mitigation (e.g., caching or throttling)

  • Errors increase system load instead of isolating failures

Results:

  • Increased CPU utilization

  • Event loop saturation

  • Amplified request processing overhead

  • Degraded response times (especially p95/p99)


h2. ๐Ÿ“Œ Recommended Focus Areas

  • Investigate KeycloakService.handleResponse error cause

  • Analyze exact error messages preceding stack traces

  • Validate token caching configuration

  • Reduce authentication calls per request

  • Review retry behavior in async flow

  • Increase Vert.x / Quarkus thread pool capacity

  • Introduce monitoring for:

    • error rate over time

    • response time (p95/p99)

  • Ensure failures fail fast instead of triggering repeated processing


If you want next step, I can:

๐Ÿ‘‰ pinpoint exact root cause (based on full raw log context before stack traces)
๐Ÿ‘‰ or map this directly to specific config changes in your ECS setup

ย 

ย 

Service logs analysis

ย 

h1. Log Analysis โ€“ Service Layer (Non-Sidecar)

Source: CloudWatch Logs Insights
Sample Size: 2,074 log entries
Scope: Application service (same timeframe as sidecar logs)


h2. ๐Ÿ“Š Summary

The analyzed service logs show a stable and clean execution profile with:

  • No errors or warnings detected

  • No authentication-related overhead

  • No DB or Kafka issues observed

  • Low repetition and no hot execution paths

In contrast to sidecar logs, the service itself behaves normally under the same load conditions.


h2. โšก Key Metrics

|| Metric || Value ||
| Total Logs | 2,074 |
| Errors (ERROR) | 0 |
| Warnings (WARN) | 0 |
| Authentication-related logs | 0 |
| Timeout indicators | 0 |
| DB-related logs | 0 |
| Kafka-related logs | 0 |


h2. โฑ๏ธ Response Time Observations

No direct response time metrics are present in logs, but behavior indicates:

  • No signs of request queuing

  • No async saturation patterns

  • No retries or error-related delays

Implication:

  • Response times at service level are expected to be:

    • stable

    • consistent

    • not degraded under load

๐Ÿ‘‰ This strongly contrasts with sidecar behavior, where response degradation is likely.


h2. ๐Ÿšจ Key Findings

h3. 1. No Errors or Failures Detected

  • 0 ERROR logs

  • 0 WARN logs

Interpretation:

  • Service executes requests successfully

  • No visible failure paths

  • No retries or reprocessing

๐Ÿ‘‰ Confirms high stability at service level


h3. 2. No Authentication Overhead

  • No auth/token-related logs detected

Interpretation:

  • Authentication is handled outside the service (sidecar layer)

  • Service is not directly impacted by Keycloak processing

๐Ÿ‘‰ Confirms separation of concerns:

  • sidecar = auth

  • service = business logic


h3. 3. No Async Saturation Patterns

Unlike sidecar logs, there are no repeated stack traces such as:

  • Vert.x event loop execution

  • Netty thread processing

  • Future/async callback chains

Interpretation:

  • No event loop pressure observed

  • No indication of thread starvation

  • No async bottlenecks


h3. 4. Minimal Repetition Pattern

Most repeated log:

{code}
ConsortiumTenantExecutor Changing context from cs00000int_0004 to cs00000int
{code}

  • Appears only a few times (very low frequency)

  • Represents normal tenant context switching

๐Ÿ‘‰ Not a performance concern


h3. 5. No Evidence of Data Layer Activity

  • No JDBC / SQL logs

  • No Kafka logs

Conclusion:

  • Service logs do not indicate:

    • DB bottlenecks

    • messaging issues


h2. ๐Ÿ”— Comparison with Sidecar Logs

|| Aspect || Sidecar Logs || Service Logs ||
| Errors | Present (~4%) | None |
| Auth activity | High (~11%) | None |
| Repetition | Very high (same stack traces) | Minimal |
| Async pressure | High (event loop saturation) | None |
| Response time impact | Likely degraded | Stable |
| Bottleneck location | Auth layer | Not present |


h2. ๐Ÿง  Key Points

  • Service layer is stable and not a bottleneck

  • No errors, retries, or failure amplification

  • No async or thread pressure observed

  • No authentication overhead at service level

  • Behavior is consistent with healthy application logic execution


h2. ๐ŸŽฏ Conclusion

The service itself is not contributing to performance degradation.

Instead:

  • All instability, errors, and load amplification originate from the sidecar (authentication layer)

  • Service processes requests cleanly once they reach it

  • Response time degradation is introduced before requests reach the service

๐Ÿ‘‰ Root cause is external to service logic


h2. ๐Ÿ“Œ Final Insight (Most Important)

{panel:title=Key Takeaway}
The system bottleneck is not in the service layer, but in the authentication flow (sidecar + Keycloak), which introduces errors, retries, and event loop saturation before requests reach the service.
{panel}


If you want next step, I can:

๐Ÿ‘‰ merge both analyses into a single โ€œEnd-to-End Request Flow Bottleneckโ€ Confluence page
๐Ÿ‘‰ or give you exact config changes (Keycloak / sidecar / Quarkus) to fix this

ย 

ย 

ย 

DB analysis (performance insights)

ย 

RDS Performance Insights โ€” Summary

๐Ÿง  Big picture

  • DB load (AAS) peaked around ~20 (near max vCPU line)

  • Load is dominated by:

    • CPU (green) initially

    • then locks + IO + vacuum-related waits

  • After peak โ†’ system drops to low steady state

  • Later (~09:00) โ†’ smaller secondary spike


๐Ÿšจ Key Findings

1. Heavy INSERT workload into mod_search tables

Top queries are almost entirely:

  • INSERT INTO cs00000int_mod_search.subject

  • INSERT INTO cs00000int_mod_search.contributor