Intermittent Messaging Disruptions

Después de la muerte

octubre 03, 2025 a 21:24

Después de la muerte

octubre 03, 2025 a 21:24

RCA

1. Incident Summary

On October 2, 2025, during production chatbot operations, an error occurred related to the persistence configuration in the Redis (ElastiCache) database. The instance in use did not support persistence mechanisms, which caused all in-memory data to be lost after a service interruption or restart. As a result, active conversations could not be recovered, and previous chats were permanently lost.

2. Impact

The lack of persistence in the Redis database caused some active chatbot conversations to become unrecoverable after the incident. This temporarily affected service continuity, as some users had to restart their conversations from the beginning.

3. Detection

The incident was initially detected through a CloudWatch alarm that flagged failures in Redis session recovery. In parallel, users reported that chatbot conversations were restarting, which confirmed the impact of the issue.

4. Response

Once the incident was confirmed, the team validated the Redis configuration and reviewed available backups. The latest automated backup prior to the event was identified, providing a recovery point to help mitigate the impact.

5. Root Cause

The incident originated because the Redis ElastiCache database was configured on a cluster type that does not support persistence mechanisms.

6. Implemented Solution

As an immediate measure, the latest available backup was identified and its data integrity validated. Subsequently, the ElastiCache service configuration was adjusted, evaluating migration to a cluster type that supports persistence through AOF or automated snapshots, in order to ensure data retention in case of restarts or failures.

Resuelto

octubre 02, 2025 a 19:06

Resuelto

octubre 02, 2025 a 19:06

This incident has been resolved.

A fix has been deployed to properly close chats that were not under active handling.

We will provide the detailed RCA in this channel as part of the postmortem.

Supervisando

octubre 02, 2025 a 15:52

Supervisando

octubre 02, 2025 a 15:52

We implemented a fix and are currently monitoring the result.

We advise users to sign out and sign back in to confirm the fix.

Identificado

octubre 02, 2025 a 15:39

Identificado

octubre 02, 2025 a 15:39

A fix is being released within the next few minutes.

Investigando

octubre 02, 2025 a 14:47