RCA
1. Incident Summary
On October 2, 2025, during production chatbot operations, an error occurred related to the persistence configuration in the Redis (ElastiCache) database. The instance in use did not support persistence mechanisms, which caused all in-memory data to be lost after a service interruption or restart. As a result, active conversations could not be recovered, and previous chats were permanently lost.
2. Impact
The lack of persistence in the Redis database caused some active chatbot conversations to become unrecoverable after the incident. This temporarily affected service continuity, as some users had to restart their conversations from the beginning.
3. Detection
The incident was initially detected through a CloudWatch alarm that flagged failures in Redis session recovery. In parallel, users reported that chatbot conversations were restarting, which confirmed the impact of the issue.
4. Response
Once the incident was confirmed, the team validated the Redis configuration and reviewed available backups. The latest automated backup prior to the event was identified, providing a recovery point to help mitigate the impact.
5. Root Cause
The incident originated because the Redis ElastiCache database was configured on a cluster type that does not support persistence mechanisms.
6. Implemented Solution
As an immediate measure, the latest available backup was identified and its data integrity validated. Subsequently, the ElastiCache service configuration was adjusted, evaluating migration to a cluster type that supports persistence through AOF or automated snapshots, in order to ensure data retention in case of restarts or failures.