Jelou - Intermittent Messaging Disruptions – Detalles del incidente

Sistemas funcionando con normalidad

Intermittent Messaging Disruptions

Resuelto
Rendimiento degradado
Iniciado el hace 19 díasDuró alrededor de 4 horas

Afectado

Platform

Rendimiento degradado de 2:47 PM a 3:52 PM, En funcionamiento de 3:52 PM a 7:06 PM

Multiagent Panel

Rendimiento degradado de 2:47 PM a 3:52 PM, En funcionamiento de 3:52 PM a 7:06 PM

Actualizaciones
  • Después de la muerte
    Después de la muerte

    RCA

    1. Incident Summary

    On October 2, 2025, during production chatbot operations, an error occurred related to the persistence configuration in the Redis (ElastiCache) database. The instance in use did not support persistence mechanisms, which caused all in-memory data to be lost after a service interruption or restart. As a result, active conversations could not be recovered, and previous chats were permanently lost.

    2. Impact

    The lack of persistence in the Redis database caused some active chatbot conversations to become unrecoverable after the incident. This temporarily affected service continuity, as some users had to restart their conversations from the beginning.

    3. Detection

    The incident was initially detected through a CloudWatch alarm that flagged failures in Redis session recovery. In parallel, users reported that chatbot conversations were restarting, which confirmed the impact of the issue.

    4. Response

    Once the incident was confirmed, the team validated the Redis configuration and reviewed available backups. The latest automated backup prior to the event was identified, providing a recovery point to help mitigate the impact.

    5. Root Cause

    The incident originated because the Redis ElastiCache database was configured on a cluster type that does not support persistence mechanisms.

    6. Implemented Solution

    As an immediate measure, the latest available backup was identified and its data integrity validated. Subsequently, the ElastiCache service configuration was adjusted, evaluating migration to a cluster type that supports persistence through AOF or automated snapshots, in order to ensure data retention in case of restarts or failures.

  • Resuelto
    Resuelto

    This incident has been resolved.

    A fix has been deployed to properly close chats that were not under active handling.

    We will provide the detailed RCA in this channel as part of the postmortem.

  • Supervisando
    Supervisando

    We implemented a fix and are currently monitoring the result.

    We advise users to sign out and sign back in to confirm the fix.

  • Identificado
    Identificado

    A fix is being released within the next few minutes.

  • Investigando
    Investigando

    We are currently experiencing a general intermittency affecting message delivery within the platform.