Jelou - 🚨 Platform Service Interruption – Incident details

All systems operational

🚨 Platform Service Interruption

Resolved
Major outage
Started 15 days agoLasted about 6 hours

Affected

Website

Channels

Major outage from 8:50 AM to 12:51 PM, Operational from 12:51 PM to 2:36 PM

WhatsApp

Major outage from 8:50 AM to 12:51 PM, Operational from 12:51 PM to 2:36 PM

Facebook Messenger

Major outage from 8:50 AM to 12:51 PM, Operational from 12:51 PM to 2:36 PM

Instagram Direct Messages

Major outage from 8:50 AM to 12:51 PM, Operational from 12:51 PM to 2:36 PM

Facebook Feed

Major outage from 8:50 AM to 12:51 PM, Operational from 12:51 PM to 2:36 PM

Updates
  • Postmortem
    Postmortem

    RCA

    Incident Summary

    On March 16, 2026, between 03:44 and 07:45 UTC, a temporary degradation occurred in the message processing and workflow services. The issue was triggered by an automatic failover event in the cache cluster. During the failover process, some services experienced a brief interruption in communication with the in-memory database, which resulted in temporary latency in request processing.

    Impact

    During the incident window, the system experienced temporary delays in the processing of some messages. Certain users may have noticed increased response times from the service.

    The cache hit rate temporarily decreased from its normal operating level, which led to increased processing times for some requests. Service performance gradually recovered as the system re-established normal operations.

    Detection

    The incident was proactively detected through automated monitoring alerts configured in the system. These alerts identified variations in the performance of the workflow service and in the operation of the cache cluster.

    The alerts enabled the infrastructure team to begin investigating the event promptly.

    Response

    The infrastructure team analyzed cache cluster metrics and dependent services to determine the source of the degradation. It was identified that an automatic failover event had occurred in the cluster as part of the high availability and protection mechanisms of the managed cache service.

    It was confirmed that this event was not related to any security incident. The system stabilized progressively as services re-established their connections and the cache returned to normal operational levels.

    Root Cause

    The incident was caused by the combination of the following factors:

    1. An automatic failover event occurred in the cache cluster at 03:44 UTC. These events are part of the high availability mechanisms of the managed cache service and may occur either as scheduled maintenance or in response to infrastructure conditions.

    2. During the failover process, platform services needed to re-establish their connections with the new active node. This resulted in a temporary reconnection period that briefly affected cache efficiency.

    3. The temporary reduction in cache efficiency increased request processing times during the system stabilization period.

    Resolution and Improvements

    In response to this incident, the engineering team is implementing the following improvements:

    1. Optimization of service connectivity configurations with the cache cluster to ensure that failover events remain fully transparent to platform operations.

    2. Implementation of proactive multi-layer monitoring capable of detecting failover events early and tracking their impact across platform services.

    3. Deployment of a centralized observability dashboard to improve incident identification and accelerate resolution times for the infrastructure team.

    4. Optimization of service auto-reconnection mechanisms to minimize recovery time during failover events.

    These measures are designed to prevent the recurrence of similar situations and ensure service continuity during infrastructure maintenance events.

  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.
  • Investigating
    Investigating

    We are currently experiencing a service disruption affecting the platform functionality. At the moment, chats may not respond and some platform features could be unavailable.

    Our technical team is already investigating the issue with high priority and working to restore the service as soon as possible.

    We will keep you updated in the next 30 minutes with more information.

    Thank you for your patience