Jelou - Historial de avisos

Sistemas funcionando con normalidad

Historial de avisos

may 2026

No se reportaron avisos este mes

abr 2026

Operator Connection Status Issues
  • Después de la muerte
    Después de la muerte

    Incident Summary
    During a production deployment related to visual and behavioral improvements of the operator profile component, the status selector (“Connected / Unavailable / Disconnected”) stopped responding correctly for some users.
    The incident lasted approximately 15 minutes.

    Root Cause
    It was identified that a recent update introduced a regression in the handling of the component’s internal states, specifically in the synchronization of the selected state after the dropdown was rendered. This caused the selection event to fail in properly updating the operator’s status in certain scenarios.

    Mitigation Applied
    Once the anomalous behavior was detected, an immediate rollback of the affected version was executed to restore normal operation of the component and minimize impact.

    Post-Fix Correction
    A subsequent fix was implemented on the affected component, adjusting the update and validation logic of the selected state to prevent inconsistencies during rendering and ensure proper propagation of selector events.

    Preventive Actions
    We already perform validations prior to each deployment on critical components. In this case, the behavior did not occur during pre-deployment testing and only appeared under a specific condition in production.

    As an additional measure, we have strengthened post-deployment validations and monitoring of these components to detect similar behaviors earlier.

  • Resuelto
    Resuelto
    This incident has been resolved.
  • Supervisando
    Supervisando

    The issue affecting the operator connection status has been identified and resolved. Connection indicators are now functioning as expected.

    Our team will continue to monitor the platform to ensure stability and prevent recurrence.

  • Investigando
    Investigando

    We are currently experiencing issues affecting the connection status of operators on the platform. Some users may observe incorrect or inconsistent availability indicators.

    Our team is actively investigating the root cause and working to restore normal behavior as soon as possible.

    Next Update: We will provide further information as soon as progress is made.

Delayed Delivery of WhatsApp Messages
  • Después de la muerte
    Después de la muerte

    Incident Summary
    On April 17, between 08:00 a.m. and 12:15 p.m. (Ecuador time), a platform-level issue occurred that caused intermittent errors in the message delivery service.
    This situation directly affected message sending within the platform, resulting in occasional failures in some processes. Due to its intermittent nature, the service was not completely unavailable, but it did exhibit inconsistent behavior.

    Impact
    The incident impacted users consuming the message delivery service, causing intermittent failures in the querying and management of phone numbers associated with WABA accounts.
    The impact was classified as medium, as not all requests failed and the service was not fully unavailable.

    Detection
    The issue was identified through error reports and service monitoring, where an irregular failure rate in responses was observed.

    Response
    Once the incident was detected, the team analyzed recent platform behavior and identified a recently deployed change as the potential cause.
    The change was immediately rolled back, and a fix was deployed across all clusters, accompanied by monitoring to validate service stability.

    Root Cause
    The incident was caused by a recent platform-level change that introduced unexpected behavior in the message delivery service, resulting in intermittent errors.

    Resolution and Preventive Measures

    Applied Solution:

    • Rollback of the change that caused the incident

    • Full deployment of the fix across all clusters

    • Validation of endpoint stability

    Preventive Measures:

    • Strengthen pre-deployment testing for critical endpoints

    • Implement stricter monitoring to detect intermittent errors

    • Apply progressive deployments (controlled rollout)

    • Improve post-deployment validation

  • Resuelto
    Resuelto
    This incident has been resolved.
  • Supervisando
    Supervisando

    A fix has been implemented by the provider. We are currently monitoring the results.

  • Investigando
    Investigando

    We are currently experiencing delays in WhatsApp message delivery. Our team is investigating and working to resolve the issue.

mar 2026

🚨 Platform Service Interruption
  • Después de la muerte
    Después de la muerte

    RCA

    Incident Summary

    On March 16, 2026, between 03:44 and 07:45 UTC, a temporary degradation occurred in the message processing and workflow services. The issue was triggered by an automatic failover event in the cache cluster. During the failover process, some services experienced a interruption in communication with the in-memory database, which resulted in temporary latency in request processing.

    Impact

    During the incident window, the system experienced temporary delays in the processing of some messages. Certain users may have noticed increased response times from the service.

    The cache hit rate temporarily decreased from its normal operating level, which led to increased processing times for some requests. Service performance gradually recovered as the system re-established normal operations.

    Detection

    The incident was proactively detected through automated monitoring alerts configured in the system. These alerts identified variations in the performance of the workflow service and in the operation of the cache cluster.

    The alerts enabled the infrastructure team to begin investigating the event promptly.

    Response

    The infrastructure team analyzed cache cluster metrics and dependent services to determine the source of the degradation. It was identified that an automatic failover event had occurred in the cluster as part of the high availability and protection mechanisms of the managed cache service.

    It was confirmed that this event was not related to any security incident. The system stabilized progressively as services re-established their connections and the cache returned to normal operational levels.

    Root Cause

    The incident was caused by the combination of the following factors:

    1. An automatic failover event occurred in the cache cluster at 03:44 UTC. These events are part of the high availability mechanisms of the managed cache service and may occur either as scheduled maintenance or in response to infrastructure conditions.

    2. During the failover process, platform services were required to re-establish connections with the new active node, which temporarily impacted cache efficiency.

    3. The temporary reduction in cache efficiency increased request processing times during the system stabilization period.

    Resolution and Improvements

    In response to this incident, the engineering team is implementing the following improvements:

    1. Optimization of service connectivity configurations with the cache cluster to ensure that failover events remain fully transparent to platform operations.

    2. Implementation of proactive multi-layer monitoring capable of detecting failover events early and tracking their impact across platform services.

    3. Deployment of a centralized observability dashboard to improve incident identification and accelerate resolution times for the infrastructure team.

    4. Optimization of service auto-reconnection mechanisms to minimize recovery time during failover events.

    These measures are designed to prevent the recurrence of similar situations and ensure service continuity during infrastructure maintenance events.

  • Resuelto
    Resuelto
    This incident has been resolved.
  • Supervisando
    Supervisando
    We implemented a fix and are currently monitoring the result.
  • Investigando
    Investigando

    We are currently experiencing a service disruption affecting the platform functionality. At the moment, chats may not respond and some platform features could be unavailable.

    Our technical team is already investigating the issue with high priority and working to restore the service as soon as possible.

    We will keep you updated in the next 30 minutes with more information.

    Thank you for your patience

mar 2026 a may 2026

Siguiente