Delays on delivery to Push Destinations
Incident Report for Fairhair.ai
Postmortem

All times UTC.

At 09:45 and 09:55 two of our internal services responsible for the management of Push delivery experience short spikes of increased error rates. This was caused by slow response time of our underlying storage backend.

At 10:02, error rates increased again, however this time it was enough to hit a concurrency limit on these services, meaning no further requests would be served until they completed. At this point, some API endpoints and User Interfaces pages became periodically unavailable. There was then a cascade effect as other services waited on these services, later hitting their limit also.

At 10:10, our Operations Team was alerted as worked to investigate and resolve the issue. An initial pausing of the delivery system to allow requests to complete, was not broad enough and still left a key service congested. A broader action to ensure all requests and services were clear of their request backlog was taken at 10:30, which allowed data to be delivered and full service restored by 10:33.

The vast majority of backlogged data was delivered by 10:45.

Actions are being taken to assess the concurrency limit of key services, in addition to adding extra capacity.

Posted 3 months ago. Mar 12, 2019 - 15:36 UTC

Resolved
This incident has been resolved.
Posted 3 months ago. Mar 12, 2019 - 13:29 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 3 months ago. Mar 12, 2019 - 10:43 UTC
Investigating
We are observing increased latency on deliveries to Push Destinations. We are working to resolve this.
Posted 3 months ago. Mar 12, 2019 - 10:21 UTC
This incident affected: STREAM for social data (Delivery).