All times UTC.
At 09:45 and 09:55 two of our internal services responsible for the management of Push delivery experience short spikes of increased error rates. This was caused by slow response time of our underlying storage backend.
At 10:02, error rates increased again, however this time it was enough to hit a concurrency limit on these services, meaning no further requests would be served until they completed. At this point, some API endpoints and User Interfaces pages became periodically unavailable. There was then a cascade effect as other services waited on these services, later hitting their limit also.
At 10:10, our Operations Team was alerted as worked to investigate and resolve the issue. An initial pausing of the delivery system to allow requests to complete, was not broad enough and still left a key service congested. A broader action to ensure all requests and services were clear of their request backlog was taken at 10:30, which allowed data to be delivered and full service restored by 10:33.
The vast majority of backlogged data was delivered by 10:45.
Actions are being taken to assess the concurrency limit of key services, in addition to adding extra capacity.