Customer lookup issue occurring again
Incident Report for SMRT Systems
Resolved
Dear SMRT Customers,

Here's an update on the issues that some of you experienced today.

Yesterday we received an alert saying we were low on storage for our prod6 elasticsearch cluster (this cluster runs search and reporting for some of our customers).

We updated our cluster via a “blue-green deploy” which basically means Amazon Web Services (AWS) spins up 6 more nodes, totalling 12 nodes online at once. All data from the old servers is copied over to the new servers, and then the old nodes are killed.

This was completed without issue yesterday, but today we saw 1 node drop. What happens when a node drops is that AWS starts a new one. The new node then reads data from the rest of the cluster to restore itself.

This is fine when a single node drops because we have 2 copies of all data over the 6 nodes.

The issue today occurred because the new AWS nodes dropped multiple times without any obvious reason. There are a couple of issues with this:
1. The cluster is now operating on fewer nodes, thus slowing down search / reports. In addition to causing random errors

2. Since data is stored twice throughout the cluster, if 2 nodes holding both copies drop at the same time, some data is lost & has to be restored from a backup or be resynced.

The second issue occurred a couple of times today. Our dev teams in both Sweden and San Francisco have been in contact with AWS about the nodes dropping while simultaneously restoring data from backups.

When restoring from backup, the newest set of data from today isn’t synced so we will be running re-syncs shortly. Initially, we did this directly after restoring from our backups, the issue is that the nodes kept dropping & so we had to redo the backup again.

AWS and our developers are currently upgrading our cluster further and they will both be monitoring the status of the cluster throughout the night to ensure there are no further issues tomorrow.

Rest assured you will have no data loss and that this is our number 1 priority. Thank you very much for your patience throughout the day, we understand the pain you felt.

If you do experience anything tomorrow or have any questions/concerns about this issue with the elasticsearch AWS cluster please reach out to support.

Sincerely,
The Entire SMRT Team
Posted Jul 15, 2021 - 16:10 PDT
Monitoring
A fix has been implemented and the customer search function is now working as expected. We will continue to test the rest of the system and monitor the situation.
Posted Jul 15, 2021 - 10:48 PDT
Identified
The issue has been identified and a fix is being deployed.
Posted Jul 15, 2021 - 09:57 PDT
Update
We are continuing to investigate this issue.
Posted Jul 15, 2021 - 09:38 PDT
Investigating
Our developers are aware of the issue and are researching the cause.
Posted Jul 15, 2021 - 09:11 PDT
This incident affected: SMRT POS (POS).