Height is experiencing an outage

Incident Report for Height

Postmortem

;tldr Height was down from 11:56 PM EDT to 2:02 PM EDT on November 13th, likely caused by using an unstable version of Kubernetes (1.18). This version had a bug that caused the load balancer to mistakenly think all of our API servers were down, and for all outside requests to fail as a result. We've switched our cluster back to the stable version (1.16) and are continuing to investigate this incident to make sure our understanding of the downtime cause is correct and identify future steps to prevent outages like this.

Timeline & investigations

Starting on November 13th at 11:56 PM EDT, Height failed to load for any users, instead showing an error page for anyone trying to use Height at the time.

As a first step, at 12:00 PM EDT, we reverted to an older version of Height to make sure it wasn't a bug in our own services. At 12:03 PM EDT, we saw one of our providers, Google Cloud Services, had posted that they were having load balancer issues and thought that this might be the issue.

However, our current understanding (and we're continuing to investigate that this was in fact the case) is that this incident was actually caused by a bug in the version of Kubernetes we were using. Effectively, our API servers were working, but our load balancer wasn't able to communicate with them and instead considered all servers to be down, which in turn meant that all outside requests couldn't reach our servers.

Upon realizing it may be tied to a Kubernetes issue, at 12:20 PM EDT we tried to create a new pool of machines in the same 1.18 cluster, but then realized we would need to create a new cluster on Kubernetes 1.16 instead. By 1:15 PM EDT, we were finishing the setup and configuration of the new cluster, reconfiguring IP addresses & certificates, and restarting the load balancer, leading to Height being back up again and fully functional by 2:02 PM EDT.

Learnings & next steps

One month ago, we had switched to Kubernetes 1.18 from 1.16 in order to make use of a load balancer feature that was introduced in 1.18. Our primary takeaway from this experience is to stay on stable versions, as Google Cloud doesn't guarantee compatibility between services on their beta channel.
We will continue to investigate and work to confirm that our understanding of what caused the incident is correct, identify if there were other options that would have been faster and brought Height online more quickly.
Once the investigation is complete, we will also hold a post-mortem to identify additional action items to prevent and mitigate similar incidents from happening in the future.

Posted Nov 13, 2020 - 22:29 UTC

Resolved

We’ve resolved the issue and Height is now back up & fully operational. We sincerely apologize for the disruption to your workday this has caused. Thanks for your patience!

Posted Nov 13, 2020 - 19:05 UTC

Investigating

We are currently investigating the issue and will post updates.

Posted Nov 13, 2020 - 16:56 UTC

This incident affected: Height.