;tldr Height was down from 11:56 PM EDT to 2:02 PM EDT on November 13th, likely caused by using an unstable version of Kubernetes (1.18). This version had a bug that caused the load balancer to mistakenly think all of our API servers were down, and for all outside requests to fail as a result. We've switched our cluster back to the stable version (1.16) and are continuing to investigate this incident to make sure our understanding of the downtime cause is correct and identify future steps to prevent outages like this.
Starting on November 13th at 11:56 PM EDT, Height failed to load for any users, instead showing an error page for anyone trying to use Height at the time.
As a first step, at 12:00 PM EDT, we reverted to an older version of Height to make sure it wasn't a bug in our own services. At 12:03 PM EDT, we saw one of our providers, Google Cloud Services, had posted that they were having load balancer issues and thought that this might be the issue.
However, our current understanding (and we're continuing to investigate that this was in fact the case) is that this incident was actually caused by a bug in the version of Kubernetes we were using. Effectively, our API servers were working, but our load balancer wasn't able to communicate with them and instead considered all servers to be down, which in turn meant that all outside requests couldn't reach our servers.
Upon realizing it may be tied to a Kubernetes issue, at 12:20 PM EDT we tried to create a new pool of machines in the same 1.18 cluster, but then realized we would need to create a new cluster on Kubernetes 1.16 instead. By 1:15 PM EDT, we were finishing the setup and configuration of the new cluster, reconfiguring IP addresses & certificates, and restarting the load balancer, leading to Height being back up again and fully functional by 2:02 PM EDT.