Saturday 12th January 2019

MetaKube partial outage

Root cause:

We created a oauth service with dex to authenticate our web applications against. Dex writes all of the session information into the etcd cluster. Due to a misconfigured monitoring we created over 500000 sessions in the etcd cluster. The database size grow from about 300MB to 700MB. That slowed down all cluster operations. During the whole incident customer applications were not affected, the MetaKube dashboard including cluster creation were regularly not working correctly when one of the master apiservers ran out of memory.