GitHub Outage Dissections

Outages are inevitable, and in this course, we dissect all GitHub outages and understand the why's and the how's. We learn about their architecture and practices and understand how things are built at scale.



1 Database Outages at GitHub

1.1 When ID column hits its limit

1.2 ALTER TABLE on a huge MySQL table

1.3 Inefficient SQL query due to an edge case

1.4 Data divergence due to Master failover

1.5 Outage due to ProxySQL upgrade

1.6 Integer Overflow in MySQL DB

2 Outages due to blindspots at GitHub

2.1 Repository creation failed due to Secret Scanning

2.2 Outage due to reversing an index

2.3 Outage due to a GraphQL lib

3 Topological Outages at GitHub

3.1 Outage in AB experimentation service

3.2 Chaos in Zookeeper

3.3 Cascading failures due to DB outage

The course is not complete yet, more lessons coming soon.

What you'll get

  • Structured Learning
  • In-depth Explanations
  • One-page write-up for each topic
  • Handwritten notes for each topic
  • Progress Tracking

About Instructor

Hey! I am Arpit Bhayani, a passionate CS engineer who loves to explore engineering in depth. In my last ~9 years of experience, I have worked at D. E. Shaw, Practo, Amazon, and Unacademy; and have built systems, services, and platforms that scaled to billions.

In 2018, I joined Unacademy as their first Technical Architect and there I designed, built, managed, and scaled services like Search, Notification, Logging, Deployment Engine, and many more. I am currently operating as a Director of Engineering leading the Site Reliability and Data Engineering verticals.