DDos Postmortem, 2016-01-13

By Matt Biilmann, January 13 2016

Follow-up on the last days distributed denial of service attacks against our infrastructure.

Tuesday January 12 at 4:50pm PST we got alerted to an incoming DDoS attack. The attack started on the east coast and expanded into a massive DDoS attack simultaneously targeting CDN nodes in New York, San Francisco, Amsterdam and Chicago. It seemed like a clear escalation from the DDoS attack Saturday that only affected our main load balancer.

The DDoS was big enough that several switches across our different data centers became unresponsive. Our monitoring system detected that and immediately started removing the affected end nodes from rotation as well as rolling out extra capacity.

We still had plenty of extra capacity and our automated traffic director simply routed around this. However, this DDoS brought to light an edge case in our failover rules. When a certain combination of CDN PoPs went down at the same time, our DNS servers would start returning empty DNS records for users within the relevant region, instead of rerouting to a different region.

This caused sites hosted on Netlify to go unresponsive across several regions on an on-and-off basis for around 30 minutes, until we got a permanent fix in place for these failover rules, causing the first service outage we’ve experienced since we started operation.

Early Wednesday morning at 4:07am PST a new followup DDoS hit. This time our automatic rerouting had significantly improved, but we still experienced regional downtime due to “flapping” edge nodes that would be in and out of rotation rapidly instead of simply routed around permanently until the end of the attack. Once we forced the traffic director to route around these affected edge nodes we returned to normal, and we now have safeguards against that happening again.

At Netlify we’re fanatical about uptime and performance, and experiencing any service outages is something we’re absolutely adamant about avoiding. The good news is that even in the phase of a global DDoS attack simultaneously targeting all of our CDN PoPs, our system had more than enough capacity to stay fast and responsive, and with the improvements we’ve but in place to our failover systems, as a direct result of this attack, we’re confident that future attacks of the same volume will fail to take down our platform.