Building a very reliable service like Netflix requires that hundreds of services work together reliably. If that doesn’t happen we’ve failed at our jobs. The Demand Engineering team’s focus is reliability. We achieve reliability by ensuring capacity needs are met across the Netflix ecosystem. We do this by shaping traffic, scaling systems and recovering from failure.
Demand Engineering helps Netflix meet our customer reliability and overall efficiency goals by ensuring that the services that run Netflix have the compute resources they need where and when they need them to. We run the infrastructure to reactively mitigate incidents through regional evacuation without our customers noticing.
Our team sits in the middle of the action at Netflix. In order to serve over 193m members around the world Netflix needs to have capacity available when and where it’s needed. More importantly we need to be able to shift that capacity on a moment’s notice in the event of a problem with the infrastructure. This means predicting what compute resources are needed, when they’re needed and where they’re needed at any point in the day.
Our team creates the tools and techniques needed to make this all possible in addition to operating the infrastructure. Steering and scaling are powerful tools to influence the availability and latency of Netflix during normal operations as well.
We have a lot of fun problems to solve, a scale that makes them challenging, and a culture that gives us the freedom to pursue what is best for our members and the business.