From the Book - First edition, April 2016.
Introduction. The production environment at Google, from the viewpoint of an SRE
Principles. Embracing risk
Monitoring distributed systems
The evolution of automation at Google
Practices. Practical alerting from time-series data
Effective troubleshooting
Postmortem culture: learning from failure
Software engineering in SRE
Load balancing at the frontend
Load balancing in the datacenter
Addressing cascading failures
Managing critical state: distributed consensus for reliability
Distributed periodic scheduling with Cron
Data processing pipelines
Date integrity: what you read is what your wrote
Reliable product launches at scale
Management. Accelerating SREs to on-call and beyond
Embedding an SRE to recover from operational overload
Communication and collaboration in SRE
The evolving SRE engagement model
Conclusions. Lessons learned from other industries.