summaryrefslogtreecommitdiffstats
path: root/ancillary
diff options
context:
space:
mode:
authorTyler Davis <tyler@gluecode.net>2025-01-02 16:59:40 +0000
committerTyler Davis <tyler@gluecode.net>2025-01-02 16:59:40 +0000
commit0cbffb6938b6b25f118e888b48edc50f0338a50d (patch)
tree32aee4eeb1958e31701e4f9f947add0a5c48428d /ancillary
parentbc8ff7cfc40c6d6a1c714f029e66112bdef146f7 (diff)
downloadjournal-0cbffb6938b6b25f118e888b48edc50f0338a50d.tar.gz
journal-0cbffb6938b6b25f118e888b48edc50f0338a50d.zip
post: formatting tweaks
Diffstat (limited to 'ancillary')
-rw-r--r--ancillary/EvolutionOfSREGoogle-USENIX2024.md12
1 files changed, 7 insertions, 5 deletions
diff --git a/ancillary/EvolutionOfSREGoogle-USENIX2024.md b/ancillary/EvolutionOfSREGoogle-USENIX2024.md
index ab061b1..d4c3c52 100644
--- a/ancillary/EvolutionOfSREGoogle-USENIX2024.md
+++ b/ancillary/EvolutionOfSREGoogle-USENIX2024.md
@@ -4,7 +4,7 @@
Using STAMP to improve resilience in Google production systems
-> December 18, 2024
+Published: December 18, 2024
Authors: Tim Falzone, Ben Treynor Sloss \
Article shepherded by: Rik Farrow
@@ -89,9 +89,9 @@ Instead of asking "What software service failed?" we ask “What interactions be
Another incredibly important implication of an accident model is that it helps you analyze the time dimension of an accident. In a linear chain, there is a sequence of events laid out over time, but it only describes two states that the system can be in—normal operations, before the last event in the chain occurs and the system has not yet had an accident, and loss operations, after the last event in the chain occurs and the accident begins.
-![Control flow of a system without hazard states][1]
-
+![Control flow of a system without hazard states][1] \
_Control flow of a system without hazard states_
+
---
The transition from normal operations to loss operations is typically very sudden—there is almost no time to react to prevent it. This is one reason why SRE uses a combination of fast burn and slow burn SLOs for detecting problems that might be developing but aren't yet at the point of causing real harm. However, these SLOs are normally attributes of individual system components.
@@ -101,7 +101,8 @@ STAMP formalizes this concept at the system level as hazard states. "A hazard is
Hazard states are not discrete events. They do not describe anything at the individual system component-level. A hazard state is a property of the system as a whole, and the system can be in a hazard state for a long period of time before an accident occurs. That gives engineers a much larger target to aim at when trying to prevent outages. Rather than trying to eliminate any single failure that could occur anywhere in the system, we work to prevent the system from entering a hazard state. And if we do enter a hazard state, if we can detect it and take action to transition from the hazard state back to normal operations, we can prevent any accident from occurring. In some cases, the system is in a hazard state for a long time—a bug is introduced but never triggered, an alert fires but no one receives it, a server is underprovisioned but suddenly receives traffic from a popular new product feature, etc.
![Diagram showing process flow from Normal operations on left through Hazard state in center to Loss Operations on Right][2] \
-Diagram showing process flow from Normal operations on left through Hazard state in center to Loss Operations on Right
+_Diagram showing process flow from Normal operations on left through Hazard state in center to Loss Operations on Right_
+
---
## Making it concrete with a real example
@@ -145,7 +146,8 @@ Of course, major incidents are never simple events—the next problem was that d
As Leveson writes in Engineering a Safer World: "In [STAMP], understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints." This shift in perspective - from trying to prove the absence of problems to effectively managing known and potential hazards - is a key principle in our system safety approach.
![Control Flow of The Rightsizer quota-management system][3] \
-Control Flow of The Rightsizer quota-management system
+_Control Flow of The Rightsizer quota-management system_
+
---
## Where We Are Heading