1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
|
# Systems and Self-Defense
_Published 2025-04-16_
Despite what the first Avengers movie would tell us, a system can
protect itself _from itself_ if built intentionally. To understand how,
let's start with the following:
**Tyler’s Law**:
> “Any system will inevitably be used to 100% of its authorized capacity.”
**Tyler's Corollary**:
> "If your authorized capacity is equal to your available capacity, your
> system will fail."
"Authorized" is a term different from "available" capacity and they are
not interchangeable.
## Denial of Service (DoS)
Authorized capacity for a computer is not generally controllable by the
end user (unless you've got `root` access), so a single process can
consume as much in resources as the computer has available (with very
few limits). If one runs a command to duplicate a movie file of "Plan 9
from Outer Space" fifty thousand times, like
```seq 50000 | xargs -I{} cp plan9.mov plan9copy{}.mov```
the computer will dutifully use all its resources to accomplish that
objective until the job completes or the disk is full.
Note, there is _no interactivity_ once executed -- the command accepts one
set of instructions, stops taking input, then executes without
indications of progress until the job is done.
A service, on the other hand, accepts input from another source and
_persists._ As the service does work on a job, it may be coded to accept
more inputs, and reply to the requestor with already-completed work.
This poses a problem for managing resources: how many resources are to
be used for in-flight operations? Does the service have enough resources
to accept new work while processing a current job? How does the service
tell the requestor it's not ready yet?
If the system is using all of its _available_ resources to do work, then
there is no resource left to respond to a client/user, to process added
[`signals`][9] (like `kill`/`term`), or even provide telemetry to an
observer.
If a service or a computer "goes silent," how are we sure it is
functioning correctly, if at all?
## The Problem
The key point to the previous section is this:
>If we are able to give a service enough work that it uses all of its
>available resources, then we've achieved a [Denial of Service][1]
>condition.
This is bad. To allow a program to "cancel" an erroneous command or be
triggered to produce telemetry / feedback, the program must be able to
listen for [`signals`][9] from the operating system and act accordingly.
The only exception is [`SIGKILL`][2], which cannot be blocked or
handled.
The goal of the operating system / kernel is to ensure that "authorized
resources" never exceed "avialable resources", or *the system will
crash*. This is why `SIGKILL` is unblockable -- it's an action of last
resort by the operating system (OS) to protect itself.
But what if we have an interactive _service_? We don't want to terminate
the process if it gets stuck -- we want it to keep running. So how do we
protect it?
## Self-defense
All programs and services practice a form of self-defense known as
response-codes or error-codes. They provide signals to an operator or
requestor that vary from "I'm still here" to "Please try again later"
or "this broke something" or even "your request is broken and I won't do
it."
What does this look like in practice though?
### Service Example
Let's say I have a web server that accepts text, appends it to a file,
and returns a line number to the client.
```
(Step 1) ClientA -- "foo" --> <Service> --> [write to disk]
(Step 2) ClientA <-- "1" -- <Service>
```
Because this service has to write values sequentially, while it is
performing work for Client-A, it cannot do anything else. This means
that if Client-A gives sufficient work to Service, it can't do anything
else.
So let's look at what happens when we introduce Client-B:
```
(Step 1) ClientA -- "foo" --> <Service> --> [write to disk]
(Step 2) ClientB -- "bar" --> <Service> --> [BLOCKED]
(Step 3) ClientB <-- "TIMEOUT" -- <Service>
(Step 4) ClientA <-- "1" -- <Service> -- [write completes]
```
Because this service has to write values sequentially, while it is
performing work for Client-A, if Client-B wants send "bar" to the
service, we must tell Client-B to wait or come back later.
By default, a TCP connection to a service will `CONNECT`, send data, and
then receive a response. The operating system can multiplex TCP
`CONNECT` requests, so if one is already active, it will tell the
underlying network hardware to "wait" until it can open up another port.
Your application, however, can't see this "wait" condition -- it just
goes silent until either the kernel accepts your connection, or you
time-out and the kernel evicts you from the queue.
This is not a great solution. What could we do instead?
### Do Nothing?
No really, what if we do nothing?
That's a pretty great situation for the developer -- zero work needed
and the kernel/OS does the multiplexing. This, however, is a _horrible_
experience for clients and service operators. Before specific solutions
like running a dedicated HTTP service or providing an http stack
in-process, the answer to making a service network-available was using
[inetd][10]. It was (and still is) incredibly slow and does not scale
beyond very low traffic rates.
So, in doing nothing, the service appears inconsistent, with
periodically high latency, and does not actually fix the problem (see
"Denial of Service" above).
### Application Layer Defense
Most network services operate on HTTP, even ones that [bind to unix
sockets][5]. Even [grpc operates on http2][6], so I think it's safe to
say I can leverage HTTP status codes as an example of how to respond to
a client without requiring _too much_ translation to other stacks.
When an HTTP server is "busy", [rfc6585][8] suggests responding with a
[code `429`][4] which maps to "Too Many Requests."
Of note is this paragraph:
>Note that this specification does not define how the origin server
>identifies the user, nor how it counts requests. For example, an
>origin server that is limiting request rates can do so based upon
>counts of requests on a per-resource basis, across the entire server,
>or even among a set of servers. Likewise, it might identify the user
>by its authentication credentials, or a stateful cookie.
Let's refer back to our Server example. If we have Client-A handling a
request, then any *additional* requests during the time Client-A's
request is being processed would be "too many requests" for the server
to handle.
So what we should so is configure the server to do two things at once
(or, at least, two things concurrently). As pseudocode:
```
var locked bool
fn writeLineToFile(s string, f file) -> (success bool, l int) {
locked = true;
n, err = os.Write(s, f);
locked = false;
if err != nil{
return false, 0; // We failed
}
return true, n;
}
main(){
f = open("filename")
requests = http.Listen(port)
for r in requests {
if locked {
r.reply(http-429) // Too Many Requests
} else {
ok, line = writeLineToFile(r.payload, f)
if ok {
r.reply(line)
} else {
r.reply(http-500) // Error
}
}
}
```
Let's take a second to look at this -- the main code opens a file to
persist the data and listens for requests on the HTTP port. All seems
normal, until we get to the `if locked` section. If the lock is active,
immediately stop what you're doing and reply with a [HTTP-429][4].
(For those using GRPC, you can swap "HTTP-429" out for metadata [status
code][7] `UNAVAILABLE(14)`.)
You'll notice that we haven't done any additional code checking -- no
computation, no querying the file, just check the boolean value, then
act.
This is, computationally speaking, very cheap to do. The benefit of the
positioning also means that we short-circuit the operation _before we
even start down the computationally-expensive path_ of accepting the
payload and doing anything with the file.
Looking further into the `writeLineToFile` function, you'll also see
that we "lock" the file for the minimum part of the operation -- the
part where we actually write to the file. We don't lock the
error-checking portion of the code and we don't lock the reply to the
client. This means that while the code is doing
things-not-writing-to-files, we can go as fast and as concurrently as we
want.
If we increase our activity to three clients (A,B,C), our new operation
graph looks something like this:
```
(Step 1) ClientA -- "foo" --> <Service> --> [write to disk]
(Step 2) ClientB -- "bar" --> <Service> --> [BLOCKED]
(Step 3) ClientC -- "quux" --> <Service> --> [BLOCKED]
(Step 4) ClientB <-- "E: 429" -- <Service>
(Step 5) ClientC <-- "E: 429" -- <Service>
(Step 6) ClientA <-- "1" -- <Service> -- [write completes]
(Step 7) ClientC -- "quux" --> <Service> --> [write to disk]
(Step 8) ClientB -- "bar" --> <Service> --> [BLOCKED]
(Step 9) ClientB <-- "E: 429" -- <Service>
(Step 10) ClientC <-- "2" -- <Service> -- [write completes]
```
You'll notice at Step-7, ClientC tried again, faster than ClientB, and
they won the race! As we reply to clients, they can decide when to try
again, and since ClientA's request was fulfilled in Step-6 the lock was
lifted, allowing ClientC to make a write.
Regardless of who wins the race to the next write, the service can only
handle one active request at a time, and **it protects itself** from
being forced to handle additional work by telling clients to try again.
## "Authorized Capacity" != "Available Capacity"
The service described previously has an "Authorized Capacity" of "1
in-flight request." If the computer running this service was a
single-core tiny computer, it might not have enough power to serve
additional requests (even the HTTP-429 replies) but most computing
devices will have either sufficient speed (allownig for concurrent
processing) or additional cores to handle concurrent requests. As a
result, authorized capacity is less than the total available capacity of
the server.
When running an unmodified, uncontained process in a computer, the
process is "authorized" to use _nearly_ the entire set of resources
available to the OS.
In a cloud environment, a process can run inside a virtual machine (VM)
where the constraints are similar to a bare-metal computer, but (with
few exceptions) are executing within even higher capacity bare-metal
hardware. _This means the VM is authorized to operate in a
higher-availability environment._
If running in a container or bsd-jail, the constraints are set by the
container manager (sometimes just a command line argument!) then the
kernel grants (and constrains) those resources to the process inside.
If a process exceeds its authorized limits in any of these environments,
the supervisor process (OS, container manager, VM, etc) will forcibly
end the offending process ("kill"/"terminate") and reclaim its
resources. It doesn't matter whether or not **additional resources
exist** that the process could use -- the supervisory system will act.
## Conclusions
All of this leads to a few heuristics when defining the runtime
environment for an application:
- Determine the maximum resource usage for a given process, then add
excess capacity to ensure the system has more than it needs by an
appreciable margin. A heuristic is a minimum of 25-15% at lower values, and as
little as 5% for larger environments (to avoid wasting significant
capacity).
- Stress-test / Load-test your application to its "maximum" throughput
for the resources you want to use, then set your maximum in-flight to
90% of that value. This will ensure that your instance is always
capable of processing the maximum in-flight within limits.
Both of these will ensure your *authorized* resources (for the
application, your container, and your process) will never exceed your
*available* capacity and keep your service functioning at maximum
throughput in the face of overwhelming requests.
[1]:https://en.wikipedia.org/wiki/Denial-of-service_attack
[2]:https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
[3]:https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/413
[4]:https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/429
[5]:https://aweirdimagination.net/2024/04/07/http-over-unix-sockets/
[6]:https://grpc.io/blog/grpc-on-http2/
[7]:https://grpc.io/docs/guides/status-codes/
[8]:https://www.rfc-editor.org/rfc/rfc6585#section-4
[9]:https://www.man7.org/linux/man-pages/man7/signal.7.html
[10]:https://en.wikipedia.org/wiki/Inetd
|