diff options
Diffstat (limited to 'docs/tiers.md')
-rw-r--r-- | docs/tiers.md | 64 |
1 files changed, 43 insertions, 21 deletions
diff --git a/docs/tiers.md b/docs/tiers.md index 60cc7b3..14af482 100644 --- a/docs/tiers.md +++ b/docs/tiers.md @@ -2,47 +2,69 @@ ## Definition -Platforms and services can have different expectations depending on the technologies used, its support systems, and customer-impact. -This document defines those expectations into four "tiers" from the most-critical (Tier 1) to the least-critical (Tier 4). +Platforms and services can have different expectations depending on the +technologies used, its support systems, and customer-impact. This +document defines those expectations into four "tiers" from the +most-critical (Tier 1) to the least-critical (Tier 4). ### Base Requirements -- Teams MUST plan for both course-of-business failures and disaster-level events. -- Teams MUST assign a tier number for each application (service, platform, or system) they support. -- Applications MUST meet the availability and resilience targets of their tier. +- Teams MUST plan for both course-of-business failures and + disaster-level events. +- Teams MUST assign a tier number for each application (service, + platform, or system) they support. +- Applications MUST meet the availability and resilience targets of + their tier. ### Tier 1 -Tier 1 applications are **core/critical systems upon which all else is built**. -Examples include Active Directory, Kubernetes clusters (Dev, Integration, or Production), and Datacenter Firewalls. +Tier 1 applications are **core/critical systems upon which all else is +built**. Examples include Active Directory, Kubernetes clusters (Dev, +Integration, or Production), and Datacenter Firewalls. -Tier 1 applications MUST provide at least **[%99.95 availability](https://uptime.is/99.95)** or less than four hours of downtime n-total per year. +Tier 1 applications MUST provide at least **[%99.95 +availability](https://uptime.is/99.95)** or less than four hours of +downtime n-total per year. ### Tier 2 -Tier 2 applications are **critical and/or time-sensitive**. -Such applications could include a customer-facing billing system, an IAM gateway, or a central code-management platform (Github). +Tier 2 applications are **critical and/or time-sensitive**. Such +applications could include a customer-facing billing system, an IAM +gateway, or a central code-management platform (Github). -Tier 2 applications MUST provide at least **[%99.9 availability](https://uptime.is/99.9)** or less than nine hours of downtime in-total per year. +Tier 2 applications MUST provide at least **[%99.9 +availability](https://uptime.is/99.9)** or less than nine hours of +downtime in-total per year. ### Tier 3 -Tier 3 applications are **important and not time-sensitive**. -These include systems for end-of-month billing, internal (non-customer-impacting) metrics, +Tier 3 applications are **important and not time-sensitive**. These +include systems for end-of-month billing, internal +(non-customer-impacting) metrics, -Tier 3 applications MUST provide at least **[%99 availability](https://uptime.is/99)** or less than four days of downtime in-total per year. +Tier 3 applications MUST provide at least **[%99 +availability](https://uptime.is/99)** or less than four days of downtime +in-total per year. ### Tier 4 -Tier 4 applications have **low impact when delayed**. -Tier 4 applications include everything which is not assigned to other tiers. +Tier 4 applications have **low impact when delayed**. Tier 4 +applications include everything which is not assigned to other tiers. -Tier 4 applications MUST provide at least **[%97 availability](https://uptime.is/97)** or less than ten days of downtime in-total per year. +Tier 4 applications MUST provide at least **[%97 +availability](https://uptime.is/97)** or less than ten days of downtime +in-total per year. ## Availability Calculations -Availability is calculated based on whether a given service is responding *correctly* and is *communicating*. +Availability is calculated based on whether a given service is +responding *correctly* and is *communicating*. + +- If a service endpoint is not reachable, but is otherwise returning + positive metrics (or no errors), the system is not Available. +- If a service endpoint can be connected, but only returns error + messages, the system is not Available. +- If a service has a combination of unreachability and erroneous + responses for a duration which exceeds the Availability limit, that + service has broken its SLA guarantee.i -- If a service endpoint is not reachable, but is otherwise returning positive metrics (or no errors), the system is not Available. -- If a service endpoint can be connected, but only returns error messages, the system is not Available. -- If a service has a combination of unreachability and erroneous responses for a duration which exceeds the Availability limit, that service has broken its SLA guarantee.i |