diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/alerting.md | 15 | ||||
-rw-r--r-- | docs/core.md | 25 | ||||
-rw-r--r-- | docs/datastores.md | 33 | ||||
-rw-r--r-- | docs/git-recommendations.md | 25 | ||||
-rw-r--r-- | docs/languages/golang.md | 31 | ||||
-rw-r--r-- | docs/languages/java.md | 3 | ||||
-rw-r--r-- | docs/languages/nodejs.md | 15 | ||||
-rw-r--r-- | docs/monitoring.md | 143 | ||||
-rw-r--r-- | docs/observability.md | 30 | ||||
-rw-r--r-- | docs/tiers.md | 64 |
10 files changed, 261 insertions, 123 deletions
diff --git a/docs/alerting.md b/docs/alerting.md index 8f8573d..de8d6a9 100644 --- a/docs/alerting.md +++ b/docs/alerting.md @@ -1,12 +1,13 @@ # Alerting -Alerts are signals from [Monitors][1] to perform actions. -Alerts MUST be _meaningful_ and _actionable_. +Alerts are signals from [Monitors][1] to perform actions. Alerts MUST +be _meaningful_ and _actionable_. -**Meaningful**: Alert only on montiors which indicate a problem. -(See [Monitoring][1] subsection "Saturation".) +**Meaningful**: Alert only on montiors which indicate a problem. (See +[Monitoring][1] subsection "Saturation".) -**Actionable**: Alerts MUST always include a corresponding action to resolve or investigate the underlying condition. +**Actionable**: Alerts MUST always include a corresponding action to +resolve or investigate the underlying condition. ## Requirements @@ -14,8 +15,8 @@ Alerts MUST be _meaningful_ and _actionable_. - Alerts MUST begin with [PagerDuty][2]. - Alert notifications MAY forward to Slack. - Alert notifications MAY forward via Email. -- Alerts MUST NOT deduplicate automatically. - A human MAY aggregate alerts if unknown dependencies provide alerts during an incident. +- Alerts MUST NOT deduplicate automatically. A human MAY aggregate + alerts if unknown dependencies provide alerts during an incident. [1]: monitoring.md [2]: https://www.pagerduty.com/ diff --git a/docs/core.md b/docs/core.md index d73b122..1e16d54 100644 --- a/docs/core.md +++ b/docs/core.md @@ -1,23 +1,30 @@ # Core Rules and Guidelines for Standards -The following contain both requirements and guidelines for producing Standards documents. +The following contain both requirements and guidelines for producing +Standards documents. ## Terms -- Standards review group (*SRG*): assigned group which reviews and approves changes to the Standards. +- Standards review group (*SRG*): assigned group which reviews and + approves changes to the Standards. ## Requirements -- The SRG MUST review outstanding change requests (PRs) every two weeks (14 days). -- PRs MUST remain open for comment no less than 5 business days (or one week). - - PRs MAY remain open for more than 2 weeks if the discussion is active and ongoing. -- The SRG MUST publish Standards no more frequently than once per week, nor less frequently than once every three months. +- The SRG MUST review outstanding change requests (PRs) every two weeks + (14 days). +- PRs MUST remain open for comment no less than 5 business days (or one + week). + - PRs MAY remain open for more than 2 weeks if the discussion is + active and ongoing. +- The SRG MUST publish Standards no more frequently than once per week, + nor less frequently than once every three months. Docuents MUST: - Follow [RFC-2119](rfc2119.txt). - Be as specific as possible (e.g., no colloquial langauge). -- Be concise and use [active voice](https://writing.wisc.edu/handbook/style/ccs_activevoice/). +- Be concise and use [active + voice](https://writing.wisc.edu/handbook/style/ccs_activevoice/). - Capitalize all words in headings. - Use American English spellings and conventions. @@ -25,4 +32,6 @@ Docuents MUST: All Standards documents: -- SHOULD follow the [Google Style Guide](https://developers.google.com/style/lists#capitalization-and-end-punctuation) for punctuation and capitalization. +- SHOULD follow the [Google Style + Guide](https://developers.google.com/style/lists#capitalization-and-end-punctuation) + for punctuation and capitalization. diff --git a/docs/datastores.md b/docs/datastores.md index ef22c72..5d1955b 100644 --- a/docs/datastores.md +++ b/docs/datastores.md @@ -2,16 +2,23 @@ ## Scope -This standard prescribes database and data storage technologies used to solve many related data-retention concerns. -The solutions recommended below are designed to encourage deep expertise in a few stable and well-understood systems, rather than maximal "fit" for each distinct use case. +This standard prescribes database and data storage technologies used to +solve many related data-retention concerns. The solutions recommended +below are designed to encourage deep expertise in a few stable and +well-understood systems, rather than maximal "fit" for each distinct use +case. -As such, the solutions may not be the most optimal but their performance, maintenance, optimizations, and reliability requirements are understood and supported by the engineering community. +As such, the solutions may not be the most optimal but their +performance, maintenance, optimizations, and reliability requirements +are understood and supported by the engineering community. ## Terms -- _Database_: provides long-term, durable storage for data whose loss or unavailability would mean violating an application's Availability or Business requirements. -- _Cache_: provides short-term, volatile storage which does not preserve data. - Caches are not in the scope of this standard. +- _Database_: provides long-term, durable storage for data whose loss or + unavailability would mean violating an application's Availability or + Business requirements. +- _Cache_: provides short-term, volatile storage which does not preserve + data. Caches are not in the scope of this standard. ## Capability Matrix @@ -22,21 +29,27 @@ As such, the solutions may not be the most optimal but their performance, mainte | [Document-Oriented] | X | X | X | | [Object-Based] | X | X | X | -_O_: While KV-stores cannot store relational data, some KV-focused databases provide relational-like "tagging" and other attribute aggregations. +_O_: While KV-stores cannot store relational data, some KV-focused +databases provide relational-like "tagging" and other attribute +aggregations. ## Selection Criteria ### PostgreSQL -Applications SHOULD use PostgreSQL (Aurora in Cloud environments and the latest stable release in OnPrem environments.) +Applications SHOULD use PostgreSQL (Aurora in Cloud environments and the +latest stable release in OnPrem environments.) -PostgreSQL across all environments supports all storage methods including: +PostgreSQL across all environments supports all storage methods +including: - Simple [Key-Value] stores (via [hstore]) - [Document-Oriented] storage and queries (via [jsonb]) - [Large-object] storage directly within the database -If the application is running in the Cloud environment and needs a total data size over [64TB (RDS)][1] or [128TB (Aurora)][2], then it MUST use another approved option. +If the application is running in the Cloud environment and needs a total +data size over [64TB (RDS)][1] or [128TB (Aurora)][2], then it MUST use +another approved option. ### DynamoDB diff --git a/docs/git-recommendations.md b/docs/git-recommendations.md index fcc19a0..d8c5d3d 100644 --- a/docs/git-recommendations.md +++ b/docs/git-recommendations.md @@ -1,25 +1,34 @@ # Git Recommendations -The following are a collection of suggestions to best use Git as a source control system. +The following are a collection of suggestions to best use Git as a +source control system. - Commits SHOULD represent a logical unit of work. - Commit frequently - Push your commits to a branch on your own fork, if possible. - Write [descriptive commit messages]. -- Keep remote repository up-to-date by committing and pushing your work regularly. -- Keep your local copies of repositories up-to-date by regularly pulling changes. - - Frequently pull from upstream to the `main` branch, and rebase your changes on top (see the [rebase workflow]). -- Coordinate with colleagues to avoid nasty merge conflicts (if you can). +- Keep remote repository up-to-date by committing and pushing your work + regularly. +- Keep your local copies of repositories up-to-date by regularly pulling + changes. + - Frequently pull from upstream to the `main` branch, and rebase your + changes on top (see the [rebase workflow]). +- Coordinate with colleagues to avoid nasty merge conflicts (if you + can). - Try the git [rebase workflow]. - Use branches and merge requests. - Pick a branching strategy that works for you and your team. - Consider [GitHub Flow](https://guides.github.com/introduction/flow/index.html). + Consider [GitHub + Flow](https://guides.github.com/introduction/flow/index.html). - Create a new branch for each feature or bugfix. - The `main` branch SHOULD always contain releasable code. - Protect your `main` branch; require merge requests to make changes. - Configure a group of default reviewers for your pull requests. - - For teams of 4 or more, require a minimum of 2 approvers for all merge requests. -- Avoid force operations (`-f` or `--force` option), especially on `main` branch, as this is an indication you are probably doing something wrong. + - For teams of 4 or more, require a minimum of 2 approvers for all + merge requests. +- Avoid force operations (`-f` or `--force` option), especially on + `main` branch, as this is an indication you are probably doing + something wrong. - Keep your repository neat: always delete merged branches. [descriptive commit messages]: https://cbea.ms/git-commit/#seven-rules diff --git a/docs/languages/golang.md b/docs/languages/golang.md index 6f1cf2f..1a054ec 100644 --- a/docs/languages/golang.md +++ b/docs/languages/golang.md @@ -2,16 +2,20 @@ ## Definitions -- A `Program` is a program, service, or application which is NOT a library. +- A `Program` is a program, service, or application which is NOT a + library. - A `Library` is code designed only for consumption by other programs. ## Requirements -- Builds MUST use the [Go-provided compiler][1]. - Builds MUST NOT use the [gcc-go][2] compiler or other alternatives. -- Programs MUST update and commit the `go.mod` and `go.sum` using `go mod tidy`. -- Programs MUST [vendor dependency code][4] and commit the vendored code to their repository. -- CI builds MUST use the [golangci-lint][5] linter as a first-stage validation step +- Builds MUST use the [Go-provided compiler][1]. Builds MUST NOT use + the [gcc-go][2] compiler or other alternatives. +- Programs MUST update and commit the `go.mod` and `go.sum` using `go + mod tidy`. +- Programs MUST [vendor dependency code][4] and commit the vendored code + to their repository. +- CI builds MUST use the [golangci-lint][5] linter as a first-stage + validation step - Programs SHOULD use the [standard project layout][3] - Programs MUST NOT use CGO unless there is no pure-Go alternative. Appropriate uses of CGO include Oracle DB drivers, GPGPU computation. @@ -27,16 +31,19 @@ ### Local Environment -- Run `go build`, `golangci-lint run`, and `go test` before pushing code for your PR. -- Fork the repository, then commit and push changes to the fork frequently. - This avoids catastrophic data loss and enables Work In Progress (WIP) sharing. +- Run `go build`, `golangci-lint run`, and `go test` before pushing code + for your PR. +- Fork the repository, then commit and push changes to the fork + frequently. This avoids catastrophic data loss and enables Work In + Progress (WIP) sharing. ### Go language - Errors MUST be handled -- Programs and Libaries SHOULD NOT use third-party libraries. - Prefer standard library packages. -- Use [gofumports](https://github.com/mvdan/gofumpt) for formatting and automatic imports +- Programs and Libaries SHOULD NOT use third-party libraries. Prefer + standard library packages. +- Use [gofumports](https://github.com/mvdan/gofumpt) for formatting and + automatic imports - Test functions MUST check both the error result and the returned data. [RFC2119]:https://www.rfc-editor.org/rfc/rfc2119.txt diff --git a/docs/languages/java.md b/docs/languages/java.md index b0119f6..d2cca4b 100644 --- a/docs/languages/java.md +++ b/docs/languages/java.md @@ -7,7 +7,8 @@ TBD ## Guidelines - Follow the [Google Style Guide][1] for formatting. - - Use the automatic formatting rules for IntelliJ IDEA and Eclipse [available here][2]. + - Use the automatic formatting rules for IntelliJ IDEA and Eclipse + [available here][2]. [1]: https://google.github.io/styleguide/javaguide.html [2]: https://raw.githubusercontent.com/google/styleguide/gh-pages/intellij-java-google-style.xml diff --git a/docs/languages/nodejs.md b/docs/languages/nodejs.md index 9eaede6..56507f6 100644 --- a/docs/languages/nodejs.md +++ b/docs/languages/nodejs.md @@ -2,12 +2,15 @@ ## Requirements -- Applications MUST target the latest LTS release of NodeJS (currently v16). +- Applications MUST target the latest LTS release of NodeJS (currently + v16). - Applicatons MUST use ESLint per the _Linting_ guidelines below. ### Linting -[ESLint](https://eslint.org/) is a linter used with Javascript to detect and enforce code and style guidelines. -There are several shared, public configs that provide base rules. -The config SHOULD extend the [standardjs](https://standardjs.com/) ESLint configuration. -This is different from the general language -standard to use [Google coding standards](https://github.com/google/eslint-config-google). + +[ESLint](https://eslint.org/) is a linter used with Javascript to detect +and enforce code and style guidelines. There are several shared, public +configs that provide base rules. The config SHOULD extend the +[standardjs](https://standardjs.com/) ESLint configuration. This is +different from the general language standard to use [Google coding +standards](https://github.com/google/eslint-config-google). diff --git a/docs/monitoring.md b/docs/monitoring.md index d868672..56e5008 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -1,78 +1,111 @@ -# Monitoring +# Monitoring (or Observability-v1) -This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams. -This standard does not provide guidelines for alerting external Customers or Vendors. +This standard presents guidelines for providing operational montioring +and alerting of applications and platforms for internal teams. This +standard does not provide guidelines for alerting external Customers or +Vendors. -Terminology will be introduced throughout this document. -Much of the terminology is drawn from [the Google SRE Book][2]. -If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book. +Terminology will be introduced throughout this document. Much of the +terminology is drawn from [the Google SRE Book][2]. If you are +unfamiliar with monitoring and alerting, especially for distributed +systems, you SHOULD read [this section][1] of the SRE book. ## Requirements All Applications: - MUST instrument their primary code paths with APM/tracing. -- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation. -- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values). +- MUST instrument and [alert][7] on the four primary signals: Latency, + Traffic, Errors, and Saturation. +- MUST NOT add unique values to time-series metrics tagging (e.g., + request IDs, UUIDs, other high-cardinality values). - SHOULD NOT use logs to track data captured by metrics or APM. -- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc). +- SHOULD consolidate logging statements to a single line (e.g., report + stacktraces on a single line, multi-line payloads rendered in + single-line format, etc). ## Observability Concepts -Observability is about understanding how an application performs at runtime, with real-world use cases and data. +Observability is about understanding how an application performs at +runtime, with real-world use cases and data. -There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_). +There are two ways to gather observability information: instrumenting +code which emits data to an aggregator (_Push_), or instrumenting code +which exposes the information for an external system to query (_Pull_ / +_Poll_). -Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself. -Using Push mechanisms significantly [reduce the overall complexity][3] of the system. +Each method has its advantages, but _Push_ allows observability +mechanisms to be incorporated within the application itself. Using Push +mechanisms significantly [reduce the overall complexity][3] of the +system. ### Definitions -Certain words have different definitions from their common use in conversation. -For the purposes of monitoring and alerting, the following definitions apply: - -- **Observe**: to understand how an application behaves, with real world use cases. -- **Monitoring**: the act of collecting information used to _Observe_ an application. -- **Event**: a record of something which happened, produced by a Monitor. - - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc. - - Events are records, not signals, and MUST represent something that actually happened. - Such occurrences represent decisions made by the Development, SRE, and - Security teams to highlight _meaningful_ occurrences. +Certain words have different definitions from their common use in +conversation. For the purposes of monitoring and alerting, the +following definitions apply: + +- **Observe**: to understand how an application behaves, with real world + use cases. +- **Monitoring**: the act of collecting information used to _Observe_ an + application. +- **Event**: a record of something which happened, produced by a + Monitor. + - A monitoring _event_ is not the same as events used in other systems + such as Databases, Cloud Providers, Apache Kafka, etc. + - Events are records, not signals, and MUST represent something that + actually happened. Such occurrences represent decisions made by the + Development, SRE, and Security teams to highlight _meaningful_ + occurrences. ## Types of Monitoring Also known as the _Pillars of Observability_. -Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_. -Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events. +Monitoring follows three broad categories: _Time Series_ metrics, +_Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_. +Each of the categories has its advantages and disadvantages, but APM +tends to have the best results across all three categories, and often +costs the least for a volume of data or events. ### Time Series -Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like: +Time Series metrics are points of data, sometimes aggregated, which MAY +answer very simple questions like: - Is the application running? -- How many requests / transactions per [time value] is the application processing? -- Does the application need to scale (horizontally) or deploy with more resources? +- How many requests / transactions per [time value] is the application + processing? +- Does the application need to scale (horizontally) or deploy with more + resources? -Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details. +Time Series metrics are good at observing trends, handling large volumes +of data with minimal infrastructure, and for tracking (normally) +slow-changing statistics such as infrastructure and node details. -Most Time Series implementations cannot associate data to events beyond key-value tagging. -Such systems are often insufficient to track intermittent issues or view details regarding particular errors. +Most Time Series implementations cannot associate data to events beyond +key-value tagging. Such systems are often insufficient to track +intermittent issues or view details regarding particular errors. ### Application Performance Monitoring (APM) -APM data are instrumentation which can operate across code functions, external requests, and across applications. -APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5]. -APM can answer all the questions addressed by the Time-Series solution, and additionally: +APM data are instrumentation which can operate across code functions, +external requests, and across applications. APM provides [distributed +tracing][4], which can be enhanced on platforms like [NewRelic][5]. APM +can answer all the questions addressed by the Time-Series solution, and +additionally: - Am I operating within my SLO/SLA for clients? - How long does a specific endpoint take to respond? - How frequently is a specific function being called? - What functions are taking the most time during a request / operation? -- What were the exact contents of request and response data during a long or erroneous operation? -- What [additional attributes][6] were present during an erroneous operation? +- What were the exact contents of request and response data during a + long or erroneous operation? +- What [additional attributes][6] were present during an erroneous + operation? -APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as: +APM SHOULD be used to track key indicators (KPIs) and other SLO-related +values such as: - Errors and Error Rate. - Traffic / Throughput rates. @@ -81,27 +114,37 @@ APM SHOULD be used to track key indicators (KPIs) and other SLO-related values s ### Logging -Logging is an inherently flexible and robust method of generating event-related information. -Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection). -Logging metrics MAY answer questions like: +Logging is an inherently flexible and robust method of generating +event-related information. Logging metrics capture event data from a +specific point in time during application execution and write to a file +(or stream, for collection). Logging metrics MAY answer questions like: - What calculated values were generated by a specific function? -- What input or payload information was provided for a specific transaction? - -Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements. -Such requirements supercede declarations here. - -- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics. - - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics). - - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging. +- What input or payload information was provided for a specific + transaction? + +Logs are often necessary for auditing purposes, which have security, +legal, or regulatory requirements. Such requirements supercede +declarations here. + +- Logging SHOULD be used to capture only events which cannot be captured + by APM or Metrics. + - Suggested events include process startup messages, signal-received + hooks, shutdown messages, and reporting failures (such as failure to + submit APM/Metrics). + - Logging MAY _temporarily_ be used to track events which are captured + by APM or Metrics during incidents or when debugging. - Logging SHOULD be in a structured format such as JSON. -- Logging MUST NOT include sensitive data without masking/eliding said data. +- Logging MUST NOT include sensitive data without masking/eliding said + data. ## Top Metrics Also known as [the Golden Signals][1]. -Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies. +Applications MUST have at least one dashboard which displays the +following metrics for all services or components, as well as metrics for +invoked dependencies. ### Latency diff --git a/docs/observability.md b/docs/observability.md new file mode 100644 index 0000000..4efa297 --- /dev/null +++ b/docs/observability.md @@ -0,0 +1,30 @@ +# Observability + +"Observability v2" is considered an evolution of the [Observability-v1 +(or Monitoring)][1] paradigm. The goal of Observability (v1 or v2) is to +provide understanding of how an application operates at runtime. + +As a common shorthand, + +### Definitions + +_Copied from the O + +Certain words have different definitions from their common use in +conversation. For the purposes of monitoring and alerting, the +following definitions apply: + +- **Observe**: to understand how an application behaves, with real world + use cases. +- **Monitoring**: the act of collecting information used to _Observe_ an + application. +- **Event**: a record of something which happened, produced by a + Monitor. + - A monitoring _event_ is not the same as events used in other systems + such as Databases, Cloud Providers, Apache Kafka, etc. + - Events are records, not signals, and MUST represent something that + actually happened. Such occurrences represent decisions made by the + Development, SRE, and Security teams to highlight _meaningful_ + occurrences. + +[1]: monitoring.md diff --git a/docs/tiers.md b/docs/tiers.md index 60cc7b3..14af482 100644 --- a/docs/tiers.md +++ b/docs/tiers.md @@ -2,47 +2,69 @@ ## Definition -Platforms and services can have different expectations depending on the technologies used, its support systems, and customer-impact. -This document defines those expectations into four "tiers" from the most-critical (Tier 1) to the least-critical (Tier 4). +Platforms and services can have different expectations depending on the +technologies used, its support systems, and customer-impact. This +document defines those expectations into four "tiers" from the +most-critical (Tier 1) to the least-critical (Tier 4). ### Base Requirements -- Teams MUST plan for both course-of-business failures and disaster-level events. -- Teams MUST assign a tier number for each application (service, platform, or system) they support. -- Applications MUST meet the availability and resilience targets of their tier. +- Teams MUST plan for both course-of-business failures and + disaster-level events. +- Teams MUST assign a tier number for each application (service, + platform, or system) they support. +- Applications MUST meet the availability and resilience targets of + their tier. ### Tier 1 -Tier 1 applications are **core/critical systems upon which all else is built**. -Examples include Active Directory, Kubernetes clusters (Dev, Integration, or Production), and Datacenter Firewalls. +Tier 1 applications are **core/critical systems upon which all else is +built**. Examples include Active Directory, Kubernetes clusters (Dev, +Integration, or Production), and Datacenter Firewalls. -Tier 1 applications MUST provide at least **[%99.95 availability](https://uptime.is/99.95)** or less than four hours of downtime n-total per year. +Tier 1 applications MUST provide at least **[%99.95 +availability](https://uptime.is/99.95)** or less than four hours of +downtime n-total per year. ### Tier 2 -Tier 2 applications are **critical and/or time-sensitive**. -Such applications could include a customer-facing billing system, an IAM gateway, or a central code-management platform (Github). +Tier 2 applications are **critical and/or time-sensitive**. Such +applications could include a customer-facing billing system, an IAM +gateway, or a central code-management platform (Github). -Tier 2 applications MUST provide at least **[%99.9 availability](https://uptime.is/99.9)** or less than nine hours of downtime in-total per year. +Tier 2 applications MUST provide at least **[%99.9 +availability](https://uptime.is/99.9)** or less than nine hours of +downtime in-total per year. ### Tier 3 -Tier 3 applications are **important and not time-sensitive**. -These include systems for end-of-month billing, internal (non-customer-impacting) metrics, +Tier 3 applications are **important and not time-sensitive**. These +include systems for end-of-month billing, internal +(non-customer-impacting) metrics, -Tier 3 applications MUST provide at least **[%99 availability](https://uptime.is/99)** or less than four days of downtime in-total per year. +Tier 3 applications MUST provide at least **[%99 +availability](https://uptime.is/99)** or less than four days of downtime +in-total per year. ### Tier 4 -Tier 4 applications have **low impact when delayed**. -Tier 4 applications include everything which is not assigned to other tiers. +Tier 4 applications have **low impact when delayed**. Tier 4 +applications include everything which is not assigned to other tiers. -Tier 4 applications MUST provide at least **[%97 availability](https://uptime.is/97)** or less than ten days of downtime in-total per year. +Tier 4 applications MUST provide at least **[%97 +availability](https://uptime.is/97)** or less than ten days of downtime +in-total per year. ## Availability Calculations -Availability is calculated based on whether a given service is responding *correctly* and is *communicating*. +Availability is calculated based on whether a given service is +responding *correctly* and is *communicating*. + +- If a service endpoint is not reachable, but is otherwise returning + positive metrics (or no errors), the system is not Available. +- If a service endpoint can be connected, but only returns error + messages, the system is not Available. +- If a service has a combination of unreachability and erroneous + responses for a duration which exceeds the Availability limit, that + service has broken its SLA guarantee.i -- If a service endpoint is not reachable, but is otherwise returning positive metrics (or no errors), the system is not Available. -- If a service endpoint can be connected, but only returns error messages, the system is not Available. -- If a service has a combination of unreachability and erroneous responses for a duration which exceeds the Availability limit, that service has broken its SLA guarantee.i |