diff options
author | Tyler Davis <tydavis@gmail.com> | 2021-12-14 00:39:36 +0000 |
---|---|---|
committer | Tyler Davis <tydavis@gmail.com> | 2022-01-05 16:25:43 +0000 |
commit | 04429378bbf9fb1d2858041473a91e402e181b49 (patch) | |
tree | 68245fcc0908420aab4ed3fa1b278468bbfa3e62 /docs | |
parent | afbfac45136f0b5117f57cf8a61f8772167bd672 (diff) | |
download | standards-04429378bbf9fb1d2858041473a91e402e181b49.tar.gz standards-04429378bbf9fb1d2858041473a91e402e181b49.zip |
First draft of standards, sole contributor
Adds:
- Owners concept
- Monitoring, Alerting
- Stubs for APIs, Secrets...
Updates:
- Contributing guide
- Guidance for languages, deployments, core rules...
Diffstat (limited to 'docs')
-rw-r--r-- | docs/alerting.md | 21 | ||||
-rw-r--r-- | docs/apis.md | 1 | ||||
-rw-r--r-- | docs/appid.md | 76 | ||||
-rw-r--r-- | docs/continuous-deployment.md | 0 | ||||
-rw-r--r-- | docs/core.md | 28 | ||||
-rw-r--r-- | docs/datastores.md | 89 | ||||
-rw-r--r-- | docs/deployment.md | 41 | ||||
-rw-r--r-- | docs/git-recommendations.md | 26 | ||||
-rw-r--r-- | docs/language.md | 0 | ||||
-rw-r--r-- | docs/languages/golang.md | 47 | ||||
-rw-r--r-- | docs/languages/java.md | 13 | ||||
-rw-r--r-- | docs/languages/nodejs.md | 13 | ||||
-rw-r--r-- | docs/languages/python.md | 11 | ||||
-rw-r--r-- | docs/languages/rust.md | 11 | ||||
-rw-r--r-- | docs/monitoring.md | 135 | ||||
-rw-r--r-- | docs/rfc2119.txt | 98 | ||||
-rw-r--r-- | docs/secrets.md | 1 | ||||
-rw-r--r-- | docs/tiers.md | 48 |
18 files changed, 659 insertions, 0 deletions
diff --git a/docs/alerting.md b/docs/alerting.md new file mode 100644 index 0000000..8f8573d --- /dev/null +++ b/docs/alerting.md @@ -0,0 +1,21 @@ +# Alerting + +Alerts are signals from [Monitors][1] to perform actions. +Alerts MUST be _meaningful_ and _actionable_. + +**Meaningful**: Alert only on montiors which indicate a problem. +(See [Monitoring][1] subsection "Saturation".) + +**Actionable**: Alerts MUST always include a corresponding action to resolve or investigate the underlying condition. + +## Requirements + +- Alerts MUST be resolved by a human being. +- Alerts MUST begin with [PagerDuty][2]. + - Alert notifications MAY forward to Slack. + - Alert notifications MAY forward via Email. +- Alerts MUST NOT deduplicate automatically. + A human MAY aggregate alerts if unknown dependencies provide alerts during an incident. + +[1]: monitoring.md +[2]: https://www.pagerduty.com/ diff --git a/docs/apis.md b/docs/apis.md new file mode 100644 index 0000000..59b8f89 --- /dev/null +++ b/docs/apis.md @@ -0,0 +1 @@ +# APIs diff --git a/docs/appid.md b/docs/appid.md new file mode 100644 index 0000000..719059c --- /dev/null +++ b/docs/appid.md @@ -0,0 +1,76 @@ +# Application ID + +## Definition + +An Application ID (APPID) is a unique identifier that identifies an application. +Identifiers are automatically incrememntally assigned when a request for a new APPID is made. + +**Note: All Platforms and Services MUST have an assigned ApplicationID.** + +## Required Fields + +An APPID is represented by the following fields (with examples): + +- ID (_APP123456_) (Must be globally `Unique`) +- Engineering Name (_Custom Service or Platform Name_) +- Business Name (_Business Product Name, if different_) +- [Tier](tiers.md) (_Tier 1_) +- Slack Channel (_#support-team_) +- Supporting Team (_support@company-team_) +- Supporting Manager (_person@company.com_) +- Director (_director@company.com_) +- VP (_vp@company.com_) +- Architect (_arch@company.com_) +- Product Owner (_po@company.com_) +- Location (_SaaS_ or _OnPrem_) +- Ancestor (_APP456789_) + +### Operational Fields + +APPID supporting fields which provide operational details for the Applicaton: + +- Descendants: *Calculated field* from all Ancestor field references. +- Dependencies: Other APPIDs on which the application depends for core functionality. + (This can generate a lightweight "service map" for incident response.) +- Code Source: Source code URL for the single application source or a root folder containing the other sources. +- Pagerduty Service(s): One (or more?) PagerDuty Escalation Policy URLs and/or Service Alert URLs. +- Kubernetes Namespace: If using Kubernetes, the namespace in which the application is deployed. +- Deployment Stage: Dev / Integration / Production +- Active: Declares an APPID as `Active` or `Retired`. + If a system is not deliberately retired and "turned off," it cannot be flagged as `Retired`. + +## Definitions + +An application can have several definitions. +Systems which are granted an APPID MAY be: + +- Customer-facing software +- Internal-to-company software +- Software providing non-interactive interfaces (such as an API or web service) +- Software providing platform or infrastructure services +- Software built and produced by company, acquired from a third-party or Vendor, or a Software-as-a-Service (SaaS) platform + +**Supplemental Rule:** Software which is not supported by the company MUST NOT be an application. + +### Boundaries + +Consider the following as to whether or not something needs an APPID: + +- Responsibility for the application MUST belong to a single team. + If a service or platform (Application) is maintained by more than one team, the system MUST be divided into multiple APPIDs. (See _Ancestor/Descendant Relationships_ defined below.) +- SLAs MUST be the same across a single APPID. + If an application has multiple endpoints with differing SLAs, those endpoints MUST be separated into differing services and APPIDs (except for vendor-provided products). +- A collection of microservices which accomplish one business objective MAY share a single APPID. +- The service or platform MUST have a unified codebase. + If different code repositories are used for a different environment (such as On-Premises versus SaaS offerings), a new APPID MUST be created for the divergent codebase. +- The physical or virtual hardware on which a service or platform runs MUST NOT be an APPID. + +### Ancestor/Descendant Relationships + +Some systems are best represented as a single Ancestor APPID with multiple Descendant APPIDs associated. +Examples include microservices which operate independently to provide a single service interface, or services which provide key functionality to a platform but are of a less-reliable tier. + +- A Descendant APPID MUST only have one Ancestor APPID +- A Descendant APPID MUST NOT be an Ancestor to another APPID +- An Ancestor APPID MUST BE of a tier lower than or equal to the highest tier +- A Descendant APPID MUST NOT be of a [more-critical tier](tiers.md) than its Ancestor APPID diff --git a/docs/continuous-deployment.md b/docs/continuous-deployment.md deleted file mode 100644 index e69de29..0000000 --- a/docs/continuous-deployment.md +++ /dev/null diff --git a/docs/core.md b/docs/core.md new file mode 100644 index 0000000..d73b122 --- /dev/null +++ b/docs/core.md @@ -0,0 +1,28 @@ +# Core Rules and Guidelines for Standards + +The following contain both requirements and guidelines for producing Standards documents. + +## Terms + +- Standards review group (*SRG*): assigned group which reviews and approves changes to the Standards. + +## Requirements + +- The SRG MUST review outstanding change requests (PRs) every two weeks (14 days). +- PRs MUST remain open for comment no less than 5 business days (or one week). + - PRs MAY remain open for more than 2 weeks if the discussion is active and ongoing. +- The SRG MUST publish Standards no more frequently than once per week, nor less frequently than once every three months. + +Docuents MUST: + +- Follow [RFC-2119](rfc2119.txt). +- Be as specific as possible (e.g., no colloquial langauge). +- Be concise and use [active voice](https://writing.wisc.edu/handbook/style/ccs_activevoice/). +- Capitalize all words in headings. +- Use American English spellings and conventions. + +## Guidelines + +All Standards documents: + +- SHOULD follow the [Google Style Guide](https://developers.google.com/style/lists#capitalization-and-end-punctuation) for punctuation and capitalization. diff --git a/docs/datastores.md b/docs/datastores.md index e69de29..ef22c72 100644 --- a/docs/datastores.md +++ b/docs/datastores.md @@ -0,0 +1,89 @@ +# Data Stores + +## Scope + +This standard prescribes database and data storage technologies used to solve many related data-retention concerns. +The solutions recommended below are designed to encourage deep expertise in a few stable and well-understood systems, rather than maximal "fit" for each distinct use case. + +As such, the solutions may not be the most optimal but their performance, maintenance, optimizations, and reliability requirements are understood and supported by the engineering community. + +## Terms + +- _Database_: provides long-term, durable storage for data whose loss or unavailability would mean violating an application's Availability or Business requirements. +- _Cache_: provides short-term, volatile storage which does not preserve data. + Caches are not in the scope of this standard. + +## Capability Matrix + +| Capabilities | RDBMS | KV Store | File/Object | +|--------------|-------|----------|-------------| +| [Relational] | X | O | | +| [Key-Value] | X | X | X | +| [Document-Oriented] | X | X | X | +| [Object-Based] | X | X | X | + +_O_: While KV-stores cannot store relational data, some KV-focused databases provide relational-like "tagging" and other attribute aggregations. + +## Selection Criteria + +### PostgreSQL + +Applications SHOULD use PostgreSQL (Aurora in Cloud environments and the latest stable release in OnPrem environments.) + +PostgreSQL across all environments supports all storage methods including: + +- Simple [Key-Value] stores (via [hstore]) +- [Document-Oriented] storage and queries (via [jsonb]) +- [Large-object] storage directly within the database + +If the application is running in the Cloud environment and needs a total data size over [64TB (RDS)][1] or [128TB (Aurora)][2], then it MUST use another approved option. + +### DynamoDB + +If an Application is hosted in AWS and requires many of: + +- Flexible, schemaless data model that will change in the future +- Read-heavy access model for items +- Single-millisecond response times +- Very fast caching for hot keys and values (< 10ms) +- Multi-region deployments + +and does not require any of: + +- Strongly consistent reads and writes in all situations +- Item/Row sizes over 400 KB +- A maximum data size of over 1TB +- Joins between different stored values + +then it MAY use DynamoDB. + +### S3 / File store + +If the Applications persists data with many of the following: + +- Large files (>100MB each) +- Large volumes of data (>1 million records) +- Simple identification requirements (e.g., "filename as key") +- Simple relational requirements (e.g., folders and files) + +and does not require any of the following: + +- Low-latency access (< 200ms) +- Relational join operations +- Low response times (< 1000ms) + +then it MAY use a filestore or S3. + +[Aurora (PostgreSQL)]: https://aws.amazon.com/rds/aurora/postgresql-features/ +[RDS (PostgreSQL)]: https://aws.amazon.com/rds/postgresql/ +[DynamoDB]: https://aws.amazon.com/dynamodb/ +[S3]: https://aws.amazon.com/s3/ +[Relational]:https://en.wikipedia.org/wiki/Relational_database +[Key-Value]: https://en.wikipedia.org/wiki/Key-value_database +[Document-Oriented]: https://en.wikipedia.org/wiki/Document-oriented_database +[Object-Based]: https://en.wikipedia.org/wiki/Object_storage#Cloud_storage +[hstore]: https://www.postgresql.org/docs/13/hstore.html +[jsonb]: https://www.postgresql.org/docs/current/datatype-json.html +[Large-object]: https://www.postgresql.org/docs/13/largeobjects.html +[1]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html +[2]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Limits.html diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..37cef3b --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,41 @@ +# Deployment + +## Service Endpoint Requirements + +Services with HTTP endpoints MUST expose the following endpoints: + +- `/info` - provides version information, git hash, etc +- `/ping` - implements the equivalent of a [Kubernetes liveness probe][1], returning HTTP status header [200 OK][2] if in a running state, or an HTTP status header [500 Error][3] if not. +- `/ready` - implements a Kubernetes readiness probe, returning a [200-OK][2] header if prepared to accept traffic, and a [500-Error][3] header if not. + +In addition: + +- Liveness and Readiness endpoints MUST be different endpoints with different implementations. + +## Rollout Mechanisms + +Types of rollout mechanisms include: + +- ["In-place"][6]: stops the application or server, updates the application out-of-band, then starts the application or server. + This mechanism always causes downtime for clients as the service is not available to service requests during the update. +- ["Blue/Green"][4]: creates a full replica of the existing production environment, running the newer application. + The load-balancer or other routing system points clients at the new replica (once instantiated), then removes the old systems. + This mechanism uses at least twice the total resources required to run an environment for the duration of the transition, at minimum. +- ["Rolling Update"][5]: requires a multi-instance deploymennt. + Removes one node, then spawns at least two nodes in its place. + If the new nodes pass deployment checks, another old node is stopped, while a new node takes its place until the entire original deployment is replaced. + - ["Canary"][7]: a "permanently halted" type of rolling-update. + Canary deployments exist alongside the older deployment, handling a fraction of the total request volume. + When the canary is verified "working correctly," it is turned into a rolling update, and then the next version is deployed as a canary. + +## Deployment Requirements + +- Services with more than one instance MUST use a rolling-update unless otherwise approved by the Architecture Review group. + +[1]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ +[2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200 +[3]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500 +[4]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/bluegreen-deployments.html +[5]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/rolling-deployments.html +[6]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/in-place-deployments.html +[7]: https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.canary-deployment.en.html diff --git a/docs/git-recommendations.md b/docs/git-recommendations.md new file mode 100644 index 0000000..fcc19a0 --- /dev/null +++ b/docs/git-recommendations.md @@ -0,0 +1,26 @@ +# Git Recommendations + +The following are a collection of suggestions to best use Git as a source control system. + +- Commits SHOULD represent a logical unit of work. +- Commit frequently + - Push your commits to a branch on your own fork, if possible. +- Write [descriptive commit messages]. +- Keep remote repository up-to-date by committing and pushing your work regularly. +- Keep your local copies of repositories up-to-date by regularly pulling changes. + - Frequently pull from upstream to the `main` branch, and rebase your changes on top (see the [rebase workflow]). +- Coordinate with colleagues to avoid nasty merge conflicts (if you can). + - Try the git [rebase workflow]. +- Use branches and merge requests. + - Pick a branching strategy that works for you and your team. + Consider [GitHub Flow](https://guides.github.com/introduction/flow/index.html). + - Create a new branch for each feature or bugfix. + - The `main` branch SHOULD always contain releasable code. + - Protect your `main` branch; require merge requests to make changes. + - Configure a group of default reviewers for your pull requests. + - For teams of 4 or more, require a minimum of 2 approvers for all merge requests. +- Avoid force operations (`-f` or `--force` option), especially on `main` branch, as this is an indication you are probably doing something wrong. +- Keep your repository neat: always delete merged branches. + +[descriptive commit messages]: https://cbea.ms/git-commit/#seven-rules +[rebase workflow]: https://git-rebase.io diff --git a/docs/language.md b/docs/language.md deleted file mode 100644 index e69de29..0000000 --- a/docs/language.md +++ /dev/null diff --git a/docs/languages/golang.md b/docs/languages/golang.md index e69de29..6f1cf2f 100644 --- a/docs/languages/golang.md +++ b/docs/languages/golang.md @@ -0,0 +1,47 @@ +# Go Langugage Standards + +## Definitions + +- A `Program` is a program, service, or application which is NOT a library. +- A `Library` is code designed only for consumption by other programs. + +## Requirements + +- Builds MUST use the [Go-provided compiler][1]. + Builds MUST NOT use the [gcc-go][2] compiler or other alternatives. +- Programs MUST update and commit the `go.mod` and `go.sum` using `go mod tidy`. +- Programs MUST [vendor dependency code][4] and commit the vendored code to their repository. +- CI builds MUST use the [golangci-lint][5] linter as a first-stage validation step +- Programs SHOULD use the [standard project layout][3] +- Programs MUST NOT use CGO unless there is no pure-Go alternative. + Appropriate uses of CGO include Oracle DB drivers, GPGPU computation. + (Prefer `CGO_ENABLED=0` as a build flag.) + +## Recommended Resources + +- [Effective Go](https://golang.org/doc/effective_go.html) +- [Go Proverbs](https://go-proverbs.github.io/) +- [Go Test package](https://pkg.go.dev/testing) + +## Supplemental Advice + +### Local Environment + +- Run `go build`, `golangci-lint run`, and `go test` before pushing code for your PR. +- Fork the repository, then commit and push changes to the fork frequently. + This avoids catastrophic data loss and enables Work In Progress (WIP) sharing. + +### Go language + +- Errors MUST be handled +- Programs and Libaries SHOULD NOT use third-party libraries. + Prefer standard library packages. +- Use [gofumports](https://github.com/mvdan/gofumpt) for formatting and automatic imports +- Test functions MUST check both the error result and the returned data. + +[RFC2119]:https://www.rfc-editor.org/rfc/rfc2119.txt +[1]:https://golang.org +[2]:https://gcc.gnu.org/onlinedocs/gccgo/ +[3]:https://github.com/golang-standards/project-layout +[4]:https://golang.org/ref/mod#vendoring +[5]:https://golangci-lint.run diff --git a/docs/languages/java.md b/docs/languages/java.md index e69de29..b0119f6 100644 --- a/docs/languages/java.md +++ b/docs/languages/java.md @@ -0,0 +1,13 @@ +# Java Language Standards + +## Requirements + +TBD + +## Guidelines + +- Follow the [Google Style Guide][1] for formatting. + - Use the automatic formatting rules for IntelliJ IDEA and Eclipse [available here][2]. + +[1]: https://google.github.io/styleguide/javaguide.html +[2]: https://raw.githubusercontent.com/google/styleguide/gh-pages/intellij-java-google-style.xml diff --git a/docs/languages/nodejs.md b/docs/languages/nodejs.md index e69de29..9eaede6 100644 --- a/docs/languages/nodejs.md +++ b/docs/languages/nodejs.md @@ -0,0 +1,13 @@ +# NodeJS and Javascript Language Standards + +## Requirements + +- Applications MUST target the latest LTS release of NodeJS (currently v16). +- Applicatons MUST use ESLint per the _Linting_ guidelines below. + +### Linting +[ESLint](https://eslint.org/) is a linter used with Javascript to detect and enforce code and style guidelines. +There are several shared, public configs that provide base rules. +The config SHOULD extend the [standardjs](https://standardjs.com/) ESLint configuration. +This is different from the general language +standard to use [Google coding standards](https://github.com/google/eslint-config-google). diff --git a/docs/languages/python.md b/docs/languages/python.md index e69de29..fa4edec 100644 --- a/docs/languages/python.md +++ b/docs/languages/python.md @@ -0,0 +1,11 @@ +# Python Language Standards + +## Requirements + +TBD + +## Guidelines + +- Code MUST be formatted using [PEP8][1]. + +[1]: https://www.python.org/dev/peps/pep-0008/ diff --git a/docs/languages/rust.md b/docs/languages/rust.md index e69de29..426164a 100644 --- a/docs/languages/rust.md +++ b/docs/languages/rust.md @@ -0,0 +1,11 @@ +# Rust Language Standards + +## Requirements + +- Committed code MUST be formatted using [rustfmt][1]. + +## Guidelines + +TBD + +[1]: https://github.com/rust-lang/rustfmt#running-rustfmt-from-your-editor diff --git a/docs/monitoring.md b/docs/monitoring.md new file mode 100644 index 0000000..d9a9b62 --- /dev/null +++ b/docs/monitoring.md @@ -0,0 +1,135 @@ +# Monitoring + +This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams. +This standard does not provide guidelines for alerting external Customers or Vendors. + +Terminology will be introduced throughout this document. +Much of the terminology is drawn from [the Google SRE Book][2]. +If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book. + +## Requirements + +All Applications: + +- MUST instrument their primary code paths with APM/tracing. +- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation. +- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values). +- SHOULD NOT use logs to track data captured by metrics or APM. +- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc). + +## Observability Concepts + +Observability is about understanding how an application performs at runtime, with real-world use cases and data. + +There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_). + +Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself. +Using Push mechanisms significantly [reduce the overall complexity][3] of the system. + +### Definitions + +Certain words have different definitions from their common use in conversation. +For the purposes of monitoring and alerting, the following definitions apply: + +- **Observe**: to understand how an application behaves, with real world use cases. +- **Monitoring**: the act of collecting information used to _Observe_ an application. +- **Event**: a record of something which happened, produced by a Monitor. + - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc. + - Events are records, not signals, and MUST represent something that actually happened. + Such occurrences represent decisions made by the + +## Types of Monitoring + +Also known as the _Pillars of Observability_. + +Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_. +Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events. + +### Time Series + +Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like: + +- Is the application running? +- How many requests / transactions per [time value] is the application processing? +- Does the application need to scale (horizontally) or deploy with more resources? + +Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details. + +Most Time Series implementations cannot associate data to events beyond key-value tagging. +Such systems are often insufficient to track intermittent issues or view details regarding particular errors. + +### Application Performance Monitoring (APM) + +APM data are instrumentation which can operate across code functions, external requests, and across applications. +APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5]. +APM can answer all the questions addressed by the Time-Series solution, and additionally: + +- Am I operating within my SLO/SLA for clients? +- How long does a specific endpoint take to respond? +- How frequently is a specific function being called? +- What functions are taking the most time during a request / operation? +- What were the exact contents of request and response data during a long or erroneous operation? +- What [additional attributes][6] were present during an erroneous operation? + +APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as: + +- Errors and Error Rate. +- Traffic / Throughput rates. +- Latency, especially within the application. +- Interactions with dependencies, including latency and error rate. + +### Logging + +Logging is an inherently flexible and robust method of generating event-related information. +Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection). +Logging metrics MAY answer questions like: + +- What calculated values were generated by a specific function? +- What input or payload information was provided for a specific transaction? + +Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements. +Such requirements supercede declarations here. + +- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics. + - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics). + - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging. +- Logging SHOULD be in a structured format such as JSON. +- Logging MUST NOT include sensitive data without masking/eliding said data. + +## Top Metrics + +Also known as [the Golden Signals][1]. + +Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies. + +### Latency + +- APM SHOULD be used to measure latency. +- Metrics MAY be used to measure latency. +- Logging MUST NOT be used to measure latency. + +### Traffic + +- APM SHOULD be used to measure traffic. +- Metrics MAY be used to measure traffic. +- Logging MUST NOT be used to measure traffic. + +### Errors + +- APM SHOULD be used to measure errors. +- Metrics MAY be used to measure errors. +- Logging SHOULD NOT be used to measure errors. + +### Saturation + +- APM SHOULD be used to measure saturation. +- Metrics MAY be used to measure saturation. +- Logging MUST NOT be used to measure saturation. + +[1]: https://sre.google/sre-book/monitoring-distributed-systems/ +[2]: https://sre.google/sre-book/part-II-principles/ +[3]: https://sre.google/sre-book/simplicity/ +[4]: https://opentracing.io/docs/overview/what-is-tracing/ +[5]: https://docs.newrelic.com/docs/distributed-tracing/concepts/introduction-distributed-tracing/ +[6]: https://developer.newrelic.com/collect-data/custom-attributes/ +[7]: alerting.md diff --git a/docs/rfc2119.txt b/docs/rfc2119.txt new file mode 100644 index 0000000..8c29989 --- /dev/null +++ b/docs/rfc2119.txt @@ -0,0 +1,98 @@ +Network Working Group S. Bradner +Request for Comments: 2119 Harvard University +BCP: 14 March 1997 +Category: Best Current Practice + + + Key words for use in RFCs to Indicate Requirement Levels + +Status of this Memo + + This document specifies an Internet Best Current Practices for the + Internet Community, and requests discussion and suggestions for + improvements. Distribution of this memo is unlimited. + +Abstract + + In many standards track documents several words are used to signify + the requirements in the specification. These words are often + capitalized. This document defines these words as they should be + interpreted in IETF documents. Authors who follow these guidelines + should incorporate this phrase near the beginning of their document: + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL + NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + RFC 2119. + + Note that the force of these words is modified by the requirement + level of the document in which they are used. + +1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the + definition is an absolute requirement of the specification. + +2. MUST NOT This phrase, or the phrase "SHALL NOT", mean that the + definition is an absolute prohibition of the specification. + +3. SHOULD This word, or the adjective "RECOMMENDED", mean that there + may exist valid reasons in particular circumstances to ignore a + particular item, but the full implications must be understood and + carefully weighed before choosing a different course. + +4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that + there may exist valid reasons in particular circumstances when the + particular behavior is acceptable or even useful, but the full + implications should be understood and the case carefully weighed + before implementing any behavior described with this label. + +5. MAY This word, or the adjective "OPTIONAL", mean that an item is + truly optional. One vendor may choose to include the item because a + particular marketplace requires it or because the vendor feels that + it enhances the product while another vendor may omit the same item. + An implementation which does not include a particular option MUST be + prepared to interoperate with another implementation which does + include the option, though perhaps with reduced functionality. In the + same vein an implementation which does include a particular option + MUST be prepared to interoperate with another implementation which + does not include the option (except, of course, for the feature the + option provides.) + +6. Guidance in the use of these Imperatives + + Imperatives of the type defined in this memo must be used with care + and sparingly. In particular, they MUST only be used where it is + actually required for interoperation or to limit behavior which has + potential for causing harm (e.g., limiting retransmisssions) For + example, they must not be used to try to impose a particular method + on implementors where the method is not required for + interoperability. + +7. Security Considerations + + These terms are frequently used to specify behavior with security + implications. The effects on security of not implementing a MUST or + SHOULD, or doing something the specification says MUST NOT or SHOULD + NOT be done may be very subtle. Document authors should take the time + to elaborate the security implications of not following + recommendations or requirements as most implementors will not have + had the benefit of the experience and discussion that produced the + specification. + +8. Acknowledgments + + The definitions of these terms are an amalgam of definitions taken + from a number of RFCs. In addition, suggestions have been + incorporated from a number of people including Robert Ullmann, Thomas + Narten, Neal McBurnett, and Robert Elz. + +9. Author's Address + + Scott Bradner + Harvard University + 1350 Mass. Ave. + Cambridge, MA 02138 + + phone - +1 617 495 3864 + + email - sob@harvard.edu + diff --git a/docs/secrets.md b/docs/secrets.md new file mode 100644 index 0000000..3e96db4 --- /dev/null +++ b/docs/secrets.md @@ -0,0 +1 @@ +# Secrets diff --git a/docs/tiers.md b/docs/tiers.md new file mode 100644 index 0000000..60cc7b3 --- /dev/null +++ b/docs/tiers.md @@ -0,0 +1,48 @@ +# Application Tiers + +## Definition + +Platforms and services can have different expectations depending on the technologies used, its support systems, and customer-impact. +This document defines those expectations into four "tiers" from the most-critical (Tier 1) to the least-critical (Tier 4). + +### Base Requirements + +- Teams MUST plan for both course-of-business failures and disaster-level events. +- Teams MUST assign a tier number for each application (service, platform, or system) they support. +- Applications MUST meet the availability and resilience targets of their tier. + +### Tier 1 + +Tier 1 applications are **core/critical systems upon which all else is built**. +Examples include Active Directory, Kubernetes clusters (Dev, Integration, or Production), and Datacenter Firewalls. + +Tier 1 applications MUST provide at least **[%99.95 availability](https://uptime.is/99.95)** or less than four hours of downtime n-total per year. + +### Tier 2 + +Tier 2 applications are **critical and/or time-sensitive**. +Such applications could include a customer-facing billing system, an IAM gateway, or a central code-management platform (Github). + +Tier 2 applications MUST provide at least **[%99.9 availability](https://uptime.is/99.9)** or less than nine hours of downtime in-total per year. + +### Tier 3 + +Tier 3 applications are **important and not time-sensitive**. +These include systems for end-of-month billing, internal (non-customer-impacting) metrics, + +Tier 3 applications MUST provide at least **[%99 availability](https://uptime.is/99)** or less than four days of downtime in-total per year. + +### Tier 4 + +Tier 4 applications have **low impact when delayed**. +Tier 4 applications include everything which is not assigned to other tiers. + +Tier 4 applications MUST provide at least **[%97 availability](https://uptime.is/97)** or less than ten days of downtime in-total per year. + +## Availability Calculations + +Availability is calculated based on whether a given service is responding *correctly* and is *communicating*. + +- If a service endpoint is not reachable, but is otherwise returning positive metrics (or no errors), the system is not Available. +- If a service endpoint can be connected, but only returns error messages, the system is not Available. +- If a service has a combination of unreachability and erroneous responses for a duration which exceeds the Availability limit, that service has broken its SLA guarantee.i |