aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/alerting.md21
-rw-r--r--docs/apis.md1
-rw-r--r--docs/appid.md76
-rw-r--r--docs/continuous-deployment.md0
-rw-r--r--docs/core.md28
-rw-r--r--docs/datastores.md89
-rw-r--r--docs/deployment.md41
-rw-r--r--docs/git-recommendations.md26
-rw-r--r--docs/language.md0
-rw-r--r--docs/languages/golang.md47
-rw-r--r--docs/languages/java.md13
-rw-r--r--docs/languages/nodejs.md13
-rw-r--r--docs/languages/python.md11
-rw-r--r--docs/languages/rust.md11
-rw-r--r--docs/monitoring.md135
-rw-r--r--docs/rfc2119.txt98
-rw-r--r--docs/secrets.md1
-rw-r--r--docs/tiers.md48
18 files changed, 659 insertions, 0 deletions
diff --git a/docs/alerting.md b/docs/alerting.md
new file mode 100644
index 0000000..8f8573d
--- /dev/null
+++ b/docs/alerting.md
@@ -0,0 +1,21 @@
+# Alerting
+
+Alerts are signals from [Monitors][1] to perform actions.
+Alerts MUST be _meaningful_ and _actionable_.
+
+**Meaningful**: Alert only on montiors which indicate a problem.
+(See [Monitoring][1] subsection "Saturation".)
+
+**Actionable**: Alerts MUST always include a corresponding action to resolve or investigate the underlying condition.
+
+## Requirements
+
+- Alerts MUST be resolved by a human being.
+- Alerts MUST begin with [PagerDuty][2].
+ - Alert notifications MAY forward to Slack.
+ - Alert notifications MAY forward via Email.
+- Alerts MUST NOT deduplicate automatically.
+ A human MAY aggregate alerts if unknown dependencies provide alerts during an incident.
+
+[1]: monitoring.md
+[2]: https://www.pagerduty.com/
diff --git a/docs/apis.md b/docs/apis.md
new file mode 100644
index 0000000..59b8f89
--- /dev/null
+++ b/docs/apis.md
@@ -0,0 +1 @@
+# APIs
diff --git a/docs/appid.md b/docs/appid.md
new file mode 100644
index 0000000..719059c
--- /dev/null
+++ b/docs/appid.md
@@ -0,0 +1,76 @@
+# Application ID
+
+## Definition
+
+An Application ID (APPID) is a unique identifier that identifies an application.
+Identifiers are automatically incrememntally assigned when a request for a new APPID is made.
+
+**Note: All Platforms and Services MUST have an assigned ApplicationID.**
+
+## Required Fields
+
+An APPID is represented by the following fields (with examples):
+
+- ID (_APP123456_) (Must be globally `Unique`)
+- Engineering Name (_Custom Service or Platform Name_)
+- Business Name (_Business Product Name, if different_)
+- [Tier](tiers.md) (_Tier 1_)
+- Slack Channel (_#support-team_)
+- Supporting Team (_support@company-team_)
+- Supporting Manager (_person@company.com_)
+- Director (_director@company.com_)
+- VP (_vp@company.com_)
+- Architect (_arch@company.com_)
+- Product Owner (_po@company.com_)
+- Location (_SaaS_ or _OnPrem_)
+- Ancestor (_APP456789_)
+
+### Operational Fields
+
+APPID supporting fields which provide operational details for the Applicaton:
+
+- Descendants: *Calculated field* from all Ancestor field references.
+- Dependencies: Other APPIDs on which the application depends for core functionality.
+ (This can generate a lightweight "service map" for incident response.)
+- Code Source: Source code URL for the single application source or a root folder containing the other sources.
+- Pagerduty Service(s): One (or more?) PagerDuty Escalation Policy URLs and/or Service Alert URLs.
+- Kubernetes Namespace: If using Kubernetes, the namespace in which the application is deployed.
+- Deployment Stage: Dev / Integration / Production
+- Active: Declares an APPID as `Active` or `Retired`.
+ If a system is not deliberately retired and "turned off," it cannot be flagged as `Retired`.
+
+## Definitions
+
+An application can have several definitions.
+Systems which are granted an APPID MAY be:
+
+- Customer-facing software
+- Internal-to-company software
+- Software providing non-interactive interfaces (such as an API or web service)
+- Software providing platform or infrastructure services
+- Software built and produced by company, acquired from a third-party or Vendor, or a Software-as-a-Service (SaaS) platform
+
+**Supplemental Rule:** Software which is not supported by the company MUST NOT be an application.
+
+### Boundaries
+
+Consider the following as to whether or not something needs an APPID:
+
+- Responsibility for the application MUST belong to a single team.
+ If a service or platform (Application) is maintained by more than one team, the system MUST be divided into multiple APPIDs. (See _Ancestor/Descendant Relationships_ defined below.)
+- SLAs MUST be the same across a single APPID.
+ If an application has multiple endpoints with differing SLAs, those endpoints MUST be separated into differing services and APPIDs (except for vendor-provided products).
+- A collection of microservices which accomplish one business objective MAY share a single APPID.
+- The service or platform MUST have a unified codebase.
+ If different code repositories are used for a different environment (such as On-Premises versus SaaS offerings), a new APPID MUST be created for the divergent codebase.
+- The physical or virtual hardware on which a service or platform runs MUST NOT be an APPID.
+
+### Ancestor/Descendant Relationships
+
+Some systems are best represented as a single Ancestor APPID with multiple Descendant APPIDs associated.
+Examples include microservices which operate independently to provide a single service interface, or services which provide key functionality to a platform but are of a less-reliable tier.
+
+- A Descendant APPID MUST only have one Ancestor APPID
+- A Descendant APPID MUST NOT be an Ancestor to another APPID
+- An Ancestor APPID MUST BE of a tier lower than or equal to the highest tier
+- A Descendant APPID MUST NOT be of a [more-critical tier](tiers.md) than its Ancestor APPID
diff --git a/docs/continuous-deployment.md b/docs/continuous-deployment.md
deleted file mode 100644
index e69de29..0000000
--- a/docs/continuous-deployment.md
+++ /dev/null
diff --git a/docs/core.md b/docs/core.md
new file mode 100644
index 0000000..d73b122
--- /dev/null
+++ b/docs/core.md
@@ -0,0 +1,28 @@
+# Core Rules and Guidelines for Standards
+
+The following contain both requirements and guidelines for producing Standards documents.
+
+## Terms
+
+- Standards review group (*SRG*): assigned group which reviews and approves changes to the Standards.
+
+## Requirements
+
+- The SRG MUST review outstanding change requests (PRs) every two weeks (14 days).
+- PRs MUST remain open for comment no less than 5 business days (or one week).
+ - PRs MAY remain open for more than 2 weeks if the discussion is active and ongoing.
+- The SRG MUST publish Standards no more frequently than once per week, nor less frequently than once every three months.
+
+Docuents MUST:
+
+- Follow [RFC-2119](rfc2119.txt).
+- Be as specific as possible (e.g., no colloquial langauge).
+- Be concise and use [active voice](https://writing.wisc.edu/handbook/style/ccs_activevoice/).
+- Capitalize all words in headings.
+- Use American English spellings and conventions.
+
+## Guidelines
+
+All Standards documents:
+
+- SHOULD follow the [Google Style Guide](https://developers.google.com/style/lists#capitalization-and-end-punctuation) for punctuation and capitalization.
diff --git a/docs/datastores.md b/docs/datastores.md
index e69de29..ef22c72 100644
--- a/docs/datastores.md
+++ b/docs/datastores.md
@@ -0,0 +1,89 @@
+# Data Stores
+
+## Scope
+
+This standard prescribes database and data storage technologies used to solve many related data-retention concerns.
+The solutions recommended below are designed to encourage deep expertise in a few stable and well-understood systems, rather than maximal "fit" for each distinct use case.
+
+As such, the solutions may not be the most optimal but their performance, maintenance, optimizations, and reliability requirements are understood and supported by the engineering community.
+
+## Terms
+
+- _Database_: provides long-term, durable storage for data whose loss or unavailability would mean violating an application's Availability or Business requirements.
+- _Cache_: provides short-term, volatile storage which does not preserve data.
+ Caches are not in the scope of this standard.
+
+## Capability Matrix
+
+| Capabilities | RDBMS | KV Store | File/Object |
+|--------------|-------|----------|-------------|
+| [Relational] | X | O | |
+| [Key-Value] | X | X | X |
+| [Document-Oriented] | X | X | X |
+| [Object-Based] | X | X | X |
+
+_O_: While KV-stores cannot store relational data, some KV-focused databases provide relational-like "tagging" and other attribute aggregations.
+
+## Selection Criteria
+
+### PostgreSQL
+
+Applications SHOULD use PostgreSQL (Aurora in Cloud environments and the latest stable release in OnPrem environments.)
+
+PostgreSQL across all environments supports all storage methods including:
+
+- Simple [Key-Value] stores (via [hstore])
+- [Document-Oriented] storage and queries (via [jsonb])
+- [Large-object] storage directly within the database
+
+If the application is running in the Cloud environment and needs a total data size over [64TB (RDS)][1] or [128TB (Aurora)][2], then it MUST use another approved option.
+
+### DynamoDB
+
+If an Application is hosted in AWS and requires many of:
+
+- Flexible, schemaless data model that will change in the future
+- Read-heavy access model for items
+- Single-millisecond response times
+- Very fast caching for hot keys and values (< 10ms)
+- Multi-region deployments
+
+and does not require any of:
+
+- Strongly consistent reads and writes in all situations
+- Item/Row sizes over 400 KB
+- A maximum data size of over 1TB
+- Joins between different stored values
+
+then it MAY use DynamoDB.
+
+### S3 / File store
+
+If the Applications persists data with many of the following:
+
+- Large files (>100MB each)
+- Large volumes of data (>1 million records)
+- Simple identification requirements (e.g., "filename as key")
+- Simple relational requirements (e.g., folders and files)
+
+and does not require any of the following:
+
+- Low-latency access (< 200ms)
+- Relational join operations
+- Low response times (< 1000ms)
+
+then it MAY use a filestore or S3.
+
+[Aurora (PostgreSQL)]: https://aws.amazon.com/rds/aurora/postgresql-features/
+[RDS (PostgreSQL)]: https://aws.amazon.com/rds/postgresql/
+[DynamoDB]: https://aws.amazon.com/dynamodb/
+[S3]: https://aws.amazon.com/s3/
+[Relational]:https://en.wikipedia.org/wiki/Relational_database
+[Key-Value]: https://en.wikipedia.org/wiki/Key-value_database
+[Document-Oriented]: https://en.wikipedia.org/wiki/Document-oriented_database
+[Object-Based]: https://en.wikipedia.org/wiki/Object_storage#Cloud_storage
+[hstore]: https://www.postgresql.org/docs/13/hstore.html
+[jsonb]: https://www.postgresql.org/docs/current/datatype-json.html
+[Large-object]: https://www.postgresql.org/docs/13/largeobjects.html
+[1]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html
+[2]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Limits.html
diff --git a/docs/deployment.md b/docs/deployment.md
new file mode 100644
index 0000000..37cef3b
--- /dev/null
+++ b/docs/deployment.md
@@ -0,0 +1,41 @@
+# Deployment
+
+## Service Endpoint Requirements
+
+Services with HTTP endpoints MUST expose the following endpoints:
+
+- `/info` - provides version information, git hash, etc
+- `/ping` - implements the equivalent of a [Kubernetes liveness probe][1], returning HTTP status header [200 OK][2] if in a running state, or an HTTP status header [500 Error][3] if not.
+- `/ready` - implements a Kubernetes readiness probe, returning a [200-OK][2] header if prepared to accept traffic, and a [500-Error][3] header if not.
+
+In addition:
+
+- Liveness and Readiness endpoints MUST be different endpoints with different implementations.
+
+## Rollout Mechanisms
+
+Types of rollout mechanisms include:
+
+- ["In-place"][6]: stops the application or server, updates the application out-of-band, then starts the application or server.
+ This mechanism always causes downtime for clients as the service is not available to service requests during the update.
+- ["Blue/Green"][4]: creates a full replica of the existing production environment, running the newer application.
+ The load-balancer or other routing system points clients at the new replica (once instantiated), then removes the old systems.
+ This mechanism uses at least twice the total resources required to run an environment for the duration of the transition, at minimum.
+- ["Rolling Update"][5]: requires a multi-instance deploymennt.
+ Removes one node, then spawns at least two nodes in its place.
+ If the new nodes pass deployment checks, another old node is stopped, while a new node takes its place until the entire original deployment is replaced.
+ - ["Canary"][7]: a "permanently halted" type of rolling-update.
+ Canary deployments exist alongside the older deployment, handling a fraction of the total request volume.
+ When the canary is verified "working correctly," it is turned into a rolling update, and then the next version is deployed as a canary.
+
+## Deployment Requirements
+
+- Services with more than one instance MUST use a rolling-update unless otherwise approved by the Architecture Review group.
+
+[1]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
+[2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200
+[3]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
+[4]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/bluegreen-deployments.html
+[5]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/rolling-deployments.html
+[6]: https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/in-place-deployments.html
+[7]: https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.canary-deployment.en.html
diff --git a/docs/git-recommendations.md b/docs/git-recommendations.md
new file mode 100644
index 0000000..fcc19a0
--- /dev/null
+++ b/docs/git-recommendations.md
@@ -0,0 +1,26 @@
+# Git Recommendations
+
+The following are a collection of suggestions to best use Git as a source control system.
+
+- Commits SHOULD represent a logical unit of work.
+- Commit frequently
+ - Push your commits to a branch on your own fork, if possible.
+- Write [descriptive commit messages].
+- Keep remote repository up-to-date by committing and pushing your work regularly.
+- Keep your local copies of repositories up-to-date by regularly pulling changes.
+ - Frequently pull from upstream to the `main` branch, and rebase your changes on top (see the [rebase workflow]).
+- Coordinate with colleagues to avoid nasty merge conflicts (if you can).
+ - Try the git [rebase workflow].
+- Use branches and merge requests.
+ - Pick a branching strategy that works for you and your team.
+ Consider [GitHub Flow](https://guides.github.com/introduction/flow/index.html).
+ - Create a new branch for each feature or bugfix.
+ - The `main` branch SHOULD always contain releasable code.
+ - Protect your `main` branch; require merge requests to make changes.
+ - Configure a group of default reviewers for your pull requests.
+ - For teams of 4 or more, require a minimum of 2 approvers for all merge requests.
+- Avoid force operations (`-f` or `--force` option), especially on `main` branch, as this is an indication you are probably doing something wrong.
+- Keep your repository neat: always delete merged branches.
+
+[descriptive commit messages]: https://cbea.ms/git-commit/#seven-rules
+[rebase workflow]: https://git-rebase.io
diff --git a/docs/language.md b/docs/language.md
deleted file mode 100644
index e69de29..0000000
--- a/docs/language.md
+++ /dev/null
diff --git a/docs/languages/golang.md b/docs/languages/golang.md
index e69de29..6f1cf2f 100644
--- a/docs/languages/golang.md
+++ b/docs/languages/golang.md
@@ -0,0 +1,47 @@
+# Go Langugage Standards
+
+## Definitions
+
+- A `Program` is a program, service, or application which is NOT a library.
+- A `Library` is code designed only for consumption by other programs.
+
+## Requirements
+
+- Builds MUST use the [Go-provided compiler][1].
+ Builds MUST NOT use the [gcc-go][2] compiler or other alternatives.
+- Programs MUST update and commit the `go.mod` and `go.sum` using `go mod tidy`.
+- Programs MUST [vendor dependency code][4] and commit the vendored code to their repository.
+- CI builds MUST use the [golangci-lint][5] linter as a first-stage validation step
+- Programs SHOULD use the [standard project layout][3]
+- Programs MUST NOT use CGO unless there is no pure-Go alternative.
+ Appropriate uses of CGO include Oracle DB drivers, GPGPU computation.
+ (Prefer `CGO_ENABLED=0` as a build flag.)
+
+## Recommended Resources
+
+- [Effective Go](https://golang.org/doc/effective_go.html)
+- [Go Proverbs](https://go-proverbs.github.io/)
+- [Go Test package](https://pkg.go.dev/testing)
+
+## Supplemental Advice
+
+### Local Environment
+
+- Run `go build`, `golangci-lint run`, and `go test` before pushing code for your PR.
+- Fork the repository, then commit and push changes to the fork frequently.
+ This avoids catastrophic data loss and enables Work In Progress (WIP) sharing.
+
+### Go language
+
+- Errors MUST be handled
+- Programs and Libaries SHOULD NOT use third-party libraries.
+ Prefer standard library packages.
+- Use [gofumports](https://github.com/mvdan/gofumpt) for formatting and automatic imports
+- Test functions MUST check both the error result and the returned data.
+
+[RFC2119]:https://www.rfc-editor.org/rfc/rfc2119.txt
+[1]:https://golang.org
+[2]:https://gcc.gnu.org/onlinedocs/gccgo/
+[3]:https://github.com/golang-standards/project-layout
+[4]:https://golang.org/ref/mod#vendoring
+[5]:https://golangci-lint.run
diff --git a/docs/languages/java.md b/docs/languages/java.md
index e69de29..b0119f6 100644
--- a/docs/languages/java.md
+++ b/docs/languages/java.md
@@ -0,0 +1,13 @@
+# Java Language Standards
+
+## Requirements
+
+TBD
+
+## Guidelines
+
+- Follow the [Google Style Guide][1] for formatting.
+ - Use the automatic formatting rules for IntelliJ IDEA and Eclipse [available here][2].
+
+[1]: https://google.github.io/styleguide/javaguide.html
+[2]: https://raw.githubusercontent.com/google/styleguide/gh-pages/intellij-java-google-style.xml
diff --git a/docs/languages/nodejs.md b/docs/languages/nodejs.md
index e69de29..9eaede6 100644
--- a/docs/languages/nodejs.md
+++ b/docs/languages/nodejs.md
@@ -0,0 +1,13 @@
+# NodeJS and Javascript Language Standards
+
+## Requirements
+
+- Applications MUST target the latest LTS release of NodeJS (currently v16).
+- Applicatons MUST use ESLint per the _Linting_ guidelines below.
+
+### Linting
+[ESLint](https://eslint.org/) is a linter used with Javascript to detect and enforce code and style guidelines.
+There are several shared, public configs that provide base rules.
+The config SHOULD extend the [standardjs](https://standardjs.com/) ESLint configuration.
+This is different from the general language
+standard to use [Google coding standards](https://github.com/google/eslint-config-google).
diff --git a/docs/languages/python.md b/docs/languages/python.md
index e69de29..fa4edec 100644
--- a/docs/languages/python.md
+++ b/docs/languages/python.md
@@ -0,0 +1,11 @@
+# Python Language Standards
+
+## Requirements
+
+TBD
+
+## Guidelines
+
+- Code MUST be formatted using [PEP8][1].
+
+[1]: https://www.python.org/dev/peps/pep-0008/
diff --git a/docs/languages/rust.md b/docs/languages/rust.md
index e69de29..426164a 100644
--- a/docs/languages/rust.md
+++ b/docs/languages/rust.md
@@ -0,0 +1,11 @@
+# Rust Language Standards
+
+## Requirements
+
+- Committed code MUST be formatted using [rustfmt][1].
+
+## Guidelines
+
+TBD
+
+[1]: https://github.com/rust-lang/rustfmt#running-rustfmt-from-your-editor
diff --git a/docs/monitoring.md b/docs/monitoring.md
new file mode 100644
index 0000000..d9a9b62
--- /dev/null
+++ b/docs/monitoring.md
@@ -0,0 +1,135 @@
+# Monitoring
+
+This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams.
+This standard does not provide guidelines for alerting external Customers or Vendors.
+
+Terminology will be introduced throughout this document.
+Much of the terminology is drawn from [the Google SRE Book][2].
+If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book.
+
+## Requirements
+
+All Applications:
+
+- MUST instrument their primary code paths with APM/tracing.
+- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation.
+- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values).
+- SHOULD NOT use logs to track data captured by metrics or APM.
+- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc).
+
+## Observability Concepts
+
+Observability is about understanding how an application performs at runtime, with real-world use cases and data.
+
+There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_).
+
+Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself.
+Using Push mechanisms significantly [reduce the overall complexity][3] of the system.
+
+### Definitions
+
+Certain words have different definitions from their common use in conversation.
+For the purposes of monitoring and alerting, the following definitions apply:
+
+- **Observe**: to understand how an application behaves, with real world use cases.
+- **Monitoring**: the act of collecting information used to _Observe_ an application.
+- **Event**: a record of something which happened, produced by a Monitor.
+ - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc.
+ - Events are records, not signals, and MUST represent something that actually happened.
+ Such occurrences represent decisions made by the
+
+## Types of Monitoring
+
+Also known as the _Pillars of Observability_.
+
+Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_.
+Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events.
+
+### Time Series
+
+Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like:
+
+- Is the application running?
+- How many requests / transactions per [time value] is the application processing?
+- Does the application need to scale (horizontally) or deploy with more resources?
+
+Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details.
+
+Most Time Series implementations cannot associate data to events beyond key-value tagging.
+Such systems are often insufficient to track intermittent issues or view details regarding particular errors.
+
+### Application Performance Monitoring (APM)
+
+APM data are instrumentation which can operate across code functions, external requests, and across applications.
+APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5].
+APM can answer all the questions addressed by the Time-Series solution, and additionally:
+
+- Am I operating within my SLO/SLA for clients?
+- How long does a specific endpoint take to respond?
+- How frequently is a specific function being called?
+- What functions are taking the most time during a request / operation?
+- What were the exact contents of request and response data during a long or erroneous operation?
+- What [additional attributes][6] were present during an erroneous operation?
+
+APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as:
+
+- Errors and Error Rate.
+- Traffic / Throughput rates.
+- Latency, especially within the application.
+- Interactions with dependencies, including latency and error rate.
+
+### Logging
+
+Logging is an inherently flexible and robust method of generating event-related information.
+Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection).
+Logging metrics MAY answer questions like:
+
+- What calculated values were generated by a specific function?
+- What input or payload information was provided for a specific transaction?
+
+Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements.
+Such requirements supercede declarations here.
+
+- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics.
+ - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics).
+ - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging.
+- Logging SHOULD be in a structured format such as JSON.
+- Logging MUST NOT include sensitive data without masking/eliding said data.
+
+## Top Metrics
+
+Also known as [the Golden Signals][1].
+
+Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies.
+
+### Latency
+
+- APM SHOULD be used to measure latency.
+- Metrics MAY be used to measure latency.
+- Logging MUST NOT be used to measure latency.
+
+### Traffic
+
+- APM SHOULD be used to measure traffic.
+- Metrics MAY be used to measure traffic.
+- Logging MUST NOT be used to measure traffic.
+
+### Errors
+
+- APM SHOULD be used to measure errors.
+- Metrics MAY be used to measure errors.
+- Logging SHOULD NOT be used to measure errors.
+
+### Saturation
+
+- APM SHOULD be used to measure saturation.
+- Metrics MAY be used to measure saturation.
+- Logging MUST NOT be used to measure saturation.
+
+[1]: https://sre.google/sre-book/monitoring-distributed-systems/
+[2]: https://sre.google/sre-book/part-II-principles/
+[3]: https://sre.google/sre-book/simplicity/
+[4]: https://opentracing.io/docs/overview/what-is-tracing/
+[5]: https://docs.newrelic.com/docs/distributed-tracing/concepts/introduction-distributed-tracing/
+[6]: https://developer.newrelic.com/collect-data/custom-attributes/
+[7]: alerting.md
diff --git a/docs/rfc2119.txt b/docs/rfc2119.txt
new file mode 100644
index 0000000..8c29989
--- /dev/null
+++ b/docs/rfc2119.txt
@@ -0,0 +1,98 @@
+Network Working Group S. Bradner
+Request for Comments: 2119 Harvard University
+BCP: 14 March 1997
+Category: Best Current Practice
+
+
+ Key words for use in RFCs to Indicate Requirement Levels
+
+Status of this Memo
+
+ This document specifies an Internet Best Current Practices for the
+ Internet Community, and requests discussion and suggestions for
+ improvements. Distribution of this memo is unlimited.
+
+Abstract
+
+ In many standards track documents several words are used to signify
+ the requirements in the specification. These words are often
+ capitalized. This document defines these words as they should be
+ interpreted in IETF documents. Authors who follow these guidelines
+ should incorporate this phrase near the beginning of their document:
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
+ NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+ "OPTIONAL" in this document are to be interpreted as described in
+ RFC 2119.
+
+ Note that the force of these words is modified by the requirement
+ level of the document in which they are used.
+
+1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the
+ definition is an absolute requirement of the specification.
+
+2. MUST NOT This phrase, or the phrase "SHALL NOT", mean that the
+ definition is an absolute prohibition of the specification.
+
+3. SHOULD This word, or the adjective "RECOMMENDED", mean that there
+ may exist valid reasons in particular circumstances to ignore a
+ particular item, but the full implications must be understood and
+ carefully weighed before choosing a different course.
+
+4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
+ there may exist valid reasons in particular circumstances when the
+ particular behavior is acceptable or even useful, but the full
+ implications should be understood and the case carefully weighed
+ before implementing any behavior described with this label.
+
+5. MAY This word, or the adjective "OPTIONAL", mean that an item is
+ truly optional. One vendor may choose to include the item because a
+ particular marketplace requires it or because the vendor feels that
+ it enhances the product while another vendor may omit the same item.
+ An implementation which does not include a particular option MUST be
+ prepared to interoperate with another implementation which does
+ include the option, though perhaps with reduced functionality. In the
+ same vein an implementation which does include a particular option
+ MUST be prepared to interoperate with another implementation which
+ does not include the option (except, of course, for the feature the
+ option provides.)
+
+6. Guidance in the use of these Imperatives
+
+ Imperatives of the type defined in this memo must be used with care
+ and sparingly. In particular, they MUST only be used where it is
+ actually required for interoperation or to limit behavior which has
+ potential for causing harm (e.g., limiting retransmisssions) For
+ example, they must not be used to try to impose a particular method
+ on implementors where the method is not required for
+ interoperability.
+
+7. Security Considerations
+
+ These terms are frequently used to specify behavior with security
+ implications. The effects on security of not implementing a MUST or
+ SHOULD, or doing something the specification says MUST NOT or SHOULD
+ NOT be done may be very subtle. Document authors should take the time
+ to elaborate the security implications of not following
+ recommendations or requirements as most implementors will not have
+ had the benefit of the experience and discussion that produced the
+ specification.
+
+8. Acknowledgments
+
+ The definitions of these terms are an amalgam of definitions taken
+ from a number of RFCs. In addition, suggestions have been
+ incorporated from a number of people including Robert Ullmann, Thomas
+ Narten, Neal McBurnett, and Robert Elz.
+
+9. Author's Address
+
+ Scott Bradner
+ Harvard University
+ 1350 Mass. Ave.
+ Cambridge, MA 02138
+
+ phone - +1 617 495 3864
+
+ email - sob@harvard.edu
+
diff --git a/docs/secrets.md b/docs/secrets.md
new file mode 100644
index 0000000..3e96db4
--- /dev/null
+++ b/docs/secrets.md
@@ -0,0 +1 @@
+# Secrets
diff --git a/docs/tiers.md b/docs/tiers.md
new file mode 100644
index 0000000..60cc7b3
--- /dev/null
+++ b/docs/tiers.md
@@ -0,0 +1,48 @@
+# Application Tiers
+
+## Definition
+
+Platforms and services can have different expectations depending on the technologies used, its support systems, and customer-impact.
+This document defines those expectations into four "tiers" from the most-critical (Tier 1) to the least-critical (Tier 4).
+
+### Base Requirements
+
+- Teams MUST plan for both course-of-business failures and disaster-level events.
+- Teams MUST assign a tier number for each application (service, platform, or system) they support.
+- Applications MUST meet the availability and resilience targets of their tier.
+
+### Tier 1
+
+Tier 1 applications are **core/critical systems upon which all else is built**.
+Examples include Active Directory, Kubernetes clusters (Dev, Integration, or Production), and Datacenter Firewalls.
+
+Tier 1 applications MUST provide at least **[%99.95 availability](https://uptime.is/99.95)** or less than four hours of downtime n-total per year.
+
+### Tier 2
+
+Tier 2 applications are **critical and/or time-sensitive**.
+Such applications could include a customer-facing billing system, an IAM gateway, or a central code-management platform (Github).
+
+Tier 2 applications MUST provide at least **[%99.9 availability](https://uptime.is/99.9)** or less than nine hours of downtime in-total per year.
+
+### Tier 3
+
+Tier 3 applications are **important and not time-sensitive**.
+These include systems for end-of-month billing, internal (non-customer-impacting) metrics,
+
+Tier 3 applications MUST provide at least **[%99 availability](https://uptime.is/99)** or less than four days of downtime in-total per year.
+
+### Tier 4
+
+Tier 4 applications have **low impact when delayed**.
+Tier 4 applications include everything which is not assigned to other tiers.
+
+Tier 4 applications MUST provide at least **[%97 availability](https://uptime.is/97)** or less than ten days of downtime in-total per year.
+
+## Availability Calculations
+
+Availability is calculated based on whether a given service is responding *correctly* and is *communicating*.
+
+- If a service endpoint is not reachable, but is otherwise returning positive metrics (or no errors), the system is not Available.
+- If a service endpoint can be connected, but only returns error messages, the system is not Available.
+- If a service has a combination of unreachability and erroneous responses for a duration which exceeds the Availability limit, that service has broken its SLA guarantee.i