aboutsummaryrefslogtreecommitdiffstats
path: root/docs/monitoring.md
diff options
context:
space:
mode:
authorTyler Davis <tyler@gluecode.net>2025-01-31 06:07:27 +0000
committerTyler Davis <tyler@gluecode.net>2025-01-31 06:07:27 +0000
commit97842d43cc06cdd8e298d74ad6705af4348df49a (patch)
tree0c3f2393c03824968ed16128135ca743f1353286 /docs/monitoring.md
parent509fe00d541ce769c4423a072318ad47294b7763 (diff)
downloadstandards-main.tar.gz
standards-main.zip
reflow linesHEADmain
Diffstat (limited to 'docs/monitoring.md')
-rw-r--r--docs/monitoring.md143
1 files changed, 93 insertions, 50 deletions
diff --git a/docs/monitoring.md b/docs/monitoring.md
index d868672..56e5008 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -1,78 +1,111 @@
-# Monitoring
+# Monitoring (or Observability-v1)
-This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams.
-This standard does not provide guidelines for alerting external Customers or Vendors.
+This standard presents guidelines for providing operational montioring
+and alerting of applications and platforms for internal teams. This
+standard does not provide guidelines for alerting external Customers or
+Vendors.
-Terminology will be introduced throughout this document.
-Much of the terminology is drawn from [the Google SRE Book][2].
-If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book.
+Terminology will be introduced throughout this document. Much of the
+terminology is drawn from [the Google SRE Book][2]. If you are
+unfamiliar with monitoring and alerting, especially for distributed
+systems, you SHOULD read [this section][1] of the SRE book.
## Requirements
All Applications:
- MUST instrument their primary code paths with APM/tracing.
-- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation.
-- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values).
+- MUST instrument and [alert][7] on the four primary signals: Latency,
+ Traffic, Errors, and Saturation.
+- MUST NOT add unique values to time-series metrics tagging (e.g.,
+ request IDs, UUIDs, other high-cardinality values).
- SHOULD NOT use logs to track data captured by metrics or APM.
-- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc).
+- SHOULD consolidate logging statements to a single line (e.g., report
+ stacktraces on a single line, multi-line payloads rendered in
+ single-line format, etc).
## Observability Concepts
-Observability is about understanding how an application performs at runtime, with real-world use cases and data.
+Observability is about understanding how an application performs at
+runtime, with real-world use cases and data.
-There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_).
+There are two ways to gather observability information: instrumenting
+code which emits data to an aggregator (_Push_), or instrumenting code
+which exposes the information for an external system to query (_Pull_ /
+_Poll_).
-Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself.
-Using Push mechanisms significantly [reduce the overall complexity][3] of the system.
+Each method has its advantages, but _Push_ allows observability
+mechanisms to be incorporated within the application itself. Using Push
+mechanisms significantly [reduce the overall complexity][3] of the
+system.
### Definitions
-Certain words have different definitions from their common use in conversation.
-For the purposes of monitoring and alerting, the following definitions apply:
-
-- **Observe**: to understand how an application behaves, with real world use cases.
-- **Monitoring**: the act of collecting information used to _Observe_ an application.
-- **Event**: a record of something which happened, produced by a Monitor.
- - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc.
- - Events are records, not signals, and MUST represent something that actually happened.
- Such occurrences represent decisions made by the Development, SRE, and
- Security teams to highlight _meaningful_ occurrences.
+Certain words have different definitions from their common use in
+conversation. For the purposes of monitoring and alerting, the
+following definitions apply:
+
+- **Observe**: to understand how an application behaves, with real world
+ use cases.
+- **Monitoring**: the act of collecting information used to _Observe_ an
+ application.
+- **Event**: a record of something which happened, produced by a
+ Monitor.
+ - A monitoring _event_ is not the same as events used in other systems
+ such as Databases, Cloud Providers, Apache Kafka, etc.
+ - Events are records, not signals, and MUST represent something that
+ actually happened. Such occurrences represent decisions made by the
+ Development, SRE, and Security teams to highlight _meaningful_
+ occurrences.
## Types of Monitoring
Also known as the _Pillars of Observability_.
-Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_.
-Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events.
+Monitoring follows three broad categories: _Time Series_ metrics,
+_Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_.
+Each of the categories has its advantages and disadvantages, but APM
+tends to have the best results across all three categories, and often
+costs the least for a volume of data or events.
### Time Series
-Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like:
+Time Series metrics are points of data, sometimes aggregated, which MAY
+answer very simple questions like:
- Is the application running?
-- How many requests / transactions per [time value] is the application processing?
-- Does the application need to scale (horizontally) or deploy with more resources?
+- How many requests / transactions per [time value] is the application
+ processing?
+- Does the application need to scale (horizontally) or deploy with more
+ resources?
-Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details.
+Time Series metrics are good at observing trends, handling large volumes
+of data with minimal infrastructure, and for tracking (normally)
+slow-changing statistics such as infrastructure and node details.
-Most Time Series implementations cannot associate data to events beyond key-value tagging.
-Such systems are often insufficient to track intermittent issues or view details regarding particular errors.
+Most Time Series implementations cannot associate data to events beyond
+key-value tagging. Such systems are often insufficient to track
+intermittent issues or view details regarding particular errors.
### Application Performance Monitoring (APM)
-APM data are instrumentation which can operate across code functions, external requests, and across applications.
-APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5].
-APM can answer all the questions addressed by the Time-Series solution, and additionally:
+APM data are instrumentation which can operate across code functions,
+external requests, and across applications. APM provides [distributed
+tracing][4], which can be enhanced on platforms like [NewRelic][5]. APM
+can answer all the questions addressed by the Time-Series solution, and
+additionally:
- Am I operating within my SLO/SLA for clients?
- How long does a specific endpoint take to respond?
- How frequently is a specific function being called?
- What functions are taking the most time during a request / operation?
-- What were the exact contents of request and response data during a long or erroneous operation?
-- What [additional attributes][6] were present during an erroneous operation?
+- What were the exact contents of request and response data during a
+ long or erroneous operation?
+- What [additional attributes][6] were present during an erroneous
+ operation?
-APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as:
+APM SHOULD be used to track key indicators (KPIs) and other SLO-related
+values such as:
- Errors and Error Rate.
- Traffic / Throughput rates.
@@ -81,27 +114,37 @@ APM SHOULD be used to track key indicators (KPIs) and other SLO-related values s
### Logging
-Logging is an inherently flexible and robust method of generating event-related information.
-Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection).
-Logging metrics MAY answer questions like:
+Logging is an inherently flexible and robust method of generating
+event-related information. Logging metrics capture event data from a
+specific point in time during application execution and write to a file
+(or stream, for collection). Logging metrics MAY answer questions like:
- What calculated values were generated by a specific function?
-- What input or payload information was provided for a specific transaction?
-
-Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements.
-Such requirements supercede declarations here.
-
-- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics.
- - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics).
- - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging.
+- What input or payload information was provided for a specific
+ transaction?
+
+Logs are often necessary for auditing purposes, which have security,
+legal, or regulatory requirements. Such requirements supercede
+declarations here.
+
+- Logging SHOULD be used to capture only events which cannot be captured
+ by APM or Metrics.
+ - Suggested events include process startup messages, signal-received
+ hooks, shutdown messages, and reporting failures (such as failure to
+ submit APM/Metrics).
+ - Logging MAY _temporarily_ be used to track events which are captured
+ by APM or Metrics during incidents or when debugging.
- Logging SHOULD be in a structured format such as JSON.
-- Logging MUST NOT include sensitive data without masking/eliding said data.
+- Logging MUST NOT include sensitive data without masking/eliding said
+ data.
## Top Metrics
Also known as [the Golden Signals][1].
-Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies.
+Applications MUST have at least one dashboard which displays the
+following metrics for all services or components, as well as metrics for
+invoked dependencies.
### Latency