reflow linesHEAD main

author: Tyler Davis <tyler@gluecode.net> 2025-01-31 06:07:27 +0000
committer: Tyler Davis <tyler@gluecode.net> 2025-01-31 06:07:27 +0000
commit: 97842d43cc06cdd8e298d74ad6705af4348df49a (patch)
tree: 0c3f2393c03824968ed16128135ca743f1353286 /docs/monitoring.md
parent: 509fe00d541ce769c4423a072318ad47294b7763 (diff)
download: standards-main.tar.gz
standards-main.zip
1 files changed, 93 insertions, 50 deletions
diff --git a/docs/monitoring.md b/docs/monitoring.md
index d868672..56e5008 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -1,78 +1,111 @@
-# Monitoring
+# Monitoring (or Observability-v1)
 
-This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams.
-This standard does not provide guidelines for alerting external Customers or Vendors.
+This standard presents guidelines for providing operational montioring
+and alerting of applications and platforms for internal teams.  This
+standard does not provide guidelines for alerting external Customers or
+Vendors.
 
-Terminology will be introduced throughout this document.
-Much of the terminology is drawn from [the Google SRE Book][2].
-If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book.
+Terminology will be introduced throughout this document.  Much of the
+terminology is drawn from [the Google SRE Book][2].  If you are
+unfamiliar with monitoring and alerting, especially for distributed
+systems, you SHOULD read [this section][1] of the SRE book.
 
 ## Requirements
 
 All Applications:
 
 - MUST instrument their primary code paths with APM/tracing.
-- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation.
-- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values).
+- MUST instrument and [alert][7] on the four primary signals: Latency,
+  Traffic, Errors, and Saturation.
+- MUST NOT add unique values to time-series metrics tagging (e.g.,
+  request IDs, UUIDs, other high-cardinality values).
 - SHOULD NOT use logs to track data captured by metrics or APM.
-- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc).
+- SHOULD consolidate logging statements to a single line (e.g., report
+  stacktraces on a single line, multi-line payloads rendered in
+  single-line format, etc).
 
 ## Observability Concepts
 
-Observability is about understanding how an application performs at runtime, with real-world use cases and data.
+Observability is about understanding how an application performs at
+runtime, with real-world use cases and data.
 
-There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_).
+There are two ways to gather observability information: instrumenting
+code which emits data to an aggregator (_Push_), or instrumenting code
+which exposes the information for an external system to query (_Pull_ /
+_Poll_).
 
-Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself.
-Using Push mechanisms significantly [reduce the overall complexity][3] of the system.
+Each method has its advantages, but _Push_ allows observability
+mechanisms to be incorporated within the application itself.  Using Push
+mechanisms significantly [reduce the overall complexity][3] of the
+system.
 
 ### Definitions
 
-Certain words have different definitions from their common use in conversation.
-For the purposes of monitoring and alerting, the following definitions apply:
-
-- **Observe**: to understand how an application behaves, with real world use cases.
-- **Monitoring**: the act of collecting information used to _Observe_ an application.
-- **Event**: a record of something which happened, produced by a Monitor.
-  - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc.
-  - Events are records, not signals, and MUST represent something that actually happened.
-  Such occurrences represent decisions made by the Development, SRE, and
-  Security teams to highlight _meaningful_ occurrences.
+Certain words have different definitions from their common use in
+conversation.  For the purposes of monitoring and alerting, the
+following definitions apply:
+
+- **Observe**: to understand how an application behaves, with real world
+  use cases.
+- **Monitoring**: the act of collecting information used to _Observe_ an
+  application.
+- **Event**: a record of something which happened, produced by a
+  Monitor.
+  - A monitoring _event_ is not the same as events used in other systems
+    such as Databases, Cloud Providers, Apache Kafka, etc.
+  - Events are records, not signals, and MUST represent something that
+    actually happened.  Such occurrences represent decisions made by the
+    Development, SRE, and Security teams to highlight _meaningful_
+    occurrences.
 
 ## Types of Monitoring
 
 Also known as the _Pillars of Observability_.
 
-Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_.
-Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events.
+Monitoring follows three broad categories: _Time Series_ metrics,
+_Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_.
+Each of the categories has its advantages and disadvantages, but APM
+tends to have the best results across all three categories, and often
+costs the least for a volume of data or events.
 
 ### Time Series
 
-Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like:
+Time Series metrics are points of data, sometimes aggregated, which MAY
+answer very simple questions like:
 
 - Is the application running?
-- How many requests / transactions per [time value] is the application processing?
-- Does the application need to scale (horizontally) or deploy with more resources?
+- How many requests / transactions per [time value] is the application
+  processing?
+- Does the application need to scale (horizontally) or deploy with more
+  resources?
 
-Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details.
+Time Series metrics are good at observing trends, handling large volumes
+of data with minimal infrastructure, and for tracking (normally)
+slow-changing statistics such as infrastructure and node details.
 
-Most Time Series implementations cannot associate data to events beyond key-value tagging.
-Such systems are often insufficient to track intermittent issues or view details regarding particular errors.
+Most Time Series implementations cannot associate data to events beyond
+key-value tagging.  Such systems are often insufficient to track
+intermittent issues or view details regarding particular errors.
 
 ### Application Performance Monitoring (APM)
 
-APM data are instrumentation which can operate across code functions, external requests, and across applications.
-APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5].
-APM can answer all the questions addressed by the Time-Series solution, and additionally:
+APM data are instrumentation which can operate across code functions,
+external requests, and across applications.  APM provides [distributed
+tracing][4], which can be enhanced on platforms like [NewRelic][5].  APM
+can answer all the questions addressed by the Time-Series solution, and
+additionally:
 
 - Am I operating within my SLO/SLA for clients?
 - How long does a specific endpoint take to respond?
 - How frequently is a specific function being called?
 - What functions are taking the most time during a request / operation?
-- What were the exact contents of request and response data during a long or erroneous operation?
-- What [additional attributes][6] were present during an erroneous operation?
+- What were the exact contents of request and response data during a
+  long or erroneous operation?
+- What [additional attributes][6] were present during an erroneous
+  operation?
 
-APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as:
+APM SHOULD be used to track key indicators (KPIs) and other SLO-related
+values such as:
 
 - Errors and Error Rate.
 - Traffic / Throughput rates.
@@ -81,27 +114,37 @@ APM SHOULD be used to track key indicators (KPIs) and other SLO-related values s
 
 ### Logging
 
-Logging is an inherently flexible and robust method of generating event-related information.
-Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection).
-Logging metrics MAY answer questions like:
+Logging is an inherently flexible and robust method of generating
+event-related information.  Logging metrics capture event data from a
+specific point in time during application execution and write to a file
+(or stream, for collection).  Logging metrics MAY answer questions like:
 
 - What calculated values were generated by a specific function?
-- What input or payload information was provided for a specific transaction?
-
-Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements.
-Such requirements supercede declarations here.
-
-- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics.
-  - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics).
-  - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging.
+- What input or payload information was provided for a specific
+  transaction?
+
+Logs are often necessary for auditing purposes, which have security,
+legal, or regulatory requirements.  Such requirements supercede
+declarations here.
+
+- Logging SHOULD be used to capture only events which cannot be captured
+  by APM or Metrics.
+  - Suggested events include process startup messages, signal-received
+    hooks, shutdown messages, and reporting failures (such as failure to
+    submit APM/Metrics).
+  - Logging MAY _temporarily_ be used to track events which are captured
+    by APM or Metrics during incidents or when debugging.
 - Logging SHOULD be in a structured format such as JSON.
-- Logging MUST NOT include sensitive data without masking/eliding said data.
+- Logging MUST NOT include sensitive data without masking/eliding said
+  data.
 
 ## Top Metrics
 
 Also known as [the Golden Signals][1].
 
-Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies.
+Applications MUST have at least one dashboard which displays the
+following metrics for all services or components, as well as metrics for
+invoked dependencies.
 
 ### Latency
author	Tyler Davis <tyler@gluecode.net>	2025-01-31 06:07:27 +0000
committer	Tyler Davis <tyler@gluecode.net>	2025-01-31 06:07:27 +0000
commit	97842d43cc06cdd8e298d74ad6705af4348df49a (patch)
tree	0c3f2393c03824968ed16128135ca743f1353286 /docs/monitoring.md
parent	509fe00d541ce769c4423a072318ad47294b7763 (diff)
download	standards-main.tar.gz standards-main.zip