diff options
Diffstat (limited to 'docs/monitoring.md')
-rw-r--r-- | docs/monitoring.md | 143 |
1 files changed, 93 insertions, 50 deletions
diff --git a/docs/monitoring.md b/docs/monitoring.md index d868672..56e5008 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -1,78 +1,111 @@ -# Monitoring +# Monitoring (or Observability-v1) -This standard presents guidelines for providing operational montioring and alerting of applications and platforms for internal teams. -This standard does not provide guidelines for alerting external Customers or Vendors. +This standard presents guidelines for providing operational montioring +and alerting of applications and platforms for internal teams. This +standard does not provide guidelines for alerting external Customers or +Vendors. -Terminology will be introduced throughout this document. -Much of the terminology is drawn from [the Google SRE Book][2]. -If you are unfamiliar with monitoring and alerting, especially for distributed systems, you SHOULD read [this section][1] of the SRE book. +Terminology will be introduced throughout this document. Much of the +terminology is drawn from [the Google SRE Book][2]. If you are +unfamiliar with monitoring and alerting, especially for distributed +systems, you SHOULD read [this section][1] of the SRE book. ## Requirements All Applications: - MUST instrument their primary code paths with APM/tracing. -- MUST instrument and [alert][7] on the four primary signals: Latency, Traffic, Errors, and Saturation. -- MUST NOT add unique values to time-series metrics tagging (e.g., request IDs, UUIDs, other high-cardinality values). +- MUST instrument and [alert][7] on the four primary signals: Latency, + Traffic, Errors, and Saturation. +- MUST NOT add unique values to time-series metrics tagging (e.g., + request IDs, UUIDs, other high-cardinality values). - SHOULD NOT use logs to track data captured by metrics or APM. -- SHOULD consolidate logging statements to a single line (e.g., report stacktraces on a single line, multi-line payloads rendered in single-line format, etc). +- SHOULD consolidate logging statements to a single line (e.g., report + stacktraces on a single line, multi-line payloads rendered in + single-line format, etc). ## Observability Concepts -Observability is about understanding how an application performs at runtime, with real-world use cases and data. +Observability is about understanding how an application performs at +runtime, with real-world use cases and data. -There are two ways to gather observability information: instrumenting code which emits data to an aggregator (_Push_), or instrumenting code which exposes the information for an external system to query (_Pull_ / _Poll_). +There are two ways to gather observability information: instrumenting +code which emits data to an aggregator (_Push_), or instrumenting code +which exposes the information for an external system to query (_Pull_ / +_Poll_). -Each method has its advantages, but _Push_ allows observability mechanisms to be incorporated within the application itself. -Using Push mechanisms significantly [reduce the overall complexity][3] of the system. +Each method has its advantages, but _Push_ allows observability +mechanisms to be incorporated within the application itself. Using Push +mechanisms significantly [reduce the overall complexity][3] of the +system. ### Definitions -Certain words have different definitions from their common use in conversation. -For the purposes of monitoring and alerting, the following definitions apply: - -- **Observe**: to understand how an application behaves, with real world use cases. -- **Monitoring**: the act of collecting information used to _Observe_ an application. -- **Event**: a record of something which happened, produced by a Monitor. - - A monitoring _event_ is not the same as events used in other systems such as Databases, Cloud Providers, Apache Kafka, etc. - - Events are records, not signals, and MUST represent something that actually happened. - Such occurrences represent decisions made by the Development, SRE, and - Security teams to highlight _meaningful_ occurrences. +Certain words have different definitions from their common use in +conversation. For the purposes of monitoring and alerting, the +following definitions apply: + +- **Observe**: to understand how an application behaves, with real world + use cases. +- **Monitoring**: the act of collecting information used to _Observe_ an + application. +- **Event**: a record of something which happened, produced by a + Monitor. + - A monitoring _event_ is not the same as events used in other systems + such as Databases, Cloud Providers, Apache Kafka, etc. + - Events are records, not signals, and MUST represent something that + actually happened. Such occurrences represent decisions made by the + Development, SRE, and Security teams to highlight _meaningful_ + occurrences. ## Types of Monitoring Also known as the _Pillars of Observability_. -Monitoring follows three broad categories: _Time Series_ metrics, _Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_. -Each of the categories has its advantages and disadvantages, but APM tends to have the best results across all three categories, and often costs the least for a volume of data or events. +Monitoring follows three broad categories: _Time Series_ metrics, +_Application Performance Monitoring_ (APM) or _Tracing_, and _Logging_. +Each of the categories has its advantages and disadvantages, but APM +tends to have the best results across all three categories, and often +costs the least for a volume of data or events. ### Time Series -Time Series metrics are points of data, sometimes aggregated, which MAY answer very simple questions like: +Time Series metrics are points of data, sometimes aggregated, which MAY +answer very simple questions like: - Is the application running? -- How many requests / transactions per [time value] is the application processing? -- Does the application need to scale (horizontally) or deploy with more resources? +- How many requests / transactions per [time value] is the application + processing? +- Does the application need to scale (horizontally) or deploy with more + resources? -Time Series metrics are good at observing trends, handling large volumes of data with minimal infrastructure, and for tracking (normally) slow-changing statistics such as infrastructure and node details. +Time Series metrics are good at observing trends, handling large volumes +of data with minimal infrastructure, and for tracking (normally) +slow-changing statistics such as infrastructure and node details. -Most Time Series implementations cannot associate data to events beyond key-value tagging. -Such systems are often insufficient to track intermittent issues or view details regarding particular errors. +Most Time Series implementations cannot associate data to events beyond +key-value tagging. Such systems are often insufficient to track +intermittent issues or view details regarding particular errors. ### Application Performance Monitoring (APM) -APM data are instrumentation which can operate across code functions, external requests, and across applications. -APM provides [distributed tracing][4], which can be enhanced on platforms like [NewRelic][5]. -APM can answer all the questions addressed by the Time-Series solution, and additionally: +APM data are instrumentation which can operate across code functions, +external requests, and across applications. APM provides [distributed +tracing][4], which can be enhanced on platforms like [NewRelic][5]. APM +can answer all the questions addressed by the Time-Series solution, and +additionally: - Am I operating within my SLO/SLA for clients? - How long does a specific endpoint take to respond? - How frequently is a specific function being called? - What functions are taking the most time during a request / operation? -- What were the exact contents of request and response data during a long or erroneous operation? -- What [additional attributes][6] were present during an erroneous operation? +- What were the exact contents of request and response data during a + long or erroneous operation? +- What [additional attributes][6] were present during an erroneous + operation? -APM SHOULD be used to track key indicators (KPIs) and other SLO-related values such as: +APM SHOULD be used to track key indicators (KPIs) and other SLO-related +values such as: - Errors and Error Rate. - Traffic / Throughput rates. @@ -81,27 +114,37 @@ APM SHOULD be used to track key indicators (KPIs) and other SLO-related values s ### Logging -Logging is an inherently flexible and robust method of generating event-related information. -Logging metrics capture event data from a specific point in time during application execution and write to a file (or stream, for collection). -Logging metrics MAY answer questions like: +Logging is an inherently flexible and robust method of generating +event-related information. Logging metrics capture event data from a +specific point in time during application execution and write to a file +(or stream, for collection). Logging metrics MAY answer questions like: - What calculated values were generated by a specific function? -- What input or payload information was provided for a specific transaction? - -Logs are often necessary for auditing purposes, which have security, legal, or regulatory requirements. -Such requirements supercede declarations here. - -- Logging SHOULD be used to capture only events which cannot be captured by APM or Metrics. - - Suggested events include process startup messages, signal-received hooks, shutdown messages, and reporting failures (such as failure to submit APM/Metrics). - - Logging MAY _temporarily_ be used to track events which are captured by APM or Metrics during incidents or when debugging. +- What input or payload information was provided for a specific + transaction? + +Logs are often necessary for auditing purposes, which have security, +legal, or regulatory requirements. Such requirements supercede +declarations here. + +- Logging SHOULD be used to capture only events which cannot be captured + by APM or Metrics. + - Suggested events include process startup messages, signal-received + hooks, shutdown messages, and reporting failures (such as failure to + submit APM/Metrics). + - Logging MAY _temporarily_ be used to track events which are captured + by APM or Metrics during incidents or when debugging. - Logging SHOULD be in a structured format such as JSON. -- Logging MUST NOT include sensitive data without masking/eliding said data. +- Logging MUST NOT include sensitive data without masking/eliding said + data. ## Top Metrics Also known as [the Golden Signals][1]. -Applications MUST have at least one dashboard which displays the following metrics for all services or components, as well as metrics for invoked dependencies. +Applications MUST have at least one dashboard which displays the +following metrics for all services or components, as well as metrics for +invoked dependencies. ### Latency |