Observability For Noobs

Intro

Monitoring vs Observability

Monitoring

Monitoring requires you to know what you are looking for in advance.

Monitoring is the process of collecting data about a system and it's components at regular intervals. This data is then used to determine the health of the system.

Monitoring allows you to see the health of individual components of a system. But cannot tell you why a component is unhealthy.

Observability

Observability is the ability to ask questions of your system that you didn't know you needed to ask in advance.

"Observability is the practice of instrumenting systems to gather actionable data depicting not only when and where an issue occurred, but—more importantly—why it occurred."

Observability allows you to investiage "unknown unknowns" on demand.

Monitoring

Monitoring is the process of collecting data about a system and it's components at regular intervals. This data is then used to determine the health of the system.

The goal of monitoring is to determine:

Is the service online?
Is the service functioning?
Is the service performing as expected?

Monitoring Methods

There are a number of different methods for monitoring a system. The method used depends on the target system you are monitoring.

RED Method

The RED method is ideal for monitoring request driven systems such as web applications at the services layer.

The RED method is named for the metrics it monitors:

Rate - The number of requests per second (Troughput).
Errors - Failed Requests.
Duration - The time it takes to process a request (Latency).

USE Method

The USE method is ideal for monitoring resource driven systems at the infrastructure layer such as network devices, virtual machines and databases.

The USE method is named for the metrics it monitors:

Utilization - The level of resource usage as time or percentage. IE % CPU or disk I/O.
Saturation - The degree to which a resource has reached it's capacity.
Errors - The number of errors that occur.

Four Golden Signals

The Four Golden Signals are a set of metrics that are used to monitor the health of a service. They originally came out of the Google SRE Handbook.

The Four Golden Signals are a combination of the RED and USE methods to monitor the Service and Infrastructure layers.

The Four Golden Signals are:

Latency - The time it takes to process a request.
Traffic (Throughput) - The number of requests per second.
Errors - The number of failed requests.
Saturation - The degree to which a resource has reached it's capacity.

Core Web Vitals

Core Web Vitals are a set of metrics that are used to measure the user experience (UX) of a web page. The Core Web Vitals are:

Largest Contentful Paint (LCP) - The time it takes for the largest element on the page to load. (Perceived page load)
First Input Delay (FID) - The time it takes for the page to respond to the first user interaction. (Perceived responsiveness)
Cumulative Layout Shift (CLS) - The amount of unexpected layout shift that occurs during the page load. (Perceived stability)

Observability

Telemetry Data Types

Telemetry data is data that is collected from a system. Observability requires the collection of alot telemetry data.

There are 4 distinct types of telemetry data identified by the MELT acronym. Metrics (M), Events (E), Logs (L) and Traces (T).

Metrics

Metrics are a set of values (usually a number) that are recorded at regular intervals. IE: total number of requests per second.

Metrics give you an aggregate view of a data point. You do not get detailed information about the data point.

Events

Events are discrete points in time that are recorded by a system. Events can contain a payload of data with a timestamp and event data. IE: TODO

Events are most useful for troubleshooting issues in the moment or recent past.

Events are high-fidelity so they can be expensive to store for long periods of time.

logs

Logs are a record of events that have occurred in a system. They can be either structured (JSON) or unstructured (SYSLOG).

Logs are useful to troubleshoot issues that have occurred in the past. They show events that occured leading up to an issue.

Traces

Traces link a series of events that have occurred across different components of the system. They are used to track a flow as it moves through the system. IE: A request from a user.

Traces are made up of a series of spans. Each span represents a component of the system that the flow has passed through.

Alerting

Mean Time To Detection (MTTD)

Mean Time To Detection (MTTD) is the average time it takes between an issue occurring and the issue being detected.

Mean Time To Resolution (MTTR)

Mean Time To Resolution (MTTR) is the average time it takes between an issue being detected and the issue being resolved. This includes the time it takes to troubleshoot and correct the issue.

Instrumentation

Instrumentation is the process of adding code to a system to collect telemetry data. Depending on the type of system will determine the type of instrumentation required.