AWS Monitoring and Observability for DOP-C02: CloudWatch, X-Ray, and EventBridge Guide

Introduction

Monitoring, Logging, and Remediation is the largest domain on the DOP-C02 exam at 26%. It’s also the domain where most candidates lose the most points. The exam doesn’t just test whether you know what CloudWatch does — it tests whether you can design monitoring architectures, write Logs Insights queries, and build automated remediation workflows.

This guide covers every monitoring and observability topic you need for DOP-C02.

CloudWatch Metrics

Standard vs. Custom Metrics

Standard metrics are automatically published by AWS services:

EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (hypervisor level only)
ALB: RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
RDS: DatabaseConnections, FreeStorageSpace, ReadLatency
SQS: ApproximateNumberOfMessages, ApproximateAgeOfOldestMessage

Custom metrics are published by your application using the PutMetricData API:

Application-specific metrics (request latency, queue depth per instance, error rates)
OS-level metrics via CloudWatch Agent (memory utilization, disk usage)
Business metrics (orders processed, user sign-ups)

Exam-critical detail: EC2 standard metrics do NOT include memory utilization or disk space. These require the CloudWatch Agent publishing custom metrics. This is a common exam question.

Metric Resolution

Standard resolution: 1-minute intervals (default for most AWS services)
High resolution: 1-second intervals (custom metrics only, via PutMetricData with StorageResolution=1)

When to use high resolution: Real-time scaling decisions, latency-sensitive applications, short-duration spikes that would be averaged out in 1-minute intervals.

Metric Math

Metric Math lets you create calculated metrics from existing ones without publishing new data.

Common exam patterns:

# Error rate as percentage
METRICS("m1") / METRICS("m2") * 100

# Anomaly detection band
ANOMALY_DETECTION_BAND(m1, 2)

# Sum across dimensions
SUM(METRICS("RequestCount"))

Exam use cases:

Calculate error rate from error count and total request count
Compute per-instance queue depth (queue size / instance count)
Create normalized metrics for comparison across different-sized fleets

CloudWatch Alarms

Alarm States

OK: Metric is within the defined threshold
ALARM: Metric breached the threshold
INSUFFICIENT_DATA: Not enough data points to evaluate

Alarm Configuration

Key parameters the exam tests:

Period: How long each evaluation period lasts (e.g., 60 seconds, 300 seconds)
Evaluation Periods: How many consecutive periods must breach the threshold
Datapoints to Alarm: How many of the evaluation periods must be in breach (M of N)
Statistic: Average, Sum, Minimum, Maximum, p99, etc.
Comparison Operator: Greater than, less than, etc.

Example: “Alarm if average CPU exceeds 80% for 3 out of 5 consecutive 5-minute periods” translates to:

Period: 300 seconds
Evaluation Periods: 5
Datapoints to Alarm: 3
Statistic: Average
Threshold: 80

Composite Alarms

Composite alarms combine multiple alarms using AND/OR logic.

Use when:

You want to alert only when multiple conditions are true simultaneously
You need to reduce alarm noise (e.g., alert only when BOTH CPU is high AND request latency is elevated)
You want a single alarm to represent overall system health

Exam example: “Alert the on-call team only when the application error rate exceeds 5% AND the database connection count exceeds 90% of the maximum.”

Alarm Actions

Alarms can trigger:

SNS notifications — Email, SMS, Lambda functions
Auto Scaling policies — Scale up or scale down
EC2 actions — Stop, terminate, reboot, or recover instances
Systems Manager actions — Run automation documents

CloudWatch Logs

Architecture

Log Groups: Containers for log streams (e.g., /aws/lambda/my-function)
Log Streams: Sequences of log events from a single source (e.g., one Lambda execution)
Log Events: Individual log entries with timestamps

Log Retention

Log groups have configurable retention periods:

1 day to 10 years, or indefinite
Default: Indefinite (logs never expire)
Exam tip: Set retention policies to control costs. Indefinite retention on high-volume logs is expensive.

Metric Filters

Metric filters extract metric data from log events. They scan log data for specific patterns and increment a CloudWatch metric when a match is found.

Example: Create a metric filter on your application log group that counts lines containing “ERROR.” The resulting metric can drive a CloudWatch alarm.

Filter pattern syntax the exam tests:

"ERROR" — Match lines containing the word ERROR
[ip, user, timestamp, request, status_code = 5*, bytes] — Match log lines with 5xx status codes
{ $.statusCode = 500 } — Match JSON logs where statusCode is 500

Subscription Filters

Subscription filters stream log data in real-time to:

Lambda functions — For processing or transformation
Kinesis Data Streams — For real-time analytics
Kinesis Data Firehose — For delivery to S3, Redshift, or OpenSearch
Another CloudWatch Logs destination — For cross-account log aggregation

Exam pattern: Centralized logging architecture uses subscription filters to stream logs from multiple accounts to a central account’s Kinesis Data Firehose, which delivers to S3.

CloudWatch Logs Insights

Logs Insights is a query language for analyzing log data. DOP-C02 expects you to understand query syntax and common patterns.

Essential Query Patterns

Find the most recent errors:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

Count errors by type:

fields @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as errorType
| stats count(*) by errorType
| sort count(*) desc

Identify top 10 slowest requests:

fields @timestamp, @duration
| sort @duration desc
| limit 10

Average latency over time (for dashboards):

stats avg(@duration) as avgLatency by bin(5m)

Find requests from a specific IP:

fields @timestamp, @message
| filter @message like /192.168.1.100/
| sort @timestamp desc

Key Commands

Command	Purpose
`fields`	Select which fields to display
`filter`	Filter events by condition
`stats`	Aggregate data (count, sum, avg, min, max, pct)
`sort`	Order results
`limit`	Restrict number of results
`parse`	Extract fields from unstructured log data
`display`	Choose which fields appear in results
`bin()`	Group timestamps into intervals

AWS X-Ray

What X-Ray Does

X-Ray provides distributed tracing for applications. It traces requests as they flow through multiple services, showing where time is spent and where errors occur.

Key Concepts

Traces: End-to-end record of a request through your application
Segments: Records for each service that processed the request
Subsegments: Detailed records within a segment (database calls, HTTP calls)
Service Map: Visual representation of your application’s architecture with latency and error data
Annotations: Key-value pairs for filtering traces (indexed, searchable)
Metadata: Additional data attached to segments (not indexed)

When to Use X-Ray (Exam Patterns)

“Identify which downstream service causes latency” → X-Ray service map
“Trace a request through microservices” → X-Ray distributed tracing
“Find the root cause of intermittent errors” → X-Ray trace analysis with filtering
“Understand application dependencies” → X-Ray service map

X-Ray Integration

For DOP-C02, know how X-Ray integrates with:

ECS/Fargate: X-Ray daemon as a sidecar container
Lambda: Built-in active tracing (enable in function configuration)
API Gateway: Enable tracing in stage settings
EC2: Install X-Ray daemon on instances
Elastic Beanstalk: Enable via configuration option

Annotations vs. Metadata

Annotations:

Key-value pairs (string, number, boolean)
Indexed and searchable via X-Ray console and API
Use for filtering traces (e.g., annotate with customer_id to find all traces for a specific customer)

Metadata:

Key-value pairs with any data type (including objects and arrays)
NOT indexed or searchable
Use for storing additional context (e.g., full request/response payloads)

Exam tip: If a question asks about filtering or searching traces, the answer involves annotations, not metadata.

Amazon EventBridge

Event-Driven Automation

EventBridge is the backbone of event-driven automation on AWS. For DOP-C02, it connects monitoring with remediation.

Event Patterns

EventBridge rules match events using patterns:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}

Common DOP-C02 Event Patterns

Detect CloudTrail API calls:

{
  "source": ["aws.iam"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": ["CreateAccessKey"]
  }
}

Detect Config compliance changes:

{
  "source": ["aws.config"],
  "detail-type": ["Config Rules Compliance Change"],
  "detail": {
    "newEvaluationResult": {
      "complianceType": ["NON_COMPLIANT"]
    }
  }
}

Detect CodePipeline failures:

{
  "source": ["aws.codepipeline"],
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "detail": {
    "state": ["FAILED"]
  }
}

EventBridge Targets

When a rule matches, EventBridge can invoke:

Lambda functions (most common for remediation)
SNS topics (for notifications)
SQS queues (for buffering)
Step Functions (for complex workflows)
Systems Manager Automation (for runbooks)
CodePipeline (to trigger deployments)
Another EventBridge bus (cross-account routing)

Automated Remediation Patterns

The exam heavily tests your ability to design remediation workflows.

Pattern 1: Alarm → SNS → Lambda → Action

CloudWatch Alarm (CPU > 90%)
  → SNS Topic
    → Lambda Function
      → Increase ASG capacity / Add instances

Pattern 2: EventBridge → Lambda → Action

EventBridge Rule (detect unauthorized SG change)
  → Lambda Function
    → Revert security group to approved state

Pattern 3: Config Rule → SSM Automation → Action

AWS Config Rule (S3 bucket not encrypted)
  → Automatic Remediation
    → SSM Automation Document
      → Enable S3 default encryption

Pattern 4: CloudWatch Alarm → Auto Scaling → Scale

CloudWatch Alarm (SQS queue depth > threshold)
  → Auto Scaling Policy (target tracking)
    → Launch additional instances

Choosing the Right Pattern

Scenario	Best Pattern
React to metric threshold	Alarm → SNS → Lambda
React to AWS API event	EventBridge → Lambda
React to compliance violation	Config → SSM Automation
Scale based on load	Alarm → Auto Scaling
Complex multi-step remediation	EventBridge → Step Functions

Monitoring Architecture for Multi-Account

DOP-C02 frequently tests multi-account monitoring. The standard architecture:

Each account: CloudWatch Logs with subscription filters
Central logging account: Kinesis Data Firehose receives logs from all accounts
S3 bucket: Long-term storage with lifecycle policies
CloudWatch cross-account observability: Query metrics and logs across accounts from a central dashboard
EventBridge: Cross-account event routing for centralized alerting

Exam-Ready Checklist for Domain 3

Validate Your Monitoring Knowledge

Domain 3 is worth 26% of your exam score. Weak performance here makes passing extremely difficult regardless of how well you do on other domains.

Sailor’s DOP-C02 mock exam bundle includes monitoring-focused questions that test CloudWatch, X-Ray, EventBridge, and remediation patterns at exam difficulty. Domain-level scoring tells you exactly whether this critical area needs more study.

DOP-C02 Exam Guide 2026 — Complete exam format and all five domains
DOP-C02 CI/CD Guide — Domain 1 deep dive (22% of exam)
DOP-C02 Practice Questions — 20 realistic questions including monitoring scenarios
10-Week DOP-C02 Study Plan — Weeks 5-6 cover monitoring in depth