Back to Blog

AWS Monitoring and Observability for DOP-C02: CloudWatch, X-Ray, and EventBridge Guide

Master the largest DOP-C02 exam domain (26%). Deep dive into CloudWatch metrics, alarms, Logs Insights, X-Ray tracing, EventBridge automation, and remediation patterns.

By Sailor Team , April 7, 2026

Introduction

Monitoring, Logging, and Remediation is the largest domain on the DOP-C02 exam at 26%. It’s also the domain where most candidates lose the most points. The exam doesn’t just test whether you know what CloudWatch does — it tests whether you can design monitoring architectures, write Logs Insights queries, and build automated remediation workflows.

This guide covers every monitoring and observability topic you need for DOP-C02.

CloudWatch Metrics

Standard vs. Custom Metrics

Standard metrics are automatically published by AWS services:

  • EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (hypervisor level only)
  • ALB: RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
  • RDS: DatabaseConnections, FreeStorageSpace, ReadLatency
  • SQS: ApproximateNumberOfMessages, ApproximateAgeOfOldestMessage

Custom metrics are published by your application using the PutMetricData API:

  • Application-specific metrics (request latency, queue depth per instance, error rates)
  • OS-level metrics via CloudWatch Agent (memory utilization, disk usage)
  • Business metrics (orders processed, user sign-ups)

Exam-critical detail: EC2 standard metrics do NOT include memory utilization or disk space. These require the CloudWatch Agent publishing custom metrics. This is a common exam question.

Metric Resolution

  • Standard resolution: 1-minute intervals (default for most AWS services)
  • High resolution: 1-second intervals (custom metrics only, via PutMetricData with StorageResolution=1)

When to use high resolution: Real-time scaling decisions, latency-sensitive applications, short-duration spikes that would be averaged out in 1-minute intervals.

Metric Math

Metric Math lets you create calculated metrics from existing ones without publishing new data.

Common exam patterns:

# Error rate as percentage
METRICS("m1") / METRICS("m2") * 100

# Anomaly detection band
ANOMALY_DETECTION_BAND(m1, 2)

# Sum across dimensions
SUM(METRICS("RequestCount"))

Exam use cases:

  • Calculate error rate from error count and total request count
  • Compute per-instance queue depth (queue size / instance count)
  • Create normalized metrics for comparison across different-sized fleets

CloudWatch Alarms

Alarm States

  • OK: Metric is within the defined threshold
  • ALARM: Metric breached the threshold
  • INSUFFICIENT_DATA: Not enough data points to evaluate

Alarm Configuration

Key parameters the exam tests:

  • Period: How long each evaluation period lasts (e.g., 60 seconds, 300 seconds)
  • Evaluation Periods: How many consecutive periods must breach the threshold
  • Datapoints to Alarm: How many of the evaluation periods must be in breach (M of N)
  • Statistic: Average, Sum, Minimum, Maximum, p99, etc.
  • Comparison Operator: Greater than, less than, etc.

Example: “Alarm if average CPU exceeds 80% for 3 out of 5 consecutive 5-minute periods” translates to:

  • Period: 300 seconds
  • Evaluation Periods: 5
  • Datapoints to Alarm: 3
  • Statistic: Average
  • Threshold: 80

Composite Alarms

Composite alarms combine multiple alarms using AND/OR logic.

Use when:

  • You want to alert only when multiple conditions are true simultaneously
  • You need to reduce alarm noise (e.g., alert only when BOTH CPU is high AND request latency is elevated)
  • You want a single alarm to represent overall system health

Exam example: “Alert the on-call team only when the application error rate exceeds 5% AND the database connection count exceeds 90% of the maximum.”

Alarm Actions

Alarms can trigger:

  • SNS notifications — Email, SMS, Lambda functions
  • Auto Scaling policies — Scale up or scale down
  • EC2 actions — Stop, terminate, reboot, or recover instances
  • Systems Manager actions — Run automation documents

CloudWatch Logs

Architecture

  • Log Groups: Containers for log streams (e.g., /aws/lambda/my-function)
  • Log Streams: Sequences of log events from a single source (e.g., one Lambda execution)
  • Log Events: Individual log entries with timestamps

Log Retention

Log groups have configurable retention periods:

  • 1 day to 10 years, or indefinite
  • Default: Indefinite (logs never expire)
  • Exam tip: Set retention policies to control costs. Indefinite retention on high-volume logs is expensive.

Metric Filters

Metric filters extract metric data from log events. They scan log data for specific patterns and increment a CloudWatch metric when a match is found.

Example: Create a metric filter on your application log group that counts lines containing “ERROR.” The resulting metric can drive a CloudWatch alarm.

Filter pattern syntax the exam tests:

  • "ERROR" — Match lines containing the word ERROR
  • [ip, user, timestamp, request, status_code = 5*, bytes] — Match log lines with 5xx status codes
  • { $.statusCode = 500 } — Match JSON logs where statusCode is 500

Subscription Filters

Subscription filters stream log data in real-time to:

  • Lambda functions — For processing or transformation
  • Kinesis Data Streams — For real-time analytics
  • Kinesis Data Firehose — For delivery to S3, Redshift, or OpenSearch
  • Another CloudWatch Logs destination — For cross-account log aggregation

Exam pattern: Centralized logging architecture uses subscription filters to stream logs from multiple accounts to a central account’s Kinesis Data Firehose, which delivers to S3.

CloudWatch Logs Insights

Logs Insights is a query language for analyzing log data. DOP-C02 expects you to understand query syntax and common patterns.

Essential Query Patterns

Find the most recent errors:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

Count errors by type:

fields @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as errorType
| stats count(*) by errorType
| sort count(*) desc

Identify top 10 slowest requests:

fields @timestamp, @duration
| sort @duration desc
| limit 10

Average latency over time (for dashboards):

stats avg(@duration) as avgLatency by bin(5m)

Find requests from a specific IP:

fields @timestamp, @message
| filter @message like /192.168.1.100/
| sort @timestamp desc

Key Commands

CommandPurpose
fieldsSelect which fields to display
filterFilter events by condition
statsAggregate data (count, sum, avg, min, max, pct)
sortOrder results
limitRestrict number of results
parseExtract fields from unstructured log data
displayChoose which fields appear in results
bin()Group timestamps into intervals

AWS X-Ray

What X-Ray Does

X-Ray provides distributed tracing for applications. It traces requests as they flow through multiple services, showing where time is spent and where errors occur.

Key Concepts

  • Traces: End-to-end record of a request through your application
  • Segments: Records for each service that processed the request
  • Subsegments: Detailed records within a segment (database calls, HTTP calls)
  • Service Map: Visual representation of your application’s architecture with latency and error data
  • Annotations: Key-value pairs for filtering traces (indexed, searchable)
  • Metadata: Additional data attached to segments (not indexed)

When to Use X-Ray (Exam Patterns)

  • “Identify which downstream service causes latency” → X-Ray service map
  • “Trace a request through microservices” → X-Ray distributed tracing
  • “Find the root cause of intermittent errors” → X-Ray trace analysis with filtering
  • “Understand application dependencies” → X-Ray service map

X-Ray Integration

For DOP-C02, know how X-Ray integrates with:

  • ECS/Fargate: X-Ray daemon as a sidecar container
  • Lambda: Built-in active tracing (enable in function configuration)
  • API Gateway: Enable tracing in stage settings
  • EC2: Install X-Ray daemon on instances
  • Elastic Beanstalk: Enable via configuration option

Annotations vs. Metadata

Annotations:

  • Key-value pairs (string, number, boolean)
  • Indexed and searchable via X-Ray console and API
  • Use for filtering traces (e.g., annotate with customer_id to find all traces for a specific customer)

Metadata:

  • Key-value pairs with any data type (including objects and arrays)
  • NOT indexed or searchable
  • Use for storing additional context (e.g., full request/response payloads)

Exam tip: If a question asks about filtering or searching traces, the answer involves annotations, not metadata.

Amazon EventBridge

Event-Driven Automation

EventBridge is the backbone of event-driven automation on AWS. For DOP-C02, it connects monitoring with remediation.

Event Patterns

EventBridge rules match events using patterns:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated"]
  }
}

Common DOP-C02 Event Patterns

Detect CloudTrail API calls:

{
  "source": ["aws.iam"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": ["CreateAccessKey"]
  }
}

Detect Config compliance changes:

{
  "source": ["aws.config"],
  "detail-type": ["Config Rules Compliance Change"],
  "detail": {
    "newEvaluationResult": {
      "complianceType": ["NON_COMPLIANT"]
    }
  }
}

Detect CodePipeline failures:

{
  "source": ["aws.codepipeline"],
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "detail": {
    "state": ["FAILED"]
  }
}

EventBridge Targets

When a rule matches, EventBridge can invoke:

  • Lambda functions (most common for remediation)
  • SNS topics (for notifications)
  • SQS queues (for buffering)
  • Step Functions (for complex workflows)
  • Systems Manager Automation (for runbooks)
  • CodePipeline (to trigger deployments)
  • Another EventBridge bus (cross-account routing)

Automated Remediation Patterns

The exam heavily tests your ability to design remediation workflows.

Pattern 1: Alarm → SNS → Lambda → Action

CloudWatch Alarm (CPU > 90%)
  → SNS Topic
    → Lambda Function
      → Increase ASG capacity / Add instances

Pattern 2: EventBridge → Lambda → Action

EventBridge Rule (detect unauthorized SG change)
  → Lambda Function
    → Revert security group to approved state

Pattern 3: Config Rule → SSM Automation → Action

AWS Config Rule (S3 bucket not encrypted)
  → Automatic Remediation
    → SSM Automation Document
      → Enable S3 default encryption

Pattern 4: CloudWatch Alarm → Auto Scaling → Scale

CloudWatch Alarm (SQS queue depth > threshold)
  → Auto Scaling Policy (target tracking)
    → Launch additional instances

Choosing the Right Pattern

ScenarioBest Pattern
React to metric thresholdAlarm → SNS → Lambda
React to AWS API eventEventBridge → Lambda
React to compliance violationConfig → SSM Automation
Scale based on loadAlarm → Auto Scaling
Complex multi-step remediationEventBridge → Step Functions

Monitoring Architecture for Multi-Account

DOP-C02 frequently tests multi-account monitoring. The standard architecture:

  1. Each account: CloudWatch Logs with subscription filters
  2. Central logging account: Kinesis Data Firehose receives logs from all accounts
  3. S3 bucket: Long-term storage with lifecycle policies
  4. CloudWatch cross-account observability: Query metrics and logs across accounts from a central dashboard
  5. EventBridge: Cross-account event routing for centralized alerting

Exam-Ready Checklist for Domain 3

  • Can explain the difference between standard and custom metrics
  • Know which EC2 metrics require the CloudWatch Agent
  • Can configure composite alarms with AND/OR logic
  • Understand metric math for calculated metrics
  • Can write CloudWatch Logs Insights queries from memory
  • Know subscription filter destinations and use cases
  • Understand X-Ray traces, segments, and annotations vs. metadata
  • Can design EventBridge rules for common AWS events
  • Know at least 4 automated remediation patterns
  • Understand cross-account monitoring architecture

Validate Your Monitoring Knowledge

Domain 3 is worth 26% of your exam score. Weak performance here makes passing extremely difficult regardless of how well you do on other domains.

Sailor’s DOP-C02 mock exam bundle includes monitoring-focused questions that test CloudWatch, X-Ray, EventBridge, and remediation patterns at exam difficulty. Domain-level scoring tells you exactly whether this critical area needs more study.

Limited Time Offer: Get 80% off all Mock Exam Bundles | Sale ends in 7 days. Start learning today.

Claim Now