Introduction
Monitoring, Logging, and Remediation is the largest domain on the DOP-C02 exam at 26%. It’s also the domain where most candidates lose the most points. The exam doesn’t just test whether you know what CloudWatch does — it tests whether you can design monitoring architectures, write Logs Insights queries, and build automated remediation workflows.
This guide covers every monitoring and observability topic you need for DOP-C02.
CloudWatch Metrics
Standard vs. Custom Metrics
Standard metrics are automatically published by AWS services:
- EC2: CPUUtilization, NetworkIn/Out, DiskReadOps (hypervisor level only)
- ALB: RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
- RDS: DatabaseConnections, FreeStorageSpace, ReadLatency
- SQS: ApproximateNumberOfMessages, ApproximateAgeOfOldestMessage
Custom metrics are published by your application using the PutMetricData API:
- Application-specific metrics (request latency, queue depth per instance, error rates)
- OS-level metrics via CloudWatch Agent (memory utilization, disk usage)
- Business metrics (orders processed, user sign-ups)
Exam-critical detail: EC2 standard metrics do NOT include memory utilization or disk space. These require the CloudWatch Agent publishing custom metrics. This is a common exam question.
Metric Resolution
- Standard resolution: 1-minute intervals (default for most AWS services)
- High resolution: 1-second intervals (custom metrics only, via PutMetricData with StorageResolution=1)
When to use high resolution: Real-time scaling decisions, latency-sensitive applications, short-duration spikes that would be averaged out in 1-minute intervals.
Metric Math
Metric Math lets you create calculated metrics from existing ones without publishing new data.
Common exam patterns:
# Error rate as percentage
METRICS("m1") / METRICS("m2") * 100
# Anomaly detection band
ANOMALY_DETECTION_BAND(m1, 2)
# Sum across dimensions
SUM(METRICS("RequestCount"))
Exam use cases:
- Calculate error rate from error count and total request count
- Compute per-instance queue depth (queue size / instance count)
- Create normalized metrics for comparison across different-sized fleets
CloudWatch Alarms
Alarm States
- OK: Metric is within the defined threshold
- ALARM: Metric breached the threshold
- INSUFFICIENT_DATA: Not enough data points to evaluate
Alarm Configuration
Key parameters the exam tests:
- Period: How long each evaluation period lasts (e.g., 60 seconds, 300 seconds)
- Evaluation Periods: How many consecutive periods must breach the threshold
- Datapoints to Alarm: How many of the evaluation periods must be in breach (M of N)
- Statistic: Average, Sum, Minimum, Maximum, p99, etc.
- Comparison Operator: Greater than, less than, etc.
Example: “Alarm if average CPU exceeds 80% for 3 out of 5 consecutive 5-minute periods” translates to:
- Period: 300 seconds
- Evaluation Periods: 5
- Datapoints to Alarm: 3
- Statistic: Average
- Threshold: 80
Composite Alarms
Composite alarms combine multiple alarms using AND/OR logic.
Use when:
- You want to alert only when multiple conditions are true simultaneously
- You need to reduce alarm noise (e.g., alert only when BOTH CPU is high AND request latency is elevated)
- You want a single alarm to represent overall system health
Exam example: “Alert the on-call team only when the application error rate exceeds 5% AND the database connection count exceeds 90% of the maximum.”
Alarm Actions
Alarms can trigger:
- SNS notifications — Email, SMS, Lambda functions
- Auto Scaling policies — Scale up or scale down
- EC2 actions — Stop, terminate, reboot, or recover instances
- Systems Manager actions — Run automation documents
CloudWatch Logs
Architecture
- Log Groups: Containers for log streams (e.g.,
/aws/lambda/my-function) - Log Streams: Sequences of log events from a single source (e.g., one Lambda execution)
- Log Events: Individual log entries with timestamps
Log Retention
Log groups have configurable retention periods:
- 1 day to 10 years, or indefinite
- Default: Indefinite (logs never expire)
- Exam tip: Set retention policies to control costs. Indefinite retention on high-volume logs is expensive.
Metric Filters
Metric filters extract metric data from log events. They scan log data for specific patterns and increment a CloudWatch metric when a match is found.
Example: Create a metric filter on your application log group that counts lines containing “ERROR.” The resulting metric can drive a CloudWatch alarm.
Filter pattern syntax the exam tests:
"ERROR"— Match lines containing the word ERROR[ip, user, timestamp, request, status_code = 5*, bytes]— Match log lines with 5xx status codes{ $.statusCode = 500 }— Match JSON logs where statusCode is 500
Subscription Filters
Subscription filters stream log data in real-time to:
- Lambda functions — For processing or transformation
- Kinesis Data Streams — For real-time analytics
- Kinesis Data Firehose — For delivery to S3, Redshift, or OpenSearch
- Another CloudWatch Logs destination — For cross-account log aggregation
Exam pattern: Centralized logging architecture uses subscription filters to stream logs from multiple accounts to a central account’s Kinesis Data Firehose, which delivers to S3.
CloudWatch Logs Insights
Logs Insights is a query language for analyzing log data. DOP-C02 expects you to understand query syntax and common patterns.
Essential Query Patterns
Find the most recent errors:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
Count errors by type:
fields @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as errorType
| stats count(*) by errorType
| sort count(*) desc
Identify top 10 slowest requests:
fields @timestamp, @duration
| sort @duration desc
| limit 10
Average latency over time (for dashboards):
stats avg(@duration) as avgLatency by bin(5m)
Find requests from a specific IP:
fields @timestamp, @message
| filter @message like /192.168.1.100/
| sort @timestamp desc
Key Commands
| Command | Purpose |
|---|---|
fields | Select which fields to display |
filter | Filter events by condition |
stats | Aggregate data (count, sum, avg, min, max, pct) |
sort | Order results |
limit | Restrict number of results |
parse | Extract fields from unstructured log data |
display | Choose which fields appear in results |
bin() | Group timestamps into intervals |
AWS X-Ray
What X-Ray Does
X-Ray provides distributed tracing for applications. It traces requests as they flow through multiple services, showing where time is spent and where errors occur.
Key Concepts
- Traces: End-to-end record of a request through your application
- Segments: Records for each service that processed the request
- Subsegments: Detailed records within a segment (database calls, HTTP calls)
- Service Map: Visual representation of your application’s architecture with latency and error data
- Annotations: Key-value pairs for filtering traces (indexed, searchable)
- Metadata: Additional data attached to segments (not indexed)
When to Use X-Ray (Exam Patterns)
- “Identify which downstream service causes latency” → X-Ray service map
- “Trace a request through microservices” → X-Ray distributed tracing
- “Find the root cause of intermittent errors” → X-Ray trace analysis with filtering
- “Understand application dependencies” → X-Ray service map
X-Ray Integration
For DOP-C02, know how X-Ray integrates with:
- ECS/Fargate: X-Ray daemon as a sidecar container
- Lambda: Built-in active tracing (enable in function configuration)
- API Gateway: Enable tracing in stage settings
- EC2: Install X-Ray daemon on instances
- Elastic Beanstalk: Enable via configuration option
Annotations vs. Metadata
Annotations:
- Key-value pairs (string, number, boolean)
- Indexed and searchable via X-Ray console and API
- Use for filtering traces (e.g., annotate with customer_id to find all traces for a specific customer)
Metadata:
- Key-value pairs with any data type (including objects and arrays)
- NOT indexed or searchable
- Use for storing additional context (e.g., full request/response payloads)
Exam tip: If a question asks about filtering or searching traces, the answer involves annotations, not metadata.
Amazon EventBridge
Event-Driven Automation
EventBridge is the backbone of event-driven automation on AWS. For DOP-C02, it connects monitoring with remediation.
Event Patterns
EventBridge rules match events using patterns:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated"]
}
}
Common DOP-C02 Event Patterns
Detect CloudTrail API calls:
{
"source": ["aws.iam"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventName": ["CreateAccessKey"]
}
}
Detect Config compliance changes:
{
"source": ["aws.config"],
"detail-type": ["Config Rules Compliance Change"],
"detail": {
"newEvaluationResult": {
"complianceType": ["NON_COMPLIANT"]
}
}
}
Detect CodePipeline failures:
{
"source": ["aws.codepipeline"],
"detail-type": ["CodePipeline Pipeline Execution State Change"],
"detail": {
"state": ["FAILED"]
}
}
EventBridge Targets
When a rule matches, EventBridge can invoke:
- Lambda functions (most common for remediation)
- SNS topics (for notifications)
- SQS queues (for buffering)
- Step Functions (for complex workflows)
- Systems Manager Automation (for runbooks)
- CodePipeline (to trigger deployments)
- Another EventBridge bus (cross-account routing)
Automated Remediation Patterns
The exam heavily tests your ability to design remediation workflows.
Pattern 1: Alarm → SNS → Lambda → Action
CloudWatch Alarm (CPU > 90%)
→ SNS Topic
→ Lambda Function
→ Increase ASG capacity / Add instances
Pattern 2: EventBridge → Lambda → Action
EventBridge Rule (detect unauthorized SG change)
→ Lambda Function
→ Revert security group to approved state
Pattern 3: Config Rule → SSM Automation → Action
AWS Config Rule (S3 bucket not encrypted)
→ Automatic Remediation
→ SSM Automation Document
→ Enable S3 default encryption
Pattern 4: CloudWatch Alarm → Auto Scaling → Scale
CloudWatch Alarm (SQS queue depth > threshold)
→ Auto Scaling Policy (target tracking)
→ Launch additional instances
Choosing the Right Pattern
| Scenario | Best Pattern |
|---|---|
| React to metric threshold | Alarm → SNS → Lambda |
| React to AWS API event | EventBridge → Lambda |
| React to compliance violation | Config → SSM Automation |
| Scale based on load | Alarm → Auto Scaling |
| Complex multi-step remediation | EventBridge → Step Functions |
Monitoring Architecture for Multi-Account
DOP-C02 frequently tests multi-account monitoring. The standard architecture:
- Each account: CloudWatch Logs with subscription filters
- Central logging account: Kinesis Data Firehose receives logs from all accounts
- S3 bucket: Long-term storage with lifecycle policies
- CloudWatch cross-account observability: Query metrics and logs across accounts from a central dashboard
- EventBridge: Cross-account event routing for centralized alerting
Exam-Ready Checklist for Domain 3
- Can explain the difference between standard and custom metrics
- Know which EC2 metrics require the CloudWatch Agent
- Can configure composite alarms with AND/OR logic
- Understand metric math for calculated metrics
- Can write CloudWatch Logs Insights queries from memory
- Know subscription filter destinations and use cases
- Understand X-Ray traces, segments, and annotations vs. metadata
- Can design EventBridge rules for common AWS events
- Know at least 4 automated remediation patterns
- Understand cross-account monitoring architecture
Validate Your Monitoring Knowledge
Domain 3 is worth 26% of your exam score. Weak performance here makes passing extremely difficult regardless of how well you do on other domains.
Sailor’s DOP-C02 mock exam bundle includes monitoring-focused questions that test CloudWatch, X-Ray, EventBridge, and remediation patterns at exam difficulty. Domain-level scoring tells you exactly whether this critical area needs more study.
Related Resources
- DOP-C02 Exam Guide 2026 — Complete exam format and all five domains
- DOP-C02 CI/CD Guide — Domain 1 deep dive (22% of exam)
- DOP-C02 Practice Questions — 20 realistic questions including monitoring scenarios
- 10-Week DOP-C02 Study Plan — Weeks 5-6 cover monitoring in depth