Knowledge Base › CloudWatch Alarms for PostgreSQL
Operations 6 min read · Monitoring · AWS RDS

CloudWatch Alarms That Actually Matter for PostgreSQL

The default RDS monitoring setup that most teams inherit alarms on CPUUtilization and not much else. It's a reasonable starting point. The problem is that CPU is a lagging indicator for most PostgreSQL failure modes. By the time CPU hits 80%, queries have already been timing out for several minutes. The metrics that matter are the ones that give you signal before the incident is in progress — not after.

These are the five alarms we set on every RDS PostgreSQL instance we work with, and why each one matters more than CPU alone.

1. DatabaseConnections — the most important one you're probably not watching

Connection exhaustion is one of the most common causes of "database unavailable" pages in production PostgreSQL. Unlike CPU spikes, which often self-correct, connection exhaustion causes immediate application errors: every new connection attempt gets rejected with FATAL: remaining connection slots are reserved for non-replication superuser connections.

The alarm should trigger at 80% of max_connections — not when you're already at 100% and application errors are happening. The threshold varies by instance class because PostgreSQL's default max_connections is memory-based:

Instance family RAM Default max_connections Alarm threshold (80%)
db.t3.medium 4 GB ~170 136
db.r5.large 16 GB ~435 348
db.r5.xlarge 32 GB ~855 684
db.r5.2xlarge 64 GB ~1700 1360
db.r5.4xlarge 128 GB ~3400 2720

To check your current max_connections value and set the alarm threshold correctly:

SHOW max_connections;
If you've already hit connection exhaustion: see the connection pool exhaustion guide. The alarm tells you the problem is coming — pgBouncer fixes the root cause.

2. FreeStorageSpace — set it early, not at zero

A full RDS volume doesn't cause a graceful degradation. PostgreSQL stops accepting writes immediately when storage is exhausted. This means: no new rows, no transaction commits, no autovacuum (which also writes). The instance becomes effectively read-only and applications start failing hard.

Set the alarm at 20% of allocated storage remaining, not 5%. You need time to react — storage can fill faster than expected after a write surge, a bloated migration, or a large batch job. Twenty percent gives you hours to respond; 5% gives you minutes.

For a 500GB gp3 volume, the alarm threshold is 100GB free (i.e., 400GB used). The metric is in bytes, so:

# 100GB in bytes = 107374182400
# Alarm: FreeStorageSpace < 107374182400
aws cloudwatch put-metric-alarm \
  --alarm-name rds-storage-low \
  --metric-name FreeStorageSpace \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-id \
  --statistic Average \
  --period 300 \
  --threshold 107374182400 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:eu-west-1:YOUR_ACCOUNT:your-alert-topic

3. ReadLatency / WriteLatency — the early warning for I/O problems

Elevated I/O latency shows up before CPU spikes when the root cause is storage-related. A missing index causing a sequential scan on a large table will show up as sustained high ReadLatency for minutes before CPU climbs. A write storm from an ORM doing per-row inserts without batching will register in WriteLatency before the queue depth saturates.

For gp3 volumes, normal read/write latency for PostgreSQL should be below 2ms for random I/O. Alarm thresholds:

Metric Warn threshold Critical threshold Notes
ReadLatency 5ms 20ms Above 20ms consistently means sequential scan or I/O saturation
WriteLatency 3ms 10ms Write latency is usually lower than read on gp3 with fsync
DiskQueueDepth 1 5 Any sustained queue depth above 1 means IOPS are saturated

4. DBLoad — the best single metric if you have Enhanced Monitoring

DBLoad is available when Enhanced Monitoring is enabled and measures the average number of active sessions at any given moment (similar to CPU load on Linux, but for database sessions). A DBLoad greater than the number of vCPUs on your instance means queries are actively competing for CPU — sessions are waiting in the run queue rather than executing.

This is the metric that catches the scenario CPU doesn't: 200 connections, all queued on lock waits, zero CPU, application timing out. CPU shows 5% and nothing looks wrong — DBLoad shows 180 and tells you something is very wrong.

Alarm threshold: set at 1× vCPU count for warning, 2× for critical. For a db.r5.xlarge (4 vCPUs): warn at DBLoad = 4, critical at DBLoad = 8.

Enable Enhanced Monitoring at the 60-second granularity minimum. The additional cost is around £3–8/month per instance. The DBLoad metric alone justifies this — it's the most useful single signal available for PostgreSQL on RDS.

5. FreeableMemory — catches the gradual memory leak

PostgreSQL uses free OS memory for the page cache — the OS keeps recently-accessed data pages in memory so that repeat reads don't hit disk. As FreeableMemory drops, the effective cache size shrinks, read IOPS increase, and query latency creeps up. This happens gradually over weeks on instances that are sized too tight, and it's easy to miss without a trend alarm.

Alarm at 15% of total instance RAM remaining. For a db.r5.xlarge (32GB): alarm when FreeableMemory drops below ~4.8GB (5,154,734,080 bytes). If this alarm fires regularly, you're either memory-constrained and need a larger instance, or a query is doing large hash joins / sort operations without an appropriate work_mem setting.

To identify memory-intensive queries in PostgreSQL:

SELECT
  left(query, 100)           AS query,
  calls,
  rows,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  shared_blks_hit + shared_blks_read AS total_blocks_touched
FROM   pg_stat_statements
ORDER BY total_blocks_touched DESC
LIMIT  10;

High total_blocks_touched combined with low shared_blks_hit / total_blocks ratio indicates cache misses — queries that are reading from disk rather than memory. This is a sign that either the working set has grown beyond available cache, or an index is missing and full table scans are evicting hot data from the cache.

What about CPU?

Still set a CPU alarm — just don't make it your only one. 80% is a reasonable critical threshold. The important thing is to treat a CPU alarm as "go look at what's running" rather than "page everyone." Most CPU spikes on well-configured PostgreSQL instances have a specific cause that takes 5–10 minutes to diagnose and fix. The P1 CPU runbook covers that exact investigation.

Alarm summary

Metric Threshold Why it matters
DatabaseConnections 80% of max_connections Predicts connection exhaustion before errors start
FreeStorageSpace 20% remaining Gives time to respond before writes fail hard
ReadLatency > 5ms warn, > 20ms critical Early indicator of seq scans and I/O saturation
DBLoad > vCPU count warn, > 2× vCPU critical Catches lock waits and queue buildup CPU misses
FreeableMemory < 15% of total RAM Catches slow cache pressure increase before latency impact
CPUUtilization > 80% Still useful; treat as "go investigate" not "panic"

Want these alarms set up correctly for your instance?
The Diagnostic Session includes a monitoring review — we check what you're currently alarming on, identify the gaps, and provide the exact CloudWatch configuration for your specific instance class and workload pattern.

Diagnostic Session — £497 →