Production Ubuntu servers (especially LTS releases 22.04 / 24.04 / 26.04) tend to fail in a surprisingly small number of repeatable patterns. Once you recognize the symptom → subsystem → most likely root cause mapping, diagnosis time drops dramatically — often from hours to minutes.
Below are the ten most frequent failure categories seen in real environments in 2025–2026, ordered roughly by how often they appear in post-mortems and support tickets.
1. High CPU Usage but Low Application Throughput
Symptoms
- top / htop shows 90–100% CPU, but requests per second or transactions/sec are much lower than expected
- high system% or irq% time
- elevated context switches / voluntary & involuntary preemptions
Most common root causes
- Interrupt storm (bad NIC driver, MSI-X issue, high packet rate + small packets)
- spinlocks / contended mutexes in application or kernel (very common with old Java, PHP-FPM, nginx worker processes)
- vmstat / mpstat shows high softirq time → networking or block layer saturation
- CPU frequency stuck low (thermal throttling, intel_pstate bug, wrong governor)
Diagnostic steps
- mpstat -P ALL 1 → per-core view
- sar -I SUM or cat /proc/interrupts delta → which IRQ line is exploding
- perf record -a -g — sleep 10; perf report → kernel/user flame graph
- bpftrace -e ‘tracepoint:syscalls:sys_enter_* { @[probe] = count(); }’ → syscall frequency
- turbostat or cpupower monitor → C-states & thermal throttling
2. Memory Pressure Leading to OOM Killer or Severe Latency Spikes
Symptoms
- OOM killer messages in dmesg/journal
- PSI memory full / some stalls > 100–500 ms
- application latency jumps from ms to seconds
- free -h shows very low available memory even with swap disabled
Most common root causes
- Java heap / Redis / PostgreSQL / ClickHouse eating memory without limits
- page cache growing uncontrollably (very large files being read repeatedly)
- memory leak in long-running daemon (common in custom Go/Rust services)
- too many connections → each holding buffers (PostgreSQL, nginx, rabbitmq)
Diagnostic steps
- cat /proc/pressure/memory → full/some stall times
- grep -i oom /var/log/syslog* or journalctl -k | grep -i oom
- smem -t -k -c “pid user command uss pss rss vsz” → real memory usage per process
- grep -r VmRSS /proc/[1-9]*/status 2>/dev/null | sort -n → quick RSS ranking
- systemd-cgtop → cgroup memory consumption (especially useful in container hosts)
3. Disk I/O Saturation (High Latency, Throughput Collapse)
Symptoms
- iostat -x shows %util 95–100%, await / svctm in tens or hundreds of ms
- application-level timeouts / slow queries despite low CPU
- journal / application logs full of I/O timeout warnings
Most common root causes
- Writeback storm (dirty pages exceeding dirty_ratio → global flush)
- Single-threaded fsync-heavy workload (PostgreSQL WAL, MySQL innodb_flush_log_at_trx_commit=1)
- No I/O scheduler tuning on NVMe (mq-deadline / none vs kyber)
- RAID rebuild or scrub running in background
- Thin LVM snapshot overload
Diagnostic steps
- iostat -xmdz 1 → per-device %util, await, svctm, r/s w/s
- iotop -o → which process is doing the I/O
- vmstat 1 → bi/bo columns + wa%
- perf record -e block:block_rq_issue -ag — sleep 10; perf report → block layer trace
- echo 1 > /proc/sys/vm/drop_caches (temporary test — does latency drop?)
4. Network Stack Collapse (High Latency / Packet Loss / Connection Refusals)
Symptoms
- SYN backlog full → connection refused / timeout
- netstat / ss shows many connections in SYN_RECV / TIME_WAIT
- tcpdump shows retransmits / out-of-order packets
Most common root causes
- somaxconn / tcp_max_syn_backlog too small
- conntrack table overflow (nf_conntrack_count ≈ nf_conntrack_max)
- BBR vs Cubic mismatch on lossy links
- CPU-bound softirq processing → single CPU saturated by network RX
Diagnostic steps
- ss -s / nstat -az | grep -i retrans
- cat /proc/net/stat/nf_conntrack → overflow counter
- sysctl net.ipv4.tcp_mem and current usage
- sar -n DEV 1 → packets dropped, overruns
- ethtool -S eth0 | grep -i drop → hardware/firmware drops
5. Boot / Early Startup Failures
Symptoms
- Stuck in initramfs busybox
- Kernel panic “unable to mount root fs”
- systemd timeout at “A start job is running for…”
Most common root causes
- LUKS unlock failure (timeout, wrong passphrase in automation)
- UUID / PARTUUID changed after disk clone / resize
- Missing module in initramfs after kernel upgrade
- Mount timeout on _netdev (NFS / iSCSI)
Diagnostic steps
- Append to kernel cmdline: rd.break=premount or systemd.unit=emergency.target
- journalctl -b -1 from rescue shell
- lsblk -f, blkid, compare with /etc/fstab
- update-initramfs -u -k all from chroot
Quick Reference – Symptom → First Three Commands
| Symptom | Command 1 | Command 2 | Command 3 |
|---|---|---|---|
| High CPU, low throughput | mpstat -P ALL 1 | perf record -a -g — sleep 15 | perf report |
| Memory pressure / OOM | cat /proc/pressure/memory | smem -t -k | systemd-cgtop |
| Disk I/O latency | iostat -xmdz 1 | iotop -o | vmstat 1 |
| Network saturation | ss -s | nstat -az | sar -n DEV 1 |
| Boot stuck in initramfs | cat /proc/cmdline | lsblk -f | cryptsetup status |
| Service won’t start / timeout | systemctl status <unit> | journalctl -u <unit> -n 200 | systemctl list-dependencies <unit> |
These patterns cover ~80–85% of production incidents on Ubuntu servers. Once you develop muscle memory for the first three commands per category, most failures become pattern-matching exercises rather than mysteries. The remaining 15–20% usually involve application-specific logic bugs, third-party software regressions, or very subtle kernel/hardware interactions — but even those become much easier to bisect when the OS-layer symptoms are ruled out quickly.