Common Ubuntu Server Failure Scenarios and How to Diagnose Them

February 15, 2026

Production Ubuntu servers (especially LTS releases 22.04 / 24.04 / 26.04) tend to fail in a surprisingly small number of repeatable patterns. Once you recognize the symptom → subsystem → most likely root cause mapping, diagnosis time drops dramatically — often from hours to minutes.

Below are the ten most frequent failure categories seen in real environments in 2025–2026, ordered roughly by how often they appear in post-mortems and support tickets.

1. High CPU Usage but Low Application Throughput

Symptoms

top / htop shows 90–100% CPU, but requests per second or transactions/sec are much lower than expected
high system% or irq% time
elevated context switches / voluntary & involuntary preemptions

Most common root causes

Interrupt storm (bad NIC driver, MSI-X issue, high packet rate + small packets)
spinlocks / contended mutexes in application or kernel (very common with old Java, PHP-FPM, nginx worker processes)
vmstat / mpstat shows high softirq time → networking or block layer saturation
CPU frequency stuck low (thermal throttling, intel_pstate bug, wrong governor)

Diagnostic steps

mpstat -P ALL 1 → per-core view
sar -I SUM or cat /proc/interrupts delta → which IRQ line is exploding
perf record -a -g — sleep 10; perf report → kernel/user flame graph
bpftrace -e ‘tracepoint:syscalls:sys_enter_* { @[probe] = count(); }’ → syscall frequency
turbostat or cpupower monitor → C-states & thermal throttling

2. Memory Pressure Leading to OOM Killer or Severe Latency Spikes

Symptoms

OOM killer messages in dmesg/journal
PSI memory full / some stalls > 100–500 ms
application latency jumps from ms to seconds
free -h shows very low available memory even with swap disabled

Most common root causes

Java heap / Redis / PostgreSQL / ClickHouse eating memory without limits
page cache growing uncontrollably (very large files being read repeatedly)
memory leak in long-running daemon (common in custom Go/Rust services)
too many connections → each holding buffers (PostgreSQL, nginx, rabbitmq)

Diagnostic steps

cat /proc/pressure/memory → full/some stall times
grep -i oom /var/log/syslog* or journalctl -k | grep -i oom
smem -t -k -c “pid user command uss pss rss vsz” → real memory usage per process
grep -r VmRSS /proc/[1-9]*/status 2>/dev/null | sort -n → quick RSS ranking
systemd-cgtop → cgroup memory consumption (especially useful in container hosts)

3. Disk I/O Saturation (High Latency, Throughput Collapse)

Symptoms

iostat -x shows %util 95–100%, await / svctm in tens or hundreds of ms
application-level timeouts / slow queries despite low CPU
journal / application logs full of I/O timeout warnings

Most common root causes

Writeback storm (dirty pages exceeding dirty_ratio → global flush)
Single-threaded fsync-heavy workload (PostgreSQL WAL, MySQL innodb_flush_log_at_trx_commit=1)
No I/O scheduler tuning on NVMe (mq-deadline / none vs kyber)
RAID rebuild or scrub running in background
Thin LVM snapshot overload

Diagnostic steps

iostat -xmdz 1 → per-device %util, await, svctm, r/s w/s
iotop -o → which process is doing the I/O
vmstat 1 → bi/bo columns + wa%
perf record -e block:block_rq_issue -ag — sleep 10; perf report → block layer trace
echo 1 > /proc/sys/vm/drop_caches (temporary test — does latency drop?)

4. Network Stack Collapse (High Latency / Packet Loss / Connection Refusals)

Symptoms

SYN backlog full → connection refused / timeout
netstat / ss shows many connections in SYN_RECV / TIME_WAIT
tcpdump shows retransmits / out-of-order packets

Most common root causes

somaxconn / tcp_max_syn_backlog too small
conntrack table overflow (nf_conntrack_count ≈ nf_conntrack_max)
BBR vs Cubic mismatch on lossy links
CPU-bound softirq processing → single CPU saturated by network RX

Diagnostic steps

ss -s / nstat -az | grep -i retrans
cat /proc/net/stat/nf_conntrack → overflow counter
sysctl net.ipv4.tcp_mem and current usage
sar -n DEV 1 → packets dropped, overruns
ethtool -S eth0 | grep -i drop → hardware/firmware drops

5. Boot / Early Startup Failures

Symptoms

Stuck in initramfs busybox
Kernel panic “unable to mount root fs”
systemd timeout at “A start job is running for…”

Most common root causes

LUKS unlock failure (timeout, wrong passphrase in automation)
UUID / PARTUUID changed after disk clone / resize
Missing module in initramfs after kernel upgrade
Mount timeout on _netdev (NFS / iSCSI)

Diagnostic steps

Append to kernel cmdline: rd.break=premount or systemd.unit=emergency.target
journalctl -b -1 from rescue shell
lsblk -f, blkid, compare with /etc/fstab
update-initramfs -u -k all from chroot

Quick Reference – Symptom → First Three Commands

Symptom	Command 1	Command 2	Command 3
High CPU, low throughput	mpstat -P ALL 1	perf record -a -g — sleep 15	perf report
Memory pressure / OOM	cat /proc/pressure/memory	smem -t -k	systemd-cgtop
Disk I/O latency	iostat -xmdz 1	iotop -o	vmstat 1
Network saturation	ss -s	nstat -az	sar -n DEV 1
Boot stuck in initramfs	cat /proc/cmdline	lsblk -f	cryptsetup status
Service won’t start / timeout	systemctl status <unit>	journalctl -u <unit> -n 200	systemctl list-dependencies <unit>

These patterns cover ~80–85% of production incidents on Ubuntu servers. Once you develop muscle memory for the first three commands per category, most failures become pattern-matching exercises rather than mysteries. The remaining 15–20% usually involve application-specific logic bugs, third-party software regressions, or very subtle kernel/hardware interactions — but even those become much easier to bisect when the OS-layer symptoms are ruled out quickly.

1. High CPU Usage but Low Application Throughput

2. Memory Pressure Leading to OOM Killer or Severe Latency Spikes

3. Disk I/O Saturation (High Latency, Throughput Collapse)

4. Network Stack Collapse (High Latency / Packet Loss / Connection Refusals)

5. Boot / Early Startup Failures

Quick Reference – Symptom → First Three Commands

Knowledge Base

Live Chat

Send Ticket

Cloud VPS

Dedicated Servers

More