• Home
  • Cloud VPS
    • Hong Kong VPS
    • US VPS
  • Dedicated Servers
    • Hong Kong Servers
    • US Servers
    • Singapore Servers
    • Japan Servers
  • Company
    • Contact Us
    • Blog
logo logo
  • Home
  • Cloud VPS
    • Hong Kong VPS
    • US VPS
  • Dedicated Servers
    • Hong Kong Servers
    • US Servers
    • Singapore Servers
    • Japan Servers
  • Company
    • Contact Us
    • Blog
ENEN
  • 简体简体
  • 繁體繁體
Client Area

Common Ubuntu Server Failure Scenarios and How to Diagnose Them

February 15, 2026

Production Ubuntu servers (especially LTS releases 22.04 / 24.04 / 26.04) tend to fail in a surprisingly small number of repeatable patterns. Once you recognize the symptom → subsystem → most likely root cause mapping, diagnosis time drops dramatically — often from hours to minutes.

Below are the ten most frequent failure categories seen in real environments in 2025–2026, ordered roughly by how often they appear in post-mortems and support tickets.

1. High CPU Usage but Low Application Throughput

Symptoms

  • top / htop shows 90–100% CPU, but requests per second or transactions/sec are much lower than expected
  • high system% or irq% time
  • elevated context switches / voluntary & involuntary preemptions

Most common root causes

  • Interrupt storm (bad NIC driver, MSI-X issue, high packet rate + small packets)
  • spinlocks / contended mutexes in application or kernel (very common with old Java, PHP-FPM, nginx worker processes)
  • vmstat / mpstat shows high softirq time → networking or block layer saturation
  • CPU frequency stuck low (thermal throttling, intel_pstate bug, wrong governor)

Diagnostic steps

  • mpstat -P ALL 1 → per-core view
  • sar -I SUM or cat /proc/interrupts delta → which IRQ line is exploding
  • perf record -a -g — sleep 10; perf report → kernel/user flame graph
  • bpftrace -e ‘tracepoint:syscalls:sys_enter_* { @[probe] = count(); }’ → syscall frequency
  • turbostat or cpupower monitor → C-states & thermal throttling

2. Memory Pressure Leading to OOM Killer or Severe Latency Spikes

Symptoms

  • OOM killer messages in dmesg/journal
  • PSI memory full / some stalls > 100–500 ms
  • application latency jumps from ms to seconds
  • free -h shows very low available memory even with swap disabled

Most common root causes

  • Java heap / Redis / PostgreSQL / ClickHouse eating memory without limits
  • page cache growing uncontrollably (very large files being read repeatedly)
  • memory leak in long-running daemon (common in custom Go/Rust services)
  • too many connections → each holding buffers (PostgreSQL, nginx, rabbitmq)

Diagnostic steps

  • cat /proc/pressure/memory → full/some stall times
  • grep -i oom /var/log/syslog* or journalctl -k | grep -i oom
  • smem -t -k -c “pid user command uss pss rss vsz” → real memory usage per process
  • grep -r VmRSS /proc/[1-9]*/status 2>/dev/null | sort -n → quick RSS ranking
  • systemd-cgtop → cgroup memory consumption (especially useful in container hosts)

3. Disk I/O Saturation (High Latency, Throughput Collapse)

Symptoms

  • iostat -x shows %util 95–100%, await / svctm in tens or hundreds of ms
  • application-level timeouts / slow queries despite low CPU
  • journal / application logs full of I/O timeout warnings

Most common root causes

  • Writeback storm (dirty pages exceeding dirty_ratio → global flush)
  • Single-threaded fsync-heavy workload (PostgreSQL WAL, MySQL innodb_flush_log_at_trx_commit=1)
  • No I/O scheduler tuning on NVMe (mq-deadline / none vs kyber)
  • RAID rebuild or scrub running in background
  • Thin LVM snapshot overload

Diagnostic steps

  • iostat -xmdz 1 → per-device %util, await, svctm, r/s w/s
  • iotop -o → which process is doing the I/O
  • vmstat 1 → bi/bo columns + wa%
  • perf record -e block:block_rq_issue -ag — sleep 10; perf report → block layer trace
  • echo 1 > /proc/sys/vm/drop_caches (temporary test — does latency drop?)

4. Network Stack Collapse (High Latency / Packet Loss / Connection Refusals)

Symptoms

  • SYN backlog full → connection refused / timeout
  • netstat / ss shows many connections in SYN_RECV / TIME_WAIT
  • tcpdump shows retransmits / out-of-order packets

Most common root causes

  • somaxconn / tcp_max_syn_backlog too small
  • conntrack table overflow (nf_conntrack_count ≈ nf_conntrack_max)
  • BBR vs Cubic mismatch on lossy links
  • CPU-bound softirq processing → single CPU saturated by network RX

Diagnostic steps

  • ss -s / nstat -az | grep -i retrans
  • cat /proc/net/stat/nf_conntrack → overflow counter
  • sysctl net.ipv4.tcp_mem and current usage
  • sar -n DEV 1 → packets dropped, overruns
  • ethtool -S eth0 | grep -i drop → hardware/firmware drops

5. Boot / Early Startup Failures

Symptoms

  • Stuck in initramfs busybox
  • Kernel panic “unable to mount root fs”
  • systemd timeout at “A start job is running for…”

Most common root causes

  • LUKS unlock failure (timeout, wrong passphrase in automation)
  • UUID / PARTUUID changed after disk clone / resize
  • Missing module in initramfs after kernel upgrade
  • Mount timeout on _netdev (NFS / iSCSI)

Diagnostic steps

  • Append to kernel cmdline: rd.break=premount or systemd.unit=emergency.target
  • journalctl -b -1 from rescue shell
  • lsblk -f, blkid, compare with /etc/fstab
  • update-initramfs -u -k all from chroot

Quick Reference – Symptom → First Three Commands

SymptomCommand 1Command 2Command 3
High CPU, low throughputmpstat -P ALL 1perf record -a -g — sleep 15perf report
Memory pressure / OOMcat /proc/pressure/memorysmem -t -ksystemd-cgtop
Disk I/O latencyiostat -xmdz 1iotop -ovmstat 1
Network saturationss -snstat -azsar -n DEV 1
Boot stuck in initramfscat /proc/cmdlinelsblk -fcryptsetup status
Service won’t start / timeoutsystemctl status <unit>journalctl -u <unit> -n 200systemctl list-dependencies <unit>

These patterns cover ~80–85% of production incidents on Ubuntu servers. Once you develop muscle memory for the first three commands per category, most failures become pattern-matching exercises rather than mysteries. The remaining 15–20% usually involve application-specific logic bugs, third-party software regressions, or very subtle kernel/hardware interactions — but even those become much easier to bisect when the OS-layer symptoms are ruled out quickly.

Recent Posts

  • Managing Users and Permissions in CentOS Stream: Best Practices (CentOS Stream 9/10 – 2026)
  • How to Set Up Nginx on CentOS Stream for High-Performance Web Hosting
  • CentOS Stream Explained: Key Differences from CentOS Linux
  • How to Configure FirewallD in CentOS Stream: From Essential to Production-Grade
  • Installing Docker on CentOS: A Practical Setup Guide (CentOS Stream 9/10 – 2026)

Recent Comments

No comments to show.

Knowledge Base

Access detailed guides, tutorials, and resources.

Live Chat

Get instant help 24/7 from our support team.

Send Ticket

Our team typically responds within 10 minutes.

logo
Alipay Cc-paypal Cc-stripe Cc-visa Cc-mastercard Bitcoin
Cloud VPS
  • Hong Kong VPS
  • US VPS
Dedicated Servers
  • Hong Kong Servers
  • US Servers
  • Singapore Servers
  • Japan Servers
More
  • Contact Us
  • Blog
  • Legal
© 2026 Server.HK | Hosting Limited, Hong Kong | Company Registration No. 77008912
Telegram
Telegram @ServerHKBot