Debian Server Troubleshooting Checklist

March 1, 2026

Troubleshooting a Debian server (Debian 12 “bookworm” or Debian 13 “trixie” in 2026) requires a structured, layered approach that starts with observation (what symptom? when did it start? what changed?), moves to global system state (resource saturation, logs), then drills down to specific subsystems (services, network, storage, kernel). The goal is to identify the contention domain — CPU scheduler, memory reclaim, I/O queues, softirq overload, or application-level blocking — rather than guessing fixes.

This checklist is ordered by speed-to-insight and common failure patterns on production Debian servers. Run commands as root or with sudo. Install helpful tools first if missing: apt update && apt install htop atop sysstat netdata ncdu lsof fail2ban.

Phase 1: Immediate Safety & Global Snapshot (1–3 minutes)

Confirm the symptom clearly SSH lag? Commands hang? Web pages timeout? High load but low CPU? Services down? Reboot loop? Note exact error messages, timestamps, recent changes (updates, config edits, traffic spike).
Check uptime & loaduptime — load average vs cores (load > 2–3× cores sustained = pressure) High load + low CPU = waiting (I/O, memory reclaim, locks).
Resource overviewhtop or top (press 1 for per-core, look for ‘D’ state = uninterruptible I/O wait) free -h + vmstat 1 5 (high si/so = swap thrashing; high wa = I/O wait) df -hT + df -i (full filesystem or inode exhaustion) iostat -xmdz 1 5 (%util near 100% + high await = storage bottleneck)
Quick log scanjournalctl -b -p err..emerg (errors since boot) journalctl -xe (recent context) dmesg | tail -50 (kernel issues, OOM killer)

Phase 2: Common High-Impact Checks (Most Frequent Culprits)

Disk / Filesystem Fullncdu / (interactive navigator — spot large dirs fast) journalctl –disk-usage → vacuum if huge: journalctl –vacuum-time=2weekslsof +L1 | grep deleted (deleted-but-open files holding space) Fix: apt clean, truncate logs, restart offending process.
Memory Pressure / Reclaim Stallscat /proc/pressure/memory (non-zero “some” or “full” = real stalls) sysctl vm.swappiness (60+ on low-RAM server → lower to 10) Enable zram if RAM < 8–16 GB. High kswapd0 CPU or direct reclaim → application memory leak or misbehaving service.
High I/O Wait / Storage Latencyiotop (which process writes/reads most) atop (press ‘d’ for disk activity per process) Journal flood, verbose logging, database WAL, Docker overlay2 growth. Fix: limit journal, prune containers, tune DB checkpoints.
Network / Connectivity Issuesip addr, ip route (interface up? default gateway?) ping 8.8.8.8 then ping google.com (DNS failure?) ss -s (huge TIME_WAIT or socket backlog) sar -n DEV 1 5 (drops, errors) Firewall blocks, bad DNS, MTU mismatch, high softirq.
Service / systemd Failuressystemctl –failed (list failed units) systemctl status <service> (e.g., nginx, postgresql) journalctl -u <service> -b -xe Dependency loops, config errors, port conflicts.
SSH / Access Problemsjournalctl -u ssh -b | grep -i fail (brute-force, key issues) fail2ban bans? Wrong keys? Port changed? Firewall rule?
Boot / Early Hangsystemd-analyze blame + critical-chain (slow units) Remove quiet from GRUB → verbose boot messages Initramfs drop to shell → wrong UUID, LUKS timeout, missing module.

Phase 3: Deeper / Workload-Specific Checks

Web / Application Stack Nginx/Apache slow logs, PHP-FPM pool exhaustion, bad bots → fail2ban on access logs Slow queries → enable DB slow log, EXPLAIN ANALYZE
Database PostgreSQL/MySQL → connection limits, autovacuum, WAL size, query planner stats
Containers / Virtualization Docker/Podman → docker system df, prune unused images/volumes High steal time → VM overcommit
Kernel / Hardwareperf top (kernel hotspots) Thermal throttling → sensors Failing disk → SMART: smartctl -a /dev/sdX

Phase 4: Prevention & Monitoring Setup

Install Netdata: real-time dashboard with PSI, I/O, softirq visuals
Prometheus + node_exporter + Grafana (long-term trends, alerts on >85% disk, PSI stalls >100 ms/s)
Schedule weekly: apt update && apt upgrade, journal vacuum, logrotate force
Backup configs (/etc), data, and test restores

This checklist covers ~90% of real-world Debian server incidents: I/O saturation, memory pressure, log floods, bad traffic, misconfigured services. Always change one thing at a time, measure before/after (e.g., with stress-ng or production load), and document findings.

Phase 1: Immediate Safety & Global Snapshot (1–3 minutes)

Phase 2: Common High-Impact Checks (Most Frequent Culprits)

Phase 3: Deeper / Workload-Specific Checks

Phase 4: Prevention & Monitoring Setup

Knowledge Base

Live Chat

Send Ticket

Cloud VPS

Dedicated Servers

More