🐧 Terminal Simulator — Module 3

Process Management & Troubleshooting

When the alert fires at 2 AM, you need to check disk space, hunt rogue processes, and fix permissions — fast.

Module Progress0/10 steps
STEP 1 / 10
df

Alert! Disk Space Critical — Identify the Full Partition

Real-World Scenario

Zabbix just fired a P1 alert: "CRITICAL: /var/log partition 100% full on prod-server-03." The application is failing to write logs and users are reporting 500 errors. You SSH into the server immediately. The first command every senior engineer runs in a disk space emergency is `df -h` — you need to see ALL partitions at a glance and identify which one is full.

Technical Breakdown

`df` (disk free) reports filesystem disk space usage. Without flags, it shows raw 1K blocks — unreadable for humans. `-h` (human-readable) converts to KB, MB, GB automatically. `-T` adds the filesystem type column (ext4, xfs, tmpfs). In production, you almost ALWAYS want `df -h`. Look at the "Use%" column — anything above 90% is a warning, above 95% is critical.

-hHuman-readable sizes (KB, MB, GB) instead of raw blocks.
-TShow filesystem type (ext4, xfs, tmpfs, etc.).
-iShow inode usage instead of block usage.
--totalAdd a grand total row at the bottom.

Your Task

Check disk usage in human-readable format. Type: df -h

devops@prod-server-03 — bash
devops@prod-server:~$

Quick Guide: Incident Response

Understanding the basics in 30 seconds

How It Works

  • df -h shows filesystem disk usage — identify which partition is full
  • du -sh | sort -rh finds the largest files eating disk space
  • ps aux | grep hunts specific processes by name
  • kill sends signals: SIGTERM (graceful) or SIGKILL -9 (force)
  • free -h checks RAM and swap usage — look at "available" not "free"
  • top gives real-time CPU, memory, and process monitoring
  • systemctl restart/status manages systemd services
  • journalctl -u reads service logs from the systemd journal
  • chmod controls file permissions — +x adds execute permission
  • chown changes file ownership — critical when root creates files

Key Benefits

  • Complete incident response flow from alert to resolution
  • Efficient disk space diagnostics with df + du pipe chains
  • Safe process termination with proper signal escalation
  • Memory monitoring to prevent OOM killer situations
  • Service management with systemctl restart/reload
  • Log analysis with journalctl for post-incident verification
  • Permission and ownership fixes for CI/CD pipelines

Real-World Uses

  • Responding to Zabbix/Prometheus disk space alerts at 2 AM
  • Finding 90GB debug logs eating production disk space
  • Killing runaway cron jobs consuming 85% CPU
  • Restarting Nginx after disk-full crash
  • Verifying clean service startup with journalctl
  • Fixing deploy script permissions after git pull
  • Fixing root-owned log files blocking application writes

The Incident Response Playbook

The Complete Troubleshooting Framework

When an alert fires, senior engineers follow a systematic 10-step approach rather than panicking. This module walks through a complete production incident from the initial Zabbix alert to the final post-mortem. The framework is:

1. df -h → Identify the full partition
2. du -sh | sort -rh → Find the space hog
3. ps aux | grep → Hunt the rogue process
4. kill -9 → Terminate the process
5. free -h → Check memory/swap health
6. top → Verify system is stabilizing
7. systemctl restart → Bring crashed services back
8. journalctl -u → Verify clean startup
9. chmod +x → Fix deploy script permissions
10. chown -R → Fix file ownership issues

🔍 Diagnostics

Identify what's wrong.

  • df -h — Disk usage per partition
  • du -sh /* — Find largest directories
  • ps aux | grep — Find processes
  • free -h — Memory and swap
  • top — Real-time system overview

⚡ Actions

Fix the problem.

  • kill -9 PID — Force kill process
  • systemctl restart — Restart service
  • chmod +x — Fix permissions
  • chown user:group — Fix ownership

✅ Verify

Confirm it's fixed.

  • journalctl -u — Service logs
  • top — CPU/MEM stabilizing
  • df -h — Space recovered
  • ls -la — Verify permissions

Kill Signal Cheat Sheet

Signal Escalation:
  1. kill PID — Send SIGTERM (15). Process can clean up gracefully.
  2. Wait 5 seconds. If still running...
  3. kill -9 PID — Send SIGKILL (9). Kernel terminates immediately.
  4. Verify: ps aux | grep PID — Confirm it's gone.

Permission vs Ownership — Know the Difference

chmod changes what actions are allowed (read, write, execute).

chown changes who the file belongs to (user and group).

Common mistake: a file has 755 permissions but is owned by root. A non-root user can read/execute but NOT write — even though write is enabled for the owner. Fix: chown devops:devops file then chmod 755 file.

The Infinity

Weekly tech insights, programming tutorials, and the latest in software development. Join our community of developers and tech enthusiasts.

Connect With Us

Daily.dev

Follow us for the latest tech insights and updates

© 2026 The Infinity. All rights reserved.