Files
Obsidian-Vault/Work/Projects/QLS-1 Server Crash Analysis 2026-03-04.md
2026-03-07 09:54:02 +01:00

9.2 KiB

QLS-1 Server Crash Analysis - 2026-03-04

Server Specs

  • Instance: AWS EC2 (aarch64)
  • CPUs: 4
  • RAM: 15.4 GB
  • Swap: 1 GB
  • OS: Ubuntu 22.04, kernel 6.8.0-1029-aws

Summary

The server ran out of memory during a deployment and became completely unresponsive, requiring a reboot around 09:55 UTC. A deployment (unzip + node build) pushed an already critically overloaded system over the edge.

However, the real problem is that a single-site server with a few background processes should not be consuming 14 GB of RAM during normal operation. The deployment was the trigger, but not the underlying cause. Memory usage needs to be profiled and reduced.

Timeline (UTC)

Time Event
00:00 - 09:36 Normal operation. Memory already dangerously tight: ~14 GB used, ~700 MB free, swap fully used (1024 MB).
09:37 Memory pressure intensifies. Used = 14758 MB, free = 291 MB. Cache shrinking.
09:39 sessionclean cron + find/sed spawn 59 processes each. Used = 15168 MB, free = 180 MB.
09:41 Deployment starts: 233 unzip + 233 rm processes appear. Free = 119 MB.
09:42 Cache evicted (172 MB), buffers = 9 MB. 1,181 allocation stalls. Page scanning = 886,726.
09:43 node/esbuild build starts. Redis forks for persistence. D-state jumps to 121. Alloc stalls = 4.4 million. System grinding to a halt.
09:44 - 09:54 Death spiral. Free ~120 MB, cache = 0, buffers = 0. D-state peaks at 196. Alloc stalls ~20 million/min. Cron jobs pile up (15 to 45). System frozen.
09:55 Reboot. Used drops to 262 MB, free = 15,251 MB. Swap = 0.
09:56+ Services restart. Normal operation resumes.

Key Numbers at Crash Point (09:43)

  • RAM: 15,525 / 15,768 MB used (98.5%)
  • Swap: 1024 / 1024 MB used (100%)
  • Page cache: 112 MB (down from ~2 GB)
  • Buffers: 0 MB
  • D-state processes: 121 (normally ~60)
  • Alloc stalls: 4.4 million / minute

The Actual Problem: Per-Process Memory Breakdown

The deployment was the straw that broke the camel's back. The real issue is baseline memory consumption. This server hosts a single site with a few background processes, yet it was consuming ~14 GB of 15.4 GB available during normal operation with swap completely exhausted.

The atop log contains per-process resident memory (RSS) data. Here is what was running:

Memory by process group at 09:30 UTC (normal operation, pre-crash)

Process Count Threads Total RSS Total PSS Swap Notes
redis-server 1 7 28,398 MB 28,390 MB 448 MB Single instance using 28 GB RSS
php8.4 (CLI) 27 27 12,835 MB 10,798 MB 534 MB Queue workers / artisan commands
mysqld 1 55 8,096 MB* ~1,500 MB 2,551 MB *RSS inflated by mmap'd InnoDB files
php-fpm8.4 10 10 2,153 MB 567 MB 0 MB FPM pool (PSS=567 MB shared)
php (CLI) 1 1 1,756 MB 1,638 MB 0 MB Single CLI process
php-fpm8.3 15 15 131 MB 5 MB 489 MB Old pool, mostly swapped out
php-fpm8.2 7 7 69 MB 11 MB 221 MB Old pool, mostly swapped out
nginx 5 5 205 MB 63 MB 6 MB
supervisord 1 1 50 MB 44 MB 41 MB
snapd 1 12 79 MB 78 MB 8 MB
amazon-ssm-agent 1 11 46 MB 46 MB 4 MB
squid 2 2 43 MB 21 MB 46 MB
meilisearch 1 13 39 MB 33 MB 27 MB
Everything else ~55 - ~300 MB - -

Note on RSS vs PSS: RSS counts shared memory and memory-mapped files in every process that maps them, leading to overcounting. PSS (Proportional Set Size) divides shared pages among users. For PHP-FPM workers, PSS is much lower because workers share the same code pages. For MySQL, RSS is heavily inflated by memory-mapped InnoDB data files that are backed by disk -- actual private memory usage is ~1.5 GB (8%), not the 8 GB RSS reported by atop.

The three big offenders

1. Redis: 28 GB RSS (PID 781977)

This is by far the largest consumer. A single redis-server process with 28 GB RSS on a 15.4 GB server means it has a working set far larger than physical RAM. The kernel is constantly swapping redis pages in and out, contributing to I/O pressure and the high D-state count even during normal operation (~60 D-state processes at all times).

This needs immediate investigation:

  • What is stored in Redis? Is it being used as a primary data store instead of a cache?
  • What is maxmemory set to? If unset, Redis will grow without limit.
  • What eviction policy is configured? (maxmemory-policy)
  • Is save / RDB persistence enabled? Forking for persistence doubles memory briefly.

2. PHP 8.4 CLI workers: 27 processes, 10.8 GB PSS

These are likely Laravel Horizon / queue workers. Individual workers range from 220 MB to 2,867 MB RSS. Two processes stand out:

  • PID 165508 (php8.4): 2,867 MB RSS at 09:30, growing to 4,584 MB by 09:41 - a clear memory leak
  • PID 165699 (php): 1,756 MB RSS at 09:30, growing to 4,569 MB by 09:43 - same pattern

These workers appear to have memory leaks. They should be configured to restart after processing N jobs (--max-jobs) or after exceeding a memory limit (--memory).

3. MySQL: ~1.5 GB actual usage

The 8 GB RSS reported by atop is misleading -- it includes memory-mapped InnoDB data files backed by disk. Actual private memory usage is ~1.5 GB (8% of RAM as seen in top). MySQL is not a significant contributor to this crash.

Memory progression leading to crash

Time Redis RSS PHP8.4 CLI RSS MySQL RSS Free RAM Event
09:30 28,398 MB 12,835 MB 8,096 MB 766 MB Normal
09:38 28,514 MB 14,648 MB 8,082 MB 291 MB PHP workers growing
09:41 28,573 MB 15,279 MB 8,077 MB 119 MB Deployment starts
09:43 57,464 MB* 12,701 MB 8,030 MB 129 MB Redis forked, system dead

*At 09:43, a second redis-server process appeared (PID 173622, 28,861 MB RSS). This is Redis forking for BGSAVE/BGREWRITEAOF (RDB/AOF persistence). On a system with no free memory, forking a 28 GB process is fatal because of copy-on-write page faults causing massive memory demand.

This is the kill shot: Redis persistence fork on an already OOM system.

Immediate Actions

Priority 1: Redis (the root cause of the memory crisis)

  1. Set maxmemory to a reasonable limit (e.g., 2-4 GB). If Redis holds more data than that, it needs to be moved to a dedicated instance or the data model needs rethinking.
  2. Set maxmemory-policy allkeys-lru to enable eviction.
  3. Disable or schedule RDB persistence (save "") to prevent the fork that killed the server. If persistence is needed, use AOF with aof-use-rdb-preamble yes and schedule BGREWRITEAOF during low-traffic windows.
  4. Audit what is stored in Redis. Run redis-cli --bigkeys and redis-cli INFO memory to understand what is consuming 28 GB.

Priority 2: PHP CLI workers (memory leaks)

  1. Add --max-jobs=500 and --memory=256 to queue worker commands to force periodic restarts.
  2. Investigate PIDs 165508 and 165699 specifically. These grew from ~2-3 GB to ~4.5 GB over 13 minutes. That rate of growth suggests either a leak or processing abnormally large jobs.
  3. Reduce worker count. 27 CLI processes is excessive if they each consume 300-500 MB. Start with 5-8 workers and measure throughput.

Priority 3: MySQL (low priority)

MySQL actual memory usage is ~1.5 GB, not the 8 GB RSS reported by atop. Not a significant contributor to this crash. No action needed unless memory remains tight after fixing Redis and PHP workers.

Priority 4: Clean up legacy PHP pools

  • php-fpm8.3 (15 workers, 489 MB swap) and php-fpm8.2 (7 workers, 221 MB swap) are almost entirely swapped out, meaning they are not actively used but still consuming swap space and incurring I/O when occasionally touched. Remove them if no sites use those versions.

Deployment Hardening

Even after fixing the baseline, deployments should not be able to crash the server:

  1. Serialize unzip/rm operations instead of 233 parallel processes.
  2. Cap node build memory with NODE_OPTIONS=--max-old-space-size=512.
  3. Add a pre-deploy memory check: abort if free memory is below a threshold.
  4. Protect critical services with OOMScoreAdjust=-900 in their systemd units.
  5. Disable Redis persistence during deployments or ensure BGSAVE cannot trigger concurrently.

Longer Term

  • Redis belongs on a separate instance (or use ElastiCache) if it genuinely needs 28 GB. Running it on the same 15 GB server as MySQL and PHP is fundamentally unsound.
  • Increasing swap from 1 GB to 2-4 GB buys time but does not fix the root cause.
  • Upgrading to a larger instance is treating the symptom. The workload should fit comfortably in 16 GB with proper limits on Redis and MySQL.
  • Set up memory monitoring and alerting (CloudWatch, Datadog, etc.) so this does not go unnoticed until it crashes.

Data Source

Analysis performed by parsing the binary atop log file atop_20260304 using Python + atoparser. The file covers 00:00:02 - 09:58:04 UTC with ~1 minute sampling intervals (599 records). Both system-level (memory, CPU, disk) and per-process (RSS, PSS, swap, state) data was extracted.