9.2 KiB
QLS-1 Server Crash Analysis - 2026-03-04
Server Specs
- Instance: AWS EC2 (aarch64)
- CPUs: 4
- RAM: 15.4 GB
- Swap: 1 GB
- OS: Ubuntu 22.04, kernel 6.8.0-1029-aws
Summary
The server ran out of memory during a deployment and became completely unresponsive, requiring a reboot around 09:55 UTC. A deployment (unzip + node build) pushed an already critically overloaded system over the edge.
However, the real problem is that a single-site server with a few background processes should not be consuming 14 GB of RAM during normal operation. The deployment was the trigger, but not the underlying cause. Memory usage needs to be profiled and reduced.
Timeline (UTC)
| Time | Event |
|---|---|
| 00:00 - 09:36 | Normal operation. Memory already dangerously tight: ~14 GB used, ~700 MB free, swap fully used (1024 MB). |
| 09:37 | Memory pressure intensifies. Used = 14758 MB, free = 291 MB. Cache shrinking. |
| 09:39 | sessionclean cron + find/sed spawn 59 processes each. Used = 15168 MB, free = 180 MB. |
| 09:41 | Deployment starts: 233 unzip + 233 rm processes appear. Free = 119 MB. |
| 09:42 | Cache evicted (172 MB), buffers = 9 MB. 1,181 allocation stalls. Page scanning = 886,726. |
| 09:43 | node/esbuild build starts. Redis forks for persistence. D-state jumps to 121. Alloc stalls = 4.4 million. System grinding to a halt. |
| 09:44 - 09:54 | Death spiral. Free ~120 MB, cache = 0, buffers = 0. D-state peaks at 196. Alloc stalls ~20 million/min. Cron jobs pile up (15 to 45). System frozen. |
| 09:55 | Reboot. Used drops to 262 MB, free = 15,251 MB. Swap = 0. |
| 09:56+ | Services restart. Normal operation resumes. |
Key Numbers at Crash Point (09:43)
- RAM: 15,525 / 15,768 MB used (98.5%)
- Swap: 1024 / 1024 MB used (100%)
- Page cache: 112 MB (down from ~2 GB)
- Buffers: 0 MB
- D-state processes: 121 (normally ~60)
- Alloc stalls: 4.4 million / minute
The Actual Problem: Per-Process Memory Breakdown
The deployment was the straw that broke the camel's back. The real issue is baseline memory consumption. This server hosts a single site with a few background processes, yet it was consuming ~14 GB of 15.4 GB available during normal operation with swap completely exhausted.
The atop log contains per-process resident memory (RSS) data. Here is what was running:
Memory by process group at 09:30 UTC (normal operation, pre-crash)
| Process | Count | Threads | Total RSS | Total PSS | Swap | Notes |
|---|---|---|---|---|---|---|
| redis-server | 1 | 7 | 28,398 MB | 28,390 MB | 448 MB | Single instance using 28 GB RSS |
| php8.4 (CLI) | 27 | 27 | 12,835 MB | 10,798 MB | 534 MB | Queue workers / artisan commands |
| mysqld | 1 | 55 | 8,096 MB* | ~1,500 MB | 2,551 MB | *RSS inflated by mmap'd InnoDB files |
| php-fpm8.4 | 10 | 10 | 2,153 MB | 567 MB | 0 MB | FPM pool (PSS=567 MB shared) |
| php (CLI) | 1 | 1 | 1,756 MB | 1,638 MB | 0 MB | Single CLI process |
| php-fpm8.3 | 15 | 15 | 131 MB | 5 MB | 489 MB | Old pool, mostly swapped out |
| php-fpm8.2 | 7 | 7 | 69 MB | 11 MB | 221 MB | Old pool, mostly swapped out |
| nginx | 5 | 5 | 205 MB | 63 MB | 6 MB | |
| supervisord | 1 | 1 | 50 MB | 44 MB | 41 MB | |
| snapd | 1 | 12 | 79 MB | 78 MB | 8 MB | |
| amazon-ssm-agent | 1 | 11 | 46 MB | 46 MB | 4 MB | |
| squid | 2 | 2 | 43 MB | 21 MB | 46 MB | |
| meilisearch | 1 | 13 | 39 MB | 33 MB | 27 MB | |
| Everything else | ~55 | - | ~300 MB | - | - |
Note on RSS vs PSS: RSS counts shared memory and memory-mapped files in every process that maps them, leading to overcounting. PSS (Proportional Set Size) divides shared pages among users. For PHP-FPM workers, PSS is much lower because workers share the same code pages. For MySQL, RSS is heavily inflated by memory-mapped InnoDB data files that are backed by disk -- actual private memory usage is ~1.5 GB (8%), not the 8 GB RSS reported by atop.
The three big offenders
1. Redis: 28 GB RSS (PID 781977)
This is by far the largest consumer. A single redis-server process with 28 GB RSS on a 15.4 GB server means it has a working set far larger than physical RAM. The kernel is constantly swapping redis pages in and out, contributing to I/O pressure and the high D-state count even during normal operation (~60 D-state processes at all times).
This needs immediate investigation:
- What is stored in Redis? Is it being used as a primary data store instead of a cache?
- What is
maxmemoryset to? If unset, Redis will grow without limit. - What eviction policy is configured? (
maxmemory-policy) - Is
save/ RDB persistence enabled? Forking for persistence doubles memory briefly.
2. PHP 8.4 CLI workers: 27 processes, 10.8 GB PSS
These are likely Laravel Horizon / queue workers. Individual workers range from 220 MB to 2,867 MB RSS. Two processes stand out:
- PID 165508 (php8.4): 2,867 MB RSS at 09:30, growing to 4,584 MB by 09:41 - a clear memory leak
- PID 165699 (php): 1,756 MB RSS at 09:30, growing to 4,569 MB by 09:43 - same pattern
These workers appear to have memory leaks. They should be configured to restart after processing N jobs (--max-jobs) or after exceeding a memory limit (--memory).
3. MySQL: ~1.5 GB actual usage
The 8 GB RSS reported by atop is misleading -- it includes memory-mapped InnoDB data files backed by disk. Actual private memory usage is ~1.5 GB (8% of RAM as seen in top). MySQL is not a significant contributor to this crash.
Memory progression leading to crash
| Time | Redis RSS | PHP8.4 CLI RSS | MySQL RSS | Free RAM | Event |
|---|---|---|---|---|---|
| 09:30 | 28,398 MB | 12,835 MB | 8,096 MB | 766 MB | Normal |
| 09:38 | 28,514 MB | 14,648 MB | 8,082 MB | 291 MB | PHP workers growing |
| 09:41 | 28,573 MB | 15,279 MB | 8,077 MB | 119 MB | Deployment starts |
| 09:43 | 57,464 MB* | 12,701 MB | 8,030 MB | 129 MB | Redis forked, system dead |
*At 09:43, a second redis-server process appeared (PID 173622, 28,861 MB RSS). This is Redis forking for BGSAVE/BGREWRITEAOF (RDB/AOF persistence). On a system with no free memory, forking a 28 GB process is fatal because of copy-on-write page faults causing massive memory demand.
This is the kill shot: Redis persistence fork on an already OOM system.
Immediate Actions
Priority 1: Redis (the root cause of the memory crisis)
- Set
maxmemoryto a reasonable limit (e.g., 2-4 GB). If Redis holds more data than that, it needs to be moved to a dedicated instance or the data model needs rethinking. - Set
maxmemory-policy allkeys-lruto enable eviction. - Disable or schedule RDB persistence (
save "") to prevent the fork that killed the server. If persistence is needed, use AOF withaof-use-rdb-preamble yesand scheduleBGREWRITEAOFduring low-traffic windows. - Audit what is stored in Redis. Run
redis-cli --bigkeysandredis-cli INFO memoryto understand what is consuming 28 GB.
Priority 2: PHP CLI workers (memory leaks)
- Add
--max-jobs=500and--memory=256to queue worker commands to force periodic restarts. - Investigate PIDs 165508 and 165699 specifically. These grew from ~2-3 GB to ~4.5 GB over 13 minutes. That rate of growth suggests either a leak or processing abnormally large jobs.
- Reduce worker count. 27 CLI processes is excessive if they each consume 300-500 MB. Start with 5-8 workers and measure throughput.
Priority 3: MySQL (low priority)
MySQL actual memory usage is ~1.5 GB, not the 8 GB RSS reported by atop. Not a significant contributor to this crash. No action needed unless memory remains tight after fixing Redis and PHP workers.
Priority 4: Clean up legacy PHP pools
- php-fpm8.3 (15 workers, 489 MB swap) and php-fpm8.2 (7 workers, 221 MB swap) are almost entirely swapped out, meaning they are not actively used but still consuming swap space and incurring I/O when occasionally touched. Remove them if no sites use those versions.
Deployment Hardening
Even after fixing the baseline, deployments should not be able to crash the server:
- Serialize unzip/rm operations instead of 233 parallel processes.
- Cap node build memory with
NODE_OPTIONS=--max-old-space-size=512. - Add a pre-deploy memory check: abort if free memory is below a threshold.
- Protect critical services with
OOMScoreAdjust=-900in their systemd units. - Disable Redis persistence during deployments or ensure
BGSAVEcannot trigger concurrently.
Longer Term
- Redis belongs on a separate instance (or use ElastiCache) if it genuinely needs 28 GB. Running it on the same 15 GB server as MySQL and PHP is fundamentally unsound.
- Increasing swap from 1 GB to 2-4 GB buys time but does not fix the root cause.
- Upgrading to a larger instance is treating the symptom. The workload should fit comfortably in 16 GB with proper limits on Redis and MySQL.
- Set up memory monitoring and alerting (CloudWatch, Datadog, etc.) so this does not go unnoticed until it crashes.
Data Source
Analysis performed by parsing the binary atop log file atop_20260304 using Python + atoparser. The file covers 00:00:02 - 09:58:04 UTC with ~1 minute sampling intervals (599 records). Both system-level (memory, CPU, disk) and per-process (RSS, PSS, swap, state) data was extracted.