From 33c80f4afdd9a79aed8591f6b4b8fb9be5307140 Mon Sep 17 00:00:00 2001 From: Vincent Verbruggen Date: Fri, 13 Mar 2026 09:42:30 +0100 Subject: [PATCH] Add detailed server provisioning checklist and analysis guide for atop logs --- Work/Inbox/Servers.md | 69 +++++- .../Analysing atop Logs with Python.md | 215 ++++++++++++++++++ 2 files changed, 276 insertions(+), 8 deletions(-) create mode 100644 Work/Resources/Analysing atop Logs with Python.md diff --git a/Work/Inbox/Servers.md b/Work/Inbox/Servers.md index f9d9efb..736e57a 100644 --- a/Work/Inbox/Servers.md +++ b/Work/Inbox/Servers.md @@ -1,8 +1,61 @@ -Use forge to create server -Tag the ec2 instance and the root storage -After creation add elastic ip -Add monitoring in forge -Update root volume to gp3 -Install atop -enable aws backup -Setup forge database baclups \ No newline at end of file +# Server Provisioning Checklist + +## AWS / Forge Setup + +- [ ] Use Forge to create server +- [ ] Tag the EC2 instance and the root storage +- [ ] After creation add elastic IP +- [ ] Add monitoring in Forge +- [ ] Update root volume to gp3 +- [ ] Enable AWS backup +- [ ] Setup Forge database backups +- [ ] Set up SSH key access for team members + +## OS Tooling + +- [ ] Install atop (`apt install atop`, verify it runs via systemd and writes to `/var/log/atop/`) +- [ ] Install htop (`apt install htop`) +- [ ] Install gdu or ncdu (`apt install gdu` or `apt install ncdu`) for disk usage analysis + +## Redis Hardening + +- [ ] Set `maxmemory` to an appropriate limit (e.g. 2gb for a 16GB server) +- [ ] Set `maxmemory-policy allkeys-lru` +- [ ] Disable RDB persistence if not needed (`save ""`) to prevent fork-based OOM +- [ ] Persist config: `redis-cli CONFIG REWRITE` +- [ ] Verify config survives reboot: check `/etc/redis/redis.conf` directly + +## Laravel / Horizon / Pulse + +- [ ] Verify Horizon trim settings in `config/horizon.php` (recent/completed: 60 min or less) +- [ ] If Pulse is enabled, ensure `pulse:work` is running in supervisor +- [ ] If Pulse is not used, disable it entirely (remove provider or `PULSE_ENABLED=false`) +- [ ] Set queue worker memory limits (`--memory=256`) and max jobs (`--max-jobs=500`) + +## PHP-FPM + +- [ ] Remove unused PHP-FPM pools/versions (only keep the version the site uses) +- [ ] Tune `pm.max_children` based on available RAM and per-worker memory usage + +## Swap + +- [ ] Verify swap is configured (at least 2 GB for a 16GB server) +- [ ] Check `vm.swappiness` is set appropriately (default 60 is fine for most cases) + +## Security + +- [ ] Verify UFW is enabled and only allows necessary ports (22, 80, 443) +- [ ] Disable password-based SSH login (`PasswordAuthentication no`) +- [ ] Verify unattended-upgrades is enabled for security patches + +## Deployment + +- [ ] Verify deployment script does not spawn hundreds of parallel processes (serialize unzip/rm) +- [ ] Cap node build memory: `NODE_OPTIONS=--max-old-space-size=512` in deploy script +- [ ] Test a deploy on the new server before going live + +## Monitoring / Alerting + +- [ ] Set up memory usage alerting (CloudWatch, Forge, or similar) so OOM situations are caught before they crash the server +- [ ] Set up disk usage alerting (logs and atop files can fill disks over time) +- [ ] Configure atop log retention (`/etc/default/atop`, default keeps 28 days) diff --git a/Work/Resources/Analysing atop Logs with Python.md b/Work/Resources/Analysing atop Logs with Python.md new file mode 100644 index 0000000..27956f8 --- /dev/null +++ b/Work/Resources/Analysing atop Logs with Python.md @@ -0,0 +1,215 @@ +# Analysing atop Binary Logs with Python + +## Overview + +atop writes binary log files (typically `/var/log/atop/atop_YYYYMMDD`) that contain per-minute snapshots of system and per-process stats. These can be read on a remote machine using Python without needing atop installed locally. + +## Prerequisites + +```bash +python3 -m venv /tmp/atop_venv +source /tmp/atop_venv/bin/activate +pip install atoparser +``` + +The `atoparser` package provides struct definitions but we parse the binary directly for flexibility. + +## File Format + +| Component | Size (bytes) | Notes | +|---|---|---| +| Raw header | 480 | File-level metadata, magic `0xfeedbeef` | +| Per-record header | 96 | Timestamp, compressed data lengths, process counts | +| System stats (sstat) | variable | zlib-compressed system memory/CPU/disk data | +| Process stats (pstat) | variable | zlib-compressed per-process data, each entry 840 bytes (TStat) | + +Records are sequential: `[raw header][rec1 header][rec1 sstat][rec1 pstat][rec2 header][rec2 sstat][rec2 pstat]...` + +## Record Header Layout (96 bytes) + +```python +curtime = struct.unpack(' 0 else '?' + vmem = struct.unpack(' 0 and isproc == 1 and rmem > 0: + procs.append({ + 'pid': pid, 'ppid': ppid, 'name': name, + 'nthr': nthr, 'state': state, + 'vmem_mb': vmem * pagesize / (1024*1024), + 'rmem_mb': rmem * pagesize / (1024*1024), + 'pmem_mb': pmem * pagesize / (1024*1024), + 'vswap_mb': vswap * pagesize / (1024*1024), + }) + return procs + +# Iterate through all records +pos = rawheadlen +while pos + rawreclen <= len(data): + rec = data[pos:pos+rawreclen] + curtime = struct.unpack(' 1780000000: + # Try to find next valid record by scanning forward + found = False + for skip in range(4, 500, 4): + if pos + skip + 4 > len(data): break + ts_val = struct.unpack('8.0f}M PSS={stats['pmem']:>8.0f}M Swap={stats['vswap']:>6.0f}M") + + pos = pstat_start + pcomplen +``` + +## Notes on RSS vs PSS + +- **RSS (Resident Set Size):** Physical RAM mapped into the process. Includes shared libraries and mmap'd files. Over-counts shared memory (counted in full for every process that maps it). +- **PSS (Proportional Set Size):** Shared pages divided by the number of processes sharing them. More accurate for total memory accounting. +- **MySQL RSS is misleading:** InnoDB mmap's its data files, inflating RSS by gigabytes. Actual private memory is much lower (check with `top` or `smem`). +- **Redis RSS is accurate:** Redis stores data in heap (anonymous) memory, so RSS closely reflects real usage. +- **PHP-FPM RSS over-counts:** Workers share PHP code pages. PSS shows true per-worker cost. + +## Gotchas + +1. The timestamp validation range needs adjusting per file. Use `date -d @TIMESTAMP` to check. +2. Some records may have alignment gaps between them -- the skip-forward loop handles this. +3. The `isproc` field at offset 64 distinguishes processes from threads. Filter by `isproc == 1` to avoid double-counting thread memory. +4. The `name` field is truncated to 15 characters. Long process names like `php-fpm8.4` fit, but `amazon-ssm-agent` becomes `amazon-ssm-agen`. +5. All memory values in the TStat struct are in pages (4096 bytes on most Linux systems).