Add detailed server provisioning checklist and analysis guide for atop logs

2026-03-13 09:42:30 +01:00
parent a4fa80bf74
commit 33c80f4afd
2 changed files with 276 additions and 8 deletions
--- a/Work/Inbox/Servers.md
+++ b/Work/Inbox/Servers.md
@@ -1,8 +1,61 @@
-Use forge to create server
-Tag the ec2 instance and the root storage
-After creation add elastic ip
-Add monitoring in forge
-Update root volume to gp3
-Install atop
-enable aws backup
-Setup forge database baclups
+# Server Provisioning Checklist
+
+## AWS / Forge Setup
+
+- [ ] Use Forge to create server
+- [ ] Tag the EC2 instance and the root storage
+- [ ] After creation add elastic IP
+- [ ] Add monitoring in Forge
+- [ ] Update root volume to gp3
+- [ ] Enable AWS backup
+- [ ] Setup Forge database backups
+- [ ] Set up SSH key access for team members
+
+## OS Tooling
+
+- [ ] Install atop (`apt install atop`, verify it runs via systemd and writes to `/var/log/atop/`)
+- [ ] Install htop (`apt install htop`)
+- [ ] Install gdu or ncdu (`apt install gdu` or `apt install ncdu`) for disk usage analysis
+
+## Redis Hardening
+
+- [ ] Set `maxmemory` to an appropriate limit (e.g. 2gb for a 16GB server)
+- [ ] Set `maxmemory-policy allkeys-lru`
+- [ ] Disable RDB persistence if not needed (`save ""`) to prevent fork-based OOM
+- [ ] Persist config: `redis-cli CONFIG REWRITE`
+- [ ] Verify config survives reboot: check `/etc/redis/redis.conf` directly
+
+## Laravel / Horizon / Pulse
+
+- [ ] Verify Horizon trim settings in `config/horizon.php` (recent/completed: 60 min or less)
+- [ ] If Pulse is enabled, ensure `pulse:work` is running in supervisor
+- [ ] If Pulse is not used, disable it entirely (remove provider or `PULSE_ENABLED=false`)
+- [ ] Set queue worker memory limits (`--memory=256`) and max jobs (`--max-jobs=500`)
+
+## PHP-FPM
+
+- [ ] Remove unused PHP-FPM pools/versions (only keep the version the site uses)
+- [ ] Tune `pm.max_children` based on available RAM and per-worker memory usage
+
+## Swap
+
+- [ ] Verify swap is configured (at least 2 GB for a 16GB server)
+- [ ] Check `vm.swappiness` is set appropriately (default 60 is fine for most cases)
+
+## Security
+
+- [ ] Verify UFW is enabled and only allows necessary ports (22, 80, 443)
+- [ ] Disable password-based SSH login (`PasswordAuthentication no`)
+- [ ] Verify unattended-upgrades is enabled for security patches
+
+## Deployment
+
+- [ ] Verify deployment script does not spawn hundreds of parallel processes (serialize unzip/rm)
+- [ ] Cap node build memory: `NODE_OPTIONS=--max-old-space-size=512` in deploy script
+- [ ] Test a deploy on the new server before going live
+
+## Monitoring / Alerting
+
+- [ ] Set up memory usage alerting (CloudWatch, Forge, or similar) so OOM situations are caught before they crash the server
+- [ ] Set up disk usage alerting (logs and atop files can fill disks over time)
+- [ ] Configure atop log retention (`/etc/default/atop`, default keeps 28 days)