Add detailed server provisioning checklist and analysis guide for atop logs
This commit is contained in:
@@ -1,8 +1,61 @@
|
|||||||
Use forge to create server
|
# Server Provisioning Checklist
|
||||||
Tag the ec2 instance and the root storage
|
|
||||||
After creation add elastic ip
|
## AWS / Forge Setup
|
||||||
Add monitoring in forge
|
|
||||||
Update root volume to gp3
|
- [ ] Use Forge to create server
|
||||||
Install atop
|
- [ ] Tag the EC2 instance and the root storage
|
||||||
enable aws backup
|
- [ ] After creation add elastic IP
|
||||||
Setup forge database baclups
|
- [ ] Add monitoring in Forge
|
||||||
|
- [ ] Update root volume to gp3
|
||||||
|
- [ ] Enable AWS backup
|
||||||
|
- [ ] Setup Forge database backups
|
||||||
|
- [ ] Set up SSH key access for team members
|
||||||
|
|
||||||
|
## OS Tooling
|
||||||
|
|
||||||
|
- [ ] Install atop (`apt install atop`, verify it runs via systemd and writes to `/var/log/atop/`)
|
||||||
|
- [ ] Install htop (`apt install htop`)
|
||||||
|
- [ ] Install gdu or ncdu (`apt install gdu` or `apt install ncdu`) for disk usage analysis
|
||||||
|
|
||||||
|
## Redis Hardening
|
||||||
|
|
||||||
|
- [ ] Set `maxmemory` to an appropriate limit (e.g. 2gb for a 16GB server)
|
||||||
|
- [ ] Set `maxmemory-policy allkeys-lru`
|
||||||
|
- [ ] Disable RDB persistence if not needed (`save ""`) to prevent fork-based OOM
|
||||||
|
- [ ] Persist config: `redis-cli CONFIG REWRITE`
|
||||||
|
- [ ] Verify config survives reboot: check `/etc/redis/redis.conf` directly
|
||||||
|
|
||||||
|
## Laravel / Horizon / Pulse
|
||||||
|
|
||||||
|
- [ ] Verify Horizon trim settings in `config/horizon.php` (recent/completed: 60 min or less)
|
||||||
|
- [ ] If Pulse is enabled, ensure `pulse:work` is running in supervisor
|
||||||
|
- [ ] If Pulse is not used, disable it entirely (remove provider or `PULSE_ENABLED=false`)
|
||||||
|
- [ ] Set queue worker memory limits (`--memory=256`) and max jobs (`--max-jobs=500`)
|
||||||
|
|
||||||
|
## PHP-FPM
|
||||||
|
|
||||||
|
- [ ] Remove unused PHP-FPM pools/versions (only keep the version the site uses)
|
||||||
|
- [ ] Tune `pm.max_children` based on available RAM and per-worker memory usage
|
||||||
|
|
||||||
|
## Swap
|
||||||
|
|
||||||
|
- [ ] Verify swap is configured (at least 2 GB for a 16GB server)
|
||||||
|
- [ ] Check `vm.swappiness` is set appropriately (default 60 is fine for most cases)
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- [ ] Verify UFW is enabled and only allows necessary ports (22, 80, 443)
|
||||||
|
- [ ] Disable password-based SSH login (`PasswordAuthentication no`)
|
||||||
|
- [ ] Verify unattended-upgrades is enabled for security patches
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
- [ ] Verify deployment script does not spawn hundreds of parallel processes (serialize unzip/rm)
|
||||||
|
- [ ] Cap node build memory: `NODE_OPTIONS=--max-old-space-size=512` in deploy script
|
||||||
|
- [ ] Test a deploy on the new server before going live
|
||||||
|
|
||||||
|
## Monitoring / Alerting
|
||||||
|
|
||||||
|
- [ ] Set up memory usage alerting (CloudWatch, Forge, or similar) so OOM situations are caught before they crash the server
|
||||||
|
- [ ] Set up disk usage alerting (logs and atop files can fill disks over time)
|
||||||
|
- [ ] Configure atop log retention (`/etc/default/atop`, default keeps 28 days)
|
||||||
|
|||||||
215
Work/Resources/Analysing atop Logs with Python.md
Normal file
215
Work/Resources/Analysing atop Logs with Python.md
Normal file
@@ -0,0 +1,215 @@
|
|||||||
|
# Analysing atop Binary Logs with Python
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
atop writes binary log files (typically `/var/log/atop/atop_YYYYMMDD`) that contain per-minute snapshots of system and per-process stats. These can be read on a remote machine using Python without needing atop installed locally.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 -m venv /tmp/atop_venv
|
||||||
|
source /tmp/atop_venv/bin/activate
|
||||||
|
pip install atoparser
|
||||||
|
```
|
||||||
|
|
||||||
|
The `atoparser` package provides struct definitions but we parse the binary directly for flexibility.
|
||||||
|
|
||||||
|
## File Format
|
||||||
|
|
||||||
|
| Component | Size (bytes) | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| Raw header | 480 | File-level metadata, magic `0xfeedbeef` |
|
||||||
|
| Per-record header | 96 | Timestamp, compressed data lengths, process counts |
|
||||||
|
| System stats (sstat) | variable | zlib-compressed system memory/CPU/disk data |
|
||||||
|
| Process stats (pstat) | variable | zlib-compressed per-process data, each entry 840 bytes (TStat) |
|
||||||
|
|
||||||
|
Records are sequential: `[raw header][rec1 header][rec1 sstat][rec1 pstat][rec2 header][rec2 sstat][rec2 pstat]...`
|
||||||
|
|
||||||
|
## Record Header Layout (96 bytes)
|
||||||
|
|
||||||
|
```python
|
||||||
|
curtime = struct.unpack('<I', rec[0:4])[0] # Unix timestamp
|
||||||
|
scomplen = struct.unpack('<I', rec[16:20])[0] # Compressed sstat length
|
||||||
|
pcomplen = struct.unpack('<I', rec[20:24])[0] # Compressed pstat length
|
||||||
|
ndeviat = struct.unpack('<I', rec[28:32])[0] # Number of process deviations
|
||||||
|
nactproc = struct.unpack('<I', rec[32:36])[0] # Number of active processes
|
||||||
|
```
|
||||||
|
|
||||||
|
## System Stats (sstat) - Memory Fields
|
||||||
|
|
||||||
|
After decompressing with `zlib.decompress()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
physmem = struct.unpack('<q', sstat[0:8])[0] # Total physical memory (pages)
|
||||||
|
freemem = struct.unpack('<q', sstat[8:16])[0] # Free memory (pages)
|
||||||
|
buffermem = struct.unpack('<q', sstat[16:24])[0] # Buffer memory (pages)
|
||||||
|
cachemem = struct.unpack('<q', sstat[24:32])[0] # Cache memory (pages)
|
||||||
|
swapmem = struct.unpack('<q', sstat[48:56])[0] # Total swap (pages)
|
||||||
|
freeswap = struct.unpack('<q', sstat[56:64])[0] # Free swap (pages)
|
||||||
|
```
|
||||||
|
|
||||||
|
Convert pages to MB: `value * 4096 / (1024 * 1024)`
|
||||||
|
|
||||||
|
## Process Stats (TStat) - 840 bytes per entry
|
||||||
|
|
||||||
|
Each TStat entry has these sections:
|
||||||
|
|
||||||
|
| Section | Offset | Size | Contents |
|
||||||
|
|---|---|---|---|
|
||||||
|
| gen | 0 | 384 | PID, name, state, threads |
|
||||||
|
| cpu | 384 | 88 | CPU usage |
|
||||||
|
| dsk | 472 | 72 | Disk I/O |
|
||||||
|
| mem | 544 | 128 | Memory usage |
|
||||||
|
| net | 672 | 112 | Network |
|
||||||
|
| gpu | 784 | 56 | GPU |
|
||||||
|
|
||||||
|
### gen section - key fields
|
||||||
|
|
||||||
|
```python
|
||||||
|
tgid = struct.unpack('<i', entry[0:4])[0] # Thread group ID
|
||||||
|
pid = struct.unpack('<i', entry[4:8])[0] # Process ID
|
||||||
|
ppid = struct.unpack('<i', entry[8:12])[0] # Parent PID
|
||||||
|
nthr = struct.unpack('<i', entry[44:48])[0] # Number of threads
|
||||||
|
name = entry[48:63].split(b'\x00')[0].decode('ascii', errors='replace') # 15 chars max
|
||||||
|
isproc = entry[64] # 1 = process, 0 = thread
|
||||||
|
state = chr(entry[65]) # S=sleeping, R=running, D=uninterruptible, Z=zombie
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** The name field is 15 bytes at offset 48, isproc is at offset 64 (not 63), state is at offset 65.
|
||||||
|
|
||||||
|
### mem section - key fields (all values in pages, multiply by 4096 for bytes)
|
||||||
|
|
||||||
|
```python
|
||||||
|
minflt = struct.unpack('<q', entry[544:552])[0] # Minor page faults
|
||||||
|
majflt = struct.unpack('<q', entry[552:560])[0] # Major page faults
|
||||||
|
vexec = struct.unpack('<q', entry[560:568])[0] # Executable virtual memory
|
||||||
|
vmem = struct.unpack('<q', entry[568:576])[0] # Virtual memory size
|
||||||
|
rmem = struct.unpack('<q', entry[576:584])[0] # Resident memory (RSS)
|
||||||
|
pmem = struct.unpack('<q', entry[584:592])[0] # Proportional memory (PSS)
|
||||||
|
vgrow = struct.unpack('<q', entry[592:600])[0] # Virtual memory growth
|
||||||
|
rgrow = struct.unpack('<q', entry[600:608])[0] # Resident memory growth
|
||||||
|
vdata = struct.unpack('<q', entry[608:616])[0] # Virtual data size
|
||||||
|
vstack = struct.unpack('<q', entry[616:624])[0] # Virtual stack size
|
||||||
|
vlibs = struct.unpack('<q', entry[624:632])[0] # Virtual library size
|
||||||
|
vswap = struct.unpack('<q', entry[632:640])[0] # Swap usage
|
||||||
|
```
|
||||||
|
|
||||||
|
## Complete Analysis Script
|
||||||
|
|
||||||
|
```python
|
||||||
|
import struct, time, zlib
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
filepath = '/path/to/atop_YYYYMMDD'
|
||||||
|
with open(filepath, 'rb') as f:
|
||||||
|
data = f.read()
|
||||||
|
|
||||||
|
rawheadlen = 480
|
||||||
|
rawreclen = 96
|
||||||
|
tstatlen = 840
|
||||||
|
pagesize = 4096
|
||||||
|
|
||||||
|
def parse_processes(pstat_data):
|
||||||
|
"""Parse all process entries from decompressed pstat data."""
|
||||||
|
procs = []
|
||||||
|
n = len(pstat_data) // tstatlen
|
||||||
|
for i in range(n):
|
||||||
|
entry = pstat_data[i*tstatlen:(i+1)*tstatlen]
|
||||||
|
pid = struct.unpack('<i', entry[4:8])[0]
|
||||||
|
ppid = struct.unpack('<i', entry[8:12])[0]
|
||||||
|
nthr = struct.unpack('<i', entry[44:48])[0]
|
||||||
|
name = entry[48:63].split(b'\x00')[0].decode('ascii', errors='replace')
|
||||||
|
isproc = entry[64]
|
||||||
|
state = chr(entry[65]) if entry[65] > 0 else '?'
|
||||||
|
vmem = struct.unpack('<q', entry[568:576])[0]
|
||||||
|
rmem = struct.unpack('<q', entry[576:584])[0]
|
||||||
|
pmem = struct.unpack('<q', entry[584:592])[0]
|
||||||
|
vswap = struct.unpack('<q', entry[632:640])[0]
|
||||||
|
|
||||||
|
if pid > 0 and isproc == 1 and rmem > 0:
|
||||||
|
procs.append({
|
||||||
|
'pid': pid, 'ppid': ppid, 'name': name,
|
||||||
|
'nthr': nthr, 'state': state,
|
||||||
|
'vmem_mb': vmem * pagesize / (1024*1024),
|
||||||
|
'rmem_mb': rmem * pagesize / (1024*1024),
|
||||||
|
'pmem_mb': pmem * pagesize / (1024*1024),
|
||||||
|
'vswap_mb': vswap * pagesize / (1024*1024),
|
||||||
|
})
|
||||||
|
return procs
|
||||||
|
|
||||||
|
# Iterate through all records
|
||||||
|
pos = rawheadlen
|
||||||
|
while pos + rawreclen <= len(data):
|
||||||
|
rec = data[pos:pos+rawreclen]
|
||||||
|
curtime = struct.unpack('<I', rec[0:4])[0]
|
||||||
|
|
||||||
|
# Validate timestamp (adjust range for your data)
|
||||||
|
if curtime < 1770000000 or curtime > 1780000000:
|
||||||
|
# Try to find next valid record by scanning forward
|
||||||
|
found = False
|
||||||
|
for skip in range(4, 500, 4):
|
||||||
|
if pos + skip + 4 > len(data): break
|
||||||
|
ts_val = struct.unpack('<I', data[pos+skip:pos+skip+4])[0]
|
||||||
|
if 1770000000 < ts_val < 1780000000:
|
||||||
|
pos = pos + skip; found = True; break
|
||||||
|
if not found: break
|
||||||
|
continue
|
||||||
|
|
||||||
|
scomplen = struct.unpack('<I', rec[16:20])[0]
|
||||||
|
pcomplen = struct.unpack('<I', rec[20:24])[0]
|
||||||
|
sstat_start = pos + rawreclen
|
||||||
|
pstat_start = sstat_start + scomplen
|
||||||
|
ts = time.gmtime(curtime)
|
||||||
|
|
||||||
|
# === System memory ===
|
||||||
|
try:
|
||||||
|
sstat_data = zlib.decompress(data[sstat_start:sstat_start+scomplen])
|
||||||
|
freemem = struct.unpack('<q', sstat_data[8:16])[0] * pagesize / (1024*1024)
|
||||||
|
cachemem = struct.unpack('<q', sstat_data[24:32])[0] * pagesize / (1024*1024)
|
||||||
|
freeswap = struct.unpack('<q', sstat_data[56:64])[0] * pagesize / (1024*1024)
|
||||||
|
except:
|
||||||
|
pos = pstat_start + pcomplen
|
||||||
|
continue
|
||||||
|
|
||||||
|
# === Per-process memory ===
|
||||||
|
try:
|
||||||
|
pstat_data = zlib.decompress(data[pstat_start:pstat_start+pcomplen])
|
||||||
|
procs = parse_processes(pstat_data)
|
||||||
|
except:
|
||||||
|
procs = []
|
||||||
|
|
||||||
|
# Filter/sort/aggregate as needed
|
||||||
|
procs.sort(key=lambda p: p['rmem_mb'], reverse=True)
|
||||||
|
|
||||||
|
# Group by name
|
||||||
|
by_name = defaultdict(lambda: {'count': 0, 'rmem': 0, 'pmem': 0, 'vswap': 0})
|
||||||
|
for p in procs:
|
||||||
|
by_name[p['name']]['count'] += 1
|
||||||
|
by_name[p['name']]['rmem'] += p['rmem_mb']
|
||||||
|
by_name[p['name']]['pmem'] += p['pmem_mb']
|
||||||
|
by_name[p['name']]['vswap'] += p['vswap_mb']
|
||||||
|
|
||||||
|
# Print summary for this timestamp
|
||||||
|
ts_str = f"{ts.tm_hour:02d}:{ts.tm_min:02d}:{ts.tm_sec:02d}"
|
||||||
|
print(f"\n=== {ts_str} UTC ===")
|
||||||
|
for name, stats in sorted(by_name.items(), key=lambda x: x[1]['rmem'], reverse=True)[:10]:
|
||||||
|
print(f" {name:<20s} x{stats['count']:<3d} RSS={stats['rmem']:>8.0f}M PSS={stats['pmem']:>8.0f}M Swap={stats['vswap']:>6.0f}M")
|
||||||
|
|
||||||
|
pos = pstat_start + pcomplen
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes on RSS vs PSS
|
||||||
|
|
||||||
|
- **RSS (Resident Set Size):** Physical RAM mapped into the process. Includes shared libraries and mmap'd files. Over-counts shared memory (counted in full for every process that maps it).
|
||||||
|
- **PSS (Proportional Set Size):** Shared pages divided by the number of processes sharing them. More accurate for total memory accounting.
|
||||||
|
- **MySQL RSS is misleading:** InnoDB mmap's its data files, inflating RSS by gigabytes. Actual private memory is much lower (check with `top` or `smem`).
|
||||||
|
- **Redis RSS is accurate:** Redis stores data in heap (anonymous) memory, so RSS closely reflects real usage.
|
||||||
|
- **PHP-FPM RSS over-counts:** Workers share PHP code pages. PSS shows true per-worker cost.
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
1. The timestamp validation range needs adjusting per file. Use `date -d @TIMESTAMP` to check.
|
||||||
|
2. Some records may have alignment gaps between them -- the skip-forward loop handles this.
|
||||||
|
3. The `isproc` field at offset 64 distinguishes processes from threads. Filter by `isproc == 1` to avoid double-counting thread memory.
|
||||||
|
4. The `name` field is truncated to 15 characters. Long process names like `php-fpm8.4` fit, but `amazon-ssm-agent` becomes `amazon-ssm-agen`.
|
||||||
|
5. All memory values in the TStat struct are in pages (4096 bytes on most Linux systems).
|
||||||
Reference in New Issue
Block a user