Orbit-RS Operations Runbook
Orbit-RS Operations Runbook
Quick Reference
Emergency Scripts
/usr/local/bin/orbit-emergency-response.sh- First response/usr/local/bin/orbit-data-recovery.sh- Data recovery/usr/local/bin/orbit-disk-cleanup.sh- Disk space recovery
Health Endpoints
GET /health- Basic health checkGET /metrics- Prometheus metricsGET /ready- Readiness probe
Key Commands
# Check service status
systemctl status orbit-rs
# View logs
journalctl -u orbit-rs -f
# Backend detection
orbit-cli detect-backend --data-dir=/var/lib/orbit
# Force snapshot
orbit-cli snapshot create --output=/tmp/emergency.json
# WAL replay
orbit-cli replay-wal --data-dir=/var/lib/orbit
Common Scenarios
1. Service Won’t Start
# Check logs for errors
journalctl -u orbit-rs --since "5 minutes ago"
# Check disk space
df -h /var/lib/orbit
# Validate configuration
orbit-cli validate-config --config=/etc/orbit/config.json
# Try safe mode
systemctl start orbit-rs --config=/etc/orbit/safe-mode.json
2. High Latency
# Check current performance
curl -s localhost:8080/metrics | grep latency
# Check system resources
top -p $(pgrep orbit-rs)
# Force garbage collection
orbit-cli gc-trigger
# Reduce cache if LSM
orbit-cli reduce-cache --block-cache-mb=128
3. Data Corruption
# Stop service
systemctl stop orbit-rs
# Backup corrupt data
cp -r /var/lib/orbit /backup/corrupt-$(date +%s)
# Restore from latest backup
/usr/local/bin/orbit-data-recovery.sh
# Validate recovery
orbit-cli validate --data-dir=/var/lib/orbit
Configuration Switching
Environment Variables
# Switch to COW B+ Tree
export ORBIT_PERSISTENCE_BACKEND=cow
systemctl restart orbit-rs
# Switch to LSM-Tree
export ORBIT_PERSISTENCE_BACKEND=lsm
systemctl restart orbit-rs
# Switch to RocksDB
export ORBIT_PERSISTENCE_BACKEND=rocksdb
systemctl restart orbit-rs
Performance Tuning
# Low memory configuration
cat > /tmp/low-memory.json << 'EOF'
{
"backend": "CowBTree",
"cow_config": {
"max_keys_per_node": 32,
"wal_buffer_size": 65536
}
}
EOF
# High performance configuration
cat > /tmp/high-perf.json << 'EOF'
{
"backend": "CowBTree",
"cow_config": {
"max_keys_per_node": 256,
"wal_buffer_size": 4194304
}
}
EOF
Monitoring Alerts
Critical (P0)
- Service down > 30s
- Disk usage > 95%
- Memory usage > 95%
- Error rate > 1%
Warning (P1)
- Write latency > 100ms
- Read latency > 50ms
- Disk usage > 85%
- Memory usage > 80%
Backup & Recovery
Daily Backup
# Automated via cron: 0 2 * * *
/usr/local/bin/orbit-backup.sh
# Manual backup
orbit-cli backup --output=/backup/manual-$(date +%s).tar.gz
Recovery Process
- Stop service:
systemctl stop orbit-rs - Backup corrupt data:
cp -r /var/lib/orbit /backup/corrupt/ - Run recovery:
/usr/local/bin/orbit-data-recovery.sh - Validate:
orbit-cli validate --data-dir=/var/lib/orbit - Start service:
systemctl start orbit-rs
Performance Baselines
| Backend | Write (μs) | Read (μs) | Memory (MB) |
|---|---|---|---|
| COW B+ Tree | 41 | <1 | 256 |
| LSM-Tree | 38 | 0.3 | 512 |
| RocksDB | 53 | 19 | 1024 |
Emergency Contacts
- Primary: DevOps (+1-555-0100)
- Secondary: SRE (+1-555-0101)
- Escalation: Eng Manager (+1-555-0102)