Runbook: AuditorQueueAge
Alert Details
- Alert Name: AuditorQueueAge
- Expression:
max without (cluster, provider) (pull_audit_new_jobs_queue{environment=~".+"}) > 0
Description
This alert triggers when the maximum age of jobs in the audit queue exceeds a certain threshold, indicating potential delays in processing audit jobs.
Possible Causes
- High volume of audit jobs leading to queue buildup.
- Performance issues with the audit processing service.
- Network latency affecting job processing.
- Configuration issues with the audit queue.
Troubleshooting Steps
-
Check Audit Queue Length
- Command:
curl -s http://audit-service:port/metrics | grep pull_audit_new_jobs_queue - Expected Output: Current metrics for the audit queue length.
- Example:
$ curl -s http://audit-service:port/metrics | grep pull_audit_new_jobs_queue pull_audit_new_jobs_queue{environment="production"} 150
- Command:
-
Check Audit Service Status
- Command:
systemctl status audit-service - Expected Output: The status of the audit service. Look for “active (running)”.
- Example:
$ systemctl status audit-service ● audit-service.service - Audit Service Loaded: loaded (/etc/systemd/system/audit-service.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-11-13 14:00:00 UTC; 19min ago
- Command:
-
Restart Audit Service
- Command:
sudo systemctl restart audit-service - Expected Output: The service restarts without errors.
- Example:
$ sudo systemctl restart audit-service
- Command:
-
Check Network Connectivity
- Command:
ping -c 4 audit-service-hostname - Expected Output: Successful ping responses.
- Example:
$ ping -c 4 audit-service-hostname PING audit-service-hostname (192.168.1.2) 56(84) bytes of data. 64 bytes from audit-service-hostname: icmp_seq=1 ttl=64 time=0.123 ms 64 bytes from audit-service-hostname: icmp_seq=2 ttl=64 time=0.124 ms 64 bytes from audit-service-hostname: icmp_seq=3 ttl=64 time=0.125 ms 64 bytes from audit-service-hostname: icmp_seq=4 ttl=64 time=0.126 ms
- Command:
-
Verify Audit Queue Configuration
- Command:
cat /etc/audit-service/config.yml - Expected Output: Configuration file contents. Ensure all settings are correct.
- Example:
$ cat /etc/audit-service/config.yml queue_name: 'audit-queue' max_queue_age: 300
- Command:
Additional Steps
-
Check Logs for Errors
- Command:
journalctl -u audit-service --since "1 hour ago" - Expected Output: Recent logs for the audit service. Look for any error messages.
- Example:
$ journalctl -u audit-service --since "1 hour ago" -- Logs begin at Wed 2024-11-13 13:00:00 UTC, end at Wed 2024-11-13 14:00:00 UTC. -- Nov 13 13:45:00 hostname audit-service[1234]: Audit job processed Nov 13 13:50:00 hostname audit-service[1234]: Error: Job processing delayed
- Command:
-
Check Underlying Infrastructure
- Ensure the server hosting the audit service is up and running.
- Verify there are no ongoing maintenance activities or outages.