Runbook: SingleAuditorRunning
Alert Details
- Alert Name: SingleAuditorRunning
- Expression:
max (pull_audit_new_jobs_queue) without (cluster, prometheus, prometheusK8sClusterProvider) > 0
Description
This alert triggers when there is only a single auditor running, and the maximum value of the audit jobs queue exceeds a certain threshold, indicating potential issues with job processing.
Possible Causes
- Only one instance of the auditor service is running.
- High volume of audit jobs causing delays.
- Network issues affecting the auditor service.
- Configuration errors in the auditor setup.
Troubleshooting Steps
-
Check Auditor Service Instances
- Command:
ps aux | grep auditor-service - Expected Output: List of running auditor service instances. Ensure more than one instance is running.
- Example:
$ ps aux | grep auditor-service root 1234 0.0 0.1 123456 1234 ? Ssl 14:00 0:00 /usr/bin/auditor-service root 5678 0.0 0.1 123456 1234 ? Ssl 14:00 0:00 /usr/bin/auditor-service
- Command:
-
Restart Auditor Service
- Command:
sudo systemctl restart auditor-service - Expected Output: The service restarts without errors.
- Example:
$ sudo systemctl restart auditor-service
- Command:
-
Check Network Connectivity
- Command:
ping -c 4 auditor-service-hostname - Expected Output: Successful ping responses.
- Example:
$ ping -c 4 auditor-service-hostname PING auditor-service-hostname (192.168.1.3) 56(84) bytes of data. 64 bytes from auditor-service-hostname: icmp_seq=1 ttl=64 time=0.123 ms 64 bytes from auditor-service-hostname: icmp_seq=2 ttl=64 time=0.124 ms 64 bytes from auditor-service-hostname: icmp_seq=3 ttl=64 time=0.125 ms 64 bytes from auditor-service-hostname: icmp_seq=4 ttl=64 time=0.126 ms
- Command:
-
Verify Auditor Configuration
- Command:
cat /etc/auditor-service/config.yml - Expected Output: Configuration file contents. Ensure all settings are correct.
- Example:
$ cat /etc/auditor-service/config.yml job_name: 'audit-jobs' max_queue_age: 300
- Command:
Additional Steps
-
Check Logs for Errors
- Command:
journalctl -u auditor-service --since "1 hour ago" - Expected Output: Recent logs for the auditor service. Look for any error messages.
- Example:
$ journalctl -u auditor-service --since "1 hour ago" -- Logs begin at Wed 2024-11-13 13:00:00 UTC, end at Wed 2024-11-13 14:00:00 UTC. -- Nov 13 13:45:00 hostname auditor-service[1234]: Auditor job processed Nov 13 13:50:00 hostname auditor-service[1234]: Error: Job processing delayed
- Command:
-
Check Underlying Infrastructure
- Ensure the server hosting the auditor service is up and running.
- Verify there are no ongoing maintenance activities or outages.