Runbook: SingleAuditorRunning

Alert Details

  • Alert Name: SingleAuditorRunning
  • Expression: max (pull_audit_new_jobs_queue) without (cluster, prometheus, prometheusK8sClusterProvider) > 0

Description

This alert triggers when there is only a single auditor running, and the maximum value of the audit jobs queue exceeds a certain threshold, indicating potential issues with job processing.

Possible Causes

  • Only one instance of the auditor service is running.
  • High volume of audit jobs causing delays.
  • Network issues affecting the auditor service.
  • Configuration errors in the auditor setup.

Troubleshooting Steps

  1. Check Auditor Service Instances

    • Command: ps aux | grep auditor-service
    • Expected Output: List of running auditor service instances. Ensure more than one instance is running.
    • Example:
      $ ps aux | grep auditor-service
      root      1234  0.0  0.1  123456  1234 ?        Ssl  14:00   0:00 /usr/bin/auditor-service
      root      5678  0.0  0.1  123456  1234 ?        Ssl  14:00   0:00 /usr/bin/auditor-service
      
  2. Restart Auditor Service

    • Command: sudo systemctl restart auditor-service
    • Expected Output: The service restarts without errors.
    • Example:
      $ sudo systemctl restart auditor-service
      
  3. Check Network Connectivity

    • Command: ping -c 4 auditor-service-hostname
    • Expected Output: Successful ping responses.
    • Example:
      $ ping -c 4 auditor-service-hostname
      PING auditor-service-hostname (192.168.1.3) 56(84) bytes of data.
      64 bytes from auditor-service-hostname: icmp_seq=1 ttl=64 time=0.123 ms
      64 bytes from auditor-service-hostname: icmp_seq=2 ttl=64 time=0.124 ms
      64 bytes from auditor-service-hostname: icmp_seq=3 ttl=64 time=0.125 ms
      64 bytes from auditor-service-hostname: icmp_seq=4 ttl=64 time=0.126 ms
      
  4. Verify Auditor Configuration

    • Command: cat /etc/auditor-service/config.yml
    • Expected Output: Configuration file contents. Ensure all settings are correct.
    • Example:
      $ cat /etc/auditor-service/config.yml
      job_name: 'audit-jobs'
      max_queue_age: 300
      

Additional Steps

  • Check Logs for Errors

    • Command: journalctl -u auditor-service --since "1 hour ago"
    • Expected Output: Recent logs for the auditor service. Look for any error messages.
    • Example:
      $ journalctl -u auditor-service --since "1 hour ago"
      -- Logs begin at Wed 2024-11-13 13:00:00 UTC, end at Wed 2024-11-13 14:00:00 UTC. --
      Nov 13 13:45:00 hostname auditor-service[1234]: Auditor job processed
      Nov 13 13:50:00 hostname auditor-service[1234]: Error: Job processing delayed
      
  • Check Underlying Infrastructure

    • Ensure the server hosting the auditor service is up and running.
    • Verify there are no ongoing maintenance activities or outages.