Level 2 Support: Responsible for handling alerts and performing initial troubleshooting steps. Tasks include log analysis, diagnostics, and applying known solutions.
Level 3 Support: Experts with deep technical knowledge, called upon for complex issues. Responsible for detailed analysis and long-term solutions.
Monitoring Tools: Access to all relevant monitoring and logging tools such as Prometheus, Grafana, ELK Stack, etc.
Communication Channels: Dedicated communication channels for collaboration with Level 3 Support, e.g., Slack, Microsoft Teams, or specific incident response systems.
Notification: Receive and acknowledge the alert via the monitoring system. Ensure all relevant team members are informed.
Initial Assessment: Conduct an initial assessment to determine the urgency and potential impact on the system. Document initial observations and assessments.
Standard Procedures: Apply standard procedures and known solutions to resolve the issue. This may involve restarting services, applying patches, or adjusting configurations.
Documentation: Document all actions taken and results in an incident management system. Ensure all steps are traceable and reproducible.
Communication: Inform Level 3 Support about the need for escalation. Use the dedicated communication channels.
Handover: Provide all collected data, logs, and documentation to Level 3 Support. Conduct a handover meeting to ensure all relevant information is transferred.
Collaboration: Work closely with Level 3 Support to resolve the issue. Assist in data collection and analysis, and ensure all actions are documented.
Documentation: Create a detailed final report documenting all steps and actions taken. The report should include a summary of the incident, actions taken, results, and recommendations for future incidents.
Lessons Learned: Identify lessons learned and areas for improvement. Discuss these with the team and implement necessary changes in processes.
Internal Communication: Inform all relevant internal stakeholders about the incident and actions taken. Ensure management and other affected departments are informed.
External Communication: Inform external stakeholders and customers if necessary. Prepare clear and concise communication to maintain trust and avoid misunderstandings.
Process Improvement: Update and improve support processes based on the insights gained from the incident. Implement new best practices and ensure all team members are trained.
Training: Conduct training sessions for staff to raise awareness of similar incidents. Ensure all team members understand and can apply the new processes and best practices.