Get ready for the ultimate Devs vs. DevOps showdown! Join our meetup in Mountain View, CA, on 2/4 Sign up
Blog

Tackling Alert Fatigue: A Journey to Better Observability

Challenges with Our Tool Stack

At Frontegg, we’re passionate about building robust authentication and user management solutions that help drive businesses in their pursuit of secure scaling. In our drive to innovate, we identified an opportunity to optimize a key aspect of our engineering process: our alert system. We realized that enhancing our alerting practices would play a vital role in improving observability and ensuring we consistently meet SLAs.

Background

Frontegg employs several observability tools for logging, monitoring, and alerting:

  • Monitoring: VictoriaMetrics
  • Logging: Coralogix
  • Alerting and Incident Management: Opsgenie
  • Communication: Slack

These tools are powerful and widely adopted, but they also introduce significant noise into our systems running on our multi-tenant, multi-region architecture.

The noise from our alerting system was caused by three factors: we had overly sensitive monitoring thresholds, we had poorly defined alerting policies, and we lacked prioritization. The combination of these aggravating factors resulted in a flood of alerts, which not only overwhelmed our team, but also diluted the significance of truly critical alerts. This in turn led to alert fatigue and reduced the overall developer experience.

Feeling the Pain Points and Identifying the Problem

Our R&D team experienced decreased productivity, higher response times, and frustration due to constant interruptions from non-critical alerts. The sheer volume of notifications made it difficult to discern genuine issues from false positives. We realized that our current setup was not sustainable and was affecting the team’s morale and efficiency.

We conducted several brainstorming sessions, gathered feedback from team members, and analyzed historical alert data to determine the root cause. We determined our alerting system needed a significant overhaul to reduce noise and improve reliability. Before we could begin looking at how we were handling alerts that came in, it became apparent that we were going to need to rethink which events should trigger an alert.

Best Practices and How We Implemented Them

We came up with these best practices to combat alert fatigue:

  1. Threshold Tuning and Dynamic Alerting: We adjusted alert thresholds in VictoriaMetrics to trigger notifications only for significant deviations and implemented machine learning algorithms to adapt these thresholds based on historical data.
  2. Deleting Non-Actionable Alerts: Being strict about deleting non-actionable alerts became a cornerstone of our strategy to combat alert fatigue. We thoroughly audited our alerting system, meticulously reviewing each alert’s relevance and actionability. Alerts that did not directly contribute to resolving an issue or consistently resulted in no action were promptly eliminated. This process involved close collaboration between R&D and operations teams to ensure that the criteria for actionable alerts were well-defined and understood. By ruthlessly pruning non-actionable alerts, we significantly reduced noise, allowing our team to focus on critical issues and improving our overall system reliability.
  3. Prioritization and Categorization: We introduced severity levels for alerts (e.g., critical, informational) in Opsgenie and included contextual information from Coralogix to provide better insights into the potential impact and necessary actions.
  4. Automated Responses: We developed auto-remediation scripts for known issues, reducing the need for manual intervention and integrating them with Opsgenie.
  5. Regular Reviews and Feedback Loops: We instituted regular post-mortem analyses of incidents to identify alerting gaps and encourage continuous feedback from the team to refine alert configurations. These post-mortems not only help us better understand the causes of incidents, but is an exercise that helps us understand the why’s and how’s of our incident response and remedy.
  6. Training and Collaboration: We provided training on effective alerting and observability practices and fostered a collaborative environment where R&D and operations teams worked together to refine strategies.
  7. Managing Alert Configuration using Infrastructure as Code (IaC): By treating alert configurations as code, we ensured consistency, version control, and easier collaboration across teams. We migrated our metric, log, and trace-based alerts into a dedicated GitHub repository, making the alerting rules and policies transparent and auditable. This approach allowed us to automate deployments and updates, reducing manual errors and enabling quick rollbacks if needed. Furthermore, it facilitated peer reviews and streamlined our workflows, as any changes to alerting configurations went through the same rigorous processes as our application code. The adoption of GitOps for alert management enhanced our operational efficiency and reinforced a culture of shared responsibility and continuous improvement.
  8. Implemented Team-focused Channels: Each alert was routed to specific channels dedicated to the relevant teams, ensuring that the right people were notified and could take prompt action. This strategy helped in reducing confusion and response times, as each team was responsible for monitoring and managing alerts pertinent to their domain. The team-focused channels fostered a sense of ownership and responsibility, enabling more effective incident resolution and a deeper understanding of the system’s behavior within each team. This targeted approach streamlined our alert management process and enhanced collaboration and communication across the organization.

Measuring Results

After implementing these changes, we observed several positive outcomes. Some of these were immediately visible while others took time to materialize due to the relative infrequency of certain kinds of events:

  • Reduced Alert Volume: The number of alerts dropped by 50%, significantly reducing noise.
  • Improved Response Times: With fewer false positives, critical issues were identified and resolved more quickly.
  • Enhanced Team Morale: The reduction in unnecessary interruptions allowed team members to focus on their work, improving overall productivity and job satisfaction.
  • Increased System Reliability: With a more reliable alerting system, we experienced fewer system downtimes and improved user satisfaction.
  • Improved sense of ownership: As part of our journey towards improved observability, we parted ways with our third-party NOC company, which received all of our alerts and triaged them for us.

Future Work

Looking ahead, we plan to continue enhancing our observability and alerting practices by leveraging advanced analytics and AI to predict potential issue before they become critical. We aim to implement a more integrated incident management system to streamline responses and improve coordination within the team. Additionally, we are committed to regularly revisiting our alerting strategies and incorporating new technologies and methodologies to stay ahead of potential challenges. This approach ensures proactive interventions, minimal downtime, and the maintenance of the highest standards of service and reliability for our users.

We are excited to continue evolving our team and our practices in our ongoing effort to offer Frontegg users the best CIAM and user management experience in the market. If you found this exploration insightful, we invite you to take a look at some of our other blogs, explore our documentation, or check out our many authentication and user management capabilities.