Monitoring of EmpowerID SaaS

This article serves as an introductory overview of the availability monitoring processes followed by EmpowerID in monitoring SaaS environments. While it does not delve into all the various aspects of Site Reliability Engineering or Security Information and Event Management performed by EmpowerID, the article provides a comprehensive understanding of the processes followed by the DevOps team to ensure a base level of service with minimal impact on end users. The focus of the article is on availability monitoring, and the information provided aims to help SaaS customers understand the monitoring processes performed by EmpowerID and assist the in-house operations team of non-SaaS customers to achieve parity. EmpowerID's solution for availability monitoring can be broadly divided into three areas, including front-end services, back-end services, and the underlying infrastructure monitored by EmpowerID DevOps.

Front-End Monitoring

To monitor site availability, EmpowerID DevOps focuses on ensuring that the main web applications load without any issues. For this purpose, Azure Monitor is utilized, and three specific URLs are checked every two minutes per Azure region. These URLs include Core Login (https://<core-domain>/WebIdpForms/Login/Portal), IAM Shop (https://<iamshop-domain>), if applicable, and My Identity (https://<myid-domain>), if applicable. All requests are checked to ensure that they are successful. In case of three consecutive failures, a High-Priority alert is raised, which would be handled by the EmpowerID DevOps team. In addition to active front-end monitoring, passive error rate monitoring is optionally performed for large user bases, where the EmpowerID UI is frequently utilized. For this, the Azure Application Gateway provides a failed-requests metric, and if the error rate exceeds the 5% threshold and sustains for more than five minutes, a High-Priority alert is raised.

 

Backend-Monitoring

EmpowerID's identity lifecycle automation functionality is often the primary reason for clients to use the platform, and monitoring backend processes is critical for ensuring system functionality. EmpowerID stores all vital information, including process state information, in one database, enabling the use of a simple yet effective mechanism to report process health. A stored procedure called Z_EmpowerID_Health checks process state information against predefined criteria and outputs a list of problematic conditions requiring attention. A complete listing of these health checks and their configurations is available at https://dotnetworkflow.jira.com/wiki/spaces/EIDADV23/pages/2984964548 .

To monitor this process, EmpowerID DevOps deploys a monitoring container that invokes the health-check procedure every five minutes and submits any reported problem conditions to Azure Monitor. If a problem condition is reported consecutively in polling intervals, a medium-priority alert is raised. Therefore, EmpowerID DevOps ensures that all of EmpowerID's various backend processes are continually monitored to maintain overall system health.

Infrastructure Monitoring

EmpowerID SaaS is hosted on Azure, utilizing several products like Azure Kubernetes Services (AKS) and SQL Database (as-a-Service). EmpowerID DevOps monitors specific metrics for each service to proactively detect issues before they affect front-end and back-end services. Depending on the metric and threshold, a medium or high-severity alert is generated. Some of the metrics monitored for SQL Database include the remaining free space, with less than 15% raising a medium-severity alert. Deadlocks are also monitored, with over three deadlocks within ten minutes raising a high-severity alert. In addition, an average CPU utilization of over 90% raises a medium-severity alert.

Alert Handling

EmpowerID utilizes Azure Monitor to aggregate metrics, evaluate rules, and raise alerts. Actions are configured in Azure Monitor to trigger alerts in Atlassian Ops Genie, which then pages EmpowerID DevOps personnel. Depending on the severity, EmpowerID manages these alerts in the following way:

  • For high-severity alerts, on-call personnel are paged regardless of the time of day, and escalations are followed up if the alert is not acknowledged.

  • For medium-severity alerts, personnel is paged during waking hours, allowing for a timely follow-up.