Monitoring EmpowerID SaaS

This article serves as an introductory overview of the availability monitoring processes followed by EmpowerID in monitoring SaaS environments. While it does not delve into all the various aspects of Site Reliability Engineering or Security Information and Event Management performed by EmpowerID, the article provides a comprehensive understanding of the processes followed by the DevOps team to ensure a base level of service with minimal impact on end users. The article focuses on availability monitoring, and the information provided aims to help SaaS customers understand the monitoring processes performed by EmpowerID and assist the in-house operations team of non-SaaS customers to achieve parity. EmpowerID's solution for availability monitoring can be divided into three areas: front-end services, back-end services, and the underlying infrastructure monitored by EmpowerID DevOps.

Front-End Monitoring

To monitor site availability, EmpowerID DevOps ensures that the main web applications load without any issues. For this purpose, Azure Monitor is utilized, and three specific URLs are checked every two minutes per Azure region. These URLs include Core Login (https://<core-domain>/WebIdpForms/Login/Portal), IAM Shop (https://<iamshop-domain>), if applicable, and My Identity (https://<myid-domain>), if applicable. All requests are checked to ensure that they are successful. In case of three consecutive failures, a High-Priority alert is raised, which would be handled by the EmpowerID DevOps team. In addition to active front-end monitoring, passive error rate monitoring is optionally performed for large user bases where the EmpowerID UI is frequently utilized. For this, the Azure Application Gateway provides a failed-requests metric, and if the error rate exceeds the 5% threshold and sustains for more than five minutes, a High-Priority alert is raised.

 

Backend-Monitoring

EmpowerID's identity lifecycle automation functionality is often the primary reason clients use the platform, and monitoring backend processes is critical for ensuring system functionality. EmpowerID stores all vital information, including process state information, in one database, enabling the use of a simple yet effective mechanism to report process health. A stored procedure called Z_EmpowerID_Health checks process state information against predefined criteria and outputs a list of problematic conditions requiring attention. A complete listing of these health checks and their configurations is available at EmpowerID HealthCheck: SQL Procedure Z_EmpowerID_Health.

To monitor this process, EmpowerID DevOps deploys a monitoring container that invokes the health-check procedure every five minutes and submits any reported problem conditions to Azure Monitor. A medium-priority alert is raised if a problem condition is reported consecutively in polling intervals. Therefore, EmpowerID DevOps ensures that all of EmpowerID's various backend processes are continually monitored to maintain overall system health.

Infrastructure Monitoring

EmpowerID SaaS is hosted on Azure, utilizing several products like Azure Kubernetes Services (AKS) and SQL Database (as-a-Service). EmpowerID DevOps monitors specific metrics for each service to proactively detect issues before they affect front-end and back-end services. A medium or high-severity alert is generated depending on the metric and threshold. Some metrics monitored for SQL Database include the remaining free space, with less than 15% raising a medium-severity alert. Deadlocks are also monitored, with over three deadlocks raising a high-severity alert within ten minutes. In addition, an average CPU utilization of over 90% raises a medium-severity alert.

Alert Handling

EmpowerID utilizes Azure Monitor to aggregate metrics, evaluate rules, and raise alerts. Actions are configured in Azure Monitor to trigger alerts in Atlassian Ops Genie, which then pages EmpowerID DevOps personnel. Depending on the severity, EmpowerID manages these alerts in the following way:

  • For high-severity alerts, on-call personnel are paged regardless of the time of day, and escalations are followed up if the alert is not acknowledged.

  • For medium-severity alerts, personnel is paged during waking hours, allowing for a timely follow-up.