Change Data Capture Engine & Kusto Query
Overview of Change Data Capture Engine
It is a challenge to accurately and scalably log data changes during complex workflow events triggered by its event engine. To address this, EmpowerID has implemented a scalable solution known as Change Data Capture (CDC). Applied selectively to specific tables and attributes, CDC creates a comprehensive changelog of all modifications, eliminating the need for manual coding for each event. The changelog is sent to Azure Log Monitor, leveraging SQL Server's CDC and Kusto Query for efficient tracking, analysis, and extraction of data changes. This ensures a more streamlined and scalable event-driven architecture for EmpowerID.
The Challenge of Capturing Data Change
EmpowerID provides a robust event engine that enables users to create and wire various flow events for events like joiners, movers, new mailbox discoveries, and security breach events without any need for coding. After defining these events, the engine evaluates them against established policies to determine the appropriate workflows to execute in response. During these workflows, the flow events may cause changes in data, and it is hard to log exactly what data has changed.
Certain events cannot be mapped through simple rule-based processes. For instance, a “move” event can occur when two out of three attributes, namely, manager, department, or job title, change within 24 hours. Writing codes for each event to record data changes is not scalable or efficient. Therefore, to make this event-driven architecture possible, we have a change detection system in place that identifies when these events occur. EmpowerID's event engine offers a better solution using the Change Data Capture engine.
Solution for Capturing Data Change
EmpowerID effectively handles the task of logging data changes by implementing a scalable solution called Change Data Capture (CDC). This technology is applied selectively to specific tables and attributes within our platform, generating a comprehensive changelog of all modifications. Instead of manually coding for each event, the database seamlessly captures data changes for designated attributes through the implementation of CDC.
The recorded changelog is subsequently transmitted to Azure Log Monitor for in-depth analysis. This solution harnesses the capabilities of SQL Server's Change Data Capture (CDC) for tracking data modifications. Additionally, it maximizes the potential of Kusto Query, renowned for its swift processing and querying abilities, particularly in handling extensive datasets for real-time or time-series data analysis. Through the use of Kusto Query, the changelog stored can be queried and extracted, enabling the initiation of events or workflows in response to specific triggers.
Overview of the Change Data Capture Engine Architecture
Let's first understand the primary components and their role in the CDC engine before diving into the architecture of how the change detection engine works.
EmpowerID Web App: The EmpowerID web application is hosted on a SaaS platform and also includes various microservices.
EmpowerID Identity Warehouse: The EmpowerID Identity Warehouse is housed in the Azure SQL Database.
Powershell Scripts: In EmpowerID Change Data Capture (CDC), Windows PowerShell commands and instructions are used to fetch JSON details, execute stored procedures, and push changes to Log Analytics.
Log Analytic Workspace: A Log Analytics workspace is a unique environment for logging data from Azure Monitor and other Azure services.
Kusto Query: Kusto Query Language (KQL) is a powerful data analysis tool with a structure similar to SQL for querying structured data.
Now, let's dive into how the CDC engine architecture works in EmpowerID.
Data Modification by EmpowerID Workflows: EmpowerID workflows or flow events can lead to Insert, Delete, or Update commands (DML statements) in the EmpowerID Identity Warehouse, which is a database hosted on Azure SQL Database.
Change Data Capture (CDC) in EmpowerID Database: Changes resulting from DML statements on a table configured for Change Data Capture (CDC) in the EmpowerID database on Azure SQL Database are stored in a dedicated table. When CDC is enabled, an additional table is created for each user-configured table, tracking changes made to the specified columns through DML statements. Operations on the user-configured table, such as deleting a user account, are recorded in this CDC history table. The table name is derived from the original table name, with "_CT" (Change Tracking) appended. SQL Server adds metadata columns to track changes, including the type (insert, update, delete). Additionally, a table for storing deleted records is created with the same name as the CDC history table, followed by "_CT." This feature is enabled by default in Azure SQL Database once CDC is activated.
PowerShell Script for Data Transformation to Log Analytics: The PowerShell script plays a crucial role in transforming and transferring data to Azure Log Analytics. Deployed in an Azure Kubernetes Cluster as a container, the master script runs an infinite WHILE loop with a 5-minute pause, triggering the Mover's Detection script every 5 minutes. It executes the "dbo.ChangeStreaming" stored procedure, generating dynamic queries based on the provided CDC table name. These queries transform default log data generated by SQL CDC into a readable format, and the PowerShell script pushes the transformed data into the log analytics workspace.
Simplifying Queries with Kusto in Log Analytics: In Log Analytics Space, complex queries can be simplified using Kusto. Query results can be utilized to trigger events within the EmpowerID Web app, such as identifying changes in attributes related to joiner, mover, and leaver (JML) events within a specified timeframe.
Downstream Actions: The change detection engine's ultimate goal is to initiate workflows, events, or operations in EmpowerID, such as invoking an Azure function, upon a successful match to a Kusto query. For example, a query may identify a "mailbox discovered" event when two out of three attributes change, triggering an EmpowerID event that initiates downstream actions. The architecture diagram and example illustrate the insertion of records into the [BusinessRequestFlowEventInbox] table, leading to the generation of business requests and corresponding notifications. However, other actions, such as running a workflow or event, are also possible.