Change Data Capture with GCP BigQuery

Nilesh Khandalkar
3 min readMay 19, 2022

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

Capturing every change from transactions in a source database and moving them to the target in real-time keeps the systems in sync and provides for reliable data replication and zero-downtime cloud migrations. CDC is perfect for modern cloud architectures since it’s a highly efficient way to move data across a wide area network. And, since it’s moving data in real-time, it also supports real-time analytics and data science.

In GCP, CDC can be achieved using various ways, using Dataflow or BigQuery SQL or Stored Procedures. This article will be focused on implementing CDC with BigQuery SQL and here we discuss only CDC Type 2 which is widely used in the Data warehousing. We will see two approaches on how this can be implemented, first one with use of Analytical Functions and second with using self joins.

Let’s see the below example:

Example 1

Day1: New customer C001. Day2: The same customer C001 has a type change. Day3: C001 has location change, if we need to maintain the history and also close the previous records and keep the latest record as active, this is CDC type 2 which is shown in the Target table.

This can be achieved by BigQuery SQL using Analytical function.

Using the above SQL, this will give the desired output as shown in the Target Table, however this will also show the records which are un-changed and are resent in the subsequent day or batch, for example:

Example 2

There can be a requirement to show only the records which have a change. This can be achieved by using the table to itself, i.e. SELF JOIN for the example 1.

The above SQL uses Self Join and then using Co-related subquery joins to the same table to check the related columns for any change. The Group by clause avoids the records which are not changed.

Finally, using MERGE we can INSERT/UPDATE THE rows in the TARGET TABLE by having the USING CLAUSE and matching all columns except EFF_START_TS

Hope this helps!

--

--

Nilesh Khandalkar

Passionate about Data and Cloud, working as Data Engineering Manager at Capgemini UK. GCP Professional Data Engineering Certified Airflow Fundamentals Certified