GCP Data Catalog Simplified

Nilesh Khandalkar
4 min readJan 12, 2022

Firstly lets understand what is Data Catalog, it is a fully managed, scalable metadata management service offering from GCP. Data Catalog can be used for searching insightful data like what data is where, understanding data like what is the source or who is the owner and also understanding data which is highly sensitive like PII information. Its like the metadata and generally we have two kinds Business Metadata and Technical Metadata.

Business Metadata: Business descriptions like what is the source of the data, owner of the data or any PII columns, etc.

Technical Metadata: Schema name, table name, columns names, etc. This is auto found in the DETAILS section of the table or Dataset.

Data Catalog is serverless, central repository for catalog information and involves features like Governance, Integration with DLP (Data Loss Prevention API’s), Policy Tagging, Search and Discovery, MAAS (Metadata-as -a-Service).

Now lets dive in more and explore how we can use Data Catalog. Assuming we already have a dataset [data_catalog_trial] and a table [data_catalog_trial_table] in BigQuery.

data_catalog_trial_table

We can apply tags to either dataset, table or columns. So the first step is to create Tag Template. Open Data Catalog in GCP console > Tag Templates > Create Tag Template.

The next step is to create fields, you can create as many fields you want. Fields are basically the tags which you will be applying to your tables or columns. Here we create two fields — Source (giving the source information of the table) and has_pii (to give if the particular fields is PII or not)

This will create our Tag Template, this template can be applied to one or multiple objects in this case either tables or columns. Now the next step is to go back to Data Catalog home page and search for the object (in this example table) to which this tag template can be applied.

Select the table, in this case we select data_catalog_trial_table. This will open up a window asking if you want to tag the table to columns or both. Select the appropriate option. Here, we will tag the id column from the table. Fill in the required information as we defined in out Tag template like source and has_pii.

Now this completes the tagging for the particular table, now lets go back to Data Catalog home page and search for the object/objects which we tagged. It will open up the table information and will shows the tagging information.

One thing to note here is that, you can apply as many different tag templates to a table or columns and also same tag template can be applied to other tables and their columns. Also you can star the most searched objects, these will appear in the data catalog home page, so no need to search the same again.

You can search from tag templates, so this will give you the list of objects the particular template is attached.

Another interesting feature in Data Catalog is that we could restrict access to BigQuery at column-level using the Policy Tags feature which requires creating policy tag taxonomy.

Hope this helps!

--

--

Nilesh Khandalkar

Passionate about Data and Cloud, working as Data Engineering Manager at Capgemini UK. GCP Professional Data Engineering Certified Airflow Fundamentals Certified