DLP — Data Loss Prevention (Identify PII) simplified

Nilesh Khandalkar
4 min readSep 26, 2021

Google Cloud DLP provides tools to classify, mask, tokenize, and transform sensitive elements to help you better manage the data that you collect, store, or use for business or analytics. DLP mainly Inspects i.e. find sensitive data and De-identify i.e. masks the sensitive data. This article will mainly focus to finding or inspecting the sensitive data.

Data Inspecting and Data masking is important to protect your information from accidental and intentional threats by ensuring that sensitive information is NOT available beyond the required stakeholders. Also, it helps companies to stay compliant with their location specific data governance laws like GDPR.

Lets dive more into how to use DLP in Google cloud to inspect or find the sensitive data.

Say, we have a table below which has Name, Email, Credit, Address. We can say that Email and Credit can be the sensitive data. Lets see how DLP helps in identifying the columns as the PII.

Search DLP in the Search bar on GCP console. It opens the DLP page:

Here we are using DLP to identify the sensitive or PII data in the BigQuery table, similarly DLP can be used to identify the sensitive or PII for the files stored in the GCS bucket.

First we need to create inspection template hence Click Create > Template > Inspect. Fill in the require information.

The next step is to create infotypes, mainly to let DLP know what kind of data you are trying to identify like email or credit card, etc. There are built-in infotypes to select from Manage Infotypes basically the sensitive data you want to scan for. Here we choose email_address

The next step is to choose Confidence threshold, basically to give exact or near to exact match for your search, here we choose Possible.

Click Create, this will create the inspection template:

Now the next step is to create Job or Job Trigger to run this inspection template.

Choose Inspection > Input Data > Fill Job Name > Specify the location of data stored in a Cloud Storage Bucket or BigQuery table. In this example it is BigQuery Table.

Then select Configure Detection > Select Template > Here select the template which you created in the earlier stage.

The last step is to add actions like saving the inspection job results to BigQuery or to Publish to PubSub or Email or Publish to Data Catalog. Here we select to save in BigQuery, Add Actions > Save to BigQuery

Finally we will Schedule this job to run.

In the Job Configuration > JSON File, we see all the Configurations which were mentioned for this Job.

Click Confirm Job, this will execute the Job to run.

Here, we see job ran successfully and identified that there were 3 rows having email_address as the sensitive data.

The same information can be found in the BigQuery table which we created in the Add Actions step.

Hope this helps, the next article will focus on how to de-identify or remove the sensitive data.

--

--

Nilesh Khandalkar

Passionate about Data and Cloud, working as Data Engineering Manager at Capgemini UK. GCP Professional Data Engineering Certified Airflow Fundamentals Certified