Google Dataflow Secure Quickstart with Python

In this step-by-step guide I will share the additional steps that I followed to enable Google Dataflow with the least permissions needed, with Google KMS Encryption enabled, and all the necessary service accounts, that one would be expected to employ in a secure production deployment.

I will base this guide on the already existing GCP Quickstart guide for Dataflow with Python. This guide is good for learning Dataflow, but in terms of security, it’s a No-No. One of the first steps that it asks you to follow, is to create a service account with Owner role.

What is Google Dataflow?

Google Dataflow is a Cloud Service wrapper for the open-source library Apache Beam, which enables you to pipe data from any source to any destination with powerful transformations in between.
With Google Dataflow you can stream data in real-time or in batch mode from any data source, onto any data sink, in a serverless fashion.
The pipelines are created using Apache Beam in Java or Python.

Dataflow Uses Compute Engine

Google Dataflow is a serverless solution with autoscaling, but it relies on the GCP Compute Engine. Google Dataflow decides how many VMs it needs, based on the machine types it is allowed to use and any constraints we provide it before starting a job(e.g. maximum number of workers).
The initial number of workers is reviewed regularly based on the volume of data that needs to be processed. Any VMs which are no longer needed are automatically destroyed.
If the constraints are high enough, Google Dataflow is able to process humongous volumes of data in very little time.

Just for fun, I would recommend you to watch this Google video for a demo of Google Dataflow during a tram trip:

Google Dataflow use-cases

One common use-case is to retrieve load data from a database on-premise, transform it and load it to BigQuery for data analysis.
Or for those familiar with machine learning, Google Dataflow also can be used to build a data pipeline to deliver data for Tensorflow to train a model regularly, using fresh data.
There are many other potential use-cases in the financial sector, which typically has to deal with large volumes of sensitive data.

This is where data security and data residency becomes very important, and with Google Dataflow we are able to encrypt data using customer-managed keys with Google KMS and also restrict, where our data is stored and where it’s processed.

Setup Guide for Google Dataflow

Create your GCP Project

You do need a GCP Project for trying Dataflow. If you don’t have one, get one created.

Enable Required APIs

To get started with Google Dataflow, you will need to enable the following APIs:

Google Dataflow
Google Dataflow API
Google Compute Engine
Google Cloud Logging
Google Cloud Storage,
Google Cloud Storage JSON
Google BigQuery
Google Cloud Pub/Sub
Google Cloud Datastore
GoogleCloud Resource Manager APIs.

Create required Service accounts

To run Dataflow we will need to create at least two service accounts.

SA for administering Google Dataflow job with dataflow-admin role:
Name: dataflow-admin@PROJECT_ID.iam.gserviceaccount.com
Role: roles/dataflow-admin
By assigning the service account with the roles/dataflow-admin role, it will have enough permissions to create and manage a dataflow job.
SA for running Google Dataflow jobs with dataflow-worker role:
Name: dataflow-worker@PROJECT_ID.iam.gserviceaccount.com
Role: roles/dataflow-worker
It is important from a security point of view that we don’t use the default compute engine account for anything since it has a wide range set of permissions.
Therefore when starting a dataflow job we should override the default service account and specify one with a more restricted set of permissions. By using the roles/dataflow-worker role, we have enough permissions to run a dataflow job. Any more permissions needed for other GCP products, can be added to this service account.

Create Required Buckets

We will need to create GCS Buckets for storing results and any temporary files created by Dataflow workers.
It is possible to run Google Dataflow jobs with just one GCS bucket, but you might want to create more than one bucket for the different types of data that you will be storing:

GCS Bucket for temp files:
Bucket Name: dataflow-data-temp
GCS Bucket for staging files:
Bucket Name: dataflow-data-staging
GCS Bucket for output files:
Bucket Name: dataflow-data-output

The nice thing about having more than one bucket is that now we can set a lifecycle rule, appropriate for each type of data stored, so that any object older than x days is automatically deleted:

Lifecycle Rule:
– Age: 3
– Action: Delete

For each bucket, we can also choose to encrypt it with a Customer-Managed Encryption Key using Google KMS.

Running the Python Wordcount Apache Beam example

Instead of defining all the steps, I am going to just outline the steps in the tutorial that you need to change.

Generate JSON key for service account dataflow-admin@PROJECT_ID.iam.gserviceaccount.com and use that service account to start the Dataflow job
We need to pass extra parameters to python wordcount example:

python -m apache_beam.examples.wordcount \
    --region DATAFLOW_REGION \
    --input gs://dataflow-samples/shakespeare/kinglear.txt \
    --output gs://dataflow-data-output/results/outputs \
    --runner DataflowRunner \
    --project PROJECT_ID \
    --temp_location gs://dataflow-data-temp/tmp
    --staging_location gs://dataflow-data-staging/staging
    --dataflow_kms_key <PATH_TO_YOUR_KMS_KEY>
    --service_account_email dataflow-worker@PROJECT_ID.iam.gserviceaccount.com

Note that we are also providing a Google KMS Key(CMEK) with — dataflow_kms_key. This option ensures that any data that can be encrypted in Dataflow, will get encrypted using Google KMS.

Conclusion

I have given you the steps that complement the Quickstart tutorial on Dataflow with Python by Google, and allow you to start with a solid foundation for creating a secure implementation of Google Dataflow.
Have I missed anything important? Please let me know!

Resources

https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-python

Posted

July 8, 2022

Data Science, Google, Google Cloud Platform, Machine Learning, Python

Armindo Cachada

Tags:

apache beam, data engineering, dataflow, gcp, google dataflow, Python