Python models

Python model execution on BigQuery

dbt uses the Dataproc service to execute python models as spark jobs on BigQuery data. Two submission methods for python models are cluster and serverless. The cluster method requires an existing and running spark cluster, serverless method creates a batch job based on the configuration.

The serverless method is easier to set up and operate and probably more cost effective than the cluster method but its startup time can be longer. The cluster method requires ceration of a cluster beforehand (with a configuration that enables connection to BigQuery) and you have to make sure that it is in running state when you run your python model. It will not stop automatically when the execution ends, which will increase costs.

dbt BigQuery setup

To configure your BigQuery connection profile for python model execution you can reference dbt's documentation here.

Serverless configuration

GCP's documentation on the Spark batch workload properties can be found here. You can set these in your profiles.yml under dataproc_batch.runtime_config.properties.

Using python packages

When submitting a serverless python model, a Dataproc batch job is created that uses a default container image for runtime. This default image has a number of python packages installed (eg. pySpark, pandas, NumPy ...). Documentation is unclear on the full list of preinstalled packages, in case you want to use a package that is not available, you can create a custom container image that contains your dependencies. Google's guide on building custom container images can be found here. The image needs to be stored in GCP's Artifact Registry, and you need to reference the image in your dbt profile, setting dataproc_batch.runtime_config.container_image to the image url.

Cluster configuration

If you want to run your python models on a dedicated cluster, that cluster must be able to access BigQuery and the GCS storage bucket. This document show how to include the GCS connector and how to run a BigQuery connector script as an initialization action when creating the cluster.

Other useful resources

How to build dbt Python models in BigQuery?