This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.
In this lab, you will create a single node Dataproc cluster and a GCS bucket for your Pyspark job output. Separating the storage from the compute allows you to treat your cluster as ephemeral, and we will delete the cluster when we are done while preserving the results.
Prepare Our Environment
- First, we need to enable the Dataproc API:
gcloud services enable dataproc.googleapis.com
- Then create a Cloud Storage bucket:
gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
- Now create the
gcloud dataproc clusters create wordcount --region=us-central1 --zone=us-central1-f --single-node --master-machine-type=n1-standard-2
- Validate that the Dataproc cluster has been created
Go to BigData > Dataproc > clusters. You will see the Dataproc cluster up and running.
- And finally, download the
wordcount.pyfile that will be used for the
gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/* .
Submit the Pyspark Job to the Dataproc Cluster
In Cloud Shell, type:
gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- \ gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt \ gs://$DEVSHELL_PROJECT_ID-data/output/
Review the Pyspark Output
- In Cloud Shell, download output files from the GCS output location:
gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .
- Note: Alternatively, we could download them to our local machine via the web console.
Delete the Dataproc Cluster
- We don’t need our cluster any longer, so let’s delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.
- Select the wordcount cluster, then click DELETE, and OK to confirm.Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.
Wait until the cluster is deleted.