Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
This lab is adapted from https://cloud.google.com/dataproc/quickstart-console
What you'll learn
- How to create a managed Cloud Dataproc cluster (with Apache Spark pre-installed).
- How to submit a Spark job
- How to shut down your cluster
What you'll need
How will you use use this tutorial?
How would you rate your experience with using Google Cloud Platform services?
Self-paced environment setup
Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as
Next, you'll need to enable billing in the Cloud Console in order to use Google Cloud resources.
Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document).
New users of Google Cloud Platform are eligible for a $300 free trial.
Click on the menu icon in the top left of the screen.
Select APIs & Services from the drop down.
Click on Enable APIs and Services.
Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.
On the Google Compute Engine page click Enable
Once it has enabled click the arrow to go back.
Now search for "Google Cloud Dataproc API" and enable it as well.
In the Google Developer Console, click the Menu icon on the top left of the screen:
Then navigate to Dataproc in the drop down.
After clicking, you should see the following if the project has no clusters:
To create a new cluster, click Create cluster.
There are many parameters you can configure when creating a new cluster. Most of the default cluster settings, which includes two worker nodes, should be sufficient for this tutorial. Let's also use the following:
Learn more about zones in Regions & Zones documentation.
Machine type (Master node)
Machine type (Worker nodes)
Click on Create to create the new cluster!
Select Jobs in the left nav to switch to Dataproc's jobs view.
Click Submit job.
Select us-central1 from the Region drop-down menu.
Select your new cluster gcelab from the Cluster drop-down menu.
Select Spark from the Job type drop-down menu.
file:///usr/lib/spark/examples/jars/spark-examples.jar in the Jar files field.
org.apache.spark.examples.SparkPi in the Main class or jar field.
1000 in the Arguments field to set the number of tasks.
Your job should appear in the Jobs list, which shows all your project's jobs with their cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes.
To see your completed job's output:
Click the job ID in the Jobs list.
Select Line Wrapping to avoid scrolling.
You should see that your job has successfully calculated a rough value for pi!
You can shut down a cluster on the Clusters page.
Select the checkbox next to the gcelab cluster.
Then click Delete.
You learned how to create a Dataproc cluster, submit a Spark job, and shut down your cluster!
- Dataproc Documentation: https://cloud.google.com/dataproc/overview
- Provision and Using a Managed Hadoop/Spark Cluster with Cloud Dataproc (Command Line) codelab
This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.