BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage and don't need a database administrator. BigQuery uses familiar SQL and it can take advantage of pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab, we will explore the Wikipedia dataset using BigQuery.
What you'll learn
- Using BigQuery
- Load a real-world dataset into BigQuery
- Writing a query to gain insight into a large dataset
What you'll need
How will you use this tutorial?
How would rate your experience with Google Cloud Platform?
If you don't already have a Google Account (Gmail or Google Apps), you must create one.
- Sign-in to Google Cloud Platform console (console.cloud.google.com) and navigate to BigQuery. You can also open the BigQuery web UI directly by entering the following URL in your browser.
- Accept the terms of service.
- Before you can use BigQuery, you must create a project. Follow the prompts to create your new project.
Choose a project name and make note of the project ID.
The project ID is a unique name across all Google Cloud projects. It will be referred to later in this codelab as
This codelab uses BigQuery resources withing the BigQuery sandbox limits. A billing account is not required. If you later want to remove the sandbox limits, you can add a billing account by signing up for the Google Cloud Platform free trial.
We'll load the Wikipedia dataset in the next section.
First, create a new dataset in the project. A dataset is composed of multiple tables. To create a dataset, click the project name under the resources pane, then click the Create dataset button:
lab as the Dataset ID:
Click the Create dataset button to create an empty dataset.
Activate Google Cloud Shell
From the GCP Console click the Cloud Shell icon on the top right toolbar:
Then click "Start Cloud Shell":
It should only take a few moments to provision and connect to the environment:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.
Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Updated property [core/project].
Load data into BigQuery
For your convenience, some of the data for April 10, 2019 from the Wikimedia pageviews dataset is available on Google Cloud Storage at
gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-140000.gz. The data file is a GZip'ed CSV file. You can load this file directly using the
bq command line utility. As part of the load command, you'll also describe the schema of the file.
bq load \ --source_format CSV \ --field_delimiter " " \ --quote "" \ --max_bad_records 3 \ $GOOGLE_CLOUD_PROJECT:lab.pageviews_20190410_140000 \ gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-140000.gz \ wiki,title,requests:integer,zero:integer
You used a few advanced options to load the page views file:
--source_format CSVto indicate the file should be parsed as a CSV file. This step is optional, since CSV is the default format.
--field_delimiter " "to indicate that a single space is used to delimit fields.
--quote ""to indicate that strings are unquoted.
--max_bad_records 3to ignore at most 3 errors while parsing the CSV file. This is needed because there are several lines in the file with only 2 columns.
You can learn more about the bq command line in the documentation.
In the BigQuery console, open one of the tables that you just loaded.
- Expand the project.
- Expand the dataset.
- Select the table.
You can see the table schema in the Schema view on the right. Find out how much data is in the table, by navigating to the Details view:
Open the Preview view to see a selection of rows from the table.
Click Compose new query button on the top right:
This will bring up the Query editor view:
Find the total number of Wikimedia views between 2-3pm on April 10, 2019, by writing this query:
SELECT SUM(requests) FROM `lab.pageviews_20190410_140000`
Click the Run button:
In a few seconds, the result will be listed in the bottom, and it'll also tell you how much data was processed:
This query processed 123.9 MB, even though the table is 691.4 MB. BigQuery only processes the bytes from the columns which are used in the query, so the total amount of data processed can be significantly less than the table size. With clustering and partitioning, the amount of data processed can be reduced even further.
Find Wikipedia page views
The Wikimedia dataset contains page views for all of the Wikimedia projects (including Wikipedia, Wiktionary, Wikibooks, Wikiquotes, etc). Let's narrow down the query to just English Wikipedia pages by adding a
SELECT SUM(requests), wiki FROM `lab.pageviews_20190410_140000` WHERE wiki = "en" GROUP BY wiki
Notice that, by querying an additional column,
wiki, the amount of data processed increased from 124 MB to 204 MB.
BigQuery supports many of the familiar SQL clauses, such as
ORDER BY, and a number of aggregation functions. In addition, you can also use regular expressions to query text fields! Let's try one:
SELECT title, SUM(requests) requests FROM `lab.pageviews_20190410_140000` WHERE wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
Query across multiple tables
You can select a range of tables to form the union using a wildcard table. First, create a second table to query over by loading the next hour's page views into a new table:
bq load \ --source_format CSV \ --field_delimiter " " \ --quote "" \ $GOOGLE_CLOUD_PROJECT:lab.pageviews_20190410_150000 \ gs://cloud-samples-data/third-party/wikimedia/pageviews/pageviews-20190410-150000.gz \ wiki,title,requests:integer,zero:integer
In the Query editor pane, query over both tables you loaded by querying tables with "
pageviews_2019" as a prefix:
SELECT title, SUM(requests) requests FROM `lab.pageviews_2019*` WHERE wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
You can filter the tables more selectively with the _TABLE_SUFFIX pseudo column. This query will limit to tables corresponding to April 10.
SELECT title, SUM(requests) requests FROM `lab.pageviews_2019*` WHERE _TABLE_SUFFIX BETWEEN '0410' AND '0410_9999999' AND wiki = "en" AND REGEXP_CONTAINS(title, 'Red.*t') GROUP BY title ORDER BY requests DESC
Optionally, delete the dataset you created with the
bq rm command. Use the
-r flag to remove any tables it contains.
bq rm -r lab
You've used BigQuery and SQL to query the real-world Wikipedia page views dataset. You have the power to query petabyte-scale datasets!
What we've covered
- How to use SQL-like syntax to query Wikipedia records
- Writing a query to gain insight into a large dataset