Beginners guide to BQ – Part 2: Query CSV & JSON files

Beginners guide to BQ – Part 2: Query CSV & JSON files

What is this article about?

In this article, we will start querying CSVs and JSONs in Google Cloud Storage (GCS) and create new tables from existing tables (ETL process).

This article is mainly for BI Developers who want to expand their capabilities to handling Big Data and finished successfully part one.

Creating tables

In this case, I needed to create 2 tables that holds you tube data from Google Storage.

I will present two examples – one over CSV Files and another over JSON Files, you can find them here.

After that, we will create tables for those files, and join both tables.

The whole process is as follows:

Tables schema

Query the CSV Files

  1. Download the attached CSV files. Because the data is structured – this use case is simpler. The Table is for the Ingestion Level (MRR), and should be named – YouTubeVideosShorten.
  2. Create a new folder in your bucket named YouTubeVideos and put the files there.
  3. Create the table in Big Query. (you can see my configuration in the following picture)create Table on BigQuery
  4. After creating your table – make sure you see your table in the table list.
  5. Now – Query your data, for example:

SELECT
SUBSTR(_FILE_NAME, LENGTH('gs://rainbowdash/YouTubeVideos/') + 1, 2 ) country,
V.video_id,
V.title,
V.category_id,
V.tags,
V.views,
V.likes,
V.dislikes,
V.comment_count,
CAST(PARSE_DATE('%y.%d.%m',
V.trending_date ) AS TIMESTAMP) trending_date,
V.publish_time publish_time,
TIMESTAMP_DIFF(CAST(PARSE_DATE('%y.%d.%m',
V.trending_date ) AS TIMESTAMP), V.publish_time, HOUR) duration
FROM
`big-passage-263811.tutorialFrankfurt.youtubevideosshorten` V

In this case, we use the table as “External” and we calculate the duration each video has gone trendy from his upload. We also select the first 2 letters of each file and get our country.

Query the JSON files

Because the data is semi-structured – this use case is a little more difficult

The Table for the Ingestion Level (MRR) YouTubeStatisctics.

  1. Download the attached JSON files. Because the data is structured – this use case is simpler.
    The Table is for the Ingestion Level (MRR) and should be named – YouTubeStatisctics.
  1. Create a new folder in your bucket named YouTubeStatistics and put the files there.
  2. Create the table in Big Query. (you can see my configuration in the following picture)
  3. After creating your table – make sure You see your table in the table list.
  4. Now – Query your data, for example:

SELECT
  SUBSTR(_FILE_NAME, LENGTH('gs://rainbowdash/YouTubeStatistics/') + 1, 2 ) country,
  json_.kind kind1,
  json_.etag etag1,
  inr,
  inr.kind,
  inr.etag,
  inr.id,
  inr.snippet.channelid,
  inr.snippet.title,
  inr.snippet.assignable
FROM
  `big-passage-263811.tutorialFrankfurt.YouTubeStatistics` json_,
  UNNEST(items) inr

** In this case we use the table as “External”
** Notice that the UNNEST() function explodes the JSON (which is 1 row of {kind, etag, list of other objects}) and allocates the kind and etag to all the other rows.

Creating a new table

  1. The new table we create will be named – YouTubeCategories.
  2. We will extract categories from the Json file.
  3. Run the following query:
    
    SELECT
      SUBSTR(_FILE_NAME, LENGTH('gs://rainbowdash/YouTubeStatistics/') + 1, 2 ) country,
      inr.id id,
      inr.snippet.title title
    FROM
      `big-passage-263811.tutorialFrankfurt.YouTubeStatistics` json_,
      UNNEST(items) inr
    
  4. After the query Run – click “Save Results”, click “BigQuery”  and then “Save”.Save results on BigQuerySave results on BigQuery
  5. Give this table the name “YouTubeCategories”, and then – save it.
  6. Make sure you can now view this Table on your Table list.

Joining the Tables

  1. In this step we will join the tables: YouTubeCategories and YouTubeVideosShorten.
  2. We need to Join these Tables by Category_Id and by Country.
  3. Here is an SQL that will answer it:
    
    SELECT
      V.country,
      V.video_id,
      V.title,
      C.title category,
      V.tags,
      V.views,
      V.likes,
      V.dislikes,
      V.comment_count,
      CAST(PARSE_DATE('%y.%d.%m',
          V.trending_date ) AS TIMESTAMP) trending_date,
      V.publish_time publish_time,
      TIMESTAMP_DIFF(CAST(PARSE_DATE('%y.%d.%m',
            V.trending_date ) AS TIMESTAMP), V.publish_time, HOUR) duration
    FROM (
      SELECT
        SUBSTR(_FILE_NAME, LENGTH('gs://rainbowdash/YouTubeVideos/') + 1, 2 ) country,
        *
      FROM
        `big-passage-263811.tutorialFrankfurt.youtubevideosshorten`) V
    LEFT JOIN
      `big-passage-263811.tutorialFrankfurt.YouTubeCategories` C
    ON
      V.category_id = C.id
      AND V.country = C.country
    
    
  4. Now you can Insert your result into another table (YouTubeFact).
    ** Notice that the last 2 tables are native tables, so the data is in BigQuery and not in Google Storage.

Bonus:

  • The Process I developed here is not best practice (the YouTubeShorten should be transformed before I joined it with YouTubeCategoies – can you tell why?).
  • Try to develop an STG table for youtubeshorten before joining it to categories.
  • Note that the Fact table schema shouldn’t be changed.
  • Next sessions:
    • Using Partitions
    • Save data as columnar
    • Why do we do these two things? Because they save money.
  • More about this in our next article…

Share this post


Enter your details