The world’s leading publication for data science, AI, and ML professionals.

BigQuery SQL: Evolution of the running total on a dataset with missing dates

Handling missing values, window functions and nested queries with BigQuery SQL

Photo by Todd Diemer on Unsplash
Photo by Todd Diemer on Unsplash

One recent requirement that arrived in our Data and Analytics team was in the form of the question: "What is the number of year-to-date listed articles per shop, and how did this number evolve?"

Finding the total number of listed year-to-date articles was not a complex analytical task. However, the problem started when we wanted to present how this number increased over time.

The reason for this was missing values in the source dataset. In other words, we didn’t have records for each date in the dataset that holds the entries of the newly listed articles per specific shop.

And here is where the challenge started.

First, we needed to figure out how to populate the missing dates for each shop. Following this step, the forward fill of the missing values per date and shop combination needed to be done. Finally, as the last step, the running total had to be calculated.

After spending some time researching and digging into BigQuery SQL tutorials, we were able to find a simple solution.

Now, we will share our implementation methodology with you.

In case you use Bigquery, the solution is several steps or – better say – several nested queries away. 🙂

Problem explanation: from source to target

Let’s present the problem in visual form, i.e. how the source data looks and what the expected outcome is.

For graphical presentation, we used Looker to show a sample of the source records and the target outcome on the time series charts.

Time-series presentation of the records in the source dataset and the expected outcome [Image by Author]
Time-series presentation of the records in the source dataset and the expected outcome [Image by Author]

From the first part of the image above (source dataset), it is visible that we have missing dates and corresponding numerical values per shop level in the selected date range.

Hence, we divided our solution into three steps to achieve the target outcome and compute the running total of the measure new_article_count over date article_online_since_date and per partition shop.

Bottom-up implementation methodology

First, with the following query, we were able to create the dummy input dataset:

The outcome of the query is:

With the input table ListedArticlesPerShop, we can start working on the bottom-up solution to compute a running total over time per shop.

Step #1: Fill in the missing date ranges per each partition (shop)

BigQuery Sql is offering one neat array function GENERATE_DATE_ARAY, where you can specify the following inputs [1]:

  • start_date – must be a DATE
  • end_date – must be a DATE
  • INT64_expr – a parameter which determines the increment used to generate dates; the default value for this parameter is one day
  • date_part – must be DAY, WEEK, MONTH, QUARTER or YEAR.

With the GENERATE_ARRAY function, we were able to create a table with complete date ranges per shop:

The outcome of the query is as follows:

After successfully concluding Step #1, we can now join the newly created query to the input table ListedArticlesPerShop.

Step #2: Join the table with filled date ranges to the input table with missing date ranges

This step is straightforward as the task is to:

  • join the two tables using the LEFT JOIN type, and
  • select the corresponding attributes from each table; ascending_date and shop from table_a, and new_article_count (now aliased under the name number_of_listed_articles) from table_b.

    After successfully finalizing this part of the task, we can now calculate the running total.

Step #3: Compute the running total over time and per each partition (shop)

The running total is calculated using the window function in our top query:

SUM (number_of_listed_articles) OVER (PARTITION BY shop) ORDER BY (ascending_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

Now, let’s wrap it all together to present the top query:

And the final query outcome is as follows:

With this last step, we successfully concluded our task. 🙂

Summary

In this post, we were able to show how you can fill in missing values and calculate the evolution patterns of the specific metric over time and per specific partition, using only BigQuery SQL.

We also presented how to combine different BigQuery functions: window and array functions, to solve a complex analytical task and deliver the required data insights.

Finally, we hope you will like our tutorial and find usage for it in your use cases too. 🙂


References:

[1] BigQuery SQL documentation, accessed: June 3rd 2022, https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_date_array


Related Articles