Back to Blog

Coffee with Steve - dbt

Steve Pisani

Hi, I’m Steve with Flywheel. Over the last 2 years, I’ve used dbt daily to build data models on Redshift, Snowflake, and BigQuery.

This is the start of a series of articles on tips, best practices, and general knowledge about dbt and how to use it to build complex data models in your Cloud Data Warehouse. This series will focus on use in BigQuery but is generally applicable to your data warehouse of choice.

First, what is dbt?

dbt (data build tool) is an open-source command-line tool that uses compilable SQL select statements to allow data analysts to transform the data in their data warehouses more effectively and with the best practices of software development. It the middle layer between your raw data, in whatever form you get it in, and your transformed data in the format you need.

The "T" in ELT

There are all kinds of useful features that ship with dbt including an automatic DAG generator, automatic documentation platform, and many macros. For more on dbt, how to set it up, and what it is exactly, check out the official dbt website, or our upcoming posts.

Let's Start with the Basics

Before diving into dbt and advanced SQL tricks, let’s get our SQL basics down. One important but easily overlooked aspect of SQL is the order in which a statement actually executes.


things can get confusing

Like the PEMDAS monstrosity above, the way you write SQL can cause confusion for readers including yourself. To reduce confusion, keep the following logical processing order in mind.

  1. FROM & JOIN: What tables to get the base data
  2. WHERE: Filters on the base data
  3. GROUP BY: Aggregates the base data
  4. HAVING: Filters on the aggregate data
  5. WINDOW Functions : Row-level aggregation
  6. SELECT: Fetch data by column name and alias
  7. DISTINCT: Deduplicate the fetched data
  8. UNION: Append additional query's results
  9. ORDER BY: Sort data
  10. LIMIT & OFFSET: Discard all rows but X rows starting at Y

So what does that mean?

There are several implications to the execution order and knowing them will make you a better SQL user.

Data reductions are best done at the top of the funnel ie JOIN and WHERE


Nine times out of ten if your query is inefficient, you have an issue in your JOINs or you have not filtered as much as you should in your WHEREs. JOINs, in particular, can be dangerous because, if you are not careful, you can inadvertently have multiple records satisfy your ON clauses that will cause duplicates.

Combat this by choosing the most appropriate JOIN type, vigorously reviewing your ON statements and defining exactly what information you need to include in your WHERE filters.

For example, if you wished to analyze recent customer payment trends, limit your query to be the smallest subset of data you can use to answer your question.


👎 DON'T

select
 c.name
 , o.amount_paid
from customers as c
outer join orders as o
 on c.id = o.customer_id

👍 DO

select
 c.name
 , o.amount_paid
from customers as c
inner join orders as o
 on c.id = o.customer_id
 and c.start_date <= o.order_date
where c.start_date > '2021-01-01'

Aliased columns are not addressable in WHERE clauses

Because aliasing does not occur until step 6, you cannot use the new names in your WHERE statements. This has been the cause of many frustrations and long repeated lines in queries.

Combat this by aliasing in an upstream CTE or dbt model (more on this later).

Aliased columns are addressable in ORDER BY clauses

On the other hand, aliased columns are available in ORDER BY clauses because they are the penultimate operation to run.


A final note.

This order is generally the order in which queries will be run by your database engine however the engine’s optimizer will change this order if it detects an improvement in possible efficiency. CAST functions or WHERE clauses on columns used in joins often will be run first to save on compute resources.

There are many other optimizations that database engines use in production but, I am honestly not an expert in the deeper layers of optimization.

Look out for the next article where we’ll be diving into using dbt. Thanks for reading!

✌️ Steve

Share on social media: 

More from the Blog

Why Build a Customer Audience Platform in your Cloud?

While companies are moving customer data to the Cloud, how do you empower sales & marketing while maintaining customer data privacy?

Read Story

The Monthly Wheeler: March 2021

March feature releases on the Customer Segmentation Platform for BigQuery

Read Story

Customer Data Privacy is changing fast

Get up to date information from Flywheel Engineering on best practices in Google Cloud and Customer Data Privacy
We will never share your email address with third parties.