Skip to main content

Quickstart for dbt Cloud and Redshift

Redshift
dbt Cloud
Quickstart
Beginner
Menu

    Introduction

    In this quickstart guide, you'll learn how to use dbt Cloud with Redshift. It will show you how to:

    • Set up a Redshift cluster.
    • Load sample data into your Redshift account.
    • Connect dbt Cloud to Redshift.
    • Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement.
    • Add tests to your models
    • Document your models
    • Schedule a job to run

    :::tips Videos for you Check out dbt Fundamentals for free if you're interested in course learning with videos. :::

    Prerequisites

    • You have a dbt Cloud account.
    • You have an AWS account with permissions to execute a CloudFormation template to create appropriate roles and a Redshift cluster.

    Create a Redshift cluster

    1. Sign in to your AWS account as a root user or an IAM user depending on your level of access.
    2. Use a CloudFormation template to quickly set up a Redshift cluster. A CloudFormation template is a configuration file that automatically spins up the necessary resources in AWS. Start a CloudFormation stack and you can refer to the create-dbtworkshop-infr JSON file for more template details.
    tip

    To avoid connectivity issues with dbt Cloud, make sure to allow inbound traffic on port 5439 from dbt Cloud's IP addresses in your Redshift security groups and Network Access Control Lists (NACLs) settings.

    1. Click Next for each page until you reach the Select acknowledgement checkbox. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and click Create Stack. You should land on the stack page with a CREATE_IN_PROGRESS status.

      Cloud Formation in ProgressCloud Formation in Progress
    2. When the stack status changes to CREATE_COMPLETE, click the Outputs tab on the top to view information that you will use throughout the rest of this guide. Save those credentials for later by keeping this open in a tab.

    3. Type Redshift in the search bar at the top and click Amazon Redshift.

      Click on RedshiftClick on Redshift
    4. Confirm that your new Redshift cluster is listed in Cluster overview. Select your new cluster. The cluster name should begin with dbtredshiftcluster-. Then, click Query Data. You can choose the classic query editor or v2. We will be using the v2 version for the purpose of this guide.

    Available Redshift ClusterAvailable Redshift Cluster
    1. You might be asked to Configure account. For this sandbox environment, we recommend selecting “Configure account”.

    2. Select your cluster from the list. In the Connect to popup, fill out the credentials from the output of the stack:

      • Authentication — Use the default which is Database user name and password (NOTE: IAM authentication is not supported in dbt Cloud).
      • Databasedbtworkshop
      • User namedbtadmin
      • Password — Use the autogenerated RSadminpassword from the output of the stack and save it for later.
    Redshift Query Editor v2Redshift Query Editor v2
    Connect to Redshift ClusterConnect to Redshift Cluster
    1. Click Create connection.

    Load data

    Now we are going to load our sample data into the S3 bucket that our Cloudformation template created. S3 buckets are simple and inexpensive way to store data outside of Redshift.

    1. The data used in this course is stored as CSVs in a public S3 bucket. You can use the following URLs to download these files. Download these to your computer to use in the following steps.

    2. Now we are going to use the S3 bucket that you created with CloudFormation and upload the files. Go to the search bar at the top and type in S3 and click on S3. There will be sample data in the bucket already, feel free to ignore it or use it for other modeling exploration. The bucket will be prefixed with dbt-data-lake.

    Go to S3Go to S3
    1. Click on the name of the bucket S3 bucket. If you have multiple S3 buckets, this will be the bucket that was listed under “Workshopbucket” on the Outputs page.
    Go to your S3 BucketGo to your S3 Bucket
    1. Click Upload. Drag the three files into the UI and click the Upload button.
    Upload your CSVsUpload your CSVs
    1. Remember the name of the S3 bucket for later. It should look like this: s3://dbt-data-lake-xxxx. You will need it for the next section.

    2. Now let’s go back to the Redshift query editor. Search for Redshift in the search bar, choose your cluster, and select Query data.

    3. In your query editor, execute this query below to create the schemas that we will be placing your raw data into. You can highlight the statement and then click on Run to run them individually. If you are on the Classic Query Editor, you might need to input them separately into the UI. You should see these schemas listed under dbtworkshop.

      create schema if not exists jaffle_shop;
      create schema if not exists stripe;
    4. Now create the tables in your schema with these queries using the statements below. These will be populated as tables in the respective schemas.

      create table jaffle_shop.customers(
      id integer,
      first_name varchar(50),
      last_name varchar(50)
      );

      create table jaffle_shop.orders(
      id integer,
      user_id integer,
      order_date date,
      status varchar(50),
      _etl_loaded_at timestamp default current_timestamp
      );

      create table stripe.payment(
      id integer,
      orderid integer,
      paymentmethod varchar(50),
      status varchar(50),
      amount integer,
      created date,
      _batched_at timestamp default current_timestamp
      );
    5. Now we need to copy the data from S3. This enables you to run queries in this guide for demonstrative purposes; it's not an example of how you would do this for a real project. Make sure to update the S3 location, iam role, and region. You can find the S3 and iam role in your outputs from the CloudFormation stack. Find the stack by searching for CloudFormation in the search bar, then clicking Stacks in the CloudFormation tile.

      copy jaffle_shop.customers( id, first_name, last_name)
      from 's3://dbt-data-lake-xxxx/jaffle_shop_customers.csv'
      iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
      region 'us-east-1'
      delimiter ','
      ignoreheader 1
      acceptinvchars;

      copy jaffle_shop.orders(id, user_id, order_date, status)
      from 's3://dbt-data-lake-xxxx/jaffle_shop_orders.csv'
      iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
      region 'us-east-1'
      delimiter ','
      ignoreheader 1
      acceptinvchars;

      copy stripe.payment(id, orderid, paymentmethod, status, amount, created)
      from 's3://dbt-data-lake-xxxx/stripe_payments.csv'
      iam_role 'arn:aws:iam::XXXXXXXXXX:role/RoleName'
      region 'us-east-1'
      delimiter ','
      ignoreheader 1
      Acceptinvchars;

      Ensure that you can run a select * from each of the tables with the following code snippets.

      select * from jaffle_shop.customers;
      select * from jaffle_shop.orders;
      select * from stripe.payment;

    Connect dbt Cloud to Redshift

    1. Create a new project in dbt Cloud. From Account settings (using the gear menu in the top right corner), click + New Project.

    2. Enter a project name and click Continue.

    3. For the warehouse, click Redshift then Next to set up your connection.

    4. Enter your Redshift settings. Reference your credentials you saved from the CloudFormation template.

      • Hostname — Your entire hostname.
      • Port5439
      • Databasedbtworkshop.
      dbt Cloud - Redshift Cluster Settingsdbt Cloud - Redshift Cluster Settings
      tip

      To avoid connectivity issues with dbt Cloud, make sure to allow inbound traffic on port 5439 from dbt Cloud's IP addresses in your Redshift security groups and Network Access Control Lists (NACLs) settings.

    5. Set your development credentials. These credentials will be used by dbt Cloud to connect to Redshift. Those credentials (as provided in your CloudFormation output) will be:

      • Usernamedbtadmin
      • Password — This is the autogenerated password that you used earlier in the guide
      • Schema — dbt Cloud automatically generates a schema name for you. By convention, this is dbt_<first-initial><last-name>. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE.
      dbt Cloud - Redshift Development Credentialsdbt Cloud - Redshift Development Credentials
    6. Click Test Connection. This verifies that dbt Cloud can access your Redshift cluster.

    7. Click Next if the test succeeded. If it failed, you might need to check your Redshift settings and credentials.

    Set up a dbt Cloud managed repository

    When you develop in dbt Cloud, you can leverage Git to version control your code.

    To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.

    To set up a managed repository:

    1. Under "Setup a repository", select Managed.
    2. Type a name for your repo such as bbaggins-dbt-quickstart
    3. Click Create. It will take a few seconds for your repository to be created and imported.
    4. Once you see the "Successfully imported repository," click Continue.

    Initialize your dbt project​ and start developing

    Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:

    1. Click Start developing in the IDE. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
    2. Above the file tree to the left, click Initialize dbt project. This builds out your folder structure with example models.
    3. Make your initial commit by clicking Commit and sync. Use the commit message initial commit and click Commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code.
    4. You can now directly query data from your warehouse and execute dbt run. You can try this out now:
      • Click + Create new file, add this query to the new file, and click Save as to save the new file:
        select * from jaffle_shop.customers
      • In the command line bar at the bottom, enter dbt run and click Enter. You should see a dbt run succeeded message.

    Build your first model

    You have two options for working with files in the dbt Cloud IDE:

    • Create a new branch (recommended) — Create a new branch to edit and commit your changes. Navigate to Version Control on the left sidebar and click Create branch.
    • Edit in the protected primary branch — If you prefer to edit, format, or lint files and execute dbt commands directly in your primary git branch. The dbt Cloud IDE prevents commits to the protected branch, so you will be prompted to commit your changes to a new branch.

    Name the new branch add-customers-model.

    1. Click the ... next to the models directory, then select Create file.
    2. Name the file customers.sql, then click Create.
    3. Copy the following query into the file and click Save.
    with customers as (

    select
    id as customer_id,
    first_name,
    last_name

    from jaffle_shop.customers

    ),

    orders as (

    select
    id as order_id,
    user_id as customer_id,
    order_date,
    status

    from jaffle_shop.orders

    ),

    customer_orders as (

    select
    customer_id,

    min(order_date) as first_order_date,
    max(order_date) as most_recent_order_date,
    count(order_id) as number_of_orders

    from orders

    group by 1

    ),

    final as (

    select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

    )

    select * from final
    1. Enter dbt run in the command prompt at the bottom of the screen. You should get a successful run and see the three models.

    Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool.

    FAQs

    How can I see the SQL that dbt is running?
    How did dbt choose which schema to build my models in?
    Do I need to create my target schema before running dbt?
    If I rerun dbt, will there be any downtime as models are rebuilt?
    What happens if the SQL in my query is bad or I get a database error?

    Change the way your model is materialized

    One of the most powerful features of dbt is that you can change the way a model is materialized in your warehouse, simply by changing a configuration value. You can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes.

    By default, everything gets created as a view. You can override that at the directory level so everything in that directory will materialize to a different materialization.

    1. Edit your dbt_project.yml file.

      • Update your project name to:

        dbt_project.yml
        name: 'jaffle_shop'
      • Configure jaffle_shop so everything in it will be materialized as a table; and configure example so everything in it will be materialized as a view. Update your models config block to:

        dbt_project.yml
        models:
        jaffle_shop:
        +materialized: table
        example:
        +materialized: view
      • Click Save.

    2. Enter the dbt run command. Your customers model should now be built as a table!

      info

      To do this, dbt had to first run a drop view statement (or API call on BigQuery), then a create table as statement.

    3. Edit models/customers.sql to override the dbt_project.yml for the customers model only by adding the following snippet to the top, and click Save:

      models/customers.sql
      {{
      config(
      materialized='view'
      )
      }}

      with customers as (

      select
      id as customer_id
      ...

      )

    4. Enter the dbt run command. Your model, customers, should now build as a view.

      • BigQuery users need to run dbt run --full-refresh instead of dbt run to full apply materialization changes.
    5. Enter the dbt run --full-refresh command for this to take effect in your warehouse.

    FAQs

    What materializations are available in dbt?
    Which materialization should I use for my model?
    What model configurations exist?

    Delete the example models

    You can now delete the files that dbt created when you initialized the project:

    1. Delete the models/example/ directory.

    2. Delete the example: key from your dbt_project.yml file, and any configurations that are listed under it.

      dbt_project.yml
      # before
      models:
      jaffle_shop:
      +materialized: table
      example:
      +materialized: view
      dbt_project.yml
      # after
      models:
      jaffle_shop:
      +materialized: table
    3. Save your changes.

    FAQs

    How do I remove deleted models from my data warehouse?
    I got an "unused model configurations" error message, what does this mean?

    Build models on top of other models

    As a best practice in SQL, you should separate logic that cleans up your data from logic that transforms your data. You have already started doing this in the existing query by using common table expressions (CTEs).

    Now you can experiment by separating the logic out into separate models and using the ref function to build models on top of other models:

    The DAG we want for our dbt projectThe DAG we want for our dbt project
    1. Create a new SQL file, models/stg_customers.sql, with the SQL from the customers CTE in our original query.

    2. Create a second new SQL file, models/stg_orders.sql, with the SQL from the orders CTE in our original query.

      models/stg_customers.sql
      select
      id as customer_id,
      first_name,
      last_name

      from jaffle_shop.customers
      models/stg_orders.sql
      select
      id as order_id,
      user_id as customer_id,
      order_date,
      status

      from jaffle_shop.orders
    3. Edit the SQL in your models/customers.sql file as follows:

      models/customers.sql
      with customers as (

      select * from {{ ref('stg_customers') }}

      ),

      orders as (

      select * from {{ ref('stg_orders') }}

      ),

      customer_orders as (

      select
      customer_id,

      min(order_date) as first_order_date,
      max(order_date) as most_recent_order_date,
      count(order_id) as number_of_orders

      from orders

      group by 1

      ),

      final as (

      select
      customers.customer_id,
      customers.first_name,
      customers.last_name,
      customer_orders.first_order_date,
      customer_orders.most_recent_order_date,
      coalesce(customer_orders.number_of_orders, 0) as number_of_orders

      from customers

      left join customer_orders using (customer_id)

      )

      select * from final

    4. Execute dbt run.

      This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.

    FAQs

    How do I run one model at a time?
    Do ref-able resource names need to be unique?
    As I create more models, how should I keep my project organized? What should I name my models?

    Add tests to your models

    Adding tests to a project helps validate that your models are working correctly.

    To add tests to your project:

    1. Create a new YAML file in the models directory, named models/schema.yml

    2. Add the following contents to the file:

      models/schema.yml
      version: 2

      models:
      - name: customers
      columns:
      - name: customer_id
      tests:
      - unique
      - not_null

      - name: stg_customers
      columns:
      - name: customer_id
      tests:
      - unique
      - not_null

      - name: stg_orders
      columns:
      - name: order_id
      tests:
      - unique
      - not_null
      - name: status
      tests:
      - accepted_values:
      values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: customer_id
      tests:
      - not_null
      - relationships:
      to: ref('stg_customers')
      field: customer_id

    3. Run dbt test, and confirm that all your tests passed.

    When you run dbt test, dbt iterates through your YAML files, and constructs a query for each test. Each query will return the number of records that fail the test. If this number is 0, then the test is successful.

    FAQs

    What tests are available for me to use in dbt? Can I add my own custom tests?
    How do I test one model at a time?
    One of my tests failed, how can I debug it?
    Does my test file need to be named `schema.yml`?
    Why do model and source yml files always start with `version: 2`?
    What tests should I add to my project?
    When should I run my tests?

    Document your models

    Adding documentation to your project allows you to describe your models in rich detail, and share that information with your team. Here, we're going to add some basic documentation to our project.

    1. Update your models/schema.yml file to include some descriptions, such as those below.

      models/schema.yml
      version: 2

      models:
      - name: customers
      description: One record per customer
      columns:
      - name: customer_id
      description: Primary key
      tests:
      - unique
      - not_null
      - name: first_order_date
      description: NULL when a customer has not yet placed an order.

      - name: stg_customers
      description: This model cleans up customer data
      columns:
      - name: customer_id
      description: Primary key
      tests:
      - unique
      - not_null

      - name: stg_orders
      description: This model cleans up order data
      columns:
      - name: order_id
      description: Primary key
      tests:
      - unique
      - not_null
      - name: status
      tests:
      - accepted_values:
      values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: customer_id
      tests:
      - not_null
      - relationships:
      to: ref('stg_customers')
      field: customer_id
    2. Run dbt docs generate to generate the documentation for your project. dbt introspects your project and your warehouse to generate a JSON file with rich documentation about your project.

    1. Click the book icon in the Develop interface to launch documentation in a new tab.

    FAQs

    How do I write long-form explanations in my descriptions?
    How do I access documentation in dbt Explorer?

    Commit your changes

    Now that you've built your customer model, you need to commit the changes you made to the project so that the repository has your latest code.

    If you edited directly in the protected primary branch:

    1. Click the Commit and sync git button. This action prepares your changes for commit.
    2. A modal titled Commit to a new branch will appear.
    3. In the modal window, name your new branch add-customers-model. This branches off from your primary branch with your new changes.
    4. Add a commit message, such as "Add customers model, tests, docs" and and commit your changes.
    5. Click Merge this branch to main to add these changes to the main branch on your repo.

    If you created a new branch before editing:

    1. Since you already branched out of the primary protected branch, go to Version Control on the left.
    2. Click Commit and sync to add a message.
    3. Add a commit message, such as "Add customers model, tests, docs."
    4. Click Merge this branch to main to add these changes to the main branch on your repo.

    Deploy dbt

    Use dbt Cloud's Scheduler to deploy your production jobs confidently and build observability into your processes. You'll learn to create a deployment environment and run a job in the following steps.

    Create a deployment environment

    1. In the upper left, select Deploy, then click Environments.
    2. Click Create Environment.
    3. In the Name field, write the name of your deployment environment. For example, "Production."
    4. In the dbt Version field, select the latest version from the dropdown.
    5. Under Deployment connection, enter the name of the dataset you want to use as the target, such as "Analytics". This will allow dbt to build and work with that dataset. For some data warehouses, the target dataset may be referred to as a "schema".
    6. Click Save.

    Create and run a job

    Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt build.

    As the jaffle_shop business gains more customers, and those customers create more orders, you will see more records added to your source data. Because you materialized the customers model as a table, you'll need to periodically rebuild your table to ensure that the data stays up-to-date. This update will happen when you run a job.

    1. After creating your deployment environment, you should be directed to the page for a new environment. If not, select Deploy in the upper left, then click Jobs.
    2. Click Create one and provide a name, for example, "Production run", and link to the Environment you just created.
    3. Scroll down to the Execution Settings section.
    4. Under Commands, add this command as part of your job if you don't see it:
      • dbt build
    5. Select the Generate docs on run checkbox to automatically generate updated project docs each time your job runs.
    6. For this exercise, do not set a schedule for your project to run — while your organization's project should run regularly, there's no need to run this example project on a schedule. Scheduling a job is sometimes referred to as deploying a project.
    7. Select Save, then click Run now to run your job.
    8. Click the run and watch its progress under "Run history."
    9. Once the run is complete, click View Documentation to see the docs for your project.

    Congratulations 🎉! You've just deployed your first dbt project!

    FAQs

    What happens if one of my runs fails?
    0