

Microsoft Visual Studio Code (VS Code) with dbt extensions installed.dbt CLI (dbt Core) and dbt Amazon Redshift adapter installed locally.AWS IAM Role with permissions to Amazon Redshift, Amazon S3, and AWS Glue.Amazon Redshift or Amazon Redshift Serverless cluster.

Prerequisites to follow along with this post’s demonstration include: We will use dbt along with the dbt package, dbt_external_tables, to create the external tables in an AWS Glue data catalog. The external tables exist in an external data catalog, which can be AWS Glue, the data catalog that comes with Amazon Athena, or an Apache Hive metastore.ĭbt can interact with Amazon Redshift Spectrum to create external tables, refresh external table partitions, and access raw data in an Amazon S3-based data lake from the data warehouse. Amazon Redshift SpectrumĪccording to AWS, “ Redshift Spectrum allows you to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.” Redshift Spectrum tables define the data structure for the files in Amazon S3. dbt handles turning these select statements into tables and views.” Further, “ dbt does the T in ELT (Extract, Load, Transform) processes - it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.” This post’s project, displayed in dbt Cloud Amazon RedshiftĪccording to AWS, “ Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.” AWS claims Amazon Redshift is the most widely used cloud data warehouse. According to dbt Labs, “ dbt enables analytics engineers to transform data in their warehouses by simply writing select statements. In the following post, we will explore the use of dbt (data build tool), developed by dbt Labs, to transform data in an AWS-based data lakehouse, built with Amazon Redshift, Redshift Spectrum, AWS Glue, and Amazon S3. Data lakehouses implement data warehouses’ data structures and management features for data lakes, which are typically more cost-effective for data storage.” dbt


Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low-cost cloud storage in open formats.” Similarly, Snowflake describes a lakehouse as “ a data solution concept that combines elements of the data warehouse with those of the data lake. According to Databricks, “ a lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Several factors have fostered the renewed interest and appeal of data warehouses, including the data lakehouse architecture. Nonetheless, data warehouses, specifically modern cloud data warehouses, continue to gain market share, led by Snowflake, Amazon Redshift, Google Cloud BigQuery, and Microsoft’s Azure Synapse Analytics. Learn how dbt makes it easy to transform data and materialize models in a modern cloud data lakehouse built on AWS Introductionĭata lakes have grabbed much of the analytics community’s attention in recent years, thanks to an overabundance of VC-backed analytics startups and marketing dollars.
