9 mins read

Replicate changes from databases to Apache Iceberg tables with Amazon Data Firehose (in preview)

Spoken by Polly

Today we’re announcing the availability of a new feature in preview Amazon Data Firehose This captures changes made in databases like PostgreSQL and MySQL and replicates the updates to them Apache Iceberg tables open Amazon Simple Storage Service (Amazon S3).

Apache Iceberg is a powerful open source spreadsheet format for performing big data analysis. Apache Iceberg brings the reliability and simplicity of SQL tables to S3 data lakes and enables open source analytics engines such as: Apache Spark, Apache Nimble, Trino, Apache HiveAnd Apache Impala to be able to work with the same data at the same time.

This new feature provides a simple, end-to-end solution for streaming database updates without impacting transactional performance of database applications. You can set up a Data Firehose stream for deployment in minutes Change data collection (CDC) Updates from your database. Now you can easily replicate data from different databases into Iceberg tables on Amazon S3 and use fresh data for large-scale analytics and machine learning (ML) applications.

Typical Amazon Web Services (AWS) Enterprise customers use hundreds of databases for transactional applications. To perform large-scale analysis and ML on the latest data, they want to capture changes made in databases, such as: B. when records are inserted, modified, or deleted in a table, and the updates to your data warehouse or Amazon S3 data lake deliver open source table formats such as Apache Iceberg.

To this end, many customers develop extract, transform, and load (ETL) jobs to read data from databases on a regular basis. However, ETL readers impact database transaction performance, and batch jobs can result in a delay of several hours before data is available for analysis. To reduce the performance impact of database transactions, customers want the ability to stream changes made in the database. This stream is called the Change Data Capture (CDC) stream.

I have met several customers who use open source distributed systems such as: Debeziumwith connectors to common databases Apache Kafka Connect Cluster and Kafka Connect Sink to read the events and deliver them to the target. The initial configuration and testing of such systems involves the installation and configuration of several open source components. It can take days or weeks. Once deployed, engineers must monitor and manage clusters and validate and apply open source updates, which increases operational overhead.

With this new data streaming capability, Amazon Data Firehose provides the ability to capture CDC streams from databases and continuously replicate them into Apache Iceberg tables on Amazon S3. You set up a Data Firehose stream by specifying the source and destination. Data Firehose continuously captures and replicates an initial data snapshot and then any subsequent changes made to the selected database tables as a data stream. To capture CDC streams, Data Firehose uses the database replication protocol, reducing the impact on database transaction performance. As the volume of database updates increases or decreases, Data Firehose automatically partitions the data and stores records until they are delivered to the destination. You don’t need to provision capacity or manage and optimize clusters. In addition to the data itself, as part of the initial Data Firehose stream creation, Data Firehose can automatically create Apache Iceberg tables using the same schema as the database tables and automatically evolve the target schema, such as: B. by adding new columns based on source schema changes.

Because Data Firehose is a fully managed service, you do not rely on open source components, apply software updates, or incur operational overhead.

Continuously replicating database changes to Apache Iceberg tables in Amazon S3 using Amazon Data Firehose provides you with a simple, scalable, end-to-end managed solution for deploying CDC streams to your data lake or data warehouse where you can run large volumes. Scale analysis and ML applications.

Let’s look at how to configure a new pipeline
To show you how to create a new CDC pipeline, I set up a Data Firehose stream using AWS Management Console. As usual, I also have the choice to use that AWS Command Line Interface (AWS CLI), AWS SDKs, AWS CloudFormationor Terraform.

For this demo, I’ll choose a MySQL database Amazon Relational Database Service (Amazon RDS) as a source. Data Firehose also works with self-managed databases Amazon Elastic Compute Cloud (Amazon EC2). To create a connection between my Virtual Private Cloud (VPC) – where the database is deployed – and the RDS API without exposing traffic to the internet, I create one AWS PrivateLink VPC service endpoint. You can learn How to create a VPC service endpoint for the RDS API by following the instructions in the Amazon RDS documentation.

I also have an S3 bucket to host the Iceberg table, and I have one AWS Identity and Access Management (IAM) Role setup with correct permissions. You can refer to the list Requirements in the Data Firehose documentation.

To get started, I open the console and navigate to the “Amazon Data Firehose” section. I can see the stream that has already been created. To create a new one, I select Create a Firehose stream.

Create a Firehose stream

I choose a source And Goal. In this example: a MySQL database and Apache Iceberg Tables. I also enter Firehose stream name for my stream.

Create Firehose Stream – Screen 1

I enter my fully qualified DNS name Database endpoint and the Name of the database VPC endpoint service. I confirm that Enable SSL is checked and under Secret nameI select the name of the secret AWS Secrets Manager where the database username and password are securely stored.

Create Firehose Stream – Screen 2

Next, I configure Data Firehose to collect specific data by specifying databases, tables, and columns with explicit names or regular expressions.

I need to create a watermark table. A watermark, in this context, is a marker used by Data Firehose to track the progress of incremental snapshots of database tables. It helps Data Firehose identify which parts of the table have already been captured and which parts still need to be processed. I can create the watermark table manually or have Data Firehose create it automatically. In this case, the database credentials passed to Data Firehose must have permissions to create a table in the source database.

Create Firehose Stream – Screen 3

Next I configure the S3 bucket region and the name to use. Data Firehose can automatically create the Iceberg tables if they don’t already exist. Likewise, the Iceberg table schema can be updated when a change is detected in your database schema.

Create Firehose Stream – Screen 4

The last step is activation Amazon CloudWatch Error logging to provide feedback on stream progress and any errors. You can Configure a short retention period for the CloudWatch log group to reduce log storage costs.

After checking my configuration, I select Create a Firehose stream.

Create Firehose Stream – Screen 5

Once the stream is created, it begins replicating the data. I can monitor the status of the stream and check for any errors.

Create Firehose Stream – Screen 6

Now it’s time to test the stream.

I open a connection to the database and insert a new row into a table.

Firehose – MySQL

I then navigate to the S3 bucket configured as the target and observe that a file has been created to store the data from the table.

View parquet files in the S3 bucket

I download the file and check its contents with the parq Command (You can install this command with pip install parquet-cli)

Contents of the Parquet file

Of course download and check parquet Files is something I only do for demos. In real life you will use it AWS glue And Amazon Athena to manage your Data Catalog and to run SQL queries on your data.

Interesting facts
Here are a few additional things you should know.

This new feature supports self-managed PostgreSQL and MySQL databases on Amazon EC2 and the following databases on Amazon RDS:

The team will continue to add support for additional databases during the preview period and after general availability. They told me that they are already working on supporting SQL Server, Oracle and MongoDB databases.

Data that Firehose uses AWS PrivateLink to connect to databases in your Amazon Virtual Private Cloud (Amazon VPC).

When setting up an Amazon Data Firehose delivery stream, you can either specify specific tables and columns or use wildcards to specify a class of tables and columns. If you use wildcards, and after the Data Firehose stream is created, new tables and columns are added to the database and they match the wildcard, Data Firehose automatically creates those tables and columns in the target.

Prices and availability
The new data streaming feature is now available in all AWS regions, except the China, AWS GovCloud (US), and Asia Pacific (Malaysia) regions. We would like you to rate this new feature and give us feedback. There are no fees for your use at the start of the preview. At some point in the future, the price will be calculated based on your actual usage, for example based on the amount of bytes read and delivered. There are no obligations or upfront investments. Be sure to read the pricing page for details.

Now go Configure your first continuous database replication to Apache Iceberg tables on Amazon S3 and visit http://aws.amazon.com/firehose.

— seb