ETL orchestration utilizing the Amazon Redshift Information API and AWS Step Features with AWS SDK integration
14 mins read

ETL orchestration utilizing the Amazon Redshift Information API and AWS Step Features with AWS SDK integration


Extract, rework, and cargo (ETL) serverless orchestration structure functions have gotten fashionable with many shoppers. These functions provides better extensibility and ease, making it simpler to keep up and simplify ETL pipelines. A major good thing about this structure is that we simplify an present ETL pipeline with AWS Step Features and straight name the Amazon Redshift Information API from the state machine. In consequence, the complexity for the ETL pipeline is decreased.

As an information engineer or an utility developer, you could wish to work together with Amazon Redshift to load or question knowledge with a easy API endpoint with out having to handle persistent connections. The Amazon Redshift Information API lets you work together with Amazon Redshift with out having to configure JDBC or ODBC connections. This function lets you orchestrate serverless knowledge processing workflows, design event-driven net functions, and run an ETL pipeline asynchronously to ingest and course of knowledge in Amazon Redshift, with using Step Features to orchestrate the complete ETL or ELT workflow.

This submit explains find out how to use Step Features and the Amazon Redshift Information API to orchestrate the completely different steps in your ETL or ELT workflow and course of knowledge into an Amazon Redshift knowledge warehouse.

AWS Lambda is usually used with Step Features because of its versatile and scalable compute advantages. An ETL workflow has a number of steps, and the complexity could range inside every step. Nevertheless, there may be another strategy with AWS SDK service integrations, a function of Step Features. These integrations permit you to name over 200 AWS companies’ API actions straight out of your state machine. This strategy is perfect for steps with comparatively low complexity in comparison with utilizing Lambda since you now not have to keep up and check operate code. Lambda capabilities have a most timeout of quarter-hour; if you want to look forward to longer-running processes, Step Features normal workflows permits a most runtime of 1 12 months.

You possibly can substitute steps that embrace a single course of with a direct integration between Step Features and AWS SDK service integrations with out utilizing Lambda. For instance, if a step is barely used to name a Lambda operate that runs a SQL assertion in Amazon Redshift, you could take away the Lambda operate with a direct integration to the Amazon Redshift Information API’s SDK API motion. You too can decouple Lambda capabilities with a number of actions into a number of steps. An implementation of that is out there later on this submit.

We created an instance use case within the GitHub repo ETL Orchestration utilizing Amazon Redshift Information API and AWS Step Features that gives an AWS CloudFormation template for setup, SQL scripts, and a state machine definition. The state machine straight reads SQL scripts saved in your Amazon Easy Storage Service (Amazon S3) bucket, runs them in your Amazon Redshift cluster, and performs an ETL workflow. We don’t use Lambda on this use case.

Answer overview

On this situation, we simplify an present ETL pipeline that makes use of Lambda to name the Information API. AWS SDK service integrations with Step Features permit you to straight name the Information API from the state machine, lowering the complexity in operating the ETL pipeline.

All the workflow performs the next steps:

  1. Arrange the required database objects and generate a set of pattern knowledge to be processed.
  2. Run two dimension jobs that carry out SCD1 and SCD2 dimension load, respectively.
  3. When each jobs have run efficiently, the load job for the actual fact desk runs.
  4. The state machine performs a validation to make sure the gross sales knowledge was loaded efficiently.

The next structure diagram highlights the end-to-end answer:

We run the state machine through the Step Features console, however you may run this answer in a number of methods:

You possibly can deploy the answer with the offered CloudFormation template, which creates the next sources:

  • Database objects within the Amazon Redshift cluster:
    • 4 saved procedures:
      • sp_setup_sales_data_pipeline() – Creates the tables and populates them with pattern knowledge
      • sp_load_dim_customer_address() – Runs the SCD1 course of on customer_address data
      • sp_load_dim_item() – Runs the SCD2 course of on merchandise data
      • sp_load_fact_sales (p_run_date date) – Processes gross sales from all shops for a given day
    • 5 Amazon Redshift tables:
      • buyer
      • customer_address
      • date_dim
      • merchandise
      • store_sales
  • The AWS Id and Entry Administration (IAM) position StateMachineExecutionRole for Step Features to permit the next permissions:
    • Federate to the Amazon Redshift cluster by way of getClusterCredentials permission avoiding password credentials
    • Run queries within the Amazon Redshift cluster by way of Information API calls
    • Record and retrieve objects from Amazon S3
  • The Step Features state machine RedshiftETLStepFunction, which incorporates the steps used to run the ETL workflow of the pattern gross sales knowledge pipeline

Stipulations

As a prerequisite for deploying the answer, you want to arrange an Amazon Redshift cluster and affiliate it with an IAM position. For extra data, see Authorizing Amazon Redshift to entry different AWS companies in your behalf. When you don’t have a cluster provisioned in your AWS account, seek advice from Getting began with Amazon Redshift for directions to set it up.

When the Amazon Redshift cluster is on the market, carry out the next steps:

  1. Obtain and save the CloudFormation template to an area folder in your pc.
  2. Obtain and save the next SQL scripts to an area folder in your pc:
    1. sp_statements.sql – Accommodates the saved procedures together with DDL and DML operations.
    2. validate_sql_statement.sql – Accommodates two validation queries you may run.
  3. Add the SQL scripts to your S3 bucket. The bucket title is the designated S3 bucket specified within the ETLScriptS3Path enter parameter.
  4. On the AWS CloudFormation console, select Create stack with new sources and add the template file you downloaded within the earlier step (etl-orchestration-with-stepfunctions-and-redshift-data-api.yaml).
  5. Enter the required parameters and select Subsequent.
  6. Select Subsequent till you get to the Assessment web page and choose the acknowledgement examine field.
  7. Select Create stack.
  8. Wait till the stack deploys efficiently.

When the stack is full, you may view the outputs, as proven within the following screenshot:

Run the ETL orchestration

After you deploy the CloudFormation template, navigate to the stack element web page. On the Sources tab, select the hyperlink for RedshiftETLStepFunction to be redirected to the Step Features console.

The RedshiftETLStepFunction state machine runs routinely, as outlined within the following workflow:

  1. read_sp_statement and run_sp_deploy_redshift – Performs the next actions:
    1. Retrieves the sp_statements.sql from Amazon S3 to get the saved process.
    2. Passes the saved process to the batch-execute-statement API to run within the Amazon Redshift cluster.
    3. Sends again the identifier of the SQL assertion to the state machine.
  2. wait_on_sp_deploy_redshift – Waits for a minimum of 5 seconds.
  3. run_sp_deploy_redshift_status_check – Invokes the Information API’s describeStatement to get the standing of the API name.
  4. is_run_sp_deploy_complete – Routes the subsequent step of the ETL workflow relying on its standing:
    1. FINISHED – Saved procedures are created in your Amazon Redshift cluster.
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All different standing – Return to the wait_on_sp_deploy_redshift step to attend for the SQL statements to complete.
  5. setup_sales_data_pipeline – Performs the next steps:
    1. Initiates the setup saved process that was beforehand created within the Amazon Redshift cluster.
    2. Sends again the identifier of the SQL assertion to the state machine.
  6. wait_on_setup_sales_data_pipeline – Waits for a minimum of 5 seconds.
  7. setup_sales_data_pipeline_status_check – Invokes the Information API’s describeStatement to get the standing of the API name.
  8. is_setup_sales_data_pipeline_complete – Routes the subsequent step of the ETL workflow relying on its standing:
    1. FINISHED – Created two dimension tables (customer_address and merchandise) and one truth desk (gross sales).
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All different standing – Return to the wait_on_setup_sales_data_pipeline step to attend for the SQL statements to complete.
  9. run_sales_data_pipeline – LoadItemTable and LoadCustomerAddressTable are two parallel workflows that Step Features runs on the similar time. The workflows run the saved procedures that have been beforehand created. The saved process hundreds the information into the merchandise and customer_address tables. All different steps within the parallel classes observe the identical idea as described beforehand. When each parallel workflows are full, run_load_fact_sales runs.
  10. run_load_fact_sales – Inserts knowledge into the store_sales desk that was created within the preliminary saved process.
  11. Validation – When all of the ETL steps are full, the state machine reads a second SQL file from Amazon S3 (validate_sql_statement.sql) and runs the 2 SQL statements utilizing the batch_execute_statement methodology.

The implementation of the ETL workflow is idempotent. If it fails, you may retry the job with none cleanup. For instance, it recreates the stg_store_sales desk every time, then deletes the goal desk store_sales with the information for the actual refresh date every time.

The next diagram illustrates the state machine workflow:

On this instance, we use the duty state useful resource arn:aws:states:::aws-sdk:redshiftdata:[apiAction] to name the corresponding Information API motion. The next desk summarizes the Information API actions and their corresponding AWS SDK integration API actions.

To make use of AWS SDK integrations, you specify the service title and API name, and, optionally, a service integration sample. The AWS SDK motion is all the time camel case, and parameter names are Pascal case. For instance, you should utilize the Step Features motion batchExecuteStatement to run a number of SQL statements in a batch as part of a single transaction on the Information API. The SQL statements could be SELECT, DML, DDL, COPY, and UNLOAD.

Validate the ETL orchestration

All the ETL workflow takes roughly 1 minute to run. The next screenshot exhibits that the ETL workflow accomplished efficiently.

When the complete gross sales knowledge pipeline is full, you could undergo the complete execution occasion historical past, as proven within the following screenshot.

Schedule the ETL orchestration

After you validate the gross sales knowledge pipeline, you could decide to run the information pipeline on a day by day schedule. You possibly can accomplish this with Amazon EventBridge.

  1. On the EventBridge console, create a rule to run the RedshiftETLStepFunction state machine day by day.
  2. To invoke the RedshiftETLStepFunction state machine on a schedule, select Schedule and outline the suitable frequency wanted to run the gross sales knowledge pipeline.
  3. Specify the goal state machine as RedshiftETLStepFunction and select Create.

You possibly can affirm the schedule on the rule particulars web page.

Clear up

Clear up the sources created by the CloudFormation template to keep away from pointless price to your AWS account. You possibly can delete the CloudFormation stack by choosing the stack on the AWS CloudFormation console and selecting Delete. This motion deletes all of the sources it provisioned. When you manually up to date a template-provisioned useful resource, you might even see some points throughout cleanup; you want to clear these up independently.

Limitations

The Information API and Step Features AWS SDK integration provides a strong mechanism to construct extremely distributed ETL functions inside minimal developer overhead. Take into account the next limitations when utilizing the Information API and Step Features:

Conclusion

On this submit, we demonstrated find out how to construct an ETL orchestration utilizing the Amazon Redshift Information API and Step Features with AWS SDK integration.

To be taught extra concerning the Information API, see Utilizing the Amazon Redshift Information API to work together with Amazon Redshift clusters and Utilizing the Amazon Redshift Information API.


Concerning the Authors

Jason Pedreza is an Analytics Specialist Options Architect at AWS with over 13 years of knowledge warehousing expertise. Previous to AWS, he constructed knowledge warehouse options at Amazon.com. He focuses on Amazon Redshift and helps clients construct scalable analytic options.

Bipin Pandey is a Information Architect at AWS. He likes to construct knowledge lake and analytics platforms for his clients. He’s keen about automating and simplifying buyer issues with using cloud options.

David Zhang is an AWS Options Architect who helps clients design sturdy, scalable, and data-driven options throughout a number of industries. With a background in software program growth, David is an lively chief and contributor to AWS open-source initiatives. He’s keen about fixing real-world enterprise issues and repeatedly strives to work from the client’s perspective. Be at liberty to attach with him on LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *