Outline error dealing with for Amazon Redshift Spectrum information


Amazon Redshift is a completely managed, petabyte-scale information warehouse service within the cloud. Amazon Redshift Spectrum permits you to question open format information instantly from the Amazon Easy Storage Service (Amazon S3) information lake with out having to load the info into Amazon Redshift tables. With Redshift Spectrum, you’ll be able to question open file codecs similar to Apache Parquet, ORC, JSON, Avro, and CSV. This characteristic of Amazon Redshift permits a contemporary information structure that permits you to question all of your information to acquire extra full insights.

Amazon Redshift has a regular method of dealing with information errors in Redshift Spectrum. Knowledge file fields containing any particular character are set to null. Character fields longer than the outlined desk column size get truncated by Redshift Spectrum, whereas numeric fields show the utmost quantity that may match within the column. With this newly added user-defined information error dealing with characteristic in Amazon Redshift Spectrum, now you can customise information validation and error dealing with.

This characteristic gives you with particular strategies for dealing with every of the situations for invalid characters, surplus characters, and numeric overflow whereas processing information utilizing Redshift Spectrum. Additionally, the errors are captured and visual within the newly created dictionary view SVL_SPECTRUM_SCAN_ERROR. You possibly can even cancel the question when an outlined threshold of errors has been reached.

Conditions

To exhibit Redshift Spectrum user-defined information dealing with, we construct an exterior desk over an information file with soccer league data and use that to point out totally different information errors and the way the brand new characteristic affords totally different choices in coping with these information errors. We’d like the next stipulations:

Answer overview

We use the info file for soccer leagues to outline an exterior desk to exhibit totally different information errors and the totally different dealing with methods supplied by the brand new characteristic to take care of these errors. The next screenshot exhibits an instance of the info file.

Notice the next within the instance:

  • The membership title can sometimes be longer than 15 characters
  • The league title will be sometimes be longer than 20 characters
  • The membership title Barcelôna consists of an invalid character
  • The column nspi consists of values which are larger than SMALLINT vary

Then we create an exterior desk (see the next code) to exhibit the brand new user-defined dealing with:

  • We outline the club_name and league_name shorter than they need to to exhibit dealing with of surplus characters
  • We outline the column league_nspi as SMALLINT to exhibit dealing with of numeric overflow
  • We use the brand new desk property data_cleansing_enabled to allow customized information dealing with
CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league
(
  league_rank smallint,
  prev_rank   smallint,
  club_name   varchar(15),
  league_name varchar(20),
  league_off  decimal(6,2),
  league_def  decimal(6,2),
  league_spi  decimal(6,2),
  league_nspi smallint
)
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    LINES TERMINATED BY 'nl'
saved as textfile
LOCATION 's3://uddh-soccer/league/'
desk properties ('skip.header.line.rely'='1','data_cleansing_enabled'='true');

Invalid character information dealing with

With the introduction of the brand new desk and column property invalid_char_handling, now you can select the way you take care of invalid characters in your information. The supported values are as follows:

  • DISABLED – Function is disabled (no dealing with).
  • SET_TO_NULL – Replaces the worth with null.
  • DROP_ROW – Drops the entire row.
  • FAIL – Fails the question when an invalid UTF-8 worth is detected.
  • REPLACE – Replaces the invalid character with a alternative. With this feature, you should utilize the newly launched desk property replacement_char.

The desk property can work over the entire desk or only a column stage. Moreover, you’ll be able to outline the desk property throughout create time or later by altering the desk.

While you disable user-defined dealing with, Redshift Spectrum by default units the worth to null (just like SET_TO_NULL):

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('invalid_char_handling'='DISABLED');

While you change the setting of the dealing with to DROP_ROW, Redshift Spectrum merely drops the row that has an invalid character:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('invalid_char_handling'='DROP_ROW');

While you change the setting of the dealing with to FAIL, Redshift Spectrum fails and returns an error:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('invalid_char_handling'='FAIL');

While you change the setting of the dealing with to REPLACE and select a alternative character, Redshift Spectrum replaces the invalid character with the chosen alternative character:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('invalid_char_handling'='REPLACE','replacement_char'='?');

Surplus character information dealing with

As talked about earlier, we outlined the columns club_name and league_name shorter than the precise contents of the corresponding fields within the information file.

With the introduction of the brand new desk property surplus_char_handling, you’ll be able to select from a number of choices:

  • DISABLED – Function is disabled (no dealing with)
  • TRUNCATE – Truncates the worth to the column measurement
  • SET_TO_NULL – Replaces the worth with null
  • DROP_ROW – Drops the entire row
  • FAIL – Fails the question when a worth is just too giant for the column

While you disable the user-defined dealing with, Redshift Spectrum defaults to truncating the excess characters (just like TRUNCATE):

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('surplus_char_handling' = 'DISABLED');

While you change the setting of the dealing with to SET_TO_NULL, Redshift Spectrum merely units to NULL the column worth of any area that’s longer than the outlined size:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('surplus_char_handling' = 'SET_TO_NULL');


While you change the setting of the dealing with to DROP_ROW, Redshift Spectrum drops the row of any area that’s longer than the outlined size:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('surplus_char_handling' = 'DROP_ROW');

While you change the setting of the dealing with to FAIL, Redshift Spectrum fails and returns an error:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('surplus_char_handling' = 'FAIL');

We have to disable the user-defined information dealing with for this information error earlier than demonstrating the following kind of error:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('surplus_char_handling' = 'DISABLED');

Numeric overflow information dealing with

For this demonstration, we outlined league_nspi deliberately SMALLINT (with a variety to carry from -32,768 to +32,767) to point out the accessible choices for information dealing with.

With the introduction of the brand new desk property numeric_overflow_handling, you’ll be able to select from a number of choices:

  • DISABLED – Function is disabled (no dealing with)
  • SET_TO_NULL – Replaces the worth with null
  • DROP_ROW – Replaces every worth within the row with NULL
  • FAIL – Fails the question when a worth is just too giant for the column

After we take a look at the supply information, we will observe that the highest 5 international locations have extra factors than the SMALLINT area can deal with.

While you disable the user-defined dealing with, Redshift Spectrum defaults to the utmost quantity the numeric information kind can deal with, for our case SMALLINT can deal with as much as 32767:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('numeric_overflow_handling' = 'DISABLED');

While you select SET_TO_NULL, Redshift Spectrum units to null the column with numeric overflow:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('numeric_overflow_handling' = 'SET_TO_NULL');

While you select DROP_ROW, Redshift Spectrum drops the row containing the column with numeric overflow:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('numeric_overflow_handling' = 'DROP_ROW');

While you select FAIL, Redshift Spectrum fails and returns an error:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('numeric_overflow_handling' = 'FAIL');

We have to disable the user-defined information dealing with for this information error earlier than demonstrating the following kind of error:

alter desk schema_spectrum_uddh.soccer_league
set desk properties ('numeric_overflow_handling' = 'DISABLED');

Cease queries at MAXERROR threshold

It’s also possible to select to cease the question if it reaches a sure threshold in errors by utilizing the newly launched parameter spectrum_query_maxerror:

Set spectrum_query_maxerror to 7;

The next screenshot exhibits that the question ran efficiently.

Nevertheless, in case you lower this threshold to a decrease quantity, the question fails as a result of it reached the preset threshold:

Set spectrum_query_maxerror to six;

Error logging

With the introduction of the brand new user-defined information dealing with characteristic, we additionally launched the brand new view svl_spectrum_scan_error, which lets you view a helpful pattern set of the logs of errors. The desk accommodates the question, file, row, column, error code, dealing with motion that was utilized, in addition to the unique worth and the modified (ensuing) worth. See the next code:

SELECT *
FROM svl_spectrum_scan_error
the place location = 's3://uddh-soccer/league/spi_global_rankings.csv'

Clear up

To keep away from incurring future fees, full the next steps:

  1. Delete the Amazon Redshift cluster created for this demonstration. When you had been utilizing an current cluster, drop the created exterior desk and exterior schema.
  2. Delete the S3 bucket.
  3. Delete the AWS Glue Knowledge Catalog database.

Conclusion

On this put up, we demonstrated Redshift Spectrum’s newly added characteristic of user-defined information error dealing with and confirmed how this characteristic gives the pliability to take a user-defined strategy to take care of information exceptions in processing exterior information. We additionally demonstrated how the logging enhancements present transparency on the errors encountered in exterior information processing without having to jot down extra customized code.

We look ahead to listening to from you about your expertise. When you’ve got questions or solutions, please go away a remark.


Concerning the Authors

Ahmed Shehata is a Knowledge Warehouse Specialist Options Architect with Amazon Internet Companies, based mostly out of Toronto.

Milind Oke is a Knowledge Warehouse Specialist Options Architect based mostly out of New York. He has been constructing information warehouse options for over 15 years and focuses on Amazon Redshift. He’s targeted on serving to clients design and construct enterprise-scale well-architected analytics and choice assist platforms.

Leave a Comment