Amazon Redshift is a completely managed, petabyte-scale information warehouse service within the cloud. Amazon Redshift Spectrum permits you to question open format information instantly from the Amazon Easy Storage Service (Amazon S3) information lake with out having to load the info into Amazon Redshift tables. With Redshift Spectrum, you’ll be able to question open file codecs similar to Apache Parquet, ORC, JSON, Avro, and CSV. This characteristic of Amazon Redshift permits a contemporary information structure that permits you to question all of your information to acquire extra full insights.
Amazon Redshift has a regular method of dealing with information errors in Redshift Spectrum. Knowledge file fields containing any particular character are set to null. Character fields longer than the outlined desk column size get truncated by Redshift Spectrum, whereas numeric fields show the utmost quantity that may match within the column. With this newly added user-defined information error dealing with characteristic in Amazon Redshift Spectrum, now you can customise information validation and error dealing with.
This characteristic gives you with particular strategies for dealing with every of the situations for invalid characters, surplus characters, and numeric overflow whereas processing information utilizing Redshift Spectrum. Additionally, the errors are captured and visual within the newly created dictionary view SVL_SPECTRUM_SCAN_ERROR
. You possibly can even cancel the question when an outlined threshold of errors has been reached.
Conditions
To exhibit Redshift Spectrum user-defined information dealing with, we construct an exterior desk over an information file with soccer league data and use that to point out totally different information errors and the way the brand new characteristic affords totally different choices in coping with these information errors. We’d like the next stipulations:
Answer overview
We use the info file for soccer leagues to outline an exterior desk to exhibit totally different information errors and the totally different dealing with methods supplied by the brand new characteristic to take care of these errors. The next screenshot exhibits an instance of the info file.
Notice the next within the instance:
- The membership title can sometimes be longer than 15 characters
- The league title will be sometimes be longer than 20 characters
- The membership title Barcelôna consists of an invalid character
- The column
nspi
consists of values which are larger than SMALLINT vary
Then we create an exterior desk (see the next code) to exhibit the brand new user-defined dealing with:
- We outline the
club_name
andleague_name
shorter than they need to to exhibit dealing with of surplus characters - We outline the column
league_nspi
as SMALLINT to exhibit dealing with of numeric overflow - We use the brand new desk property
data_cleansing_enabled
to allow customized information dealing with
Invalid character information dealing with
With the introduction of the brand new desk and column property invalid_char_handling
, now you can select the way you take care of invalid characters in your information. The supported values are as follows:
- DISABLED – Function is disabled (no dealing with).
- SET_TO_NULL – Replaces the worth with
null
. - DROP_ROW – Drops the entire row.
- FAIL – Fails the question when an invalid UTF-8 worth is detected.
- REPLACE – Replaces the invalid character with a alternative. With this feature, you should utilize the newly launched desk property
replacement_char
.
The desk property can work over the entire desk or only a column stage. Moreover, you’ll be able to outline the desk property throughout create time or later by altering the desk.
While you disable user-defined dealing with, Redshift Spectrum by default units the worth to null
(just like SET_TO_NULL
):
While you change the setting of the dealing with to DROP_ROW
, Redshift Spectrum merely drops the row that has an invalid character:
While you change the setting of the dealing with to FAIL
, Redshift Spectrum fails and returns an error:
While you change the setting of the dealing with to REPLACE
and select a alternative character, Redshift Spectrum replaces the invalid character with the chosen alternative character:
Surplus character information dealing with
As talked about earlier, we outlined the columns club_name
and league_name
shorter than the precise contents of the corresponding fields within the information file.
With the introduction of the brand new desk property surplus_char_handling
, you’ll be able to select from a number of choices:
- DISABLED – Function is disabled (no dealing with)
- TRUNCATE – Truncates the worth to the column measurement
- SET_TO_NULL – Replaces the worth with
null
- DROP_ROW – Drops the entire row
- FAIL – Fails the question when a worth is just too giant for the column
While you disable the user-defined dealing with, Redshift Spectrum defaults to truncating the excess characters (just like TRUNCATE
):
While you change the setting of the dealing with to SET_TO_NULL
, Redshift Spectrum merely units to NULL
the column worth of any area that’s longer than the outlined size:
While you change the setting of the dealing with to DROP_ROW
, Redshift Spectrum drops the row of any area that’s longer than the outlined size:
While you change the setting of the dealing with to FAIL
, Redshift Spectrum fails and returns an error:
We have to disable the user-defined information dealing with for this information error earlier than demonstrating the following kind of error:
Numeric overflow information dealing with
For this demonstration, we outlined league_nspi
deliberately SMALLINT (with a variety to carry from -32,768 to +32,767) to point out the accessible choices for information dealing with.
With the introduction of the brand new desk property numeric_overflow_handling
, you’ll be able to select from a number of choices:
- DISABLED – Function is disabled (no dealing with)
- SET_TO_NULL – Replaces the worth with
null
- DROP_ROW – Replaces every worth within the row with
NULL
- FAIL – Fails the question when a worth is just too giant for the column
After we take a look at the supply information, we will observe that the highest 5 international locations have extra factors than the SMALLINT area can deal with.
While you disable the user-defined dealing with, Redshift Spectrum defaults to the utmost quantity the numeric information kind can deal with, for our case SMALLINT can deal with as much as 32767:
While you select SET_TO_NULL
, Redshift Spectrum units to null
the column with numeric overflow:
While you select DROP_ROW
, Redshift Spectrum drops the row containing the column with numeric overflow:
While you select FAIL
, Redshift Spectrum fails and returns an error:
We have to disable the user-defined information dealing with for this information error earlier than demonstrating the following kind of error:
Cease queries at MAXERROR threshold
It’s also possible to select to cease the question if it reaches a sure threshold in errors by utilizing the newly launched parameter spectrum_query_maxerror
:
The next screenshot exhibits that the question ran efficiently.
Nevertheless, in case you lower this threshold to a decrease quantity, the question fails as a result of it reached the preset threshold:
Error logging
With the introduction of the brand new user-defined information dealing with characteristic, we additionally launched the brand new view svl_spectrum_scan_error
, which lets you view a helpful pattern set of the logs of errors. The desk accommodates the question, file, row, column, error code, dealing with motion that was utilized, in addition to the unique worth and the modified (ensuing) worth. See the next code:
Clear up
To keep away from incurring future fees, full the next steps:
- Delete the Amazon Redshift cluster created for this demonstration. When you had been utilizing an current cluster, drop the created exterior desk and exterior schema.
- Delete the S3 bucket.
- Delete the AWS Glue Knowledge Catalog database.
Conclusion
On this put up, we demonstrated Redshift Spectrum’s newly added characteristic of user-defined information error dealing with and confirmed how this characteristic gives the pliability to take a user-defined strategy to take care of information exceptions in processing exterior information. We additionally demonstrated how the logging enhancements present transparency on the errors encountered in exterior information processing without having to jot down extra customized code.
We look ahead to listening to from you about your expertise. When you’ve got questions or solutions, please go away a remark.
Concerning the Authors
Ahmed Shehata is a Knowledge Warehouse Specialist Options Architect with Amazon Internet Companies, based mostly out of Toronto.
Milind Oke is a Knowledge Warehouse Specialist Options Architect based mostly out of New York. He has been constructing information warehouse options for over 15 years and focuses on Amazon Redshift. He’s targeted on serving to clients design and construct enterprise-scale well-architected analytics and choice assist platforms.