Wednesday, February 15, 2023
HomeBig DataSpeed up your knowledge warehouse migration to Amazon Redshift – Half 6

Speed up your knowledge warehouse migration to Amazon Redshift – Half 6


That is the sixth in a sequence of posts. We’re excited to share dozens of latest options to automate your schema conversion; protect your funding in present scripts, experiences, and purposes; speed up question efficiency; and probably simplify your migrations from legacy knowledge warehouses to Amazon Redshift.

Amazon Redshift is the cloud knowledge warehouse of selection for tens of hundreds of shoppers who use it to research exabytes of knowledge to achieve enterprise insights. With Amazon Redshift, you’ll be able to question knowledge throughout your knowledge warehouse, operational knowledge shops, and knowledge lake utilizing normal SQL. You may also combine different AWS companies equivalent to Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to make use of all of the analytic capabilities within the AWS Cloud.

Migrating an information warehouse could be a complicated enterprise. Your legacy workload would possibly depend on proprietary options that aren’t straight supported by a contemporary knowledge warehouse like Amazon Redshift. For instance, some knowledge warehouses implement major key constraints, making a tradeoff with DML efficiency. Amazon Redshift helps you to outline a major key however makes use of the constraint for question optimization functions solely. If you happen to use Amazon Redshift, or are migrating to Amazon Redshift, chances are you’ll want a mechanism to test that major key constraints aren’t being violated by extract, remodel, and cargo (ETL) processes.

On this submit, we describe two design patterns that you should use to perform this effectively. We additionally present you easy methods to use the AWS Schema Conversion Device (AWS SCT) to routinely apply the design patterns to your SQL code.

We begin by defining the semantics to handle. Then we describe the design patterns and analyze their efficiency. We conclude by exhibiting you ways AWS SCT can routinely convert your code to implement major keys.

Major keys

A major key (PK) is a set of attributes such that no two rows can have the identical worth within the PK. For instance, the next Teradata desk has a two-attribute major key (emp_id, div_id). Presumably, worker IDs are distinctive solely inside divisions.

CREATE TABLE testschema.emp ( 
  emp_id INTEGER NOT NULL
, identify VARCHAR(12) NOT NULL
, div_id INTEGER NOT NULL
, job_title VARCHAR(12)
, wage DECIMAL(8,2)
, birthdate DATE NOT NULL ) 
CONSTRAINT pk_emp_id PRIMARY KEY (emp_id, div_id);

Most databases require {that a} major key fulfill two standards:

  • Uniqueness – The PK values are distinctive over all rows within the desk
  • Not NULL – The PK attributes don’t settle for NULL values

On this submit, we give attention to easy methods to help the previous major key semantics. We describe two design patterns that you should use to develop SQL purposes that respect major keys in Amazon Redshift. Our focus is on INSERT-SELECT statements. Clients have informed us that INSERT-SELECT operations comprise over 50% of the DML workload in opposition to tables with distinctive constraints. We briefly present some steering for different DML statements later within the submit.

INSERT-SELECT

In the remainder of this submit, we dive deep into design patterns for INSERT-SELECT statements. We’re involved with statements of the next type:

INSERT INTO <goal desk> SELECT * FROM <staging desk>

The schema of the staging desk is an identical to the goal desk on a column-by-column foundation.

A reproduction PK worth will be launched by two eventualities:

  • The staging desk accommodates duplicates, which means there are two or extra rows within the staging knowledge with the identical PK worth
  • There’s a row x within the staging desk and a row y within the goal desk that share the identical PK worth

Be aware that these conditions are impartial. It may be the case that the staging desk accommodates duplicates, the staging desk and goal desk share a reproduction, or each.

It’s crucial that the staging desk doesn’t include duplicate PK values. To make sure this, you’ll be able to apply deduplication logic, as described on this submit, to the staging desk when it’s loaded. Alternatively, in case your upstream supply can assure that duplicates have been eradicated earlier than supply, you’ll be able to eradicate this step.

Be part of

The primary design sample merely joins the staging and goal tables. If any rows are returned, then the staging and goal tables share a major key worth.

Suppose now we have staging and goal tables outlined as the next:

CREATE TABLE stg ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
); 

CREATE TABLE tgt ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
);

We will use the next question to detect any duplicate major key values:

SELECT rely(1) 
FROM stg, tgt 
WHERE tgt.pk_col = stg.pk_col;

If the first key has a number of columns, then the WHERE situation will be prolonged:

SELECT rely(1)
FROM stg, tgt
WHERE
    tgt.pk_col1 = stg.pk_col1
AND tgt.pk_col2 = tgt.pk_col2
AND …
;

There may be one complication with this design sample. If you happen to enable NULL values within the major key column, then you could add particular code to deal with the NULL to NULL matching:

SELECT rely(1)
FROM stg, tgt
WHERE
   (tgt.pk_col = stg.pk_col) 
OR (tgt.pk_col IS NULL AND stg.pk_col IS NULL)
;

That is the first drawback of this design sample—the code will be ugly and unintuitive. Moreover, when you’ve got a multicolumn major key, then the code turns into much more difficult.

INTERSECT

The second design sample that we describe makes use of the Amazon Redshift INTERSECT operation. INTERSECT is a set-based operation that determines if two queries have any rows in frequent. You possibly can take a look at UNION, INTERSECT, and EXCEPT within the Amazon Redshift documentation for extra info.

We will decide if the staging and goal desk have duplicate PK values utilizing the next question:

SELECT COUNT(1)
FROM (
  SELECT pk_col FROM stg
  INTERSECT
  SELECT pk_col FROM tgt
) a
;

If the first key’s composed of multiple column, you’ll be able to merely modify the subqueries to incorporate the extra columns:

SELECT COUNT(1)
FROM (
  SELECT pk_col1, pk_col2, …, pk_coln FROM stg
  INTERSECT
  SELECT pk_col, pk_col2, …, pk_coln FROM tgt
) a
;

This sample’s principal benefit is its simplicity. The code is simpler to grasp and validate than the be part of design sample. INTERSECT handles the NULL to NULL matching implicitly so that you don’t have to jot down any particular code for NULL values.

Efficiency

We examined each design patterns utilizing an Amazon Redshift cluster consisting of 12 ra3.4xlarge nodes. Every node contained 12 CPU and 96 GB of reminiscence.

We created the staging and goal tables with the identical distribution and type keys to reduce knowledge redistribution at question time.

We generated the take a look at knowledge artificially utilizing a customized program. The goal dataset contained 1 billion rows of knowledge. We ran 10 trials of each algorithms utilizing staging datasets that ranged from 20–200 million rows, in 20-million-row increments.

Within the following graph, the be part of design sample is proven as a blue line. The intersect design sample is proven as an orange line.

You possibly can observe that the efficiency of each algorithms is superb. Every is ready to detect duplicates in lower than 1 second for all trials. The be part of algorithm outperforms the intersect algorithm, however each have glorious efficiency.

So, which algorithm ought to use you select? If you happen to’re creating a brand new software on Amazon Redshift, the intersect algorithm might be your best option. The inherent NULL matching logic and easy intuitive code make this your best option for brand new purposes.

Conversely, if you could squeeze each little bit of efficiency out of your software, then the be part of algorithm is your best choice. On this case, you’ll should commerce complexity and maybe further effort in code evaluation to achieve the additional efficiency.

Automation

If you happen to’re migrating an present software to Amazon Redshift, you should use AWS SCT to routinely convert your SQL code.

Let’s see how this works. Suppose you could have the next Teradata desk. We use it because the goal desk in an INSERT-SELECT operation.

CREATE MULTISET TABLE testschema.test_pk_tgt (
  pk_col INTEGER NOT NULL
, payload VARCHAR(100) NOT NULL
, PRIMARY KEY (pk_col)
);

The staging desk is an identical to the goal desk, with the identical columns and knowledge varieties.

Subsequent, we create a process to load the goal desk from the staging desk. The process accommodates a single INSERT-SELECT assertion:

REPLACE PROCEDURE testschema.insert_select()
BEGIN
INSERT INTO testschema.test_pk_tgt (pk_col, payload)
SELECT pk_col, payload FROM testschema.test_pk_stg;
END;

Now we use AWS SCT to transform the Teradata saved process to Amazon Redshift. First, open Settings, Conversion settings, and make sure that you’ve chosen the choice Automate Major key / Distinctive constraint. If you happen to don’t choose this selection, AWS SCT received’t add the PK test to the transformed code.

Subsequent, select the saved process within the supply database tree, right-click, and select Convert schema.

AWS SCT converts the saved process (and embedded INSERT-SELECT) utilizing the be part of rewrite sample. As a result of AWS SCT performs the conversion for you, it makes use of the be part of rewrite sample to leverage its efficiency benefit.

And that’s it, it’s that easy. If you happen to’re migrating from Oracle or Teradata, you should use AWS SCT to transform your INSERT-SELECT statements now. We’ll be including help for extra knowledge warehouse engines quickly.

On this submit, we targeted on INSERT-SELECT statements, however we’re additionally completely satisfied to report that AWS SCT can implement major key constraints for INSERT-VALUE and UPDATE statements. AWS SCT injects the suitable SELECT assertion into your code to find out if the INSERT-VALUE or UPDATE will create duplicate major key values. Obtain the newest model of AWS SCT and provides it a attempt!

Conclusion

On this submit, we confirmed you easy methods to implement major keys in Amazon Redshift. If you happen to’re implementing a brand new software in Amazon Redshift, you should use the design patterns on this submit to implement the constraints as a part of your ETL stream.

Additionally, should you’re migrating from an Oracle or Teradata database, you should use AWS SCT to routinely convert your SQL to Amazon Redshift. AWS SCT will inject extra code into your SQL stream to implement your distinctive key constraints, and thereby insulate your software code from any associated adjustments.

We’re completely satisfied to share these updates that can assist you in your knowledge warehouse migration initiatives. Within the meantime, you’ll be able to study extra about Amazon Redshift and AWS SCT. Completely happy migrating!


In regards to the authors

Michael Soo is a Principal Database Engineer with the AWS Database Migration Service workforce. He builds services that assist prospects migrate their database workloads to the AWS cloud.

Illia Kravtsov is a Database Developer with the AWS Challenge Delta Migration workforce. He has 10+ years expertise in knowledge warehouse improvement with Teradata and different MPP databases.

RELATED ARTICLES

Most Popular