HomeBig DataExport JSON information to Amazon S3 utilizing Amazon Redshift UNLOAD

Export JSON information to Amazon S3 utilizing Amazon Redshift UNLOAD


Amazon Redshift is a quick, scalable, safe, and totally managed cloud information warehouse that makes it easy and cost-effective to investigate all of your information utilizing commonplace SQL. Amazon Redshift affords as much as 3 times higher worth efficiency than another cloud information warehouse. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of knowledge per day and energy analytics workloads equivalent to high-performance enterprise intelligence (BI) reporting, dashboarding functions, information exploration, and real-time analytics.

As the quantity of knowledge generated by IoT gadgets, social media, and cloud functions continues to develop, organizations need to simply and cost-effectively analyze this information with minimal time-to-insight. An enormous quantity of this information is out there in semi-structured format and wishes extra extract, remodel, and cargo (ETL) processes to make it accessible or to combine it with structured information for evaluation. Amazon Redshift powers the fashionable information structure, which allows you to question information throughout your information warehouse, information lake, and operational databases to achieve sooner and deeper insights not potential in any other case. With a contemporary information structure, you’ll be able to retailer information in semi-structured format in your Amazon Easy Storage Service (Amazon S3) information lake and combine it with structured information on Amazon Redshift. This lets you make this information obtainable to different analytics and machine studying functions quite than locking it in a silo.

On this publish, we talk about the UNLOAD function in Amazon Redshift and how you can export information from an Amazon Redshift cluster to JSON recordsdata on an Amazon S3 information lake.

JSON assist options in Amazon Redshift

Amazon Redshift options equivalent to COPY, UNLOAD, and Amazon Redshift Spectrum allow you to maneuver and question information between your information warehouse and information lake.

With the UNLOAD command, you’ll be able to export a question consequence set in textual content, JSON, or Apache Parquet file format to Amazon S3. UNLOAD command can be advisable when you might want to retrieve giant consequence units out of your information warehouse. Since UNLOAD processes and exports information in parallel from Amazon Redshift’s compute nodes to Amazon S3, this reduces the community overhead and thus time in studying giant variety of rows. When utilizing the JSON choice with UNLOAD, Amazon Redshift unloads to a JSON file with every line containing a JSON object, representing a full file within the question consequence. Within the JSON file, Amazon Redshift varieties are unloaded because the closest JSON illustration. For instance, Boolean values are unloaded as true or false, NULL values are unloaded as null, and timestamp values are unloaded as strings. If a default JSON illustration doesn’t swimsuit a specific use case, you’ll be able to modify it by casting to the specified kind within the SELECT question of the UNLOAD assertion.

Moreover, to create a legitimate JSON object, the identify of every column within the question consequence have to be distinctive. If the column names within the question consequence aren’t distinctive, the JSON UNLOAD course of fails. To keep away from this, we advocate utilizing correct column aliases so that every column within the question consequence stays distinctive whereas getting unloaded. We illustrate this conduct later on this publish.

With the Amazon Redshift SUPER information kind, you’ll be able to retailer information in JSON format on native Amazon Redshift tables. This fashion, you’ll be able to course of the information with none community overhead and use Amazon Redshift schema properties to optimally save and question semi structured information domestically. Along with reaching low latency, you may also use the SUPER information kind when your question requires sturdy consistency, predictable question efficiency, complicated question assist, and ease of use with evolving schemas and schemaless information. Amazon Redshift helps writing nested JSON when the question consequence accommodates SUPER columns.

Updating and sustaining information with always evolving schemas might be difficult and provides further ETL steps to the analytics pipeline. The JSON file format gives assist for schema definition, is light-weight, and is extensively used as an information switch mechanism by completely different providers, instruments, and applied sciences.

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a distributed, open-source search and analytics suite used for a broad set of use circumstances like real-time utility monitoring, log analytics, and web site search. It makes use of JSON because the supported file format for information ingestion. The power to unload information natively in JSON format from Amazon Redshift into the Amazon S3 information lake reduces complexity and extra information processing steps if that information must be ingested into Amazon OpenSearch Service for additional evaluation.

That is one instance of how seamless information motion may help you construct an built-in information platform with an information lake on Amazon S3, information warehouse on Amazon Redshift and search and log analytics utilizing Amazon OpenSearch Service and another JSON-oriented downstream analytics resolution. For extra details about the Lake Home strategy, see Construct a Lake Home Structure on AWS.

Examples of Amazon Redshift JSON UNLOAD

On this publish, we present you the next completely different eventualities:

  • Instance 1 – Unload buyer information in JSON format into Amazon S3, partitioning output recordsdata into partition folders, following the Apache Hive conference, with buyer start month because the partition key. We make a number of modifications to the columns within the SELECT assertion of the UNLOAD command:
    • Convert the c_preferred_cust_flag column from character to Boolean
    • Take away main and trailing areas from the c_first_name, c_last_name, and c_email_address columns utilizing the Amazon Redshift built-in operate btrim
  • Instance 2 – Unload line merchandise information (with SUPER column) in JSON format into Amazon S3 with information not partitioned
  • Instance 3 – Unload line merchandise information (With SUPER column) in JSON format into Amazon S3, partitioning output recordsdata into partition folders, following the Apache Hive conference, with buyer key because the partition key

For the primary instance, we used the buyer desk and information from the TPCDS dataset. For examples involving desk with SUPER column, we used the customer_orders_lineitem desk and information from the next tutorial.

Instance 1: Export buyer information

For this instance, we used the shopper desk and information from TPCDS dataset. We created the database schema and buyer desk, and copied information into it. See the next code:

-- Created a brand new database
create schema json_unload_demo; 

-- created and populated buyer desk within the new schema

create desk json_unload_demo.buyer
(
  c_customer_sk int4 not null ,                 
  c_customer_id char(16) not null ,             
  c_current_cdemo_sk int4 ,   
  c_current_hdemo_sk int4 ,   
  c_current_addr_sk int4 ,    
  c_first_shipto_date_sk int4 ,                 
  c_first_sales_date_sk int4 ,
  c_salutation char(10) ,     
  c_first_name char(20) ,     
  c_last_name char(30) ,      
  c_preferred_cust_flag char(1) ,               
  c_birth_day int4 ,          
  c_birth_month int4 ,        
  c_birth_year int4 ,         
  c_birth_country varchar(20) ,                 
  c_login char(13) ,          
  c_email_address char(50) ,  
  c_last_review_date_sk int4 ,
  main key (c_customer_sk)
) distkey(c_customer_sk);

copy json_unload_demo.buyer from 's3://redshift-downloads/TPC-DS/2.13/3TB/buyer/' 
iam_role '<<AWS IAM position connected to your amazon redshift cluster>>' 
gzip delimiter '|' EMPTYASNULL;

You possibly can create a default AWS Id and Entry Administration (IAM) position on your Amazon Redshift cluster to repeat from and unload to your Amazon S3 location. For extra info, see Use the default IAM position in Amazon Redshift to simplify accessing different AWS providers.

On this instance, we unloaded buyer information for all prospects with start 12 months 1992 in JSON format into Amazon S3 with none partitions. We make the next modifications to the UNLOAD assertion:

  • Convert the c_preferred_cust_flag column from character to Boolean
  • Take away main and trailing areas from the c_first_name, c_last_name, and c_email_address columns utilizing the btrim operate
  • Set the utmost measurement of exported recordsdata in Amazon S3 to 64 MB

See the next code:

unload ('SELECT c_customer_sk,
    c_customer_id ,
    c_current_cdemo_sk ,
    c_current_hdemo_sk ,
    c_current_addr_sk ,
    c_first_shipto_date_sk ,
    c_first_sales_date_sk ,
    c_salutation ,
    btrim(c_first_name),
    btrim(c_last_name),
    c_birth_day ,
    c_birth_month ,
    c_birth_year ,
    c_birth_country ,
    c_last_review_date_sk,
    DECODE(c_preferred_cust_flag, ''Y'', TRUE, ''N'', FALSE)::boolean as c_preferred_cust_flag_bool,
    c_login, 
    btrim(c_email_address) 
    from buyer the place c_birth_year = 1992;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/non-super/buyer/' 
FORMAT JSON 
partition by (c_birth_month)  embody
iam_role '<<AWS IAM position connected to your amazon redshift cluster>>'
MAXFILESIZE 64 MB;

Once we ran the UNLOAD command, we encountered an error as a result of the columns that used the btrim operate all tried to be exported as btrim (which is the default conduct of Amazon Redshift when the identical operate is utilized to a number of columns which can be chosen collectively). To keep away from this error, we have to use a singular column alias for every column the place the btrim operate was used.

If we choose the c_first_name, c_last_name, and c_email_address columns by making use of the btrim operate and c_preferred_cust_flag, we are able to convert them from character to Boolean.

We ran the next question in Amazon Redshift Question Editor v2:

SELECT btrim(c_first_name) ,
    btrim(c_last_name),
    btrim(c_email_address) , 
    DECODE(c_preferred_cust_flag, 'Y', TRUE, 'N', FALSE)::boolean c_preferred_cust_flag_bool  
    from buyer the place c_birth_year = 1992 restrict 10; 

All three columns that used the btrim operate are set as btrim within the output consequence as a substitute of their respective column identify.

An error occurred in UNLOAD as a result of we didn’t use a column alias.

We added column aliases within the following code:

unload ('SELECT c_customer_sk,
    c_customer_id ,
    c_current_cdemo_sk ,
    c_current_hdemo_sk ,
    c_current_addr_sk ,
    c_first_shipto_date_sk ,
    c_first_sales_date_sk ,
    c_salutation ,
    btrim(c_first_name) as c_first_name,
    btrim(c_last_name) as c_last_name,
    c_birth_day ,
    c_birth_month ,
    c_birth_year ,
    c_birth_country ,
    c_last_review_date_sk,
    DECODE(c_preferred_cust_flag, ''Y'', TRUE, ''N'', FALSE)::boolean as c_preferred_cust_flag_bool,
    c_login, 
    btrim(c_email_address) as c_email_addr_trimmed 
    from buyer the place c_birth_year = 1992;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/non-super/buyer/' 
FORMAT JSON 
partition by (c_birth_month)  embody
iam_role '<<AWS IAM position connected to your amazon redshift cluster>>'
MAXFILESIZE 64 MB;

After we added column aliases, the UNLOAD command accomplished efficiently and recordsdata have been exported to the specified location in Amazon S3.

The next screenshot reveals information is unloaded in JSON format partitioning output recordsdata into partition folders, following the Apache Hive conference, with buyer start month because the partition key into Amazon S3 from the Amazon Redshift buyer desk.

A question with Amazon S3 Choose reveals a snippet of knowledge within the JSON file on Amazon S3 that was unloaded.

The column aliases c_first_name, c_last_name, and c_email_addr_trimmed have been written into the JSON file as per the SELECT question. Boolean values have been saved in c_preferred_cust_flag_bool as properly.

Examples 2 and three: Utilizing the SUPER column

For the following two examples, we used the customer_orders_lineitem desk and information. We created the customer_orders_lineitem desk and copied information into it with the next code:

-- Created a brand new desk with SUPER column

CREATE TABLE JSON_unload_demo.customer_orders_lineitem
(c_custkey bigint
,c_name varchar
,c_address varchar
,c_nationkey smallint
,c_phone varchar
,c_acctbal decimal(12,2)
,c_mktsegment varchar
,c_comment varchar
,c_orders tremendous
);

-- Loaded information into the brand new desk
COPY json_unload_demo.customer_orders_lineitem 
FROM 's3://redshift-downloads/semistructured/tpch-nested/information/json/customer_orders_lineitem'
IAM_ROLE '<<AWS IAM position connected to your amazon redshift cluster>>'
FORMAT JSON 'auto';

Subsequent, we ran a number of queries to discover the customer_orders_lineitem desk’s information:

choose * from json_unload_demo.customer_orders_lineitem;

choose c_orders from json_unload_demo.customer_orders_lineitem;

SELECT attr as attribute_name, val as object_value FROM json_unload_demo.customer_orders_lineitem c, c.c_orders o, UNPIVOT o AS val AT attr;

Instance 2: With out partitions

On this instance, we unloaded all of the rows of the customer_orders_lineitem desk in JSON format into Amazon S3 with none partitions:

unload ('choose * from json_unload_demo.customer_orders_lineitem;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/tremendous/customer-order-lineitem/'
FORMAT JSON
iam_role '<<AWS IAM position connected to your amazon redshift cluster>>';

After we run the UNLOAD command, the information is out there within the desired Amazon S3 location. The next screenshot reveals information is unloaded in JSON format with none partitions into Amazon S3 from the Amazon Redshift customer_orders_lineitem desk.

A question with Amazon S3 Choose reveals a snippet of knowledge within the JSON file on Amazon S3 that was unloaded.

Instance 3: With partitions

On this instance, we unloaded all of the rows of the customer_orders_lineitem desk in JSON format partitioning output recordsdata into partition folders, following the Apache Hive conference, with buyer key because the partition key into Amazon S3:

unload ('choose * from json_unload_demo.customer_orders_lineitem;')
to 's3://<<Your Amazon S3 Bucket>>/partitioned/tremendous/customer-order-lineitem-1/'
FORMAT JSON
partition by (c_custkey) embody
iam_role '<<AWS IAM position connected to your amazon redshift cluster>>';

After we run the UNLOAD command, the information is out there within the desired Amazon S3 location. The next screenshot reveals information is unloaded in JSON format partitioning output recordsdata into partition folders, following the Apache Hive conference, with buyer key because the partition key into Amazon S3 from the Amazon Redshift customer_orders_lineitem desk.

A question with Amazon S3 Choose reveals a snippet of knowledge within the JSON file on Amazon S3 that acquired unloaded.

Conclusion

On this publish, we confirmed how you need to use the Amazon Redshift UNLOAD command to unload the results of a question to a number of JSON recordsdata into your Amazon S3 location. We additionally confirmed how one can partition the information utilizing your selection of partition key when you unload the information. You should utilize this function to export information to JSON recordsdata into Amazon S3 out of your Amazon Redshift cluster or your Amazon Redshift Serverless endpoint to make your information processing less complicated and construct an built-in information analytics platform.


Concerning the Authors

Dipankar Kushari is a Senior Analytics Options Architect with AWS.

Sayali Jojan is a Senior Analytics Options Architect with AWS. She has 7 years of expertise working with prospects to design and construct options on the AWS Cloud, with a give attention to information and analytics.

Cody Cunningham is a Software program Growth Engineer with AWS, engaged on information ingestion for Amazon Redshift.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments