HomeBig DataUse the AWS Glue connector to learn and write Apache Iceberg tables...

Use the AWS Glue connector to learn and write Apache Iceberg tables with ACID transactions and carry out time journey


These days, many shoppers have constructed their knowledge lakes because the core of their knowledge analytic programs. In a typical use case of knowledge lakes, many concurrent queries run to retrieve constant snapshots of enterprise insights by aggregating question outcomes. A big quantity of knowledge continually comes from totally different knowledge sources into the info lakes. There’s additionally a typical demand to mirror the adjustments occurring within the knowledge sources into the info lakes. Which means not solely inserts but additionally updates and deletes have to be replicated into the info lakes.

Apache Iceberg gives the aptitude of ACID transactions in your knowledge lakes, which permits concurrent queries so as to add or delete information remoted from any current queries with read-consistency for queries. Iceberg is an open desk format designed for giant analytic workloads on large datasets. You possibly can carry out ACID transactions in opposition to your knowledge lakes through the use of easy SQL expressions. It additionally allows time journey, rollback, hidden partitioning, and schema evolution adjustments, reminiscent of including, dropping, renaming, updating, and reordering columns.

AWS Glue is likely one of the key parts to constructing knowledge lakes. It extracts knowledge from a number of sources and ingests your knowledge to your knowledge lake constructed on Amazon Easy Storage Service (Amazon S3) utilizing each batch and streaming jobs. To broaden the accessibility of your AWS Glue extract, remodel, and cargo (ETL) jobs to Iceberg, AWS Glue gives an Apache Iceberg connector. The connector means that you can construct Iceberg tables in your knowledge lakes and run Iceberg operations reminiscent of ACID transactions, time journey, rollbacks, and so forth out of your AWS Glue ETL jobs.

On this put up, we give an summary of how one can arrange the Iceberg connector for AWS Glue and configure the related sources to make use of Iceberg with AWS Glue jobs. We additionally display how one can run typical Iceberg operations on AWS Glue interactive classes with an instance use case.

Apache Iceberg connector for AWS Glue

With the Apache Iceberg connector for AWS Glue, you’ll be able to reap the benefits of the next Iceberg capabilities:

  • Fundamental operations on Iceberg tables – This consists of creating Iceberg tables within the AWS Glue Information Catalog and inserting, updating, and deleting information with ACID transactions within the Iceberg tables
  • Inserting and updating information – You possibly can run UPSERT (replace and insert) queries to your Iceberg desk
  • Time journey on Iceberg tables – You possibly can learn a selected model of an Iceberg desk from desk snapshots that Iceberg manages
  • Rollback of desk variations – You possibly can revert an Iceberg desk again to a selected model of the desk

Iceberg presents extra helpful capabilities reminiscent of hidden partitioning; schema evolution with add, drop, replace, and rename help; computerized knowledge compaction; and extra. For extra particulars about Iceberg, check with the Apache Iceberg documentation.

Subsequent, we display how the Apache Iceberg connector for AWS Glue works for every Iceberg functionality based mostly on an instance use case.

Overview of instance buyer state of affairs

Let’s assume that an ecommerce firm sells merchandise on their on-line platform. Clients should purchase merchandise and write opinions to every product. Clients can add, replace, or delete their opinions at any time. The client opinions are an vital supply for analyzing buyer sentiment and enterprise tendencies.

On this state of affairs, now we have the next groups in our group:

  • Information engineering group – Answerable for constructing and managing knowledge platforms.
  • Information analyst group – Answerable for analyzing buyer opinions and creating enterprise studies. This group queries the opinions each day, creates a enterprise intelligence (BI) report, and shares it with gross sales group.
  • Buyer help group – Answerable for replying to buyer inquiries. This group queries the opinions once they get inquiries in regards to the opinions.

Our answer has the next necessities:

  • Question scalability is vital as a result of the web site is big.
  • Particular person buyer opinions will be added, up to date, and deleted.
  • The info analyst group wants to make use of each notebooks and advert hoc queries for his or her evaluation.
  • The client help group typically must view the historical past of the client opinions.
  • Buyer opinions can at all times be added, up to date, and deleted, even whereas one of many groups is querying the opinions for evaluation. Which means any lead to a question isn’t affected by uncommitted buyer evaluate write operations.
  • Any adjustments in buyer opinions which are made by the group’s varied groups have to be mirrored in BI studies and question outcomes.

On this put up, we construct a knowledge lake of buyer evaluate knowledge on prime of Amazon S3. To satisfy these necessities, we introduce Apache Iceberg to allow including, updating, and deleting information; ACID transactions; and time journey queries. We additionally use an AWS Glue Studio pocket book to combine and question the info at scale. First, we arrange the connector so we are able to create an AWS Glue connection for Iceberg.

Arrange the Apache Iceberg connector and create the Iceberg connection

We first arrange Apache Iceberg connector for AWS Glue to make use of Apache Iceberg with AWS Glue jobs. Significantly, on this part, we arrange the Apache Iceberg connector for AWS Glue and create an AWS Glue job with the connector. Full the next steps:

  1. Navigate to the Apache Iceberg connector for AWS Glue web page in AWS Market.
  2. Select Proceed to Subscribe.

  1. Evaluation the data below Phrases and Circumstances, and select Settle for Phrases to proceed.

  1. When the subscription is full, select Proceed to Configuration.

  1. For Success choice, select Glue 3.0. (1.0 and a couple of.0 are additionally out there choices.)
  2. For Software program model, select the newest software program model.

As of this writing, 0.12.0-2 is the newest model of the Apache Iceberg connector for AWS Glue.

  1. Select Proceed to Launch.

  1. Select Utilization directions.
  2. Select Activate the Glue connector from AWS Glue Studio.

You’re redirected to AWS Glue Studio.

  1. For Identify, enter a reputation to your connection (for instance, iceberg-connection).

  1. Select Create connection and activate connector.

A message seems that the connection was efficiently added, and the connection is now seen on the AWS Glue Studio console.

Configure sources and permissions

We use a offered AWS CloudFormation template to arrange Iceberg configuration for AWS Glue. AWS CloudFormation creates the next sources:

  • An S3 bucket to retailer an Iceberg configuration file and precise knowledge
  • An AWS Lambda perform to generate an Iceberg configuration file based mostly on parameters offered by a person for the CloudFormation template, and to wash up the sources created by means of this put up
  • AWS Identification and Entry Administration (IAM) roles and insurance policies with needed permissions
  • An AWS Glue database within the Information Catalog to register Iceberg tables

To deploy the CloudFormation template, full the next steps:

  1. Select Launch Stack:

Launch Button

  1. For DynamoDBTableName, enter a reputation for an Amazon DynamoDB desk that’s created robotically when AWS Glue creates an Iceberg desk.

This desk is used for an AWS Glue job to acquire a commit lock to keep away from concurrently modifying information in Iceberg tables. For extra particulars about commit locking, check with DynamoDB for Commit Locking. Be aware that you just shouldn’t specify the identify of an current desk.

  1. For IcebergDatabaseName, enter a reputation for the AWS Glue database that’s created within the Information Catalog and used for registering Iceberg tables.
  2. Select Subsequent.

  1. Choose I acknowledge that AWS CloudFormation would possibly create IAM sources with customized names.
  2. Select Create stack.

Begin an AWS Glue Studio pocket book to make use of Apache Iceberg

After you launch the CloudFormation stack, you create an AWS Glue Studio pocket book to carry out Iceberg operations. Full the next steps:

  1. Obtain the Jupyter pocket book file.
  2. On the AWS Glue console, select Jobs within the navigation pane.
  3. Below Create job, choose Jupyter Pocket book.

  1. Choose Add and edit an current pocket book and add iceberg-with-glue.ipynb.

  1. Select Create.
  2. For Job identify, enter a reputation.
  3. For IAM position, select IcebergConnectorGlueJobRole, which was created through the CloudFormation template.
  4. Select Begin pocket book job.

The method takes a couple of minutes to finish, after which you’ll be able to see an AWS Glue Studio pocket book view.

  1. Select Save to avoid wasting the pocket book.

Arrange the Iceberg configuration

To arrange the Iceberg configuration, full the next steps:

  1. Run the next cells with a number of choices (magics). Be aware that you just set your connection identify for the %connections magic within the cell.

For extra data, check with Configuring AWS Glue Interactive Periods for Jupyter and AWS Glue Studio notebooks.

A message Session <session-id> has been created seems when your AWS Glue Studio pocket book is prepared.

Within the final cell on this part, you load your Iceberg configuration, which you specified when launching the CloudFormation stack. The Iceberg configuration features a warehouse path for Iceberg precise knowledge, a DynamoDB desk identify for commit locking, a database identify to your Iceberg tables, and extra.

To load the configuration, set the S3 bucket identify that was created through the CloudFormation stack.

  1. On the AWS CloudFormation console, select Stacks within the navigation pane.
  2. Select the stack you created.
  3. On the Outputs tab, copy the S3 bucket identify.

  1. Set the S3 identify because the S3_BUCKET parameter in your pocket book.

  1. Run the cell and cargo the Iceberg configuration that you just set.

Initialize the job with Iceberg configurations

We proceed to run cells to provoke a SparkSession on this part.

  1. Set an Iceberg warehouse path and a DynamoDB desk identify for Iceberg commit locking from the user_config parameter.
  2. Initialize a SparkSession by setting the Iceberg configurations.
  3. With the SparkSession object, create SparkContext and GlueContext objects.

The next screenshot exhibits the related part within the pocket book.

We offer the small print of every parameter that you just configure for the SparkSession within the appendix of this put up.

For this put up, we display setting the Spark configuration for Iceberg. It’s also possible to set the configuration as AWS Glue job parameters. For extra data, check with the Utilization Info part within the Iceberg connector product web page.

Use case walkthrough

To stroll by means of our use case, we use two tables; acr_iceberg and acr_iceberg_report. The desk acr_iceberg comprises the client evaluate knowledge. The desk acr_iceberg_report comprises BI evaluation outcomes based mostly on the client evaluate knowledge. All adjustments to acr_iceberg additionally affect acr_iceberg_report. The desk acr_iceberg_report must be up to date each day, proper earlier than sharing enterprise studies with stakeholders.

To display this use case, we stroll by means of the next typical steps:

  1. An information engineering group registers the acr_iceberg and acr_iceberg_report tables within the Glue Information Catalog.
  2. Clients (ecommerce customers) add opinions to merchandise within the Industrial_Supplies class. These opinions are added to the Iceberg desk.
  3. A buyer requests to replace their opinions. We simulate updating the client evaluate within the acr_iceberg desk.
  4. We mirror the client’s request of the up to date evaluate in acr_iceberg into acr_iceberg_report.
  5. We revert the client’s request of the up to date evaluate for the client evaluate desk acr_iceberg, and mirror the reversion in acr_iceberg_report.

1. Create Iceberg tables of buyer opinions and BI studies

On this step, the info engineering group creates the acr_iceberg Iceberg desk for buyer opinions knowledge (based mostly on the Amazon Buyer Evaluations Dataset), and the group creates the acr_iceberg_report Iceberg desk for BI studies.

Create the acr_iceberg desk for buyer opinions

The next code initially extracts the Amazon buyer opinions, that are saved in a public S3 bucket. Then it creates an Iceberg desk of the client opinions and masses these opinions into your specified S3 bucket (created through CloudFormation stack). Be aware that the script masses partial datasets to keep away from taking quite a lot of time to load the info.

# Loading the dataset and creating an Iceberg desk. It will take about 3-5 minutes.
spark.learn 
    .choice('basePath', INPUT_BASE_PATH) 
    .parquet(*INPUT_CATEGORIES) 
    .writeTo(f'{CATALOG}.{DATABASE}.{TABLE}') 
    .tableProperty('format-version', '2') 
    .create()

Relating to the tableProperty parameter, we specify format model 2 to make the desk model appropriate with Amazon Athena. For extra details about Athena help for Iceberg tables, check with Issues and limitations. To study extra in regards to the distinction between Iceberg desk variations 1 and a couple of, check with Appendix E: Format model adjustments.

Let’s run the next cells. Working the second cell takes round 3–5 minutes.

After you run the cells, the acr_iceberg desk is obtainable in your specified database within the Glue Information Catalog.

It’s also possible to see the precise knowledge and metadata of the Iceberg desk within the S3 bucket that’s created by means of the CloudFormation stack. Iceberg creates the desk and writes precise knowledge and related metadata that features desk schema, desk model data, and so forth. See the next objects in your S3 bucket:

$ aws s3 ls 's3://your-bucket/knowledge/' --recursive
YYYY-MM-dd hh:mm:ss   83616660 knowledge/iceberg_blog_default.db/acr_iceberg/knowledge/00000-44-c2983230-c43a-4f4a-9b89-1f7c13e59645-00001.parquet
YYYY-MM-dd hh:mm:ss   83247771 
...
YYYY-MM-dd hh:mm:ss       5134 knowledge/iceberg_blog_default.db/acr_iceberg/metadata/00000-bc5d3ea2-280f-4e28-a71f-4c2b749ed637.metadata.json
YYYY-MM-dd hh:mm:ss     116950 knowledge/iceberg_blog_default.db/acr_iceberg/metadata/411308cd-1f4d-4535-9444-f6b56a56697f-m0.avro
YYYY-MM-dd hh:mm:ss       3821 knowledge/iceberg_blog_default.db/acr_iceberg/metadata/snap-6122957686233868728-1-411308cd-1f4d-4535-9444-f6b56a56697f.avro

The job tries to create a DynamoDB desk, which you specified within the CloudFormation stack (within the following screenshot, its identify is myGlueLockTable), if it doesn’t exist already. As we mentioned earlier, the DynamoDB desk is used for commit locking for Iceberg tables.

Create the acr_iceberg_report Iceberg desk for BI studies

The info engineer group additionally creates the acr_iceberg_report desk for BI studies within the Glue Information Catalog. This desk initially has the next information.

comment_count avg_star product_category
1240 4.20729367860598 Digicam
95 4.80167540490342 Industrial_Supplies
663 3.80123467540571 PC

To create the desk, run the next cell.

The 2 Iceberg tables have been created. Let’s test the acr_iceberg desk information by working a question.

Decide the common star ranking for every product class by querying the Iceberg desk

You possibly can see the Iceberg desk information through the use of a SELECT assertion. On this part, we question the acr_iceberg desk to simulate seeing a present BI report knowledge by working an advert hoc question.

Run the next cell within the pocket book to get the aggregated variety of buyer feedback and imply star ranking for every product_category.

The cell output has the next outcomes.

One other method to question Iceberg tables is utilizing Amazon Athena (while you use the Athena with Iceberg tables, it is advisable to arrange the Iceberg atmosphere) or Amazon EMR.

2. Add buyer opinions within the Iceberg desk

On this part, prospects add feedback for some merchandise within the Industrial Provides product class, and we add these feedback to the acr_iceberg desk. To display this state of affairs, we create a Spark DataFrame based mostly on the next new buyer opinions after which add them to the desk with an INSERT assertion.

market customer_id review_id product_id product_
father or mother
product_
title
star_
ranking
helpful_
votes
total_
votes
vine verified_
buy
review_
headline
review_
physique
review_
date
12 months product_
class
US 12345689 ISB35E4556F144 I00EDBY7X8 989172340 plastic containers 5 0 0 N Y 5 Stars Nice product! 2022-02-01 2022 Industrial_
Provides
US 78901234 IS4392CD4C3C4 I00D7JFOPC 952000001 battery tester 3 0 0 N Y good one, however
it broke
some days later
nope 2022-02-01 2022 Industrial_
Provides
US 12345123 IS97B103F8B24C I002LHA74O 818426953 spray bottle 2 1 1 N N Two Stars the bottle isn’t
as huge as pictured.
2022-02-01 2022 Industrial_
Provides
US 23000093 ISAB4268D46F3X I00ARPLCGY 562945918 3d printer 5 3 3 N Y Tremendous nice very helpful 2022-02-01 2022 Industrial_
Provides
US 89874312 ISAB4268137V2Y I80ARDQCY 564669018 circuit board 4 0 0 Y Y Nice, however
a bit bit costly
you should purchase this,
however observe the value
2022-02-01 2022 Industrial_
Provides

Run the next cells within the pocket book to insert the client feedback to the Iceberg desk. The method takes about 1 minute.

Run the subsequent cell to see an addition to the product class Industrial_Supplies with 5 below comment_count.

3. Replace a buyer evaluate within the Iceberg desk

Within the earlier part, we added new buyer opinions to the acr_iceberg Iceberg desk. On this part, a buyer requests an replace of their evaluate. Particularly, buyer 78901234 requests the next replace of the evaluate ID IS4392CD4C3C4.

  • change star_rating from 3 to five
  • replace the review_headline from good one, but it surely broke some days later to excellent

We replace the client remark through the use of an UPDATE question by working the next cell.

We will evaluate the up to date file by working the subsequent cell as follows.

Additionally, while you run this cell for the reporting desk, you’ll be able to see the up to date avg_star column worth for the Industrial_Supplies product class. Particularly, the avg_star worth has been up to date from 3.8 to 4.2 on account of the star_rating altering from 3 to five:

4. Replicate adjustments within the buyer opinions desk within the BI report desk with a MERGE INTO question

On this part, we mirror the adjustments within the acr_iceberg desk into the BI report desk acr_iceberg_report. To take action, we run the MERGE INTO question and mix the 2 tables based mostly on the situation of the product_category column in every desk. This question works as follows:

  • When the product_category column in every desk is similar, the question returns the sum of every column file
  • When the column in every desk is just not the identical, the question simply inserts a brand new file

This MERGE INTO operation can also be known as an UPSERT (replace and insert).

Run the next cell to mirror the replace of buyer opinions within the acr_iceberg desk into the acr_iceberg_report BI desk.

After the MERGE INTO question is full, you’ll be able to see the up to date acr_iceberg_report desk by working the next cell.

The MERGE INTO question carried out the next adjustments:

  • Within the Digicam, Industrial_Supplies, and PC product classes, every comment_count is the sum between the preliminary worth of the acr_iceberg_report desk and the aggregated desk worth. For instance, within the Industrial_Supplies product class row, the comment_count 100 is calculated by 95 (within the preliminary model of acr_iceberg_report) + 5 (within the aggregated report desk).
  • Along with comment_count, the avg_star within the Digicam, Industrial_Supplies, or PC product class row can also be computed by averaging between every avg_star worth in acr_iceberg_report and within the aggregated desk.
  • In different product classes, every comment_count and avg_star is similar as every worth within the aggregated desk, which implies that every worth within the aggregated desk is inserted into the acr_iceberg_report desk.

5. Roll again the Iceberg tables and mirror adjustments within the BI report desk

On this part, the client who requested the replace of the evaluate now requests to revert the up to date evaluate.

Iceberg shops versioning tables by means of the operations for Iceberg tables. We will see the data of every model of desk by inspecting tables, and we are able to additionally time journey or roll again tables to an outdated desk model.

To finish the client request to revert the up to date evaluate, we have to revert the desk model of acr_iceberg to the sooner model once we first added the opinions. Moreover, we have to replace the acr_iceberg_report desk to mirror the rollback of the acr_iceberg desk model. Particularly, we have to carry out the next three steps to finish these operations:

  1. Examine the historical past of desk adjustments of acr_iceberg and acr_iceberg_report to get every desk snapshot.
  2. Roll again acr_iceberg to the model when first we inserted information, and likewise roll again the acr_iceberg_report desk to the preliminary model to mirror the client evaluate replace.
  3. Merge the acr_iceberg desk with the acr_iceberg_report desk once more.

Get the metadata of every report desk

As a primary step, we test desk variations by inspecting the desk. Run the next cells.

Now you’ll be able to see the next desk variations in acr_iceberg and acr_iceberg_report:

  • acr_iceberg has three variations:
    • The oldest one is the preliminary model of this desk, which exhibits the append operation
    • The second oldest one is the file insertion, which exhibits the append operation
    • The most recent one is the replace, which exhibits the overwrite operation
  • acr_iceberg_report has two variations:
    • The oldest one is the preliminary model of this desk, which exhibits the append operation
    • The opposite one is from the MERGE INTO question within the earlier part, which exhibits the overwrite operation

As proven within the following screenshot, we roll again to the acr_iceberg desk model, inserting information based mostly on the client revert request. We additionally roll again to the acr_iceberg_report desk model within the preliminary model to discard the MERGE INTO operation within the earlier part.

Roll again the acr_iceberg and acr_iceberg_report tables

Primarily based in your snapshot IDs, you’ll be able to roll again every desk model:

  • For acr_iceberg, use the second-oldest snapshot_id (on this instance, 5440744662350048750) and exchange <Kind snapshot_id in ace_iceberg desk> within the following cell with this snapshot_id.
  • For acr_iceberg_report desk, use the preliminary snapshot_id (on this instance, 7958428388396549892) and exchange <Kind snaphost_id in ace_iceberg_report desk> within the following cell with this snapshot_id.

After you specify the snapshot_id for every rollback question, run the next cells.

When this step is full, you’ll be able to see the earlier and present snapshot IDs of every desk.

Every Iceberg desk has been reverted to the precise model now.

Replicate adjustments in acr_iceberg into acr_iceberg_report once more

We mirror the acr_iceberg desk reversion into the present acr_iceberg_report desk. To finish this, run the next cell.

After you rerun the MERGE INTO question, run the next cell to see the brand new desk information. After we examine the desk information, we observe that the avg_star worth in Industrial_Supplies is decrease than the worth of the earlier desk avg_star.

You had been capable of mirror a buyer’s request of reverting their up to date evaluate on the BI report desk. Particularly, you will get the up to date avg_star file within the Industrial_Supplies product class.

Clear up

To wash up all sources that you just created, delete the CloudFormation stack.

Conclusion

On this put up, we walked by means of utilizing the Apache Iceberg connector with AWS Glue ETL jobs. We created an Iceberg desk constructed on Amazon S3, and ran queries reminiscent of studying the Iceberg desk knowledge, inserting a file, merging two tables, and time journey.

The operations for the Iceberg desk that we demonstrated on this put up aren’t the entire operations Iceberg helps. Seek advice from the Apache Iceberg documentation for details about extra operations.

Appendix: Spark configurations to make use of Apache Iceberg on AWS Glue

As we talked about earlier, the pocket book units up a Spark configuration to combine Iceberg with AWS Glue. The next desk exhibits what every parameter defines.

Spark configuration key Worth Description
spark.sql.catalog.{CATALOG} org.apache.iceberg.spark.SparkCatalog Specifies a Spark catalog interface that communicates with Iceberg tables.
spark.sql.catalog.{CATALOG}.warehouse {WAREHOUSE_PATH} A warehouse path for jobs to write down iceberg metadata and precise knowledge.
spark.sql.catalog.{CATALOG}.catalog-impl org.apache.iceberg.aws.
glue.GlueCatalog
The implementation of the Spark catalog class to speak between Iceberg tables and the AWS Glue Information Catalog.
spark.sql.catalog.{CATALOG}.io-impl org.apache.iceberg.aws.s3.S3FileIO Used for Iceberg to speak with Amazon S3.
spark.sql.catalog.{CATALOG}.lock-impl org.apache.iceberg.aws.glue.
DynamoLockManager
Used for Iceberg to handle desk locks.
spark.sql.catalog.{CATALOG}.lock.desk {DYNAMODB_TABLE} A DynamoDB desk identify to retailer desk locks.
spark.sql.extensions org.apache.icerberg.spark.extensions.
IcebergSparkSessionExtensions
The implementation that allows Spark to run Iceberg-specific SQL instructions.
spark.sql.session.timeZone UTC Units the time zone of the Spark atmosphere to UTC for additional Iceberg time journey queries. The epoch time is within the UTC time zone.

Concerning the Creator

Tomohiro Tanaka is a Cloud Help Engineer at Amazon Net Companies. He builds Glue connectors reminiscent of Apache Iceberg connector and TPC-DS connector. He’s captivated with serving to prospects construct knowledge lakes utilizing ETL workloads. In his free time, he additionally enjoys espresso breaks along with his colleagues and making espresso at house.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments