HomeBig DataConstruct a knowledge sharing workflow with AWS Lake Formation to your knowledge...

Construct a knowledge sharing workflow with AWS Lake Formation to your knowledge mesh


A key advantage of a knowledge mesh structure is permitting totally different traces of enterprise (LOBs) and organizational items to function independently and provide their knowledge as a product. This mannequin not solely permits organizations to scale, but additionally provides the end-to-end possession of sustaining the product to knowledge producers which can be the area specialists of the information. This possession entails sustaining the information pipelines, debugging ETL scripts, fixing knowledge high quality points, and preserving the catalog entries updated because the dataset evolves over time.

On the patron aspect, groups can search the central catalog for related knowledge merchandise and request entry. Entry to the information is finished through the knowledge sharing function in AWS Lake Formation. As the quantity of knowledge merchandise develop and probably extra delicate info is saved in a corporation’s knowledge lake, it’s necessary that the method and mechanism to request and grant entry to particular knowledge merchandise are carried out in a scalable and safe method.

This submit describes easy methods to construct a workflow engine that automates the information sharing course of whereas together with a separate approval mechanism for knowledge merchandise which can be tagged as delicate (for instance, containing PII knowledge). Each the workflow and approval mechanism are customizable and ought to be tailored to stick to your organization’s inside processes. As well as, we embody an non-compulsory workflow UI to show easy methods to combine with the workflow engine. The UI is only one instance of how the interplay works. In a typical giant enterprise, you can too use ticketing programs to routinely set off each the workflow and the approval course of.

Resolution overview

A typical knowledge mesh structure for analytics in AWS accommodates one central account that collates all of the totally different knowledge merchandise from a number of producer accounts. Shoppers can search the obtainable knowledge merchandise in a single location. Sharing knowledge merchandise to customers doesn’t really make a separate copy, however as a substitute simply creates a pointer to the catalog merchandise. This implies any updates that producers make to their merchandise are routinely mirrored within the central account in addition to in all the patron accounts.

Constructing on high of this basis, the answer accommodates a number of elements, as depicted within the following diagram:

The central account contains the next elements:

  • AWS Glue – Used for Information Catalog functions.
  • AWS Lake Formation – Used to safe entry to the information in addition to present the information sharing capabilities that allow the information mesh structure.
  • AWS Step Capabilities – The precise workflow is outlined as a state machine. You possibly can customise this to stick to your group’s approval necessities.
  • AWS Amplify – The workflow UI makes use of the Amplify framework to safe entry. It additionally makes use of Amplify to host the React-based software. On the backend, the Amplify framework creates two Amazon Cognito elements to assist the safety necessities:
    • Person swimming pools – Present a consumer listing performance.
    • Id swimming pools – Present federated sign-in capabilities utilizing Amazon Cognito consumer swimming pools as the placement of the consumer particulars. The identification swimming pools vend non permanent credentials so the workflow UI can entry AWS Glue and Step Capabilities APIs.
  • AWS Lambda – Incorporates the appliance logic orchestrated by the Step Capabilities state machine. It additionally gives the mandatory software logic when a producer approves or denies a request for entry.
  • Amazon API Gateway – Offers the API for producers to just accept and deny requests.

The producer account accommodates the next elements:

The buyer account accommodates the next elements:

  • AWS Glue – Used for Information Catalog functions.
  • AWS Lake Formation – After the information has been made obtainable, customers can grant entry to its personal customers through Lake Formation.
  • AWS Useful resource Entry Supervisor (AWS RAM) – If the grantee account is in the identical group because the grantor account, the shared useful resource is accessible instantly to the grantee. If the grantee account is just not in the identical group, AWS RAM sends an invite to the grantee account to just accept or reject the useful resource grant. For extra particulars about Lake Formation cross-account entry, see Cross-Account Entry: How It Works.

The answer is cut up into a number of steps:

  1. Deploy the central account backend, together with the workflow engine and its related elements.
  2. Deploy the backend for the producer accounts. You possibly can repeat this step a number of instances relying on the variety of producer accounts that you just’re onboarding into the workflow engine.
  3. Deploy the non-compulsory workflow UI within the central account to work together with the central account backend.

Workflow overview

The next diagram illustrates the workflow. On this explicit instance, the state machine checks if the desk or database (relying on what’s being shared) has the pii_flag parameter and if it’s set to TRUE. If each situations are legitimate, it sends an approval request to the producer’s SNS subject. In any other case, it routinely shares the product to the requesting client.

This workflow is the core of the answer, and might be personalized to suit your group’s approval course of. As well as, you possibly can add customized parameters to databases, tables, and even columns to connect additional metadata to assist the workflow logic.

Stipulations

The next are the deployment necessities:

You possibly can clone the workflow UI and AWS CDK scripts from the GitHub repository.

Deploy the central account backend

To deploy the backend for the central account, go to the basis of the undertaking after cloning the GitHub repository and enter the next code:

yarn deploy-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

This deploys the next:

  • IAM roles utilized by the Lambda features and Step Capabilities state machine
  • Lambda features
  • The Step Capabilities state machine (the workflow itself)
  • An API Gateway

When the deployment is full, it generates a JSON file within the src/cfn-output.json location. This file is utilized by the UI deployment script to generate a scoped-down IAM coverage and workflow UI software to find the state machine that was created by the AWS CDK script.

The precise AWS CDK scripts for the central account deployment are in infra/central/. This additionally contains the Lambda features (within the infra/central/features/ folder) which can be utilized by each the state machine and the API Gateway.

Lake Formation permissions

The next desk accommodates the minimal required permissions that the central account knowledge lake administrator must grant to the respective IAM roles for the backend to have entry to the AWS Glue Information Catalog.

Function Permission Grantable
WorkflowLambdaTableDetails
  • Database: DESCRIBE
  • Tables: DESCRIBE
N/A
WorkflowLambdaShareCatalog

Workflow catalog parameters

The workflow makes use of the next catalog parameters to supply its performance.

Catalog Sort Parameter Identify Description
Database data_owner (Required) The account ID of the producer account that owns the information merchandise.
Database data_owner_name A readable pleasant identify that identifies the producer within the UI.
Database pii_flag A flag (true/false) that determines whether or not the information product requires approval (based mostly on the instance workflow).
Column pii_flag A flag (true/false) that determines whether or not the information product requires approval (based mostly on the instance workflow). That is solely relevant if requesting table-level entry.

You should utilize UpdateDatabase and UpdateTable so as to add parameters to database and column-level granularity, respectively. Alternatively, you need to use the CLI for AWS Glue so as to add the related parameters.

Use the AWS CLI to run the next command to test the present parameters in your database:

aws glue get-database --name <DATABASE_NAME> --profile <PROFILE_OF_CENTRAL_ACCOUNT>

You get the next response:

{
  "Database": {
    "Identify": "<DATABASE_NAME>",
    "CreateTime": "<CREATION_TIME>",
    "CreateTableDefaultPermissions": [],
    "CatalogId": "<CATALOG_ID>"
  }
}

To replace the database with the parameters indicated within the previous desk, we first create the enter JSON file, which accommodates the parameters that we wish to replace the database with. For instance, see the next code:

{
  "Identify": "<DATABASE_NAME>",
  "Parameters": {
    "data_owner": "<AWS_ACCOUNT_ID_OF_OWNER>",
    "data_owner_name": "<AWS_ACCOUNT_NAME_OF_OWNER>",
    "pii_flag": "true"
  }
}

Run the next command to replace the Information Catalog:

aws glue update-database --name <DATABASE_NAME> --database-input file://<FILE_NAME>.json --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Deploy the producer account backend

To deploy the backend to your producer accounts, go to the basis of the undertaking and run the next command:

yarn deploy-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT> --parameters centralMeshAccountId=<central_account_account_id>

This deploys the next:

  • An SNS subject the place approval requests get printed.
  • The ProducerWorkflowRole IAM position with a belief relationship to the central account. This position permits Amazon SNS publish to the beforehand created SNS subject.

You possibly can run this deployment script a number of instances, every time pointing to a special producer account that you just wish to take part within the workflow.

To obtain notification emails, subscribe your electronic mail within the SNS subject that the deployment script created. For instance, our subject is known as DataLakeSharingApproval. To get the complete ARN, you possibly can both go to the Amazon Easy Notification Service console or run the next command to checklist all of the subjects and get the ARN for DataLakeSharingApproval:

aws sns list-topics --profile <PROFILE_OF_PRODUCER_ACCOUNT>

After you might have the ARN, you possibly can subscribe your electronic mail by operating the next command:

aws sns subscribe --topic-arn <TOPIC_ARN> --protocol electronic mail --notification-endpoint <EMAIL_ADDRESS> --profile <PROFILE_OF_PRODUCER_ACCOUNT>

You then obtain a affirmation electronic mail through the e-mail handle that you just subscribed. Select Verify subscription to obtain notifications from this SNS subject.

Deploy the workflow UI

The workflow UI is designed to be deployed within the central account the place the central knowledge catalog is positioned.

To begin the deployment, enter the next command:

This deploys the next:

  • Amazon Cognito consumer pool and identification pool
  • React-based software to work together with the catalog and request knowledge entry

The deployment command prompts you for the next info:

  • Venture info – Use the default values.
  • AWS authentication – Use your profile for the central account. Amplify makes use of this profile to deploy the backend sources.

UI authentication – Use the default configuration and your username. Select No, I’m carried out when requested to configure superior settings.

  • UI internet hosting – Use internet hosting with the Amplify console and select guide deployment.

The script provides a abstract of what’s deployed. Getting into Y triggers the sources to be deployed within the backend. The immediate seems to be much like the next screenshot:

When the deployment is full, the remaining immediate is for the preliminary consumer info equivalent to consumer identify and electronic mail. A brief password is routinely generated and despatched to the e-mail supplied. The consumer is required to alter the password after the primary login.

The deployment script grants IAM permissions to the consumer through an inline coverage connected to the Amazon Cognito authenticated IAM position:

{
   "Model":"2012-10-17",
   "Assertion":[
      {
         "Effect":"Allow",
         "Action":[
            "glue:GetDatabase",
            "glue:GetTables",
            "glue:GetDatabases",
            "glue:GetTable"
         ],
         "Useful resource":"*"
      },
      {
         "Impact":"Enable",
         "Motion":[
            "states:ListExecutions",
            "states:StartExecution"
         ],
         "Useful resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:stateMachine:<STATE_MACHINE_NAME>"
]
      },
      {
         "Impact":"Enable",
         "Motion":[
             "states:DescribeExecution"
         ],
         "Useful resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:execution:<STATE_MACHINE_NAME>:*"
]
      }


   ]
}

The final remaining step is to grant Lake Formation permissions (DESCRIBE for each databases and tables) to the authenticated IAM position related to the Amazon Cognito identification pool. You’ll find the IAM position by operating the next command:

cat amplify/team-provider-info.json

The IAM position identify is within the AuthRoleName property underneath the awscloudformation key. After you grant the required permissions, you need to use the URL supplied in your browser to open the workflow UI.

Your non permanent password is emailed to you so you possibly can full the preliminary login, after which you’re requested to alter your password.

The primary web page after logging in is the checklist of databases that customers can entry.

Select Request Entry to see the database particulars and the checklist of tables.

Select Request Per Desk Entry and see extra particulars on the desk degree.

Going again within the earlier web page, we request database-level entry by coming into the patron account ID that receives the share request.

As a result of this database has been tagged with a pii_flag, the workflow must ship an approval request to the product proprietor. To obtain this approval request electronic mail, the product proprietor’s electronic mail must be subscribed to the DataLakeSharingApproval SNS subject within the product account. The main points ought to look much like the next screenshot:

The e-mail seems to be much like the next screenshot:

The product proprietor chooses the Approve hyperlink to set off the Step Capabilities state machine to proceed operating and share the catalog merchandise to the patron account.

For this instance, the patron account is just not a part of a corporation, so the admin of the patron account has to go to AWS RAM and settle for the invitation.

After the useful resource share is accepted, the shared database seems within the client account’s catalog.

Clear up

For those who not want to make use of this resolution, use the supplied cleanup scripts to take away the deployed sources.

Producer account

To take away the deployed sources in producer accounts, run the next command for every producer account that you just deployed in:

yarn clean-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT>

Central account

Run the next command to take away the workflow backend within the central account:

yarn clean-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Workflow UI

The cleanup script for the workflow UI depends on an Amplify CLI command to provoke the teardown of the deployed sources. Moreover, you need to use a customized script to take away the inline coverage within the authenticated IAM position utilized by Amazon Cognito in order that Amplify can totally clear up all of the deployed sources. Run the next command to set off the cleanup:

This command doesn’t require the profile parameter as a result of it makes use of the present Amplify configuration to deduce the place the sources are deployed and which profile was used.

Conclusion

This submit demonstrated easy methods to construct a workflow engine to automate a corporation’s approval course of to achieve entry to knowledge merchandise with various levels of sensitivity. Utilizing a workflow engine permits knowledge sharing in a self-service method whereas codifying your group’s inside processes to have the ability to safely scale as extra knowledge merchandise and groups get onboarded.

The supplied workflow UI demonstrated one attainable integration state of affairs. Different attainable integration eventualities embody integration along with your group’s ticketing system to set off the workflow in addition to obtain and reply to approval requests, or integration with enterprise chat purposes to additional shorten the approval cycle.

Lastly, a excessive diploma of customization is feasible with the demonstrated strategy. Organizations have full management over the workflow, how knowledge product sensitivity ranges are outlined, what will get auto-approved and what wants additional approvals, the hierarchy of approvals (equivalent to a single approver or a number of approvers), and the way the approvals get delivered and acted upon. You possibly can reap the benefits of this flexibility to automate your organization’s processes to assist them safely speed up in direction of being a data-driven group.


In regards to the Writer

Jan Michael Go Tan is a Principal Options Architect for Amazon Internet Providers. He helps clients design scalable and progressive options with the AWS Cloud.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments