AWS Glue using Pyspark

 

Introduction

What is AWS glue?

1. AWS Glue is a fully managed Extract-Transform-Load pipeline (ETL) service. This service makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it swiftly and reliably between various data stores.

2. It comprises of components such as a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.

3. AWS Glue is serverless, which means that there’s no infrastructure to set up or manage.

When to use AWS Glue?

1. When you want to automate your ETL processes.

2. When you want to easily integrate with other data sources and targets such as Amazon Kinesis, Amazon Redshift, Amazon S3, etc.

3. When you want your pipeline to be cost-effective; it can be cheaper because users only pay for the consumed resources. If your ETL jobs require more computing power from time to time but generally consume fewer resources, you don’t need to pay for usage hours without any actual usage.

S3 Bucket

1. Create an S3 bucket:

Go to AWS console and search for S3 Bucket.

Create a new S3 bucket, Bucket name should be unique. (We created a bucket named glue-job-customscript)

After creating the bucket, create two directories inside the bucket.

1. Input

2. Output

Note:

Please change the bucket policy to grant read and write access to the user.

{"Version": "2022-03-05",
"Statement": [{"Sid": "ListObjectsInBucket",
"Effect": "Allow","Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::glue-job-customscript/*"
}]}

Crawlers

A Crawler reads data at the source location and creates tables in the Data Catalog. A table is the metadata definition that represents your data.

The Crawler creates the metadata that allows GLUE and services such as ATHENA to view the information stored in the S3 bucket as a database with tables.

2. Create a Crawlers

1. Give a name to the crawler then click Next.

2. In Crawler source type, we do not need to change anything. Just click Next.

3. Data Store

4. In Add another data source, select No and click Next.

5. IAM Role creates a new IAM role. Choose to create an IAM role and provide a name to the role (we created a role named Glue-Script).

6. Schedule: choose the appropriate option and click Next(we chose Run on-demand))

7. In Output select an existing database if there are, any otherwise create a new database (by clicking Add Database) and click Next.

You will see the database that you created while executing the crawler. Click the database that you created, you will be able to see the table that you created.

Job

An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target.

3. Create a Job

Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud