Create AWS S3 Bucket and Read Parquet File from Fabric Notebook
6K views
Jan 29, 2024
In this video, I covered how to get started with AWS S3, create S3 bucket, access key, secret key and read data in Fabric Notebook into the AWS S3.
View Video Transcript
0:00
Hello everyone and compliments of the season
0:12
In this video, I'm going to show you how to read a Leak House Backup file into Amazon
0:19
S3 Web Service using the Fabric Notebook. This is going to be a comprehensive end-to-end project because I'm going to walk you through
0:28
what is Amazon S3, how you can create buckets, create access keys, secret keys and of course
0:35
how you can read the data from Fabric Notebook into the S3 Web Service and of course how
0:41
you can query the data in the S3 query window. So let's get started
0:48
Now what is Amazon S3? Now the S3 simply means Simple Storage Service and it is a scalable object storage service
0:56
that is offered as part of the Amazon Web Services and it allows you to store and retrieve
1:01
any amount of data at any time from any location on the web
1:07
The S3 is designed to be a highly durable, available and scalable service to store data
1:13
or form data backup, archive your data, distribute content and of course it serves as a data lake
1:20
Using the Amazon Web Service, we're going to create what's called buckets. Now buckets are basically containers for storing any kind of objects like the PDF
1:28
like the parquet file, CSV, PNG or whatever file you have. Enough of talking, let's get started
1:36
I am currently in the console.aws.amazon.com and of course we can click on the services
1:44
to see all the services in the AWS or I can even search the S3 in the search bar and then
1:51
click on this S3 service. For the first time, we're going to see this Amazon S3 store and retrieve any amount of
1:58
data from anywhere because I do not have any container buckets created
2:04
So I'm going to click on this create buckets that can store different kind of objects and
2:09
of course we can specify the AWS region in the general configuration
2:15
I'm going to scroll down to, there we go, EU01 and of course we can choose the bucket
2:24
type, we can choose the general purpose or directory. I'm going to go with the general purpose which is the recommended and I'm going to scroll down
2:32
Now for the bucket name, you can see this kind of example, we can't use uppercase as
2:37
a bucket name. It's not going to allow that. So I'm going to type in sales data 1234 just to make it more unique and of course you can
2:48
even select a bucket in S3 if we have any existing bucket but I do not have any bucket
2:53
so I'm not going to choose this and let's scroll down. For the object ownership, now we can specify the access control levels, okay so this is
3:01
going to be disabled, ACL and let's scroll down. Now for the block public access settings for this bucket, we're going to block all the
3:09
public access just for now. For the bucket version, I'm going to click on enable and this allows me to get any objects
3:17
that has been deleted in a bucket so we can see version is a means of keeping multiple
3:21
variants of an object in the same bucket. So enabled and of course we can optionally specify tags and for the default encryption
3:28
we want to use the default server-side encryption with Amazon S3 managed keys and let's go down
3:35
and of course for the advanced settings, we don't need to do anything here for now, let's
3:40
just go down and click on create bucket, okay I'm going to go up, okay I can see that bucket
3:48
with the same name already exists. Now I do not have any bucket with the same name but I don't know why
3:55
I'm just going to type in 12 and then sales data 12, oh okay I'm still seeing the same
4:00
error or let's use 007, let me click out and this is accepted sales data 007 and then
4:08
let's click on create bucket, successfully created bucket sales data 007 and that is
4:15
super amazing. So we can see the name of the bucket sales data 007, we can see the AWS region in EU
4:22
01 covering Stockholm and some other places. So for the access, we can see bucket and objects not public and that is fine for now
4:29
I can click on the bucket name and then we can see different kind of parameters like
4:35
the object, the properties, the permissions, the metrics, management and the access points
4:42
Now what I'm going to do is to go ahead and create what's called access key and secret
4:47
key in the identity and access management service. So I'm going to come here, I'm going to type in AIM and then I'm going to click on this
4:56
Now in the identity and access management AIM service, we can see the access management
5:02
and we can see different kind of the dashboard rather, the AIM dashboard, we can see the
5:07
security recommendations and so on. Now I want to click on users, I want to create a new user and click on create user
5:16
Now I'm going to call it my name, Abiola Dede, so this is going to be the username
5:21
Now I'm not going to provide user access to the AWS management consult
5:26
So I'm going to go ahead and click on next. And then for the permissions options, under the set permissions, now we can add to a group
5:34
we can copy permission from a user to this particular new user
5:38
We can even attach individual direct policies. So I'm going to choose this attach policies directly and then I'm going to apply these policies
5:48
I'm going to click on this administrative access Amplify. And I think let me just grant, okay, device setup
5:54
I think that is fine for this user. I'm going to scroll down, then click on next
5:59
And then we can see the review and create. So this is the username, the console password type, none, require password reset, none
6:10
And of course we can see the permission summary. So scroll down and click on create user
6:17
And there we go, Abiola Dede user created with a few errors. That is fine
6:21
There's no problem. Now I'm going to click on the user we just created
6:25
And then in the user, we're going to create what's called the access key
6:29
So I'm going to click on this create access key. And of course we can specify the access key
6:35
We can choose the use case for command line interface CLI against the local code application
6:39
running on AWS compute service, third party service. We can even click on or choose application running outside AWS
6:47
I'm going to choose this third party service. We plan to use this access key to enable access for a third party application or service that
6:53
monitors or manage your AWS resources. So I'm going to scroll down and I'm going to click on this
7:00
I understand above recommendation. I want to proceed to create an access key
7:04
So click on next. And then for the description tag, let's just call it, you know, access, go ahead and create
7:13
access key. Okay. So you can see this is the only time that the secret access can be viewed or downloaded
7:21
So what I'm going to do is I'm going to click on this, the secret code. So I'm not going to click on it anyway, but this is the access key
7:28
So what I'm going to do is click on this download.csv. So I'm going to download to my personal PC
7:32
There we go. AbuelaDavidAccessKey.csv. That's fine. And go ahead and click done. Okay
7:39
So you can see the access key one created, and of course it is now active
7:44
What I'm going to do finally for now, I'm going to click on sales data to investigate
7:49
Now you can see in this case, we do not have any object
7:52
So let's head over to the Fabric Notebook. Now basically in this Fabric Notebook workspace, we can see we have this AWS S3 leak house
8:04
I'm going to click on it. And of course we can see we have this file uploaded sales data.csv
8:09
I can click on it to investigate. So we can see the sales data
8:14
And of course, I'm going to click on this file. So what I did just to click on these three ellipses, and of course just load to the existing
8:23
table, this particular sales data. Okay, that's fine. I'm going to open a notebook
8:28
So click on new notebook. Now we need to install two libraries
8:38
The first one is the Boto3, which allows us to interact with AWS services and also S3FS
8:45
to interact specifically with the S3 service in AWS. So let's do pip install Boto3 and control enter to run the cell
8:57
While that is doing its job, I'm going to click to add a new cell
9:01
Now I'm going to install pip install. Now I'm going to install the S3FS
9:10
And let's wait for this to finish its job. Okay, there we go
9:15
So it has been installed. And I'm going to come to the pip install
9:19
Okay, so you can see command executed in few seconds. So I'm going to scroll down
9:25
First we need to import the Boto3 and of course we want to import pandas and then SPD
9:30
So import Boto3. Okay, and then we'll import pandas as SPD
9:40
So let's run this cell and let's see. There we go. So command executed in 312 milliseconds and that is amazing
9:49
So let's click on a new cell. Now we'll initialize the instance of the Boto3 and then we'll see the resource
9:55
So we'll specify the service name, the region name, the AWS access key ID, AWS secret access key
10:02
So I'm going to just give the variable AWS S3. Okay, S3
10:09
Okay, and then we'll initialize the Boto3.resource and then open the brackets
10:16
Now we'll specify the name of the service. So let's call this one service underscore name
10:24
And that's going to be S3, right? Amazon S3 and then put in a comma
10:28
And then we'll provide the region, region underscore name. And that has to be, let me just double check, that should be EU01
10:39
So EU iPhone North, iPhone 1 and then put in a comma
10:46
Now we need to provide the AWS underscore access, access key ID
10:54
So, and there we go. So we can see the AWS access key ID
10:59
Now specify the AWS underscore secret underscore access key. So AWS underscore secret underscore access key
11:10
And that has to be inside a single code. Okay, so there we go
11:17
Now there we go. So we can see the access service name, the region, the access key ID
11:23
and of course the secret access key ID. So let's go ahead and control enter to run the cell
11:29
There we go. So command executed successfully. Now, the next thing we need to do is to read the parquet file
11:36
into pd.region underscore parquet. So I'm going to come to the sales data, okay
11:41
And click on this horizontal ellipses. And I want to copy the path
11:46
So copy that and let's add a new cell. Now I'm just going to do df equals to pd.read underscore parquet
11:54
And of course, inside double quote, I'm going to control V extension
12:00
and then click enter. Now we need to go ahead and store the data frame into parquet extension
12:07
So I'm going to do df.to underscore parquet. And then inside open and close brackets, now single quote
12:15
I'm going to call this one sales data dot parquet. Okay, so let's check it out
12:21
So df.to underscore parquet. Okay, this is fine. Control enter. And let's see
12:29
There we go. So command executed in just two seconds and 490 milliseconds
12:35
Now I'm going to scroll down. So the final thing we need to do is to go ahead and upload this particular
12:41
sales data dot parquet into the AWS. So I'm going to call the AWS S3 that we defined
12:49
And if we want to access the bucket, so inside open and close brackets
12:54
I'm going to specify the name of the bucket, which is sales 007
13:00
So sales data, sales data 007, 007. Okay. And then we'll use the dot upload underscore file
13:11
And of course, we'll actually specify the file name. So file name, that's most equal to this sales data dot parquet
13:20
Let me just copy it. And inside single quotes, control V. And of course, we want to specify the key
13:28
So this is key. And inside that has to equal to inside single quote
13:33
Let's just call it sales data. So let's go through it again
13:38
So first we specify the bucket. And then we specify the name of the bucket
13:43
And then we use the dot upload file. And of course, we specify the file name, which is sales data dot parquet
13:50
And then the key, which is sales data. So this is going to be what's going to be seen as the key in the S3
13:56
So let's go ahead and control enter to run the cell. There we go
14:02
So command executed in 879 milliseconds by Abiola Abiola. And then we can go to the S3
14:10
Now, I'm just going to go ahead and refresh the page. Now, this is the moment of truth
14:16
I'm going to click on the sales data 007 bucket. And there we go
14:21
Sales data last modified December 21st, 2023 by 1838. And that's exactly the time we are
14:29
And that is super cool. And of course, you can see the size for 81.7 kilobytes
14:35
And this is super amazing. And of course, you can see the storage class is standard
14:39
I'm going to click on the object, the sales data. And of course, you can see the properties, the permission, the version
14:46
So you can see the object overview, owner, AWS region, the last modified date, size
14:52
We can see the S3 URI and, of course, the object URL
14:56
And, of course, you can see the key. So this is exactly what we specified, the sales data
15:00
If you don't forget, they've got this particular key. So you can see the key here
15:04
This is super amazing. And finally, I'm going to click on this object actions
15:09
And I'm going to click on this query with S3 select. So for the input settings, now we can see the path and, of course, the SAS
15:17
Now I'm going to choose, because this is actually a package, I'm going to choose a patch package
15:22
And for the compression, this is not supported. That's fine. I'm going to scroll down
15:28
And then I can select the output settings. So for the output, you can choose the CSV or JSON
15:33
I can even choose the CSV delimiter. So that is fine. Just go ahead and scroll down
15:39
Now this is going to be select star from S3 object. And this is S as an alias
15:46
I'm going to see the limits or the first 10 records. I'm going to scroll down
15:52
I can see in the results query there's nothing to display. I'm going to click on run SQL query
15:59
Let's see. Successfully returned 10 records in 2045 milliseconds. I'm going to scroll down
16:06
I can see the raw data. I'm going to choose the formatted. And there we go
16:10
So we can see the data. So this is super amazing. So this is the end-to-end project on how to create S3 buckets
16:19
how we can create access key, secret key. We can read data from Fabric Boot Booth
16:25
And of course, how we can query the data in the S3 query window
16:29
I trust you enjoyed this video. If you do, like, share, comment, and see you in the new year
16:36
Thank you and bye for now. Cheers
#Cloud Storage
#File Sharing & Hosting
#Web Services