Data Transformation Using Azure Data Factory by Manoj Mittal || Azure Virtual Conference
21K views
Nov 16, 2023
Conference Website: https://www.2020twenty.net/azure C# Corner - Global Community for Software and Data Developers https://www.c-sharpcorner.com
View Video Transcript
0:00
So, hey everyone and thanks for joining me this evening, wherever you are in the words
0:07
So, hey, good evening, good night and good morning. So, I'm here to just talk about the Azure Data Factory
0:14
So, it's a quite popular topic sent to the Azure. So, in the world of the data, in the world of the big data, we always talk about the data
0:23
And then data exists in multiple forms. It's an organizational structure, it's just an organized structure in a relational format
0:30
or in non-relational format or any other, it's just, it's just a question of many other systems
0:36
So, however, on its own data, raw data doesn't have their own proper meaning or doesn't have
0:42
any context, it doesn't have any meaningful insights to the yst, data scientist or
0:47
to the business decision maker. So the big data always, once we have coming from the multiple sources, we always require
0:54
the surveys that can orchestrate, can operationalize the process to refine this enormous storage of the raw data
1:04
into actionable business insights. So that's why the Azure Data Factory, which I'm going to talk about is a very vital
1:12
plays a very vital role in the cloud service that build for the complex hybrid structure
1:19
So this is about me, a little bit of my introduction. I'm a C Sharp Corner MVP holding a couple of certifications
1:26
This is my blog sites, YouTube channels, Twitter, LinkedIn, at the end, the C Sharp Corner sites
1:32
So I'm quite active on the C Sharp Corner. You can follow me on the C Sharp Corners
1:37
and like comments on the articles if you need any help. I'm always available for that
1:41
I'm working and I'm MindTree Limited in a Bangalore located as a senior technical architect
1:46
Thanks for that. So now come back to the topic. So, Azure Data Factory
1:51
So what is Azure Data Factory basically? So Data Factory, as you can see on the screen
1:56
is just a cloud-based data integration service and used to compute the data storage, data movement
2:04
processing the service into the automated data pipeline. So once we talk about what is Data Factory actually
2:12
So Data Factory is a cloud-based path solution, platform-as-a-service solution. So it's just, we don't need to worry about the servers
2:20
for the scalability, for the patching, it's a server less activity. What ever we are using the component
2:26
how much we are consuming, we need to pay for that. Second, we say that data integration service
2:32
definitely it's allowed to integrate multiple sources and generate and provide a common platform
2:38
where we can integrate multiple sources of data at a single form
2:42
And then orchestras and automation. So it has 80 plus connector and pipeline can be automated
2:48
whenever it's required. data transformation. Then definitely when we get the data
2:53
we have ability to transform to the insights, data break, as your function, other data transformations tools
3:02
So when we talk about the Azure Data Factory, so in the past, if you will see SSIS package
3:08
when the peoples are using into the on-premises system. So it's a cloud-based service on the Azure Data Factory
3:14
So here the things come, what is ETS? So when we talk about SSIS
3:19
So that's the terminology come ETL and ELT. What is both and what's the difference in between those
3:26
So ETL is just like extract transformation and load the data. First, we need to extract and then transform
3:33
and then load the data or ELT, we'll say extract, load and transformation
3:38
as you see at this section. So first we need to transform the data using some formulas
3:44
but we are going to write filters, lookup before loading the data. or here we are going to transform the data
3:50
loading into the data warehouse, and then we have an ability to change the query on the fly
3:56
and read the data and change the data as per the needs
3:59
That is ELT. That's depending on the business scenarios which we are going to use, ETL or ELT
4:06
So it's a very high level comparisons, SSIS and ADF. So on a very high level or 30,000 feet
4:12
I will talk about SSIS is the on-premises solutions which were used earlier
4:19
Now, we're talking about the cloud, we have an ADF. It has a much higher scalability
4:24
It has a cloud and SAS connectors. It has an event-based trigger
4:28
We can run the SSIS package with an ADF also. If we have, we can do the connectivity
4:34
with the on-premises environment and we can run it on the Azure Data Factory
4:39
So data factory considerations, we'll talk about. So there are basically the two versions
4:47
of the data factories is there. So V2 and the V1. V1 was the first, so currently the V2
4:52
it's a quite improved version as compared to the V1. And then we have a build option
4:57
So there are a lot of options that exist to do the coding sources to the configuration
5:02
in this section. We have a PowerShell option, .NET, Python, REST, on the web, that ARM templates
5:09
So it's a highly integrated with that DevOps key walls where we are storing our credentials
5:14
on a secure manner. Monitors, we can monitor all our pipelines or our integration services
5:20
We can do the automation for those sections. No data storage. So this is a key factor
5:25
Azure Data Factory doesn't store anything within itself. It just process the data and move from one locations
5:32
to the another required location where we need actual processing. It's have a security standards on its own
5:38
HTTP or CLS whenever is possible. So let's talk about the main data factory element here
5:46
So I'll just draw the simple diagrams to make it easy so we can understand
5:51
So if you see here, so this section is just a toy and this toy is having a two component
5:56
we need to integrate. This is a buzzer and this is a battery section
6:00
So if need to add into this section, so the activity and the jobs run one by one
6:05
First, we are going to fit the battery into the toy And then in the next job
6:09
we are going to fit the buzzer also into the toy. So each job is done by the one particular person
6:15
which is the activity here, right? So this is called the activity
6:19
So in short, if I will say, what is pipeline? So this terminology is very important
6:25
So after a couple of minutes, when I will go to the demo, so it's very useful to understand what is pipeline
6:31
what is activity? So activity, combinations of activities called a pipeline here
6:36
and then it's the integration runtime. So what jobs is running, it is called integration runtime
6:42
and the trigger actions when my job is going to trigger. Okay, and here is a link services
6:47
So if I just talk about in the examples, in the next slide I will talk it more about
6:52
what is the link services, why is required. So basically once my job is going to done
6:57
it's going to connect with any of the services. If I'm taking the data from the blog
7:00
so I need to connect with the blog as a connection string. I need to push the data into the SQL
7:05
I need to connect with the SQL via connection string. So those activities
7:09
So this job, which we are calling is a data set. So it has a commission of the components
7:14
what we need to do called a data set here. So entire process which comes here
7:19
the data sets, pipelines, activity, link service is called Azure Data Factory
7:26
So we'll try to understand all these things one more time here to get the more clear picture So we have a pipeline which is a group of logical group of the activity and those activities going to consume with a data set which consumes and then produce the results for the activity
7:41
and those data sets what we have data sets can be the my where my data exists into the multiple
7:48
storage which may be the block storage bucket it's going to store there and it's going to connect
7:53
with those services which is called the link services and my activity run on those liaison
7:57
of link services configures. So, take the real-time example. So, what is the data factory element and relationship
8:07
Okay, so here we're going to talk about Azure Data Factory. First, whenever we are going to define
8:12
the structures of the Azure Data Factory, so one pipelines come into the picture
8:17
In the pipelines, we need to define the integrations of runtime. Under the integrations runtime
8:23
so here I will just talk about, I have two components I'm going to take
8:27
So I will just store my data in the backend data, which is coming from the multiple sources
8:32
some other systems is dumping into the blog storage. So it's going to, I need to move the data
8:38
from the blog storage to the SQL server. So this is the Azure data factory
8:42
So we can write a code also to do all this activity, but code is not feasible when I'm doing
8:47
on a very large basis, all those structures. So it's an error, prone and just a cloud base
8:52
we can say ETL as a service, this process. So the blog is going to connect to the link service
8:58
Similarly, SQL is also connected with their own link service. And then once it's connected to the link service
9:04
which is part of that data set. Similarly, SQL also having a link service
9:09
and a part of data sets. And both we have a one section where my data exists
9:13
And this is another section where data needs to go. And then activity comes in between
9:18
So which is called the activity. So there is a multiple activity exists
9:22
what we need to do with the data. So I just in this scenario, we need to take the data and dump into the SQL server
9:28
using the copy activity. So we'll take the scenarios before going to the demo
9:33
So for example, if you talk about the aerospace industry, so there is a lot of
9:39
industry is having a lot of data. And then data, what those airport authorities
9:44
or the airline authorities, they want to process the data, what time the flight is going to schedule, okay
9:49
What's the status of the flights? How many passengers are there? What's the origins
9:53
What's the destination? So such type of data, each airline is having
9:57
their own storage. So the airlines or the airport authority, they want to process the data
10:02
So in that scenario, they feel talk about the real time and it would be very helpful to take the data
10:06
from the multiple sources, but in a blob storage, S3 buckets. So lot of storage are there and we can just dump it
10:12
into the one hard data warehouse and do the processing and connect to the Power BI and generate the recordings
10:19
So that's all for this. I'll just quickly jump to the demo and we'll see
10:24
Yep, so I login to the portal.azure.com. Here I will just create from the scratch the entire
10:32
structure. So I will just take the storage account first I will create here. I click
10:40
on storage account. I will just click on create. So here I will just create our new resource group access to the Azure Virtual Conference
10:55
04 is my resource group. I just give the name Azure Virtual Conference 04 Blobs Storage. Yep
11:09
So here is this. I will click on the, I am okay with the other configurations
11:14
I will just take it by default, other configurations, and click on the create
11:18
So let it create the storage account where I'm going to dump the some CSV file
11:22
Apparently, I'm going to create the Azure Data Factory here. So here is a Data Factory
11:30
I'll just take Data Factory. I'll just click on Create. So I'll select the same resource group
11:42
which I use here. and I will give the name ABC04 and say our data factory to the next one yeah so the github I will just not connect for this demo purpose I will just write forward go
12:00
I will just the next I am ok with other configuration review and just click on the create
12:07
So the data factory object also created. Now data factory is the intermediate layer. Now I need to create a SQL server also. I will just go to the again to the portal home sections. I will create a SEO SQL, create resources. I will just SQL database
12:27
Click on create. So here I will select my resource group where everything should go
12:39
I will just keep the name ABC04 Azure SQL Database. I will just put the name of the new server
12:53
here we see 04 is your SQL database server yep I'll create a password just here I will create like that SA admin
13:08
I'll create a password for this one password so the password is created click on okay
13:23
So the server is get created. Now the compute storage, I don't need that much storage
13:29
I will just reduce it. For this demo, I need a minimal requirement
13:34
So I just click, looking for the standard. I'm okay with this cost
13:38
So creating the SQL server, the cost is coming almost 991 INR
13:43
I'm okay with that. Apply. And I'll just go for the next
13:48
Just check, okay, public endpoint. So for the production purpose, we should not do the public endpoint
13:53
We should define all the IP address which is required, but for time of purpose
13:58
I'm just using the endpoints for our public, create and review. Yep
14:04
So what I did, I just created a blob storage. I created Azure Data Factory
14:10
I created a SQL Server. Blob storage I created, I will dump the file there
14:15
I just assume the multiple sources as dumping into the Blob storage file
14:19
I will take the file, process into the Azure Data Factory extract the data and move it to the SQL server. So that is the gamut theme. So let it create
14:27
I will in the meantime I will go to my resources and just check the blob storage. I will go to the
14:34
home, my resource group. So let it come. AVC04 is my resource group which I created. So blob
14:48
storage has already came data factory SQL server is in provision stage I will go to the
14:53
blob storage. So in blob storage I will create a one container and just give the name of
15:01
the containers as a input where I manually place the file as of now I will just click create Yep it created So for the demo purpose I have already the file which is having the some data
15:17
I will take this file and upload it here. So just click on upload
15:24
Select the file. Yep, select flight data. Upload
15:35
So this file is get uploaded, which is my raw data, which is coming from the multiple sources to me like that
15:41
How we better we can check what content exists here. I'll just go click on edit and click on the preview board
15:53
Yeah, so this is the data. What we have the IDs, flight names, where is the to
15:57
what is the airport code, countries, and all such data we have
16:03
have almost 44 records exist so that's pretty good I got the data here now move
16:10
to the SQL server and let's create the table structure for us at least move to
16:19
the resource group yep let it come it's still from its name so the server has came up
16:33
It's taking some time to come up. Okay. Let it refresh one more time
16:40
SQL server should come here. Yep, just waiting to load. Yes
16:53
So the SQL server database, it's came here. I select the SQL server
16:58
So here we can go to the query editor and we can see the SQL server structures what we have
17:05
It just asks for the credentials. So I'll just have a SQL server authentication
17:10
I will use this one. Password, click OK. Yep, so I got what the structures I have
17:20
As of now, nothing is there. I have a blank table. So I'll create a one blank table as per the schema
17:25
what I have. So let me go into the structures. I have that structures
17:30
the query I have already made it. I will just take the particular query
17:34
and go back and run it. So now my tables, blank tables will be there
17:42
in the SQL server. All right. Yep, nothing will be there. It's just a blank table we have
17:50
Now we move to the Azure data service and our actual work will start there
17:55
I'll go to the resource group, Azure data factory. So, once I click to the Azure Data Factory
18:04
this is the interface where we land it. And here we have a sections like authors and the monitors
18:09
Once I click in author and monitor, it will go to the adf.azoo.com
18:15
The Azure Data Factory interface will come up here. And here we can define our own pipelines
18:21
activity and data sets. So let it float. So we'll configure for the blog and the SQL connections
18:27
and integrations between both here. Yep, so once you land in this page, so this page has a very well documented all the components
18:43
from the Microsoft, what is Azure Data Factory and why we need to use all the other activities
18:48
just go through this documentation is very useful. And they have defined the entire process itself, we need to create a pipeline, we need to create
18:55
a data flow, and we can create a pipeline from the existing template and we can store
18:59
our process, what we are going to create as a template. We can use the activity like the copy data
19:04
We can configure the SSIS integrations from the on-premises as well as what we are doing
19:09
in the Azure Data Factory, the cloud-based and GitHub repository also we can add
19:13
So I will just click it here and go to the actual process
19:17
where we can define. So this is my factory resources. So to start with, we need to define a pipeline
19:23
So for the pipeline, I click on the right-click and the new pipeline
19:27
So here it's come. So I'll just give the name. We need to store the data from the block to the SQL
19:32
I will just give the name block to SQL here, just a name here
19:38
And for the time being, I will just place the one activity. It just will wait
19:42
So it shouldn't for the error while doing any processing. So I got the blank activity
19:47
which I just dummy it here as a wait. Now let's move to the data sets and create a data sets
19:53
which will connect with our block storage. So I'll just select the blob storage here, click continue
20:02
So what type of, if I'm talking with a blob storage, what type of content we have in blob storage
20:06
I have a CSV file. So we can define what type of content we are getting
20:10
from the multiple sources. So that type of format we can select here
20:14
So I have a CSV file, I selected this one. I'll click on continue
20:19
So I can keep the name blob-plied-data. So I can select the link services
20:28
I click on new. So the link service, there are multiple options to create a link service
20:33
So it's just a connection strings, which we are creating here. So I'll give a name, the block
20:38
flight, data, link service. Just for the names sake, okay, it should be the proper naming convention
20:47
So I can understand. Yep, I'll just keep it the accounts key
20:51
So basically once for the production purpose, if we are going to use, we need to use the manage
20:55
identity is a recommended one but for this time I use the account key and I will select my subscription
21:01
and then I select my storage account which is the 04 blob storage which I created and I just check it for the connection
21:11
if everything goes smooth it will show success else we need to correct the error
21:17
if it succeeds I click on create So let it create a connection
21:25
So once it's created, we created a linked service. Now we need to give the file path where our file exists
21:32
So here we are just giving hardcoded this particular file name, but we can make the parameterize
21:37
and we can just take it this as a parameter and we can change it in the actual project
21:43
So the first row what is coming here, we can select as a header. Click OK
21:47
So this settings we can do the changes later point of time also
21:53
So here also we have those configuration settings. So this is the one data sets we created
21:58
which is connected with the Blob Storage. Now we will connect with the SQL in a similar fashion
22:04
So the SQL, I'll just go to the Azure and select the Azure SQL database
22:11
So here I'm creating the SQL flight data and create a link service here also which is going to connect with a SQL. So I will
22:21
give the name here SQL.Slight Data Link Service
22:32
So for the connections I will select my subscriptions. Let it be as is those
22:37
things Server names what we have 04 Database name 04 and the Authentication I will use SQL I created those one SAadmin is my user name password is my password
22:52
Okay, we just check it this one. So this database, this data set is going to connect with a SQL
23:01
It is successfully connected. Click on create. So now we have a two data set and the one pipeline exists as per this diagram
23:09
diagram what we talked about here. We have this section in place, we have this section in place
23:14
now we left it over. We need to create an activity which is going to connect both
23:18
Okay so coming back here the table name, what the blank table we have created. We need to select
23:26
the table name. I clicked on ok. So now the section we are good to go. Now we'll go back
23:35
to the pipeline and pipelines I will select the activity which is copy data I'll just drag and
23:43
drop here and this is no more is required I will just right click and delete this section yep so
23:50
here once I select this one so it just come up with a settings what configurations we need to do
23:57
what is my source and what is my destination so I go to the source source is my blob storage
24:03
where my data exists. Okay. And the sync, which is my destination
24:09
where the data is going to go, which is a SQL. So let it come
24:14
So here we have a lot of process. If we need to use the stored processors
24:19
we can define pre-copy script when any processing is happening from the block to SQL
24:24
So I need to do any pre-processing we can do here with the scripting
24:28
Now going to the mapping. So this section what it says, Whatever the columns we have
24:33
it may not be the same in the Blob storage or in the SQL. So we need to do the mapping from the column
24:38
As you can see on my screen, my columns also not the same. So here are a few columns which is same as map
24:44
but not of the columns it's not mapped. Here I have a destination section
24:48
but sources doesn't map. So I will select those particular columns, which I feel it's appropriate
24:53
Prom is this one. So here is the airport code. Here I have a country code
25:02
So similarly for the other columns, for the destination, it's a 2. It's a 2 airport code. It's a 2 country. Yep. And then finally the status. So here I have a flight status, but here I don't have a column with the name of flight status. I will select a status. So I'm good to go. This is a warning. I will just keep it as of now. Yep
25:27
So this section has done. So what we have done, we connected with a blob storage
25:32
we connected with a SQL center mapping. So let's validate this section
25:38
Yep, so our pipeline has been validated at no error found. So we are good to go
25:42
So before publishing all those things, better to debug this application, entire process
25:47
what we have defined and check whether it's work find or not or any error access for that
25:52
So when we are debugging, our pipelines is running. So it's in process
26:04
So let's see, let's wait for a couple of seconds and it will show the status if it succeed or not
26:11
Yep, it succeed, we got the message. It succeed when we did the debugging
26:16
So let's go to the Azure SQL and just check before publishing whether we really get the data
26:21
or not here. I'll just select the command and click on run
26:27
Yep, so it's that easy. So I didn't write a single line of code
26:32
using the Azure Data Factory, the very vital tool, and I got the all data in my SQL
26:37
If you need to see what data it's came here, we can just check how many records we were having
26:43
in the blob storage and what it came here. Yep, all 44 records came here
26:50
So now we are good to go and we can publish all this activity
26:56
So Blob storage, apart from that, we can, so with Azure Data Factory
26:59
we can do a lot of other things. So like once we did the publish all, we can define the notification
27:05
When my job is get success or job is get fail. So accordingly, the target audience will get the notify
27:12
and we can monitor when our job has been published and it's going to trigger as per the schedule
27:18
So we can monitor all the jobs and we can see what the status is coming
27:22
what job is running and which job is took the what time
27:25
and what's the trigger run, what's the integration service. So all the activities will be listed down here
27:31
It's a quite rich dashboard, we can get it here. So as of now, it's a blank because just now I created
27:36
and didn't run using the publishing services, right? So that means now as I publish, I will go back again
27:44
and let it run one more time. Yep, let me run it
27:56
So it's running again. So the agenda running again, why I'm showing
28:00
So I didn't write any of the pre-scripts, pre-validation scripts as I ran
28:04
What is doing? It's just taking the data from the blob and keep copying the data into the SQL and data
28:09
will get keep adding. Now we have a, earlier we have a 44 records
28:13
Once my job will finish, it's here. So it's still in queue
28:18
Once it's finished, as it succeeds, and if I run it again, data will keep adding here
28:23
So we can write all those scripts to validate the data. So I'll write one simple script
28:29
So I just want to delete the data, whatever it exists, and it will upload the data from the scratch
28:33
So it's quite simple. We'll just select our activity. And here in the mapping section, not in mapping thing
28:48
Yep, so we can write the pre-copy script here. So it just says delete from this particular table
28:58
So we have a table like data. I'll come up here and just run it
29:03
We'll just validate one more time and just do the run it again
29:11
So let it run. Now we'll see what type of data and how much records we will get it again
29:18
It's queued again. It will not take couple of more seconds
29:32
Yep, it succeeds. I will go back to my query and I will check how many records we have
29:39
Hopefully we should have 44. Yep, here we go. So we have 44 records
29:43
Now, earlier whatever the records was there. So we can just do the processing
29:47
before uploading the data into the SQL. So this is more about how we can transform the data
29:52
before doing the loading here. Yep. So that's all from my side for this session
29:57
So as in recap, so what we did, we just take the data from the external source
30:02
that is a blob storage and did the processing and dump the data into the Azure SQL
30:07
via the Azure Data Factory or ETL as a service process. Thank you for joining me
#Business & Industrial
#Cloud Storage
#Data Management
#Distributed & Cloud Computing
#Networking
#Programming
#Software
#Web Services