Data Migration using Azure Data Factory by Abhishek Mishra
9K views
Nov 1, 2023
C# Corner - Community of Software and Data Developers https://www.c-sharpcorner.com #CSharpCorner #CSharpConf22 About C# Corner C# Corner, headquartered in Philadelphia, PA, is an online global community of 3 million software developers. C# Corner serves 5+ million visitors with 9 million page views each month. We publish the latest news and articles on cutting-edge software development topics. Developers share their knowledge and connect via content, forums, and chapters. Thousands of members benefit from our monthly events, webinars, and conferences. We also provide tools for career growth such as career advice, resume writing, training, certifications, books and whitepapers, and videos.
View Video Transcript
0:00
Introduce myself, this is my LinkedIn URL, you can get in touch with me
0:18
I have a 17 plus years of experience working as a cloud architect
0:24
Currently I am leading the Azure practice for a healthcare organization. I have built the practice of around 2,000 folks
0:35
out of which there are around 200 to 300 SMs reporting directly into me
0:42
I'm an author, like I have published three books with BPP publications and two with appres on Azure
0:50
Even this use case is also there in my book, Hanson Azure Data Platform, you'll find it
0:57
I'm doing my PhD and I'm a research scholar, on digital transformation and I'm an alumni from Utkal University, Bhounishwar and Indian School
1:07
of Business, Hyderabad. So let's get started. So this is the use case
1:15
I'll like this is the agenda like I'll start with the use case
1:19
So I'll explain you what was the use case. What was the need for doing this
1:23
Because if there won't be a need, you won't be doing a solution right
1:29
won't be doing an implementation. So I will explain you the use case
1:34
What was that? What the customer needed? What was the need for that
1:38
And what was the design that we proposed, or my team proposed
1:42
And how did we implement it? I'll show you. I'll open the Azure portal
1:46
And I will implement that part. A crisp of it, like it's a pretty big story
1:52
A snapshot of it I'll implement and I'll show it to you. OK
1:57
So this is the use case. So like the customer, it was a big US healthcare giant
2:06
like they're the providers, so I'm not going to call out the name
2:10
And they've done a lot of implementations and their entire platform, data platform was based on Azure SQL
2:18
And you know like nowadays like the growth, the business growth we get through the merger
2:25
M and A route, like we try to merge or we try to acquire new companies
2:30
So this company, what it did, it acquired another company, which is a big data lake player
2:39
So they had a big data late platform. Lot of ytics are running on that
2:44
And the use case was how we can move the entire platform
2:50
the existing platform of SQL Azure SQL database to this data lake platform
2:56
How can we can seamlessly move it? So we can't shut it down overnight
3:01
So what was the solution? Like the ask was both the system would be there
3:05
So like they will be existing side by side. One of the system will serve the customers and the other system will be serving the existing old customers
3:17
And gradually the movement will happen, the migration will happen. All the data in the Azure SQL database or the data warehouse will move to like Azure DataLick
3:27
because the new system was on Azure Data Lake. So that was a use case that was put forward by the customer
3:35
And this is the design we like put forward. But before I get into the design, I have few queries like
3:44
so like I spoke about Azure SQL database and Azure Data Lake
3:50
How many of you know Azure SQL database? What it is? Cool
3:56
And Azure data lake? Okay. Okay, cool. We have some audience who know it
4:03
Azure SQL database is a relational database. Okay? So you understand relational database, right
4:10
All of you, like we can have a, like, relationship defined, foreign keys, primary keys, all those things
4:17
But when it comes to data lake, it's a mess, right? So you can bring in anything, you can dump anything in that
4:25
dump anything in that. And later we do an ELT or an ET based on our requirement. Like extract the data, load it, transform it, or extract, transform, and or load. So whatever operations we do. And that is the ytics world that talk about that ET or ELT. Okay
4:45
So is it sort of unstructured data? Yes, correct. It's a, it's very much unstructured data. It can be any data. It can be, uh, it can be, uh
4:55
like maybe some random JSON files, some images, anything, any data you can think of
5:04
like maybe some IoT sensors are generating some wild data. So anything can stay on on the data lake
5:15
And Azure has its own flavor of Azure Data Lake, sorry, Azure has its own flavor of its data lake
5:21
that is our Gen 2 Azure Data Lake, and, And it's based on the Azure storage account
5:29
So I'll show you like how to create data like in Azure in some time. So the solution that we proposed was we will move all the tables, all the data in the tables, each of the table
5:40
Like there will be thousands of tables, right? Table one, table two, table three. Each of the tables will convert it into JSON files
5:47
And we will dump those JSON files where in the data lake. So that was the solution that we targeted
5:55
and we built it. And the flowchard you can see, the first step is like I need to get all the table names, right
6:02
What are the tables that are there in the existing system, in the existing database
6:07
I'll iterate to each of the tables, convert each of the tables into JSON data
6:12
and then put it in the data lake. So that was a strategy that we thought of
6:16
But again, like, let me highlight this, the solution is just a crisp of what we did
6:22
In the real time, like it was a big, $30 million deal for us and there were a lot of other players involved, a lot of vendors
6:30
were also working with us and it was a big story. It's just a crisp of it, the flowchad
6:36
that you see. So whatever solution design it is here, we will try to implement that thing
6:42
now in some time. For that, we will use the Azure Data Factory. How many of you have worked on
6:47
Azure Data Factory before? Very good. Cool. So like, Don't worry, if you don't know as your data factory
6:57
it's, it's, it's, there actually you are going to build some kind
7:01
of pipelines and orchestration that will help you move data from one place to another
7:06
or perform some ETL operations or ELT operations. Some kind of operations and data
7:12
you can play around with the data, you can transform the data, you can shift the data from one place to another
7:18
So all these activities you can do using as your data factory and it's pretty rich
7:23
And now like it is, integrating with synapse, that's a long story. Like let me not bring it in here
7:29
So what we are going to do is we will use the Azure Data factory to look up for all the tables in the database
7:36
Then we will iterate through each of those tables, pick each table one by one, copy the data
7:43
and ingest it into Azure Data Lake. So that's what I'm going to implement in the next 15 to 20 minutes
7:53
Not much of slides. but again I'm at risk my demo should work okay so I'm going to what I'm going to do is I'll
8:08
create a SQL database with some preloaded tables the adventure works by default the
8:16
one that we get I'll restore that while creating it then I will create a data lake
8:22
then I will move all the tables from the SQL database to data leak
8:28
So let me create the SQL. From the marketplace, I'll go, I'll click on Azure SQL
8:49
I just create a single databases I not diving into what these terms are
8:58
I'll give the name of the database. I need to create a SQL server
9:07
Then authentication. I'll use SQL authentication. It's taking a while
9:43
So the workload I'll choose at development. So what it will do is it will provision a serverless database for me
9:53
I'm not getting into what a serverless database is right now. And I'll keep it locally redundant backup storage because I'm just demoing it
10:04
It's a development workload. Now most importantly, I'll enable public endpoint. I'll allow other Azure services to connect to it
10:15
I'll add the current client IP so that I'll be able to play around with the data
10:23
And yeah, that's it. Next security. Here, like, I'm going to use a sample
10:34
Like, you can see here, like, the Adventure Works LT. This database will get restored by default
10:41
Okay? So that I'll get a set of tables that I can actually port it
10:46
I don't have my customer tables right away with me. Once the project is gone, the data is gone with them
10:52
So I'll use this. I'll click on review. is create I may making sense guys all are with me
11:03
All are with me? Okay, okay. Okay
11:16
While the database is getting created, what I'm going to do is
11:20
I will create the Azure Data Lake, where I am going to put the table data, like convert those
11:31
table, the data in the table in JSON, and per table I want to put it, not in JSON, we'll do CSV
11:38
sorry, we'll put it in CSV and dump it in the Azure Data Lake. So let me do that
11:44
The network is too slow actually. To slow actually
11:54
Storage account
12:14
You guys know what the storage account is? in Azure? It stores something
12:22
It stores either some table data or it can store some blobs or it can store some files. Okay
12:34
So let me create a storage account. The storage account is so powerful, like maybe like after the session, maybe you should go back and see what the storage account is
12:45
If you are planning to work in Azure, that is a fundamental. That is a fundamental thing that you should be knowing
12:50
The storage account is so powerful that the entire Azure data lake runs on top of this
12:57
storage account. It's so robust. Earlier we had the HDFS. How many of you understand HDFS file systems
13:05
How do you get it, right? So it used to be that HDFS thing
13:11
But now no HDFS in Gen 2, generation, new generation of storage account
13:17
It's all like a storage account on top of which we'll be building the hierarchical namespace and enabling the data leak
13:27
We'll do that. I'll name it, let's say, 2010. Okay, hopefully it's there
13:52
Then I'll make it LRS locally redundant. I'm not getting into all these details, what there
13:59
But since just remember, I don't want data redundancy. It's for a demo, so I'm doing this
14:05
To the advance. Here, I can enable while creating a storage account, a hierarchical namespace. Okay
14:16
I have enabled a hierarchal namespace. The next is public access I need
14:22
And rest all the things I don't need. So I'll just verify whether I enable hierarchal yes
14:30
And in the basic tab, I've given all the necessary stuff. So now I will review plus create
14:42
Now let me create it. OK. OK. While it's getting created, our SQL database is ready
14:50
Let me get into the SQL database. The query editor is just like your, you use SSMS right
15:01
to connect to your databases, SQL databases. The same way we can use this
15:15
Okay. And you can see a bunch of tables out here, right
15:24
Correct. We can see a bunch of tables that came by default because I chose to have the sample database
15:30
while creating the like this as a SQL database. Okay. Now the storage account is done
15:40
Let me get into this. Or rather, my data like this done. inside my data lake I'll just create a container I'll call it
15:51
set destination this is where your tables will come okay all these tables
15:58
whatever you see here they will get converted into CSP format and they will get
16:03
dumped out here in this container okay so let me write the data factory now which
16:11
can actually move the data from this database to these containers I've already created the
16:29
data factory like to do save some time and to give you some more time to have lunch so
16:34
that we can wind up early I see a lot of gloomy faces okay but that's fine it's
16:40
lunch time I can get it So I'll just directly open the data factory, but again, like it's simply a few clicks
16:50
You just need to search for the data factory in your marketplace and then like give a name for the data factory, do some clicks and create the data factory
17:01
Okay. Now, I need to, once my data factory is created, so I need to go to go to the data factory is created
17:09
So I need to go to the studio of the data factory where I can actually orchestrate that pipeline that will move data or these tables from here to the data lake into this destination folder
17:22
So let me do that. Okay. And if you are planning to go on Azure, work on Azure
17:34
So these are few of the basics services that you should be knowing like I know you need to know the web app functions all these things but from data part also like some glimpse you should have
17:48
Okay. What is the first thing I need to do? I need to connect to both the storage account as well as the SQL database
17:57
Like from your code, how do you connect to a database? You write a connection string, right
18:02
Using a connection string, you connect to your database. In the same way, I'll create the link services that will connect to the database as well as to the storage account
18:13
So let me create and create link service. Go to Azure. Let me first create it for the Azure SQL
18:23
Continue. I'll name it as DBCon database connection string. Okay
18:34
I need to select my server, my database, and my password
18:51
Let me test the connection whether it's working or not. Now I will create it
19:04
Okay. Okay. Let me create another link service
19:17
Now I am able to connect to the database. Now let me connect to the data lake
19:24
Okay. Generation 2 data lake. That's what we created. So I'll name it as maybe
19:34
DLCon, okay, data link connection. I need to select the storage account and do a test connection. Okay
19:51
It's cool. Let me create it. Any queries, guys? Till now. Okay
20:04
when something is dangerous. But that's fine. Okay. So now I have created the link services
20:11
What next I need to do? I need to connect to the database, pull the table data, and dump it into the data leak
20:20
But again, in the entire process, I need to keep the data somewhere, right
20:26
I need a data set to hold the table data. And I need a data set
20:33
to which I can transfer this table data and put it in the data lake
20:39
So let me create that. For this you can go to the author
20:45
I'll create the data set. First let me create it for the SQL database. Okay
21:01
Maybe I will call it DSDB. Okay. And this is the link service I'm going to use
21:15
Okay. I'll do okay. Then I need to pass on the parameters. Why
21:21
Because I need to iterate through all the tables that are there in the database
21:26
So I need to hold those table name somewhere. So where I can do this
21:33
So I will create two parameters, table name. Table schema
21:47
Okay. Now click on edit. Then I need to say that. this is the table name like normally we access using schema name dot table name okay so my
22:06
database data set is done now let me create a data set for the data lake I need to
22:23
dump the data in the CSV format right So I need this like delimited text
22:32
Continue. I'll call it DSDL, data set for the data lake. First row will be the header
22:43
And the link service is. Okay, once a DL con. Okay. Then I will do an okay
22:51
Okay, one more thing I need to do. I need to put my data in this destination, right
22:56
here like in this folder destination folder so I need to select that do an okay then
23:03
similarly I need to add few more parameters here like to generate the file name
23:11
I need to instruct like what would be the file name right so I need to put it
23:17
somewhere so I'll put it in this file name okay and what next
23:26
in the connection I need to say I missed it
23:37
okay, destination, inside the destination, it will be keeping it in this file name, okay
23:46
All good till now. And the first row would be the header
23:52
So, now my data set is ready. ready my connection string is ready now I am going to author my pipeline I'll create
24:00
that pipeline that is going to transfer data from the database to the data lake
24:06
so let me do that okay so if you go back to my deck to the design what is the
24:22
first thing that in the data fact you need a look up this lookup will connect to the Azure SQL and get the list of all the tables, what are there
24:30
and pass it on to the copy data, and the copy data inside the for its loop will start putting in the data. Okay
24:38
So let me figure out the lookup one. Okay. Now, very importantly, uncheck this, first row only
24:51
Okay. because we want the entire data from the table not just the first row I mean to say all the all the records from the query okay now currently the table name here I'll just put it as N a because I'm going to populate it inside my copy data so I'm just putting it as N A and here I'm going to write my query I already have my query here I'll just copy paste it
25:19
I'll just show you what this query does. So what it will do is it will return me all the tables actually
25:34
So if you see here, you can see all the tables out here in that database
25:39
It's giving me a comprehensive list of all the tables that are available
25:46
Okay. So let me use this. query here so that I can fetch the list of the tables the database
25:54
What next? I need to have a for its loop. So I'll add this forage
26:08
And I'll connect. I'll pass on the data once this completes, I'll pass on the data to the forage
26:17
to the forage. Here also I need to do something. What like I need to get the items like whatever
26:25
is getting passed from the look up Okay I need to get that data so that I can iterate across it Okay So now the last control like what it will do is it will get the output from the
26:55
table names it will get and it will add, it will send it to the copy data
27:01
And inside the for loop, we need to have a copy data. Okay
27:14
So the source is d sdb and and current item dot table
27:32
why I use this because you see like table underscore name table
27:38
underscore schema this is what is getting written from the database okay here at dynamic content
27:51
table schema okay so it's done now let's take care of the sync sync may we need to have this
28:09
data lake okay and file name also I need to pass on I'll use a concat
28:21
Okay. I'll name the files based on the table name
28:32
And CSV. Okay. Okay. Now, all everything is ready
28:45
I just need to publish it. Okay, I'll publish. Do the publish
28:53
Yeah, I need another five, seven minutes. Just, it's already, it's all done
29:04
Okay. So once that is done, the next is I will trigger it
29:11
What we did for the customer was we set a trigger every fortnight
29:17
so that whenever there are data coming into the existing table, existing table we used to move it till like there every time there is a cutover like
29:27
when the cuttover happened we totally moved to the new platform and we did away with
29:30
the existing old platform okay now let me figure it fingers crossed it should run if
29:44
some errors then I don't have much time to fix it The challenge is let me be open up with you
29:57
Hardly I code these days. Okay, I got some error. Just one minute, I will just fix it
30:03
It expects its parameter to be an area or a string
30:18
Just one minute, I'll just fix it quickly
31:58
Dot value. This should work
32:08
OK, fingers, OK, fingers crossed
32:25
OK, fingers crossed. like I code but I design basically I work as a principal architect so I have a big team of
33:05
2,000 folks reporting into me so they do all these hard work so I was very scared like
33:14
whether it will work but good that it worked okay now let us go to the destination and just see
33:23
see here we have got all the table data in this nation folder I'll just show you one CSV and we'll call it a day see the entire
33:54
data dump for all the tables we could successfully migrate it from the SQL database to our data link
34:03
So I rest my session here. So any queries, feel free to ask
34:11
or else what you can do is you can just take a screen of this and feel free to connect with me on LinkedIn
34:18
Any queries or anything? Definitely I'll be able to help you out
34:23
No, I'll show you. I'll show you. I showed you the CSV of one table
34:32
But actually what has happened is, wait, wait. I'll show you. Inside this now we iterated and dumped all the tables here
34:42
The names you see, right, address, build version, error log, product category
34:47
All these are table names. And for the data, the entire database actually, like I have CSVed, the entire
34:57
database actually and the one that I downloaded that was a single one that I showed you that
35:04
CSV file okay okay cool so I'm done feel free to connect whenever needed it
#Cloud Storage
#Computer Education
#Data Management
#Distributed & Cloud Computing