Data Replication in Azure SQL and Beyond || Women Data Summit 2021
12K views
Nov 9, 2023
This session will cover a range of data replication technologies in Azure SQL Databases, Managed Instances, and SQL Server, with a focus on Change Data Capture and SQL Data Sync. Some of the topics discuss will be: functionality, use cases, technical details, benefits and disadvantages for each data replication product covered. Conference Website: https://globaltechconferences.com/event/women-data-summit-2021/ C# Corner - Community of Software and Data Developers: https://www.c-sharpcorner.com C# Live - Dev Streaming Destination: https://csharp.live #SQL #Azure #Data #WomenDataSummit
View Video Transcript
0:00
So I'm Mara, I'm a PM at Microsoft in the Azure SQL department, and I'm very excited to be telling you more about data application options in Azure SQL test services
0:12
So let's get started. So today we are going to discuss a bit about data application at a high level, why it is important, what are some common scenarios and solutions that we've seen clients use for their application needs
0:29
And then we are going to dive deep into change data capture
0:33
and we're going to go see a demo for change data capture. Then we move to SQL Data Sync and the demo there as well
0:40
And lastly, we'll cover change tracking. These are some of the data change recording and data application solutions that Microsoft offers
0:51
But there are many other first-party and other third-party-grade solutions for applications
0:58
So this is just a small part of what's out there. So as I was telling you, they will focus mostly on PES solutions
1:09
However, many of the technologies that will cover are also available on SQL server on Prem, on Azure SQL managed instance
1:18
However, we'll mostly focus around the Azure SQLDB component. So, for instance, the CDC, the change data capture demo and the SQL data
1:28
SYND demos will be focused on Azure SQL databases. So this is the deployment type of choice for today's presentation
1:39
So first of all, let's talk a bit about why is data replication relevant
1:43
At a very high level, you might have a business and you might have inventories and you might
1:50
be operating at a global level and you might want to synchronize your inventories to have a consistent version of your inventories
1:57
have to synchronize databases across the globe or you might be running different businesses
2:05
in different regions of the work. So this is the most common scenario that we see clients use data application for
2:12
And to go into a bit more specifics, there are some specific scenarios behind the data application needs
2:20
So first of all, you might decide to synchronize distributed applications and workloads and
2:25
globally distributed data. So here when it comes to distributed applications, you might want to synchronize data across different workloads
2:34
So you might have one production database, another database that has an ytics workload
2:39
and you might want to synchronize those. Some common solutions for these need would be SQL data
2:46
synced, CDC, change tracking, which will cover all in today's presentation. Then another scenario might be that you want
2:55
ensure disaster recovery and business continuity. And this is very helpful if you have, for instance
3:02
data center outage, if there is a regional disaster, a large-scale outage
3:08
And in this case, you might decide to use active geo-application, geo-indundant backups
3:15
Third of all, you might want to scale out to get only workloads. So in this case, you basically would want to offload
3:22
to be done only workloads instead of running them on the grid-on-read replica and this would mainly be for performance purposes
3:31
And the common solution in this case would be active geo-replication or the read-scale-out feature that we're offering in Azure SQL
3:40
And lastly, you might want to keep databases synchronized when migrating. For instance, you go from SQL server to an Azure SQL database, and you might decide to use DMS online migrations to sync during the migration
3:55
process and SQL data sync to ensure that your databases are still in sync after migration
4:03
has happened. Yes. And these are some specific scenarios and a list of solution that is in no way 100% comprehensive
4:13
As I was mentioning, there are many other solutions. These are some very popular ones that we've seen customers use for such scenarios. Great
4:24
Now we're going to go a bit into change data capture, which is available in SQL server
4:31
Azure SQL managed instance, and actually, very recently we announced CDC in Azure SQL databases in public preview
4:40
So at a high level, CDC allows you to record DML changes
4:45
so delete, update, and inserts that happen in your table. So how does this work
4:52
You first want to enable your database, and the tables within the database that you want to track for DML changes
5:00
And once you enable the tables, there is some change tables that are automatically created for each enabled source table
5:08
So once you do a DML change on one of the tracked source tables, those changes going to the transaction law
5:15
And once they get into the transaction log, CDC has two jobs
5:20
So it has a capture drop, which takes the DML changes from the transaction law
5:24
and add them to the change tables. And there is also a cleanup job
5:29
which is the change tables based on the retention policy. And lastly, you have query functions that consume the data changes from the CDC tables
5:39
And you might use this to send the change data to order downstream subscribers
5:44
You might decide to run ytics on this change data. We'll discuss some common scenarios very soon
5:52
So you might have noticed that we discuss about the capture and the cleanup jobs which are central to CDC
5:59
However, in Azure SQL MI and SQL server, these are SQL server agent jobs
6:06
But very recently, as I was telling you, we released CDC in Azure SQL databases
6:12
And in Azure SQL databases, these two jobs are replaced by a scheduler
6:16
So basically, this scheduler automatically runs a scan and cleanup, which is very convenient for customers
6:24
So you can just let it and assume it's run on the background. However, if you want, you can still manually execute
6:30
these store procedures that you can see here to run scan and clean up on demand
6:36
So that's still an option. And functionally, apart from this big difference
6:41
everything is pretty much the same for CDC in Azure SQL databases
6:46
managed instance, and SQL server. Something else to mention is that if you want
6:52
to use change data capture, in Azure SQL databases, you must use databases higher than the standard three pricing tier
7:01
So if you have a basic database in Azure SQL, CDC won't work on it
7:08
So there are some key CDC use cases, as I was telling you
7:12
So you might decide to use CDC in order to track changes
7:16
for audit purposes or perform ytics on change data. You might also decide to send the changes
7:24
that come to the change tables to order downstream subscribers and to run all kinds of ETL operations
7:30
from the OLTP system to the data lake or the data warehouse
7:35
And for instance, you might decide to use Azure Data Factory to push your changes from the change tables to other subscribers And lastly you might want to do event programming to provide like instantaneous responses to data changes
7:50
And very useful in this case, it might be if you have a dynamic product pricing application
7:57
So you want the prices of your products automatically react to changes in your inventory, in the demand
8:05
in other market trends. So there are a lot of cool use cases behind changes
8:10
data capture. So kind of to bring everything together before we go into a demo, here are some key CDC concepts
8:20
that you should know if you are exploring this solution. So first of all, the capture and cleanup
8:26
So the capture, as we talk, it scans the transaction log and it is created when the first
8:33
table is enabled for CDC. Remember, I told you that you first enabled the database and then all the tables you
8:40
want to track. And then similarly the cleanup which which cleans the change tables based on a
8:47
retention policy. Both of this you have to ensure that SQL server agent is running if you are
8:53
using CDC on SQL server or on Azure SQL managed instance. Then you have the change tables
9:00
very easily we talked about them every source table that is enabled for CDC has an
9:07
associated change table. Then you can use use these two VMVs to monitor change data capture across all
9:15
deployment types, they are quite common. And then you have some table valued functions which
9:23
allow you to consume changes getting into the change tables. So by default, you can get all data changes, but if you want to access the net
9:31
changes when you enable CDC, we'll see in the demo, you'll have to mention that as one of
9:36
the parameters, so you'll have to set that to one. And lastly, log sequence number
9:41
So the committee lesson kind of identifies those changes committed within the same transaction and objects those transactions
9:50
Because remember, CDC is using the transaction log to get data changes
9:57
And in order to enable CDC at the database level, you have to run this third procedure
10:03
which leads to the CDC schema being created. And these five CDC tables, which are stored
10:10
on your system tables within your same database. So before using CDC, you should make sure that you have enough space in your database
10:18
And if needed, it is the number of Vigors, move to a higher pricing tier
10:23
And then once you enable on your database, you enable it on your tables through these procedures
10:33
And we'll see that very soon in this demo. So for this, for the purposes of this demo, I'm going to
10:40
to focus on how CDC works in Azure SQL databases because as you mentioned, PES is going to be the focus for today's session
10:51
So let's see. So I have a test database and a customer's table, which right now only has one customer
11:00
so I'm enabling CDC at a database level. This is a general purpose database, so higher than S3
11:08
CDC should work all fine. So as you can see, once I enable CDC at the database level
11:18
you'll see that I have five system tables created within the same database
11:24
So all the CDC artifacts are stored in the same customer database
11:28
And here I'm going to show you quickly what you have in each of these tables that just got created
11:35
change tables here. You can see all those change, all those table enables
11:38
Any DDL changes you can see in this table, the index column, and lastly the log sequence number time mapping
11:47
These are all mostly empty right now because we haven't done almost anything
11:52
So let's see, customer table, very simple. It only has one record right now
11:58
So now let's move on and enable CDC on this table. So you mentioned the source schema, the name of the table
12:07
the name of the table, the role which controls, so you can control access to the change data
12:14
And here, if you enable support net changes, you can use that function to see net changes
12:20
And you can select which columns to track, which is quite cool. However, I'm not going to do that right now
12:26
So as you can see, once I enable that the table level, I get the change table being created
12:32
an associated change table for my customer source table. And another table is created the CDC jobs table
12:40
And here you can see on the left on the system table section
12:45
And if we look in the CDC jobs, you will see the default parameters for capture and cleanup
12:51
And you can change this to configure retention or the threshold. So now let's insert something in our source table
13:00
And automatically the scheduler runs scan very quickly. So right now you can see these changes already
13:06
in the change table. However, if you want, and if you've gotten used to this based on SQL server, Azure SQL
13:14
MI, you can still run, scan on demand manually. So here you can see this still complete successfully and it's run on the background
13:24
So now I want to change the retention in minutes for the cleanup job
13:30
So I set it up to 230. And if I go to the CDC drops, you can see the retention for the cleanup drop has been changed to 30
13:45
And this is with regards to cleaning the change tables. So now very simply I want to disable CDC
13:52
So I'm like looking at all the system tables. And here you can see that I have multiple
13:57
So let's say I forgot which one has been enabled for CDC
14:01
so I can go ahead and find the necessary column. Here you can see is tracked by CDC
14:13
and you can see which of your tables are tracked, and you can disable the table for change data capture
14:19
and lastly, the database. So once you do that, you can also directly disable the database
14:26
and once you do that, all the CDC artifacts are gone as expected
14:37
Yes. And here I'm just showing you how the artifacts have disappeared from the system tables
14:46
Great. So let's move onwards. So let's learn a bit about SQL Datasync, which is available on SQL server and Azure SQL databases
15:00
So sync allows you to synchronize data across Azure SQL, SQL server on-prem or on your DMs, and the whole sync concept revolves around a sync group
15:16
So a single contains one hub database and one or more member databases that can either be on for instance or on Azure SQL DB because it not supported sync is not supported is not supported on Azure SQL managed instances of now And as you can see in our diagram on the right
15:38
all the changes go through the hub database, which has to be on Azure SQL database
15:43
So the members do not synchronize among them directly. All the changes have to go through the cloud
15:49
through the hub DB. So there are some properties of a single
15:55
which you can see here. So first of all, you have the sync schema
15:59
which basically specifies what are the tables that you want to be synchronized within your database
16:05
Then you have the sync direction. So here you can select whether you want one-way synchronization
16:10
or bidirectional. So this bidirectional functionality is very appreciated by customers, because basically you can move changes from hub to member
16:21
member to hub and both. Then you can set up the sync frequency. So you can run sync manually on demand or you can also set it in so that
16:31
it's such that is run automatically. If you want sync to be run automatically, you have to set up
16:37
the frequency for it and you can put it as low as a few seconds or as large as like every 30 days
16:44
do a sync. And lastly, you have a conflict in solving policy. So if you have a, if you have a
16:52
conflict on the data changes, you have to set who gets priority
16:56
And here you can set hub wins or member wins. So these are the two options that you have
17:04
And overall, sync works based on triggers. So insert, update and delete triggers
17:09
Unlike CDC, which we just discussed, which works based on transaction log changes
17:17
taking the changes from the transaction log directly. So you have a sync agent
17:22
So in case you have a database that is on-prems or member database, the agent ensures data security
17:32
So basically, the sync service communicates with the agent by through some encrypted connection and a unique token, which is called the agent key
17:42
So basically, the SQL server databases authenticate the agent using the connection string and agent key
17:49
and this is very, very secure and direct communication with the databases behind the firewall is prevented
17:59
And lastly, another security measure that we've recently taken is the private leak for sync
18:05
which is in public preview right now. So basically, this allows you to establish a secure connection
18:12
between the sync service and the member and the sync service and each of the hub databases during the
18:19
synchronization and we'll see this into practice very soon in our demo one small
18:26
mention for private link if you want to use private link all your databases in
18:31
the sync group have to be in an Azure SQL databases so overall as you can see
18:40
here there are three main use cases that we've seen for SQL data sync the
18:45
first one being hybrid data synchronizations So basically, if you want to synchronize data across SQL server and Azure SQL databases
18:54
this might appeal to customers looking to move to the cloud and put some of your applications in Azure
19:03
Then distributed applications, so synchronizing different workloads across different databases. And lastly, globally distributed applications, which will see more about in our demo
19:15
in which we'll synchronize databases across the globe. Great. So in this demo, we are going to synchronize
19:25
to Azure SQL databases across US and Asia. So we have a
19:31
food's business and it has inventory both in the US and Asia and we want to
19:36
make sure that the inventory is in sync across these two these two regions
19:40
right? So let's see. I'm using for the purposes of this demo, I'm
19:45
using SQL Data Sync in the Azure portal, but you can use it in PowerShell using Rest APIs
19:55
So there are more options. So as you can see, I have a very simple application of fruits
20:03
and I have fruits inventory in the US, and these are some of the items in my inventory
20:10
and I have another database in Asia, which also has fruit inventory
20:15
And as you can see, the Mellon, not all of the US show in Asia
20:22
So I want to make sure that these two databases have bidirectional sync
20:26
So I go into the sync to the other databases and I create a sync group
20:34
US to Asia sync, let's say. So I have to create a new sync metadata database in the same region as the hub
20:45
So the sync demo US will be our hub database. And I'll set up a lower workload is quite small
20:55
And I leave the default collation. So I want to set up automatic sync
21:01
Let's say every two second I want the sync to be triggered
21:05
And the conflict resolution is that the hub wins if I have a data conflict
21:10
And here, you can see the private link. I want to use private link
21:14
to use private link, so they directed me to the documentation to learn more
21:19
So I'm creating the Sing Group, but before this gets created, I have to manually approve the
21:25
private endpoint connection, so the private link. So I go to the private link. This is the
21:30
private link between the Sync Service and the Hub database. So I have to manually approve it
21:37
before the Sing Group can get created. Okay. Great. So now let's add some sync members. In object to do that, I have to log in the hub database
21:50
So with the credentials at the server level, and then I add a member Azure database
22:02
And I select the server on which the member is, I have an Asia Test server
22:10
The sync directions, once again, I want bi-direction. sync and the server details, the authentication
22:18
And once again, I use private link to connect the sync service to the member database as well
22:24
to ensure secure connection. And once again, I have to approve the private endpoint manually
22:31
So I'll show you another way of doing that. So we're going to go directly to the Asia server
22:38
And I'm going to approve the connection from the private endpoint connection from there
22:44
This is just another way to approve privately. And once I do that
22:57
I've added a member. Click OK and it should work fine. And lastly, I configure the sync group
23:08
So for each of the databases, I can select the tables that I want to sync As you can see I refreshing the schema And it detects that I only have one table with three columns
23:23
I need a primary key. As you can see. And once again, I go on the member database, refresh schema
23:34
and I select which table from the member database I want to connect with
23:39
There are some limitations around using sync with table names and data types
23:47
So you must check those before you set it all up. And now you can see the sync group is created
23:55
The databases, you have some logs here to monitor the progress. So the thing should have worked
24:00
So now I'm going back to the databases to actually make sure that they are in sync
24:06
So the sync is working on the background based on the frequency that I set
24:12
which is of triggering it every two seconds. And now I go on the Asia databases
24:19
I'm opening both databases to ensure that I have same data in both
24:29
And you can see that there are also some SQL Data Sync artifacts
24:33
which similar to CDC are stored on the same. user database so you must make sure that you have enough space before you get started with sync
24:44
So let's select and as you can see that missing inventory item that we only had in the US shows up in Asia already
24:53
And now since we set up by directional sync, let's now I'm just showing you that both are the same
25:00
So since we set it up by directionally, let's add some items to Asia as well and see
25:05
and make sure that it gets shown in US as well. So I'm inserting something more
25:15
I made a small mistake so this is not going to work
25:21
So I'm adding this to the member database in Asia
25:32
Now I'm going to go to, as you can say, I'm adding this to the member database in Asia. As you can say, I'm going back to the US
25:39
So it's still not there. So sync gets triggered every two second
25:43
but the actual thing might take a bit longer. So I have to wait a bit and now
25:48
now I went to Asia back. So now if I run in the US, it shows up
25:55
So don't assume that just because you set up the frequency every two seconds, you'll see data already there every two second
26:02
It triggers it every two seconds. it triggers it every two seconds and then the scene takes a bit of time
26:07
So this was our demo on syncing data across different databases around the world
26:17
And lastly, we have change tracking, which works in SQL server and Azure SQL databases
26:24
So the key thing about change tracking is that it records that rows in a tablework changed
26:31
but it does not capture the actual data that was changed. So let's say an application in which you just want to know
26:39
if data was changed and you use change tracking, but if you want to know which data has changed
26:44
you want more details around it, you might just change data capture instead, right
26:51
So some key change tracking concepts are the following. So you have the change tracking table, the functions
27:00
auto cleanup, and change tracking current version. So basically each table that is enabled for change tracking
27:08
has an internal on-disc table, which is used by the change tracking functions
27:13
to determine the change versions and the rows that have changed in a particular version
27:21
And then you have the auto cleanup, which scans the user databases
27:25
to kind of identify change tracking enabled databases and based on a retention
27:31
which is similar to what we've seen in change data capture. So the retention period set at the setting of the database
27:39
each internal on this table is first of any expired records. And then you, as you can see, you also have some query functions that supplied details of the changes
27:51
in an easily consumed relational format, and we'll see some of those very soon
27:57
So you can see that it's quite similar to change data capture, but it's
28:01
It has a different performance, less of an overhead since it only tells you whether data has been changed
28:10
but not what the change data is and what was changed. And you can see an example here on the right hand side
28:21
You only have the, it basically tracks the column that changed based on the primary key
28:26
And here you can see what type of DML change was made
28:30
So like update, delete or insert. So given that CDC and change tracking are quite similar
28:39
I wanted to do a small comparison. So as I was telling you
28:44
you see which ones have been changed in the case of change tracking
28:49
but you also get historical data, historical change data in the case of the CDC solution
28:57
So because change tracking is only looking at the primary he is of the course that work changed
29:03
This leads to less overhead. Well, CDC has obviously more overhead and also the change tables are stored on the user's database
29:14
and all that. So change tracking is enabled at the database and the table level
29:21
Each enabled source table has a corresponding on disk table, which is quite similar to CDC
29:26
However, the change table is stored on the database. And lastly, what they have in common is that they can be enabled on the same database and no special considerations are required
29:39
So you can have them running on the same database in parallel
29:44
And lastly, we're not going to do a demo on change tracking, but here I'm showing you how to enable it at the database level to this DDL
29:54
ultra ddL, you enable it at the level of the tables that you want
29:59
So similar to you have some parameters. You can do track columns updated on if you want that to be on
30:09
It's by default off. So you have to set it to on if you want that
30:13
And this is how you can determine which databases and tables have changed tracking enabled, similar to kind of what we did in the CGC demo
30:23
And this is how you get the change tracking current version, the latest version
30:29
Great. So I talked for a long time and just a quick recap of what we did today. So we discussed about why data application is relevant. Then we went into some common scenarios and solutions for data application and then we went a bit into depth for change data capture, SQL data sync and change tracking. So this is pretty much it for today. Thank you
30:59
Thank you