Learn DynamoDB Data Modeling
23K views
Oct 30, 2023
Join us in this informative talk as we explore data modeling in DynamoDB with real world examples. Discover how to design data models that enable efficient access patterns and optimize filtering capabilities. We will cover essential techniques for structuring data to support diverse query requirements and establishing relationships between entities. Learn how to effectively filter data using partition and sort keys, enabling retrieval of specific items and ranges. Also we will Dive into Global Secondary Indexes (GSIs) sorting with timestamps etc. Software Architecture Conference 2023 (https://softwarearchitecture.live/) #SoftwareArchitectureConf23 #CSharpTV #csharpcorner
View Video Transcript
0:00
So as you know, my name is Vikar
0:02
I'm a senior staff engineer at SOFi. And previously, I was a database working on S3 for close to eight years
0:10
You all know S3 is a storage system, but S3 also need to deal with storing data in the sense that metadata, so you can index your data better
0:21
So that is where I think my experience of building no SQL systems is coming into play here
0:26
Yeah, but that's something about me and the last two years I've been working at
0:33
SOFIS staff engineers. SoFi is a financial institution, it's a bank in the US and we build a software using
0:40
the cloud technologies, mostly AWS. All right, just before going to the talk, a lot of my content here is from DynamoDB
0:50
book, The Images, because I really love this book and I think everybody should read this book
0:55
It's the author, Alex Debrie is also a popular speaker as well
1:02
So I would like to credit some of the images that I steal from his book
1:07
All right, so I mean, this is, this talk is about Dynamity, but some of the concepts and some of the things that we're talking about here applies to any no-sequence system
1:18
whether it's Cassandra or any other no-sicle system. But let's assume that we are going to talk more about
1:25
DynamoDB and less about Cassandra, but let's get started. So why DynamoDB, right
1:31
The first thing is at serverless. That means you don't have to meet in any servers
1:36
DynamoDB provides you an API that you can talk to for, to get put your objects or
1:41
rows attributes and you don't have to manage enough your servers. So that would be one of the, one of the advantage of using DynamoDB, it's
1:49
serverless and it can scale elastically up and now. And the second one is no connection limits
1:55
I'm sure you have dealt with things like connection pools, JDBCs, and then you have multiple hosts hitting a single database
2:04
and you had to deal with connection problems. So the fun part with DynamoDB is the no connection limits
2:09
Because of the way it runs on scale on multiple partitions, you can spin up any number of hosts, whether it's on cloud
2:16
whether it's Lambda, on AWS or in Azure or Google Cloud, and then keep hitting without any connection limits from multiple hosts
2:25
The scale, one of the biggest advantages scale, so you don't have to worry about scaling your servers up
2:32
scaling your servers down, everything is taken care by your automatically within the Cloud by DynamoDB
2:38
And performance. So one of the things that Dynamityby provides you is consistent performance
2:43
So think about the image that you have here, right? So if you have your MySQL or Postgres database
2:49
you start with some database with very small amount of data when you build your application
2:54
let's say one gigabyte, and it's super fast. It works really well in your pre-proded environments
3:00
production environments. But as you grow, you can see the performance keeps going down
3:07
And the curve keeps going into fast, sluggish, and painful. So at that point, all you need to do is probably
3:15
higher experts and manage your indexes and maybe have to do some re-architectures and stuff like that
3:20
to come back to a normal state, but that's a lot of effort. Right, and you need to have expertise in something like
3:27
you see a lot of database administrators doing that kind of work
3:31
So with DynamoDB, you get this consistent performance. It's always fast. You do it once on one gigabyte
3:38
and you can scale it until one terabyte or even 10 terabytes, 20 terabytes, even petabytes
3:43
You can just keep having the same performance, a consistent performance. And consistency, consistency in the sense
3:51
of performance as well, but consistency in the sense, of no SQL database as well
3:55
So you see a lot of databases or a lot of no SQL databases
4:02
because they have to manage replication. You also want some, you don't get consistent, consistent responses back in the sense that
4:11
most of them are eventually consistent. Like once you write your data, you don't, and you read it, you might not get the latest data
4:18
you might get a stay later, right? So, DynamoDB pro gets you read after write consistency
4:23
that means once you write you can get it immediately if you can call your master replica it also
4:30
supports both eventual consistency and strong consistency but let's let's talk about that later but
4:35
consistency is something that only provides and predictable billing so because of the way things are
4:42
in the cloud you can you just can do a capacity planning or request planning and decide this is my
4:48
monthly billing going to look like based on my code so it's very predictable in that sense
4:53
And zero downtime. Yes, DownerDB had few outages here and there, but most of the time, it's zero downtime
4:59
So remember how many times you had to deal with your database admins to say, you know what
5:04
we're doing an update and you'll have a downtime and you'll have to have a downtime with your customers
5:09
You send an email to customers or have some banners in your home pages saying we are down for maintenance
5:15
With DynamoDB, you don't have to do that because it's always up
5:19
Okay, let's go to next slide. So before we go in detail, so one of the guiding principle of DynamoDB is it won't let you do things that won't scale
5:32
So with data modeling, you won't do things that you cannot scale
5:37
Unlike a relationship, there are no joints, no filters, you can't do sums, no aggregations
5:43
So how do you do data modeling then? So let's get to that
5:48
Okay, before going in deep, let's go into some NoSQL concepts. You have your primary key just like relational
5:57
You have a primary key with a primary key consists of a partition key and a sort key
6:01
which is a composite primary key. You can either have just a partition key without a sort key or you need what you can do
6:06
with a sort key. So in this case, look at the image
6:12
You can see Tom Hines as partition key. And some of his movies like Castaway Toy Stories are, for example
6:18
example, are your sot keys. But you can get away with not having a sotky altogether as well
6:23
Or you can have a composite primary key, just like the image
6:27
And the attributes, we have something about attributes here. So you have roles, ears, and genres
6:35
So those are the attribute fields. And unlike relational, these are called items
6:41
So you call them rows, but these are items in DynamoDB. And one caveat is you cannot quite do that
6:48
So you cannot have, you cannot, you can quietly DynamoDB with partition key and a sort key or a partition key or a partial sort key and partition key, but not an attribute
6:59
There are indexes just like relational dynamity be supposed a global sector indexes
7:03
We're going to talk about them later, but without setting up a GSI or a global sector index, you cannot file you with attribute
7:11
All right, so how does DynamoDB work? I'm sure you know a relational database is, is
7:18
let's say it's running on a single host, but unlike, unlike that, DynamoDB works on partitions
7:23
and partitions are working on different hosts. So let's say one partition is associated with one host
7:30
So let's say there are nine hosts in this case, where each partition lives on
7:34
And let say there is a we spoke about partition keys and sort keys So let say the partition key is a customer ID and the sort key is order ID So when DynamoDB web server receives this request to its request router
7:48
so the request router looks at the customer ID and figures and does a hash and figures
7:53
I need to go to partition 1 where this item belongs to and do a put or a get
7:59
So that's how DynamoDB internally works where there's a hash and then the hash based
8:04
hash you go to a single partition and do a fetch or output. But partition one is also
8:11
replicator into multiple other hosts so that way you don't you get that durability and also
8:18
eventual consistency and you can hit any partition to get your data. So let's look at some
8:26
first order things for your data modeling. Generally what you do is you go to draw an entity
8:33
relationship diagram correct in any in any data modeling scenario you have to do the same you build
8:38
your relationship diagram and then write up your access patterns so from the access patterns can
8:44
be hey look up your employee details by employee ID or quite employee details by employee name so write
8:49
them down in excel or Google sheets or whatever and then think about what should be your
8:55
partition key that partition key would be able to help you to shard your data and think about
9:00
what can be your sort key the sort key can be used for sorting your data in order so you can preserve the orders for example and also what can be your
9:09
indexes in this case let's say you want to have in the previous in your previous
9:16
events let's say you want to have year as one of your index so decide on your
9:20
global second indexes as well so some of the things that you have to think about
9:24
when you're thinking about data modeling these are the first order things and these are
9:28
very similar to what we do in relation as well except you have a primary key but you have a
9:32
partition case it's not key here but the second order is the most important part because
9:38
these are some things that are forced or enforced by DynamoDB so some of the things I need to know
9:43
know is there are conditional conditional expressions that you can write in DynamoDB put if so you
9:50
can do things like hey put this object only if this condition is true so you can write conditional
9:56
quiries and Dynamitb also supports transactions so like just like a relational
10:02
also think about transactions in the sense if you write if you have three transact three
10:09
queries to execute so you can wrap it up in a transaction and i don't want to be make sure these
10:14
are atomic so that way all the transactions execute or none of them execute and you can also
10:20
think about complex attributes so i know we spoke we know we have things like binary data etc
10:28
in your databases along with that i know what you also have complex attributes. That means there's something like maps and also a list that you
10:38
can support. If you look at the image, the mailing address, for example, I think is a map. And
10:43
these are this is what the complex attribute that you can write. And but you cannot write GSIs
10:49
and converse attributes because it's hard, but your GSIs can be on your first name and last in
10:54
this image, for example. And then B also has some limits. So any query or scan operation that you're doing
11:02
against dynamite abe is maximum one mb size so if you write something you cannot get back
11:08
more than one mb you need to think about pageination and stuff like that to keep your previous
11:13
marker and then query again so you can read or scan your full database or quite your full database
11:18
if you have to and dynamoomal you have something on partitions as we talked about and there are
11:24
limits on partitions as well the current limits are around 1k 1 000 write capacity units and 3 000
11:30
read capacity nets. These should be pretty much, should be very high number for most of the
11:37
applications. But again, if you are thinking you need more, then probably you can need to go towards
11:44
on-demand mode or think about what mistakes you made in your data modeling because generally
11:49
this should be very sufficient because 1,000 DPS and 3,000 DPS per second should be sufficient
11:55
in most of the use cases. And the row sizes of 400 kb, so whatever, whether you
12:00
use complex attributes, where they use normal attributes, you cannot have more than 400 kb
12:04
So that is something that you have to think about when you're modeling your data. And the maximum partitions is 10 terabytes
12:11
So imagine we, as I said, we have a partition key and the partition key would be going to one host
12:18
So you cannot have, if you model your data such a way that all your data is in a single
12:23
partition key, then you're doing something wrong because the maximum partitions size is 10
12:27
10 terabyte. So you need to find some high cardinality attributes for your partition so you can
12:32
divide it better and you don't hit this limit of 10 terabytes as you grow
12:40
Okay, so we already spoke about some primary key but let's go into more detail. So
12:45
primary key is a unique key that combines both partition can a sort key. So it's always
12:52
unique for example if you have like amazon and the date here which is 1824
12:57
If you write again, you're going to overwrite it and it's not going to be a new entry
13:02
but it's you're going to overwrite. So it's always unique in the sense that partition key is always unique
13:07
And you cannot update a primary key in the sense that you can only override it and every new
13:13
writer partition, every new write of partition key, sort key is a new row or item in dynamody
13:22
And partition keys for grouping, as I said, partition key is used in hash
13:27
So you can hit the right host or partition in DynamoDB. So it's used for grouping your data together
13:34
And sort keys for ordering. So if you see on the side, you see the sort keys are ordered lexicographically
13:45
And partition key should be selected with high cardinality. So that way it's always unique and you always have multiple partitions and not a single partition
13:53
As I said, if you have multiple partitions, you get more throughput out of your partitions
13:57
than being restricted to 1,000,000 TPS that then would be imposes. And you can also some strategies while developing your partition key is high cardinality
14:08
You also need to make sure think about composite attributes. So if you look at the image there, the invoice number, let's say is 1-1-2-1-2
14:17
is these are composite attributes. There's something called hatches that we are using in this case to create more high cardinality
14:27
partition keys so you can employ that strategy in your use case. And also you can add some predictable randomness
14:33
So let's say you have that invoice number, but you want to also create more partitions, partition keys
14:40
Then you can add some kind of randomness between let's say 0 to n
14:44
hash your data, let's say, hash your one of your attributes and decide where you
14:48
want to go into your partition key. So it's not easy for search, but if you want, you can do employ that strategy based on
14:55
the use case again. I wouldn't recommend it, but if your use case purpose it and you don't have an option
15:02
then you can do something like predictable random assessment. So you can go into multiple, go ahead and hit multiple partitions and not just single partition
15:12
And some of the best practices of SOT key is you can also just like partition key you can also have composite sort keys where you can do on hashes just like of you have your client 1 TXID hash 1 client 1 TXID has to
15:26
So you can do composite sort keys as well. And the nice part about sort keys, you can do
15:31
if you want to QWERTY, you can get your partition key and then say it starts with something in your Qaeda
15:37
so that way you can list everything from your partition. And sort keys can be used for
15:43
versioning. So let's say you have to, you have some use case where you want to update the data
15:48
You have a lot of updates coming into one of the use case. But at the same time, you want to preserve the history
15:53
Then you can use SOTKi as your version strategy. That means every time you get a new version
15:59
you're updated by one, two, three, four, five, six, nine, ten. Such a way you can have that history on using your SOT key
16:08
And you can also employ things like sorting with timestans, like, hey, every time you, every time you, every time update
16:13
comes in our data comes in you can create a uiD sorry you can create a time stamp uh epoch
16:19
timestamp or ISO 8601 date and then add your data so you can use that for sorting
16:27
and you can also use uiDs but with uiDs what happens is you lose the order so for that
16:34
and the previous one where you can have coaling timestamps that are colliding if you're
16:38
dealing with high scale then you have some collisions so there's something called ulyd that you also
16:43
try out so that you get best of both, which is best of view IDs and also sorting with
16:49
your time stamps. So the advantage of ULID is it preserves your order. The way ULID works is the first
16:56
48 bits is milliseconds. So that way that the 48 bits make sure they're ordered. And the next 80
17:03
bits is make sure they are random. So that way you still get randomness and not have collisions
17:10
with your timestamps, your time stamps, and then you are safe and also let's go graphically
17:14
sortable, sortable as I said. Okay, let's get to query. So you can do, you can write queries on your, on your database
17:25
I mean, on your, and I'm not, I'm not going to be database. And for that, you need a partition
17:29
key. Let's say in this case, the partition case, device one, two, three. And if you query by
17:35
device one, two, three, you're going to get all the readings as a set reading from the beginning and
17:40
maximum you can get is 1nV and you have your last token and based on the token you can
17:44
file you can file again and fetch that data full data for device 1 to 3 partition key and i also
17:51
said you can have a sort key when you're quietly or you cannot have it so you can also some
17:57
you can also do something like partial sort keys for example you can quiet hey my partition
18:01
key is device 1 2 3 and my sort key starts with reading 2020 0 31410 let's say so it's going to
18:10
give you all the readings that are that starts with reading 2020s 0 3114 for example so you can do a
18:17
full partition key or you can do a partial sort key so you can get a list of all your readings
18:22
in this case and you can also add some filters filters are like very small things like hey i want
18:30
to have temperature greater than 32.1 for example so you could add those filters in your queries
18:35
and then would be does that filtering but still you get back uh one mb of the of your
18:40
data. And I really like this feature which is ScanNX forward. So a lot of times when you design
18:45
you think, oh, maybe I have to always go from top to down, but DynamityB also support something
18:51
called ScanDX forward. You can set that field to false and DynamityB can return you from top
18:57
down as well. In this case, you can get, if you see the image, you get, you can get the readings
19:02
and reverse as well. So that is something that would be offers, which is very nice. And a pageination
19:08
as I said, because I know what I have these limits of one megabyte per Qaity, you need to
19:13
dynamoDB supports our DynamitB SDK supports page nation by out of box. So you can get your last
19:21
token and then Quiety again and then fetch those readings in this case of that device one, two
19:28
three. And you can also do consistent reads. DynamoDB as I said, can support read after
19:34
write. By default you, by default it's eventually consistent. So you have to
19:38
set when you're writing the query you have to say I want consistent reads when you say
19:43
consistent treats it's going to hit the master node and get you the data not slave
19:47
notes so that way you always have that read after write guarantee and uh favor inquiry over
19:55
scans acquies will need partition keys so having post partition keys is easy to fetch your data
20:01
scanning is bad because you'll have to go through a lot of data and if you have
20:05
terabytes of data that it's going to take forever. So always think to acquire your data over
20:12
scanning. Okay so let's get to an example of one one to many relationships. So a lot of times
20:20
you have to data model on one to many relationships like customer has orders for example
20:26
That is one example or actor works in movies so that is or has a bunch of movies under his
20:33
bucket so that is one to many relationships. So how do I, how do I fetch the information on customer if I have an order, right
20:40
So some of the strategies that you can employ are denormalization, which is exactly opposite of normalization that we are doing relational
20:48
And it also to employ strategies like pre-joining your data. So that way they are co-located together in DynamoDB
20:54
So let's get through some examples and see how that works up. So if you see, so sorry, I think the image has cut the
21:05
the title, but what I'm trying to say here is, so let's say you have an access pattern
21:12
where you want to do one to many relationships. And let's say a company has few payment methods, and the payment methods in this case
21:23
are a credit card and a bank account. So what you can do is you can create a complex attribution in DynamoDB
21:30
And so if you see here Berkshire, a Berkshire is a partition key, and Berkshire has two payment
21:35
and that's which is great card and a bank account same thing with facebook facebook has a
21:41
great card and a bank account as well so one organization has multiple payment types so
21:47
you can employ something like a complex attributes like a map that we described and
21:52
denormize your data like this one the few caveats of it with with this that you cannot
21:58
have indexes as i said you don't need an access pattern like hey get me our great card numbers
22:03
I don't think you have it or get me on bank accounts. So probably go by an employer in this space
22:09
So if you have a reasonable use case, you could employ this denormalization plus complex attribute
22:15
as one of your idea. But at the same time, we also have constraints like 400 kilobytes on a low and dynamity
22:23
be, correct? So this strategy cannot be applied to, let's say, things like customer has orders
22:29
If the customer is ordering, or hundreds of orders, every day so this this won't work because you're going to exhaust your row
22:39
size on item size so look look at other strategy for that so another strategy is
22:50
to determine as your data but also to picket your data for example in this case you have Stephen King as one of your author and you have few attributes Let say author birth rate is something that you want to get every time and or you can
23:04
you have a book name and you want to fetch the author birth date based on it. Author of birth rate is generally doesn't change. Once you have it, you have it and you don't
23:12
really change it. So what you can do is you can duplicate the data and it's not
23:18
that you don't deal with any data indicate issues because the data doesn't change
23:24
So this is something that can be applied, data duplication and data normalization if you have
23:31
no data and integrity issues. And other strategy is pre-joining your data
23:37
When you pre-jorn your data, you have a complex composite primary key plus quidies
23:43
So as I said, remember we can co-locate the data. For example, you can data model in the sense you have a Berkshire Hathaway as a partition key
23:52
and you have a Burchshire organization item followed by user items in the same partition key
23:58
So what you can do here is you can collocate the data. When you co-locate the data, when you fetch your, when you're query, write a query for
24:06
give me item, user items, for example, of Berkshire, you can do a query on Berkshire and you're
24:14
going to get the organization item and also user items. And based on organization items, you can get information about your organization
24:21
So user items are in this case, organization to user items is one to many
24:27
So whenever you feature user items, you can also get the organization items directly
24:31
So you can employ that strategy for one of a strategy for one to many relationships
24:38
And next concept is filtering. And a lot of times you'll have to filter your data, right
24:45
So you can filter. first of all by partition key, that's a very basic thing
24:49
But you can also apply some advanced filtering strategies. Again, filtering by partition can sort key is basic because you know those are the
24:56
primary filtering mechanisms. But you can also do something like sparse index
25:00
Sparse index is an interesting concept. So for example, if you look at the diagram here
25:08
let's say you want to create a, you want to create a GSI, which is a global secondary index
25:16
on one of your item. So what really happens? Maybe this is a better example
25:22
So for example, in this case, you create a GSI PK and SK on the book name
25:29
And, or really this is not a good example. Let me see a bit
25:33
Yeah, I think this is a better example maybe. So let's say Workshad has like users
25:39
and one of them is admin, like every organization has an admin user
25:44
And you want to fetch all the admins, for example. So what you can do is you can create a GSI primary partition key one and GSI sort key one on your item
25:57
So that way, when you want to fetch all your admins from one place or all their
26:03
admins, you can fetch them with this. So this is a sparsiax because if you have thousands of people or thousands of employees in
26:12
workshop, let's say there is only one admin and every company has their own own
26:16
admin in this case. So by writing a GSI P.K and SK, you're creating a sparse and text
26:23
So that way you're when you, when you, so maybe you have to take a step back
26:27
So when you do a GSI, when you create a GSI, then what if is actually creating a new table for you
26:32
So in this case, there's another table that is created with a partition key as the company
26:36
name and the sort key as admin. So you can buy, give me all the, give me the admin for this company, for example
26:45
So you can get the sparse index, which is admin directly. And so you can filter by type as I said on admin type
26:54
or you can also query by single attribute into the index. And another filtering strategy that can employ is always
27:01
you don't have to rely on server side filtering, your own website filtering
27:05
That means you can fetch one in your data from DynamoDB. And maybe filtered it and keeps your data for maybe for other things
27:13
maybe you're doing things like storing your data and then using it your next page
27:19
So you can also apply science at future. I think one strategy that we did not go in deep is global security index
27:26
I was hinting about it all over the slides, but let's go into this as well
27:31
So in DynamoDB, so you have two kinds of indexes that DamotB supports
27:38
One is local, secondly local index, another is global secondary index. I wouldn't pay too much attention to local index because local index is only for partition
27:48
whereas global second index is for all your partitions. So I would focus on global second index here
27:56
The way global second index works is when you create your table, you can set up
28:01
I want a secondary index or a global secondary index. Let's say you define your organizations, in this case organization, and SOTK is Workshare Hathaway
28:12
So DynamodyB underneath creates a new table for it. So every time you make an update to your table
28:18
then only will also makes an update to your index table. But one caveat with index, GSI, a global secondary index
28:25
is you cannot get consistent reads on your on your index table
28:31
So make sure you work around that because you won't get to you, what you get is eventually consistent indexes and not a strong consistency
28:38
So that is something that you have to work on. There's a limit on GSIs today
28:42
that I believe it's 20, so you can have 20 GSIs. So make sure you are thinking over that problem as well and not have
28:49
I'm not create an index for every attribute. And one more thing is every time
28:59
because Dynamitb creates these tables, so the cost also increases because every time you create index
29:06
you're writing to a new table. So what you can do is you don't have to pick the full row information
29:12
you can say I only want organizations and for example the company name as my primary key and
29:20
sot key and I only want let's say subscription level I really don't care about organization name
29:24
or organization address for example so you could also pick your data into your GSI so you can
29:32
save some cost in that yeah and we spoke about portion Texas
29:36
sparse indexes are data or tables or indexes that have smaller amount of data so you can
29:44
pick your filters or write your filters with it. I think we also went through this which is a sparse index where you have an admin
29:55
So if you, I mean maybe this is a good example. So how does other table look like when the
30:00
how does the GSI table look like? So when you write to your primary table, GSI, GSI
30:06
it's another table so this is how it will look like you have your partition key sort key
30:11
and the full attributes in this case but again you can pick your data when you're defining your
30:18
data when you're defining your models all right i think we covered most of it uh thanks for
30:27
joining let's go for you
#Cloud Storage
#Data Management
#Programming
#Software
#Web Services