0:01
hello everyone welcome to another
0:03
episode of the cloud show today we're
0:05
going to talk about distributed systems
0:08
and we're going to talk about it in the
0:09
in the terms of figuring out the most
0:13
sort of underrated and hard things about
0:16
building distributed systems and for
0:19
that the star of the show is the
0:21
distributed systems expert Leila
0:34
hello hello he would have called you
0:38
there the distributed systems expert
0:39
yeah i'm like "Oh my god he's overdoing
0:41
it again." No no that's true it's true
0:45
thanks hey it's good to see you welcome
0:47
back to the show this is your second
0:49
appearance yes thank you for having me
0:52
again it's good to see you absolutely
0:53
you didn't have enough for the show last
0:55
time that's good apparently you keep
0:57
convincing me so here I am here you are
1:01
all right that's great and and um I know
1:04
that you know uh distributed systems is
1:06
a is a big deal uh for a lot of people
1:09
um and and they're trying to build
1:12
something really great and it's hard um
1:14
there are some problems and some
1:16
challenges to tackle in this space so
1:21
like who are you why why do you know
1:23
anything about distributed systems if
1:26
yeah great so my name's Laya uh I work
1:29
for a company called Particular Software
1:31
where we build in service bus and I'm
1:33
basically uh mostly focused and
1:35
specialized in messagebased systems
1:37
event- driven systems if you will and
1:39
I've been building these systems for
1:41
over a decade now so there's one or two
1:44
things that I picked up along the way i
1:46
guess just a little bit just a little
1:49
bit and but you're based out of you're
1:51
from Belgium right yes absolutely
1:54
waffles and fries and ice cream and
1:57
lovely things all good things all right
1:59
now actually I was I was there just
2:01
recently right for the for the
2:03
conference it's a really good conference
2:04
down there at the um um Techama yeah
2:08
yeah yeah that was absolutely one of my
2:10
favorites in fact we had already
2:12
scheduled this uh this session at that
2:15
time so we didn't schedule it then we
2:17
had we had we had this on the books for
2:19
a couple of months right yes you're
2:21
you're a busy person and you you're like
2:22
"Can I schedule you in oh June
2:27
sorry but I'm here i'm here that's all."
2:30
Now you're here and that's it's all our
2:32
to our benefit so So let's dive into it
2:35
let's go into distributed systems some
2:38
underrated hard things about building
2:41
distributed systems i mean it's all the
2:43
floor is yours take it away what are we
2:44
talking about thank you well one of the
2:47
things that um I like to bring up is
2:49
that we spend so much time uh especially
2:52
in event- driven systems talking about
2:54
oh should this be a command or should
2:56
this be an event and what are our
2:58
service boundaries and which service is
3:00
going to communicate with what and no
3:01
that service should call that service we
3:04
don't want that type of coupling should
3:05
we do orchestration or choreography and
3:08
it's like so many discussions and I mean
3:12
they're valuable discussions right um
3:14
they're important discussions They shape
3:16
how the system is going to look like but
3:18
there's this one thing that is like an
3:20
oversight over and over and over again
3:23
and it's basically how are we going to
3:26
version this system like you know the
3:28
system is going to evolve over time
3:30
right it's not going to be perfect from
3:33
day one business requirements change our
3:36
understanding of the domain changes and
3:38
then things are going to have to change
3:40
so how do we do that in a way that
3:42
doesn't break the rest of the system oh
3:45
that's interesting so so the the the
3:47
over focus is on the shape of the thing
3:50
uh but they're not understanding or
3:52
maybe not putting enough emphasis on the
3:55
fact that there is an evolution here you
3:57
will you will have revisions of this
3:59
thing yes so I think it's really
4:02
important to think about it that way
4:04
from the start like how are we going to
4:06
prepare ourselves for that type of So
4:08
there's a version you know 01 or version
4:10
one like you have to have a version from
4:12
the beginning right not as an add-on
4:15
later yes so to start some of those
4:18
things also tie into how do you design
4:20
the system because I think we have to
4:22
design for evolvability as well and one
4:25
of the things that I um like to
4:27
emphasize uh whenever I can is to say
4:30
it's important to start thinking about
4:32
public facing events and private facing
4:34
events from the beginning because when
4:37
private when you have private events
4:39
then basically the subscribers are
4:41
within the boundaries of that service
4:43
right yes they're not going to be
4:44
outside of that hence the word private
4:48
but the thing is that when you have
4:50
private events they're more easily
4:53
changed in the sense that you control
4:55
the subscribers right you can just
4:57
change them because you know who's going
4:59
to use them and you can Yeah yeah yeah
5:03
makes sense exactly but with public
5:05
events well then as a publisher the
5:08
whole idea of popups is that you're
5:09
unaware of your consumers that you don't
5:11
care who is subscribed and why they are
5:13
subscribed and what they're doing with
5:17
One of the biggest mistakes is putting a
5:20
lot of information into that payload and
5:22
then you're like "Ah that's probably not
5:24
a good idea oh these teams are doing
5:26
what with that data and
5:29
and then it's kind of like really tricky
5:31
to take something away because that
5:33
breaks your consumers and it's
5:35
especially tricky in these large
5:36
distributed systems where you don't even
5:39
know who the consumers are right that's
5:42
true so you kind of have to think about
5:44
these things up front so you don't don't
5:46
just like expose everything and now
5:48
they're taking all and you're like what
5:51
did are you taking all you weren't
5:53
supposed to take all why are you doing
5:54
this exactly so one of the things that I
5:58
always like to compare this to is
6:00
whenever you think about public facing
6:02
events think about it's like it's like
6:05
posting a picture on social media like
6:08
are you going to regret it later like
6:10
how much do you want to give away
6:13
basically fair enough i may have had a
6:16
couple of couple or two of those in
6:18
incidents myself maybe
6:21
yeah it's it's a good way to get people
6:23
to think about it but of course that's
6:25
just the first step is right just how do
6:27
you design it um I think a good way to
6:30
also think about it is uh you know
6:32
something for the audience to dig a
6:33
little bit deeper into summer events for
6:36
example you could have multiple private
6:38
events but then your consumers don't
6:40
necessarily care about super granular
6:43
level of events those may be very
6:45
helpful and meaningful inside your
6:47
service boundary but it might be helpful
6:49
to basically summarize them into a more
6:52
overarching event so let's say that
6:54
stuff is added to your shopping basket
6:56
all the time and you could be publishing
6:58
item added item removed item increased
7:01
whatever it is right so um the thing is
7:05
that most of your consumers don't care
7:07
about that and at checkout you just want
7:09
to know like okay what's in the basket
7:11
now i don't necessarily want to have all
7:13
of that information all of that traffic
7:16
and all of that complexity of
7:17
understanding what happened before and
7:19
all of that and ordering issues and
7:21
things like that so then it can be
7:23
useful to have that service publish a
7:25
summary event of this is the shopping
7:27
basket at checkout right and then that
7:29
would give you like the view that you
7:32
want from a consumer perspective oh man
7:35
so that's one way to go about it yeah
7:37
but then of course that's not all right
7:39
things are going to change whether we
7:40
want it or not like I always say change
7:42
is inevitable right um so then we have
7:46
to prepare ourselves in in terms of how
7:49
are we going to version these events and
7:52
like a lot of attention is paid to that
7:54
in terms of APIs but I don't see that
7:56
same attention being spent on on our
7:59
events but events are are the API right
8:03
in terms of of event driven systems so
8:06
how do you prepare yourselves for that
8:08
and and the first thing to do is
8:10
basically to have a schema to have a
8:13
schema Now for that data how is that
8:15
going to look like how is that
8:16
structured that can also already provide
8:19
like additional insights into how your
8:22
consumers are even supposed to treat
8:24
that data consume that data like a good
8:26
example is temperature right okay you
8:29
have a field in there that's temperature
8:31
okay what does that mean like is it
8:33
Celsius or is it Fahrenheit or is it
8:36
Kelvin and is it the temperature of
8:38
light right like it it could be anything
8:41
if it's science then it could be Kelvin
8:44
right and and a schema allows you to add
8:47
that information to add that sort of
8:48
what we call metadata so that it's
8:50
easier on the consumer side to
8:52
understand how that data should be
8:54
interpreted but what's especially
8:56
helpful about schemas is that they
9:01
ah and then then we can start to talk
9:04
about okay how is this schema going to
9:07
evolve and what type of compatibility
9:09
strategies do we want to have for this
9:14
so um there are a couple of different
9:17
strategies you could have forward
9:18
compatibility backward compatibility
9:21
full compatibility and you can even make
9:23
that transitive can apply to any of the
9:26
previous back uh compatibility
9:28
strategies forward backward and full and
9:30
transitive basically means that um for
9:33
example let's say that you want to use
9:35
forward compatibility which would make
9:37
sense probably if you're using Azure
9:40
service bus for example as a broker then
9:42
it allows you to delete optional fields
9:44
and to add fields whether they're
9:46
required or not and consumers can still
9:49
use the previous version of the schema
9:52
to basically receive messages that were
9:55
constructed with a newer version of the
9:56
schema i see yep yep right so when you
10:00
use backward compatibility it's kind of
10:02
reversed and you can delete fields and
10:04
add optional fields and then consumers
10:06
can use the next version of the schema
10:08
to still be able to process messages
10:10
that were constructed with the previous
10:12
version right now when you make that
10:14
transitive it doesn't only apply to the
10:17
previous or the next version but to all
10:19
of the versions of that schema that you
10:21
have so it becomes a lot more
10:24
um and the thing is that that basically
10:26
allows you to communicate to your
10:29
consumers how can this change like how
10:32
can I expect this thing to evolve and
10:35
how my and it allows consumers to make
10:38
better decisions on how to write their
10:40
code and how to uh basically prepare
10:43
themselves better and gives you a lot
10:45
more flexibility from a producer
10:46
perspective because as long as you
10:49
fulfill the compatibility promise that
10:52
you set out to to basically a dear to
10:55
like you can keep upgrading to new
10:57
versions of the schema without breaking
11:02
now I also want to caveat that right
11:05
because if you add something to a schema
11:08
well your consumers will still have to
11:10
upgrade to use that thing makes sense
11:12
right yes but at least it won't break
11:17
they're not going to fail because oh
11:19
something new was added and they don't
11:21
know what to do with that right and the
11:23
reverse could also be true if you delete
11:25
an opt let's say an optional field well
11:28
if your consumers were aware that that
11:30
thing was optional which is indicated by
11:32
the schema then they would have been
11:34
prepared to deal with that being removed
11:36
because it will just not be there and
11:39
they can deal with that yeah yeah yeah
11:41
yeah it does make sense because I can
11:43
see how the client side or the
11:45
integrator or whomever like the the
11:48
who's calling um can can have code
11:52
dependent on the exact nature of the of
11:54
the API or the the schema from before
11:57
right uh if they've done it right then
12:00
they can they can survive if you have
12:01
declared how you are how you are revving
12:06
exactly so by setting up that what I
12:10
always tend to say is that a
12:11
compatibility um strategy that you set
12:14
up it's basically a policy decision
12:17
right it's it's communicating how is the
12:20
system going to evolve for also from the
12:22
producer side but what can our consumers
12:25
also expect from that and and and and
12:28
this is definitely one of the choices
12:30
that need to be made earlier when
12:32
building the system it's not like this
12:34
afterthought oh yes and now we need to
12:37
No this is something that needs to be
12:39
discussed up front you need to agree on
12:41
the strategy uh how we're going to apply
12:44
this how do we see everything evolve so
12:47
that you can make as little breaking
12:49
changes as possible basically yeah yeah
12:52
so one place where I encounter a lot of
12:56
um a lot of API schemas and a lot of
12:58
versions is or API versions is in um in
13:03
the Azure management uh endpoint uh
13:06
there's a you know there's a hundreds of
13:08
services right and they all have they
13:10
all have a version and then you can
13:14
query you can ask okay which versions
13:16
are you supporting still if you will
13:19
right and and sometimes there has been
13:21
there has been some warnings come up as
13:24
oh this is going to this is going to be
13:25
changed by so and so date this is not
13:28
going to be compatible anymore
13:30
that sort of thing so how do you how do
13:33
you go about like can can even is that
13:35
even a thing you can do how do you go
13:37
about removing an old version like okay
13:40
we can't support this old anymore
13:42
what are we going to do that's a really
13:44
good question so that's for me like the
13:47
next step right so first we need to try
13:49
to set up a system that allows you to
13:51
make some evolution that is not breaking
13:54
okay um and then the next step would be
13:56
okay like really there's nothing we can
13:58
do about this we really need to break
14:00
this you have a breaking change well
14:03
then it then it kind of becomes a
14:05
completely new thing because the schema
14:07
has changed so basically the the meaning
14:10
of that event has now changed and now
14:12
needs to be re-evaluated by your
14:15
consumers and the best strategy to deal
14:17
with that is to basically first notify
14:20
your consumers like you know this this
14:22
is not going to be supported but to also
14:25
not go and delete it immediately to
14:27
basically work with a deprecation uh
14:30
warning is like like we do with public
14:32
APIs right um so basically to say okay
14:36
this will be supported for the next
14:38
three months and throughout that sort of
14:41
deprecation period what is important
14:43
from a producer perspective is let's say
14:46
that you have an old order placed event
14:48
and a new one right is that throughout
14:51
the deprecation period you continue to
14:54
publish both both of those events of
14:57
course and that that allows your
14:59
consumers to gradually start basically
15:02
upgrading to that new order placed event
15:05
in in in favor of the old one but even
15:08
that is quite complicated right because
15:10
the thing is that especially in an
15:11
event- driven system well let's say that
15:14
you're using Azure Service Bus okay so
15:17
you're subscribed to this topic and now
15:19
there's a completely new topic because
15:21
it's a completely new event it has
15:23
semantically changed therefore it goes
15:25
to a different topic at least if you
15:28
chose to use a a topic per event type of
15:30
topology of course assuming that for a
15:32
moment now the problem is that there's
15:35
inflight messages right how do you not
15:38
lose anything when you're trying to do
15:40
that yeah exactly and that that requires
15:43
a really sort of step-by-step type of
15:45
upgrade strategy so the whole idea is
15:48
that during this dual publishing period
15:52
when when the producer is publishing
15:54
both of the versions of the event you
15:56
basically as a subscriber have to go and
16:00
subscribe to that new type of an event
16:03
and as close to that as possible
16:05
unsubscribe from the old one yes now
16:09
unsubscribing does not mean that your
16:11
inflight messages are lost like in Azure
16:13
service bus when you subscribe to topic
16:15
you get a virtual queue or subq whatever
16:17
you want to call it right so your
16:19
messages are still there so you need to
16:21
keep the handling code in place so that
16:24
you can consume all of those inflight
16:26
messages and you don't lose them
16:31
right and so you then you'll have to
16:33
check is there any is there any more
16:36
data in this for me at all or can I stop
16:39
using it and then on the other side if
16:41
you're if you're the publisher you you
16:43
you probably want to check I have this
16:46
break here uh how many are still using
16:48
the old one and then how many are now
16:51
using the new one is do we see a trend
16:53
of that's the tricky part that's the
16:55
tricky part is as a producer you can't
16:57
really do that because like we mentioned
16:59
in the beginning sometimes you don't
17:01
know who your consumers are and that's
17:03
why purely from the producer perspective
17:06
I think we need to be a little bit
17:08
flexible with the deprecation period
17:10
like don't make it a week don't make it
17:15
but maybe like three months five months
17:17
depending on how quickly your system can
17:20
evolve right but but what I al also
17:23
always say from a producer perspective
17:25
is that if you have a way that your
17:28
subscribers can get notified of this
17:31
change right and you have set a
17:34
deprecation time well for you it's not
17:38
your responsibility to then go and check
17:40
the consumers right this is one of the
17:41
agreements that you can make within the
17:43
system is like okay when the deprecation
17:46
time has basically been elapsed and
17:49
we're done we've given it six months
17:51
we're done then you can basically stop
17:53
publishing that old version of the event
17:56
and at that point assume that all of
17:58
your consumers have upgraded if they're
18:00
good citizens of the system landscape
18:05
yeah this is hard of course well this is
18:08
all brilliant and I guess we could be
18:10
talking about this for a long time still
18:12
but we don't have any more time to talk
18:14
about this so we have to stop doing that
18:16
which is a shame but it's it's been a
18:19
real pleasure having you on the show to
18:21
talk about these very complicated
18:22
matters of of versioning and and
18:25
handling distributed systems thank you
18:27
for being on the show thank you for
18:29
having me good to chat with you yeah
18:32
totally my pleasure and audience see you
18:34
next time on the cloud show bye-bye