Hashing Explained in 5 minutes
1 views
May 21, 2025
Hashing Explained in 5 minutes
View Video Transcript
0:00
in its simplest form hashing is the
0:02
process of converting an arbitrary piece
0:04
of data into fixed size values typically
0:07
an integer value the data can be a
0:09
username IP address or anything that can
0:11
be hashed or transformed into an integer
0:14
value to understand hashing further
0:16
let's take a simple example of load
0:18
balancer which is used to distribute
0:20
network traffic across multiple servers
0:22
here we have three clients C1 C2 and C3
0:25
there are three servers the clients can
0:27
talk to but instead of making them talk
0:29
directly we'll have a load balancer in
0:32
between so that client requests are
0:33
routed and evenly distributed across
0:35
servers when a load balancer redirects
0:38
or routes a request from a client to the
0:40
server it needs to have a server
0:42
selection strategy that is which request
0:44
should be served by which server one way
0:47
of server selection could be randomly
0:49
selecting the servers another approach
0:51
could be around robin wherein the load
0:53
balancer goes through each server in a
0:55
specific order maybe starting from top
0:57
server A then going to B and then C and
1:00
then going back to A like a loop however
1:03
depending upon the nature of the traffic
1:05
these strategies can be problematic
1:08
let's say each of the requests going
1:09
from the client to server takes a long
1:11
time to process by the server that is
1:13
they are computationally very expensive
1:15
request one way to deal with the request
1:18
is caching that is each server will have
1:21
some inmemory cache so instead of
1:23
computing the request every time the
1:25
server checks if the request has been
1:27
cached if it hasn't been then it
1:30
actually performs the operation
1:32
otherwise it skips the operation and
1:33
immediately returns the cache response
1:35
which it already computed for a similar
1:37
request previously
1:40
as you can imagine if your load balancer
1:42
is simply selecting the servers randomly
1:44
or following round robin there can be
1:46
lot of cash misses and unnecessary
1:50
computation let's take an example let's
1:52
say client one sends a request the load
1:55
balancer uses a roundrobin approach and
1:57
routes a request to server A the server
2:00
checks the cache and finds it empty and
2:01
performs a long operation stores the
2:04
response in cache and returns the
2:05
response to client one now if client one
2:08
sent the same request again with
2:09
roundroin approach the load balancer
2:12
routes the request to server B and since
2:14
server B never saw the request earlier
2:16
so it doesn't have the response in its
2:18
cache so basically it has to perform the
2:20
same operation even though server A
2:23
already performed the operation and has
2:25
it in it cache which is essentially a
2:28
server A cache mess by the load balancer
2:31
that is cache built by server A was not
2:33
utilized by the load balancer
2:36
we need to have a better server
2:37
selection strategy in which the same
2:39
request can be sent to the same server
2:42
every time utilizing the cache and
2:44
avoiding unnecessary
2:46
recomputations this is one of the places
2:48
where hashing can be useful we can hash
2:51
each of the client request coming to the
2:53
load balancer and based on the hash
2:55
value we can route the request according
2:58
to the position of the server so for the
3:00
sake of simplicity let's say we have a
3:02
hashing function given to us which takes
3:04
the name of each client and transforms
3:06
it to an integer value from the
3:08
definition of our hash function we know
3:10
it takes an arbitrary piece of data and
3:13
transforms into a fixed integer value so
3:16
say when we pass C1 through the hashing
3:18
function it is transformed to 11 C2 is
3:22
transformed to 12 and C3 to 13 a
3:25
simplest form of hashing strategy is to
3:27
use the modulo operator which is in fact
3:30
the percentage sign on all the hashes
3:33
and divide it by the number of servers
3:34
we have so if we do 11 modulo 3 the
3:39
result will be two that is the remainder
3:42
when you divide 11 by 3 for 12 it is
3:45
zero for 13 it's one so all requests
3:48
from C1 will now go to this server
3:51
request from C2 will go to this server
3:53
and request from C3 will go to this
3:55
server maximizing our cache hits and
3:58
minimizing unnecessary recompetition now
4:00
it's important to have hashing function
4:02
that evenly distributes requests from
4:05
each client to server like in our case
4:08
you don't want requests to be bombarded
4:10
to a single server creating a server
4:12
hotspot typically engineers don't write
4:14
their own hash function from scratch
4:16
they use industry gate hashing function
4:18
such as MD5 SSH1 and decret and so on
4:22
you can be assured that these hashing
4:24
functions provide even distribution by
4:26
default now if you're working on a large
4:29
scale distributed system there are high
4:31
chances that you would end up either
4:33
adding or removing servers depending
4:35
upon network traffic or demand so if we
4:38
add another server four and do a mod now
4:42
server 3 will never be called and we
4:44
lose even distribution of all the
4:45
request a brute force approach to avoid
4:48
this is to recomputee all the hashes and
4:50
then apply the mod but that can be
4:53
extremely expensive we need a better
4:55
solution check out my video on
4:57
consistent hashing which solve this
4:58
[Music]
#Networking