0:00
So we're going to talk about statistics in SQL Server, and this is one of my favorite topics
0:08
So I'm going to zip through some of my beginning slides here because I want to get to talking
0:13
about the real content. This is all the information about me, and really the most important thing on here is
0:20
my email address, erin at sqlskills.com. If you have a question that doesn't get answered today, feel free to send me an email
0:28
I'm happy to talk about statistics at any point. The SQL skills team, I have our names up here on the screen
0:36
because sometimes I talk about other members of the team and I want you to know that they're real people
0:40
So Paul and Kimberly are the owners. Tim, Jonathan, and myself are consultants in the team
0:45
We offer training, training at this, at conferences. We also do our own training, immersion events
0:51
that are gonna be online this year. And we do provide consulting as well
0:56
So if you're interested in learning more about SQL skills, please hit our website. I've got multiple links up here, but SQL skills.com is where you want to go
1:04
And the scripts, the slides for today, they are going to be available on GitHub. And I'm just
1:12
going to bring this up for a second. And actually what I should really do is take this and throw this
1:19
into the chat somewhere, but I don't know how to do that. So hopefully if you just search GitHub
1:24
and my name Erin Stilato, you'll see this SQL Server statistics repository and you can grab
1:30
everything from today there. Please enter any questions you have in the comments section
1:37
I'm not going to read the abstract. I'm hoping that you all have an idea of what we're going to
1:42
talk about here with regard to statistics and their importance when we talk about them as it
1:50
relates to SQL server. So we're going to do a review, and we're going to look at how we find
1:56
statistics in SQL server. And then we'll talk about what happens when they go bad and how that
2:01
can affect performance. And we will also then talk about how we update them, because that is really
2:07
one of the most critical things. So the reason that we care so much about them, the reason that
2:13
I have done, I don't know, well over 100, I don't know if hundreds sessions on statistics
2:19
is because they are absolutely essential when we talk about performance. And a statistic in SQL Server is information about the distribution of data in a column
2:32
or a set of columns in a table or a view. So if you, let's say you have a column, which is customer ID, and you want to understand
2:43
not just maybe how many, it's not like how many different customer IDs we have, but customer IDs
2:49
as it relates to the number of orders that we have in the orders table. Our statistics are how many
2:54
customer IDs do we have for all these different orders, right? What does that look like? And it's
3:01
important to have these stats, to have this information available for the optimizer, because
3:07
that determines what it's going to do, what kind of plan it's going to generate. And it uses these
3:14
estimates from stats to figure out how many rows are going to be returned, which is also called
3:20
cardinality. So you may hear the terms cardinality estimator, right, which is basically, we're going
3:26
to use statistics to figure out how much data is going to come back. So pictorially or visually
3:32
if you want to think about how we look at this, you submit a query to SQL Server
3:38
and it then goes to the query optimizer, which uses statistics to say
3:44
what is the object? What is the that we're going to access? Are we accessing a table
3:49
Are we accessing an index? Are we doing a seek or scan of that index? How are we joining
3:56
to other tables? How much memory do we need? So there's a lot of information that basically comes
4:04
from or a lot of decisions that are made based upon these statistics. And then from all of that
4:12
information, the optimizer puts together a plan, which then gets executed. So one of the most
4:17
important things is to understand that stats are going to influence the performance of your queries
4:23
Now, when we talk about stats, there's a lot of options that are relevant here
4:29
So I want to take a moment to highlight them. The first is auto create statistics. And this option is enabled by default
4:36
Whenever you create a new user database, I recommend you leave it enabled
4:42
Auto update statistics is also enabled by default when you create a new database
4:48
And I recommend you also leave that enabled. There's a third option here, which is auto update statistics asynchronously
4:57
This, one of my good friends, Glenn Berry, definitely recommends that you update this
5:03
or that you enable this option. It is not enabled by default
5:07
And there's something I want you to understand about that before you decide if you're going
5:11
to or not. So I'm going to come back to that later on. And then there's also this option of auto create incremental statistics, which is going
5:18
to be relevant for you if you are using partitioning. If you're not using partitioning, don't worry
5:24
about it at all. Now, these stats, how are they created? There's multiple ways that they can come
5:31
into existence within SQL Server. The first is automatically. Every time you create an index
5:37
every time you create a filtered index, every time you create an indexed view, you will have
5:43
a statistic created. So if you look at this syntax here, create non-clustered index on my sales
5:50
orders table on the order date column. So if I create this index, what happens is when I then
5:59
look at sys.stats, I also have a statistic. And notice, what's really important to notice here
6:06
is that the name of the statistic is the same as the name of the index. So the index was named
6:12
NCI underscore orders underscore order date. The statistic here has the same name, but they are two
6:21
different entities in SQL Server. That's really important to understand. Now, we also get statistics
6:29
created on individual columns within SQL Server. So if you query a column or you join on it, you get
6:37
a statistic, assuming there's a little asterisk there, assuming that it doesn't already exist
6:43
because it's the leading column in an index. So for example, let me come back here for one second
6:50
In this example here, we created an index on order date. So if I have another query that queries on
6:56
order date or joins on it, it's not going to create another statistic. There's already one
7:01
here. But in the case where I have a query against that orders table, where I say expected
7:06
expected delivery date equals a value. If there's not already a column statistic there, that will go ahead and SQL Server will go
7:15
ahead and create that statistic, that column level statistic for me. Now you'll notice the naming here is way different, right
7:23
The column is named expected delivery date And we have a statistic name that starts with underscore WA underscore SIS and then a bunch of numbers and letters That because it auto created and it column level
7:35
So that's a naming convention that gets followed. So I'll take a note here that's to say if you ever manually create statistics, don't
7:44
follow that naming convention. Use your own, right? Because this is how you know it's an automatically created column level statistic
7:51
So this is the case where this column is in the predicate. when I join on the column. So here we're joining to orders based on back order, order ID. And here
8:03
I've joined on that column. I don't have a statistic for it. And so SQL server says, I'm going to
8:08
create a statistic here on this back order ID column. So again, automatically created for
8:15
indexes, filtered indexes, index views, automatically created on columns when we query those or we join on those. Now we can also create stats manually. So these can be created on
8:30
one column, on an individual column. They can be created on multiple columns. And you can also
8:36
create filtered statistics for a column. And there's a question here from Paul that says
8:42
should we ever create multi-column stats without using an index? That's a big, it depends, Paul
8:49
I would really want to talk a little bit more about the query that you're trying to troubleshoot or improve performance for before I just said yes
8:57
Because, yes, there are times when you would want to create a multi-column statistic without the index
9:02
And there are other times where I would say, I really want you to create the index here
9:06
So that's a little bit bigger conversation than what's in this slide
9:11
Now, how do we look at stats once they've been created, whether they're created automatically or whether we've created them manually
9:19
The easiest things to do is use sys.stats. And that's how we can view all stats for a table
9:26
The stats date function tells us when a statistic was updated, which can be really important when we're troubleshooting
9:35
And then my favorite thing, honestly, if I ever have to have a favorite DBCC command, it's this one
9:40
It's DBCC show statistics. And this gives us output in three sets
9:45
It gives us a header, it gives us a density vector, and then it gives us a histogram. And if you can understand this, and we're going to take a look at
9:53
it, then it goes a long way to understanding sometimes what the optimizer is doing. We also
9:59
have DMDB stats properties, and we have DMDB stats histogram. These are DMVs that were added in
10:07
more recent. DMD stats properties was added in 2012 histogram. I think that was added in 2016
10:14
We don't have a DMV specifically for the density vector, but these are easier to use programmatically
10:19
compared to show statistics. I admit I grew up on show stats
10:23
That's what I tend to use. So I'm a very visual person
10:27
I like to look at what's going on in SQL Server. So we are going to switch over to a demo here so that we can take a look at how we find
10:36
statistics in SQL Server. So I'm using a copy of the Wide World Importers database that I've named slightly differently
10:45
And first, I want to show you two things. One is sphelpindex, and the other is the SQL skills
10:53
version of that. So sphelpindex is available by default in SQL server. The SQL skills version
10:59
was written by Kimberly, and it just gives you a little bit more information. So I start with this
11:05
because I want you to see what indexes exist, understanding that we're going to have then statistics
11:12
on each of these columns. And if we look down here, we just get a little bit more information
11:19
in addition to what we see above. Notably, we get included columns
11:24
and the filter definition, which can be really helpful when you're doing index strategies
11:30
Now we can also, as I said, use sys.stats, which this gives us all of the statistics that exist
11:37
So you'll see that we have five stats that basically mirror the indexes that we have
11:44
And then we have three column level statistics that exist here in this orders table
11:51
And we're really gonna focus on this index, which is on customer ID
11:56
And if we look at the statistics for this, just specifically for this statistic
12:02
we are going to get this three-part output that I mentioned. We have the header up here
12:08
And the thing about the header, and I'm going to zoom in here so that we can see this a little bit closer
12:12
is that we see when the statistic was last updated, pretty recently
12:17
we see how many rows were in the table at the time the statistic was updated
12:24
So this doesn't update continuously, right? That row count is valid as of when stats were updated
12:32
This tells us how many rows were sampled. Basically, how many rows did it go check to figure out what the distribution looked like
12:40
And we've got steps and density here. And then in the middle, we have what's called our density vector
12:46
So this helps us understand how unique the customer ID is. And then at the bottom here, we have what we call the histogram
12:55
And each of these entries under range high key, each of these are different customer IDs
13:02
And then this EQ rows says, okay, for those customer IDs, how many rows equal that
13:08
So for customer ID one, we have 5,346 rows. For customer ID 42, we have 5,295 rows
13:20
And you'll notice there's some gaps in here. And there's only 200 steps in the histogram plus another for a null value
13:26
But if we have more than 200 customer IDs, which we do, we can't have a step in the histogram for everyone
13:33
So we end up with some gaps. And to get estimates as to how many rows are in between that, we use this average range rows
13:42
So you'll notice that we have an entry for 42. We have an entry for 45
13:47
So if we queried for 43 or for 44, it's going to use this average range rows value to estimate how many rows exist for that customer ID
13:58
So I'm going to show you this in action. And just for reference, I have the DMDB stats properties and the DMDB stats histogram DMVs there
14:07
So you can go ahead and take a look at those if you're interested. but this is our query and we're going to enable our execution plan
14:17
and we'll go ahead and run this and this runs pretty quickly right but what I want to pay
14:25
attention to is how we figure out that cardinality estimate that SQL server has right where does that
14:32
come from and if we look here in this pop-up for this index seek here which this is where we're
14:39
going against that non-clustered index. And you'll see that it estimates the number of rows
14:47
we're going to return is 5,295. And that matches actually how many rows it returned
14:54
So that a really good estimate right And if we double check right it was 5 If we double check here back in the statistic for 42 we sure enough we see that that estimate is correct
15:07
So this is a case where what SQL Server thought was actually correct
15:12
So our statistics are accurate. They're up to date. Now, as I said, not every value can fit into the histogram
15:20
So what happens if we query for a customer ID of 52, which is not in the histogram
15:28
And so we'll look here at this plan and we'll come in and we'll see that in this case, for the value of 52, it estimates 5,404 rows
15:39
But the actual was 5,437. So it's pretty close, right? It's not too far off
15:45
I wouldn't think that that's astronomical, astronomically different. And again, if we check back here and we look at our histogram and we see, okay, for 52, there's no 52, but that estimate was 5,404, right
16:04
And that's because, let me just get this all straight, 52 falls between the value of 50 and 55
16:12
And so it says, okay, on average, the number of rows for each value between 50 and 55 is 5,404
16:22
So that's where it gets that estimate. This is all well and good when we're doing literal values, right
16:31
But what happens when I use a variable? So here I'm going to declare customer ID, and then I'm going to set it to a value of 42
16:40
which we know is in the histogram, and then run this query. And now when we look at our plan
16:47
and we look at our estimate, this is a little bit different than what we saw before, right
16:53
Now we have this value of 5,845.25. So I don't remember seeing any kind of decimal numbers in
17:03
our histogram. And here's what happens. The value for customer ID isn't known when it does
17:09
optimization. So it uses the density vector. Remember, I mentioned the density vector briefly
17:15
and I said that tells how unique it is. So here's the density vector in the middle of the DBCC show
17:21
statistics output. And you'll see this value is 0.001. So again, this is an estimate or this is a
17:29
calculation that it does to say on average, how many rows return for a customer, any given customer
17:34
ID. And you can do the math. So we can take that density vector and we can multiply it times the
17:42
number of rows in the table. And when we do that, we can see that there's that 584.25, right? That
17:49
estimate that it comes up with. So that's great. Okay. If I have an equality, but what happens if
17:57
I come into a less than scenario, right? So give me the information from this table
18:02
from the orders table, where my customer ID is less than some variable value that I've sent
18:10
And when this one comes back, this returns tons of rows. You see, it changes the plan actually
18:15
And in this case, right, estimated number of rows, we're in like 1.1 million here
18:21
right this is way higher than the 5 000 that we were in before and so the estimate in this case
18:29
is it's taking the number of rows in the table times 30 so it just estimates that it's going to
18:36
have 30 of the rows 1.1 million return so now you can see that it uses the histogram to get
18:47
information about how many rows it's going to return. And all we've done so far is just query
18:54
with literal values. And by a literal value, well, we did a little variables here, but with a literal
19:00
value, right, where we've said customer ID equals 52, customer ID equals 42. Now, maybe some of you
19:08
have a lot of ad hoc queries that have literal values like this, but I'm going to guess that
19:13
some of you also have stored procedures. And Alvaro says, where is that 30% coming from
19:20
That's just part of the heuristics that SQL Server uses. So you don't get to mess with that
19:24
You don't get to change it. That's the way that it is. So I'm going to guess that some of you have
19:29
stored procedures. And we're going to create one here. Very simple. We're using that same query
19:35
that we were just using, but we're going to put that into a procedure. So customer ID is what
19:41
we're passing in and we're going to run this exact same query before for a given customer ID
19:48
So let's go ahead and create that store procedure. And then we're going to set statistics IO and time
19:55
on. And I'm going to run this store procedure with our value of 42. So we know this returns
20:01
about 5,000 rows, right? 5,295. And if I look at my messages here, I can see that we have about
20:09
16,000 reads. And if I look at my plan, what's important here to note now is we're going to
20:15
talk a little bit about the shape of the plan, what the plan looks like because of what the
20:19
optimizer decided to do based on statistics. It decided to do an index seek. It said, I'm going to
20:25
use that non-clustered index that we looked at before, that FK sales customer ID. I'm going to
20:33
go seek that index and I'm going to pull the customer IDs that have a value of 42. But it also
20:40
needs order date and order date isn't in this index. And so it has to go do this key lookup over
20:46
here. And where we only have a few thousand rows, that's not so bad, right? It's not great. You can
20:53
see I have a missing index recommendation. We're going to ignore that for now. But a key lookup
20:58
for a few thousand rows isn't awful. Now let's see what happens when we do this for a value of
21:03
823. Now you'll notice here down in the bottom right, this query returned 22,000 rows. So we
21:12
just quadrupled the amount of data that we're returning. And it executed pretty quickly
21:19
But notice now that we're up to 67,000 reads. We were at 16,000. Now we're up to 67,000. We've
21:26
almost we've tripled here, right? And if we look at our plan, you can see things start to look a
21:35
little strange, right? The estimated number is 5,295, 5,295. And the number of rows is way higher
21:43
22,000. So what's important to remember here is that when we execute a stored procedure
21:49
the initial value that it uses to compile that is used to determine the plan. And then that plan
21:55
gets cached in the procedure, right? In the procedure cache of the plan cache. So that's
22:01
going to be reused regardless of what value I pass in. So if I run this same thing with the value of
22:09
422, this is going to take, it still runs pretty fast. This is, you know, 120,000 rows are coming
22:18
back But if we look now at the messages this is 366 reads that it doing instead of 16 or 67 Right now we up in the 300 range and it still runs really fast but this is not efficient We still have the same plan because
22:37
that's the plan that's cashed and it's going to continue to use that until something changes
22:44
And in our case, we're going to do something horrible, something as a reminder, right? I'm
22:50
just going to throw this in here just as a friendly reminder, right? Not for production
22:57
We're going to do a DBCC free proc cash, which basically means get rid of all of the plans that
23:04
are in cash. So we do that and we're going to run this procedure again, which now means that it's
23:10
going to go through and it's going to create the plan based upon the estimates that it gets for the
23:16
customer ID of 422. So it returns 100,000 rows. And if we look at the plan, you'll see that this
23:24
is a different plan, right? Rather than that nested loop that had the index seek with the key lookup
23:31
now we get a clustered index scan. And I don't know if it's Jan or Jan, so I apologize if I don't
23:38
have that correct. But yes, this is a parameter sniffing issue or parameter sensitivity. You'll
23:44
hear all of that, all of those terms, parameter sniffing, parameter sensitivity, PSP, right? So
23:50
now with this value of 422, if we look at our estimates here, right, we get 119,000 rows
23:58
And it actually, that's what it returns, right? But it does a clustered index scan. And you might
24:05
think, well, hang on, aren't seeks great? Isn't that what we want? Well, if we look at the amount
24:10
of IO that it did, right? 35,000 versus 366,000. So this is a case where customer ID is a value
24:21
that I would say is, it has a skewed distribution. Now, what do I mean by a skewed distribution
24:28
Let me show you, let me go back to the slides and let me show you a little bit what I mean
24:32
about distribution, right? If you think about the histogram, you see in the middle here, this 422
24:39
right and there's a bunch of other values across the bottom and we have the kind of this nice flat
24:45
bit and then we have these little spikes in there so that is my skew right if i had a nice flat line
24:53
all the way across for this then that would tell me that that my data is pretty evenly distributed
24:59
i wouldn't have a parameter sensitive issue but if i look at this a slightly different way if i
25:05
order this data based by the count, 422 is all the way over there on the left side of the screen
25:10
you can see that I have a few big values and then it really evens out. So this becomes a problem
25:17
in a parameterized scenario, like with the store procedure, because the plan that gets into cash
25:24
for most of these values is pretty good, but every so often it might not be so great
25:30
okay now how do we manage statistics well you can let sql server do it
25:40
or you can set up scheduled jobs and these are important this is what i recommend i'm going to
25:48
talk about what sql server does but i'm going to step onto my soapbox for a second and tell you
25:53
that I really want you to manually manage statistics by creating jobs that run on a regular basis
26:01
If any of you are familiar with Ola Hallegrin's maintenance scripts, Ola has a method to do this
26:08
I've written some blog posts that I have links to at the end that talk about how you can just manage stats
26:14
with his scripts, not even worried about indexing reorgs and rebuilds, but stats alone
26:19
which I think are absolutely critical. So I could spend an hour talking just about that
26:27
But what I want you to understand is that automatic updates are going to occur even if you are manually managing stats
26:36
So you're going to make a best attempt with managing stats manually, but automatic updates are still going to occur if you have that automatic update option enabled, which you do by default
26:49
Now, again, this can be asynchronous, which means that when SQL Server detects that it
26:55
needs to update stats, rather than doing it right away, it will wait until whatever query
27:00
you were going to run finishes, and then it will update them. So the benefit is that whatever query you were going to run doesn't wait for stats to
27:09
be updated before it executes, it just goes ahead and executes, and then stats get updated
27:13
later on. The drawback here is that maybe that plan that it's going to use isn't ideal
27:20
And I want it to get stats updated and then get a better plan
27:26
So there's pros and cons to both options. Again, in your environment, you may want to try one
27:33
And if you don't notice any problems, if things seem great, then that's probably the option
27:37
for you. But by default, asynchronous updates are off. manually again part of your regular maintenance that you should be doing and there's a couple
27:47
options that are really useful when we talk about updating stats one of them is no recompute
27:52
which means um if the auto update kicks in uh don't update stats for a particular statistic
28:00
this isn't something that you enable across the board this is something that you
28:04
it's a calculated strike when you apply that to a statistic. And then there's also an option of
28:10
persist sample percent, which by default, when an automatic update occurs, it uses a default sample
28:17
which is about 10% or less. And if I want a statistic to always be updated, let's say with
28:25
a full scan, because I have a skewed data distribution, I can choose to persist the
28:30
sample percent for that so that no matter when it updates, it always updates with a full scan
28:37
So these automatic updates are going to occur if I have auto update enabled when a certain number
28:44
of modifications occur. And this is based on a threshold. And that threshold depends on what
28:51
version and compat mode you're running. So the old threshold would be when 20% plus 500 rows
28:59
had changed, 20% of the data plus 500 rows. That new threshold is the square root of a thousand
29:05
times the number of rows. So the new threshold is the default behavior in 2016 and higher if you
29:13
have a compat mode of 130 or higher. In earlier versions, all the way back to 2008 R2 Service Pack
29:22
one, you can get that threshold with trace flag 2371. And what this looks like in terms of numbers
29:30
is that it makes the threshold a lot smaller. Because when we talk about a table with, let's
29:37
say, 100 million rows, 20% of that is 20 million rows. You could be waiting a long time for stats
29:44
to update. So for as much as I want you to manage stats, there are places that don't. And so for
29:51
some of those environments, those stats were getting horrifically out of date
29:55
And so this lower threshold is designed to get those stats updated
29:59
on a more regular basis. So I want to show what happens when we don't update those
30:05
and also what happens with our plans. So I'm going to, I've got another
30:13
I'm going to, this is my out of date stats script. So I'm going to mess up statistics on purpose
30:20
And I did a great thing here. I had this demo where I changed my connection parameter
30:27
and it keeps going back here. So let's get this sorted out here. And while this runs
30:34
this only takes a minute or two to run. It looks like, let's see, there was another question. Are
30:40
auto-created stats updated when you rebuild your indexes? It's a great question. When you rebuild
30:48
an index, if I rebuild all the indexes for a table, the statistics for those indexes get updated
30:55
but the statistics for the column level statistics do not. Whether those column level stats are automatically or manually created by default
31:07
if you use Ola's script, he has an option to update all statistics
31:12
which means your index stats and your column level stats. But if you just did an alter index rebuild, it doesn't hit any column stats on a table
31:20
Great question. All right. So I've prepped this and we're going to go back to my wide world importers B database
31:30
And we're going to just double check our statistics for this orders table
31:35
So we just restored this database. Times become really important here. 154, 3.6 million rows in the table
31:43
This sample is 171,000 rows. So this is a default sample, 10% or less, right
31:50
Not a full scan. And I'm going to turn on statistics IO and we're going to run this query a little bit more
31:58
complicated than we were looking at before with a few different customer ID values
32:03
So we'll go ahead and run these again, that 422 is in there, which we know is skewed, right
32:09
So let's go ahead and run each of these and we'll take a look at our statistics output here
32:15
So we can see that we're getting 92,000 reads. We're getting 277,000 reads
32:22
We're getting 366,000 reads. And our durations are going up a little bit here too, right
32:29
1.7 seconds, 1.73.4. So what's going on here? Let's grab this. I have a note to myself that I want to grab that info
32:39
So I have to listen to my own directions, right, that I give myself
32:43
because I wanted it for a reason. and again, let's just double check here what we have for stats, right? Okay, so we're at 3.6
32:53
million rows, 171,000 sampled. Now, we're going to, actually, what I want to do here, I want to
32:59
check one other thing. If 422, 823, 42, let's look at what we have here. 42, right, somewhere around
33:09
the 5,000 range. 422, right? This is 8,000. So this is different than what we had in the table before
33:20
Okay. Now we're going to mess with the data. We are going to add a lot of data to customer ID
33:28
422. And let me tell you, I've turned off auto update statistics. Oh, I hate it when my other
33:39
demos creep in and mess up this one. So I turned off auto update statistics for this. Oh, it changed
33:48
my context. Love it. And I'm going to change a significant amount of data in this table
33:56
We're going to set all of these to a customer ID of 422. So now we've got over half a million rows in this table
34:04
And I going to run this this time with the live query stats We kick this off And you can see that it pushing this through three nested loops And it going to take several seconds for these half a million rows
34:20
1.3 million rows in the end to come through. I love looking at this
34:25
I'll run this again. So we can see, I'm going to zoom in here, right
34:30
It's estimating just a few thousand rows are going to come out here
34:35
And it ends up being hundreds of thousands. So my estimates are way off
34:41
This plan is really inefficient for the value of 422. So let's turn on auto update statistics
34:56
And make sure that we're in the context of that database. And I'm going to turn off my live query stats because at this point
35:03
I'm going to make these things run a little bit faster. Now we're going to run this for customer ID 422
35:10
This still takes a little bit of time because we're bringing back over a million rows
35:14
right out of a 3 million row table. It's not necessarily that much faster
35:18
but this is a change in the plan. Before we had three nested loops going on
35:23
and now we've got just this one hash match. Now we have a clustered index scan here
35:30
And our estimate in terms of how much data is gonna be returned is way above that 2600 number
35:37
that we saw before. So let's look at statistics. Haley, I wanna know why you're holding your breath
35:46
Here if we look at statistics and I look at a value of 422, right
35:51
This has been updated. If I look up here, right? We were at 150 something earlier, now we're at 159
35:58
So when I turned on auto update stats, because I had modified so much data
36:03
when I ran that query, it said, oh, okay, I see that stats have changed
36:10
and that I need to go, or I see that data has changed and I need to go and update stats
36:14
And after it updated stats, then it generates a new plan based on that value of 422
36:21
So I want to understand how stats affect performance now here at my procedure level
36:32
So we're going to create our stored procedure, which is just like the query we've been using
36:36
Right now, we're just putting it into a procedure and we're going to run this 10 times
36:43
And we'll check out the plan. This plan goes back to the nested loop plan, right
36:48
So this again is a parameter sensitive query with 422. We ended up getting a hash match with something lower than that
36:55
In this case, where it's estimating about 5,800 rows, we're getting those nested loops
37:01
Okay. So two different plans occurring here. We've run that 10 times
37:07
And if we look at the plan cache, we can see here that this has executed 10 times
37:14
We've got one plan and the query plan is what we were just looking at here with those nested loops across
37:23
Excellent. Now, and I have auto update enabled, right? Now let's just take a look here at modifications, right
37:32
So this is a pretty slick view. I like this a lot because this tells me how much churn there is, how much data changes in my table
37:40
And right here, I don't have any modifications because I just updated stats, right
37:45
And we haven't done anything. So let's go ahead and update a few 60-some thousand rows so that we cross that threshold
37:54
of when an automatic update is going to kick in. And let's verify that we changed that data
38:01
We did, right? We changed about 70,000 rows. So that exceeds the 60,000 threshold with that new stats update threshold
38:12
And then we're going to run this again. And if we look at the plan this plan looks exactly the same right It has the same shape that we had before We got these nested loops going on in here So how do I know here if anything actually changed Well first of
38:31
all, let's see if stats updated. They did. Remember they last updated at 159 and now
38:37
it's 202 and the rows sampled has changed. We did have a full scan before, and now we're
38:43
down to just a default 171,000. How do we know that this is recompiled? This is a nice little
38:51
trick here. When we look in the plan cache for this particular plan, you can see this plan
38:55
generation num bumped up to two and our execution count reset. So even though it's the same plan
39:02
it recompiled it. So it drops the execution count. It gets rid of the info and the plan
39:07
generation num increases by a value of one. Last scenario, we're going to recompile this procedure
39:16
on purpose, and we're going to run it with a value of 414 for my customer ID. And let's see what plan
39:24
we get here. We get that nested loop plant. And if we check this out, right, we can see that it's
39:29
estimating 4,939 rows. So when we're down in that flat part, down on that 5,000 row range, we're
39:37
getting this nested loop scenario. We're going to go ahead and modify some more data again
39:43
because data changes in your system, right? Throughout the day, things change. Let's confirm
39:49
that we've exceeded our thresholds for change. We have for a couple stats
39:59
new threshold is 85,000. We've exceeded it in some cases. So let's see what happens
40:05
when we run with a different parameter. Ah, we switched plans, right
40:14
So enough data had changed that when I went and ran this again
40:18
it went and updated stats and we got a different plan in here
40:23
And this happens, okay, great. We can run it a whole bunch of times
40:30
and every single time. Now it's running this with these values that end up returning like 5,000 rows, but it ends up doing this clustered index scan in this hash match, which is inefficient for this
40:43
Again, parameter sensitive type of scenario. And if we look here in the cache, right, we can see that this is that recompiled
40:52
And what happens here is when a plan changes for whatever reason, because a bunch of data came into the system and stats automatically updated, one of the things that folks tend to do is they tend to look at stats, which is great
41:10
But there's two things. There's three things, really, that you've got to pay attention to
41:14
You need to look to see when they were last updated. You need to understand what the sample rate was
41:20
And you also need to understand if there have been modifications. So at this point, we have a planning cache that's a clustered index scan, which we don't love
41:31
And if we don't understand that there was a recompile, we might look at this, we might say, oh, you know what
41:38
I think we need to update stats here. I think that's what we need to do
41:43
And that's often a sledgehammer approach. and people sometimes think that it's updating stats
41:50
which has fixed the issue. And in some cases it is, but some cases what happens is that updating stats causes a recompile
41:57
And the recompile means that the next time I go and execute this with this value of 414
42:02
I'm going to get a new plan. I'm going to get a better plan and things seem to be good
42:08
But in this case, I didn't get a better plan. and this is then people are really scratching their head and they're going what happened here
42:17
what what's wrong here's the thing if you update statistics and no data has changed
42:28
right we had row modifications of zero then stats are going to get updated and they been updated here right we can see that we updated with a full scan And sure enough that what it did But even though it updated stats
42:42
because no data had changed, it didn't recompile the plan. And this is an interesting thing
42:50
to understand about stats. And something that I want you to think about
42:53
when you're troubleshooting and looking at plans with poor performance and thinking that maybe I have a stats problem
43:00
Maybe you do. but I don't want you to assume that just because you update stats
43:06
you're always going to get a different plan. You want to look to see when stats were last updated
43:11
You want to understand what the sample rate was. And you also want to understand if you've had row modifications occur
43:19
So this is something that catches a lot of people out. And I have a note in here about it
43:25
So in SQL server 2012 and higher, which I'm hoping a lot of you are on, right
43:29
When stats are updated, if no date is changed, the plans aren't going to be invalidated
43:36
so you won't get recompiles. Prior to that, the behavior was a little bit different
43:42
If you updated stats and you had auto-update enabled, the plans were invalidated
43:48
But if you had auto-update off for some reason, then the plans weren't invalidated
43:54
So understand that the estimates that you see in the plans are not always correct
43:59
I created a contrived scenario where the first case is true. Stats hadn't been updated recently
44:07
And updating stats did address our issue. Sometimes your estimates are off because of that skew
44:13
the uneven distribution of data, or because the sample rate is too low
44:18
and it doesn't have a good capture of what the distribution looks like. And then you have a ton of other scenarios
44:24
that can cause estimates to not be correct. but stats and looking at what SQL server thinks is in the table and what
44:32
actually is, is typically where you want to start. So you need to understand how to look at that histogram
44:38
like we showed in the beginning. So we looked at how to do that. We looked at what happens when stats don't go so well
44:45
when they're not being updated. And we talked about how to update stats. I've got a lot of references here that I'm going to kind of slowly go through
44:52
while I'm looking at your questions. So if there's a question is
44:57
would the adaptive join feature mitigate what I am showing? So yes, right
45:03
That's designed to help a parameter sensitive problem, assuming that I'm going between a nested loop
45:09
and a hash join, which I was in that example. So in the adaptive join scenario
45:14
which was introduced in 2017, right? You've got to have the appropriate compat mode there
45:19
for that feature to be in play. 2017 was column store, 2019 supporter with row store? Yes, it would flip between that based upon the number of rows coming
45:29
in. Victor said, did I enable auto stats after the updates? I enabled it somewhere in the middle
45:37
of that demo, but initially I had it turned off so that you could see that if I'm not updating
45:43
stats and I'm changing lots of data, the optimizer does not have great information
45:49
And then Rohit says, does table partitioning switching affect statistics? That's a really great question
45:57
So the challenge with statistics and partitions is that by default, unless you've got incremental on
46:04
you have one statistic, one histogram for the entire partitioned table. When you enable incremental stats, you do get stats at the partition level
46:14
the optimizer doesn't actually use them, but it does incorporate that into the overall histogram
46:21
so that you don't have to go and update stats across the whole table. So to answer your question
46:27
when I partition switch, when I switch in, I want to update stats for that partition
46:33
and then those will get folded into the table partition