Unlocking Performance Improvements in .NET || Code Quality & Performance Virtual Conference
Nov 9, 2023
Performance is at the heart of .NET, with an incredible amount of energy invested in every release towards making the stack faster and more scalable. In this talk, Stephen Toub will walk through example changes that have improved performance in the .NET stack over time, highlighting how such changes benefited apps and services running on .NET, and how those changes can serve as a blueprint for further improvements in your own code bases.
Conference Website: https://globaltechconferences.com/event/code-quality-performance-virtual-conference-2021/
C# Corner - Community of Software and Data Developers: https://www.c-sharpcorner.com
C# Live - Dev Streaming Destination: https://csharp.live
#Codequality #Performance #.net #VirtualConference
Show More Show Less View Video Transcript
0:00
Thank you for having me
0:03
My name is Stephen Tobe. I'm a partner software engineer on the .NET team at Microsoft
0:08
I spend most of my time focused on the libraries that make up .NET
0:14
but also up and down the stack with other people, really thinking hard about the performance and how we
0:19
evolve it moving forward in the future. .NET has been going through something of a performance renaissance
0:28
If you, you know, it's been around for two decades now, and for a large portion of its existence or its youth, performance was always a consideration
0:39
It was always important, especially at the lowest levels of the stack. But other things often trumped performance
0:44
For example, usability would often be really paramount. Obviously, security is always and continues to be first and foremost, but usability would end up taking precedence over performance to the point where we might not expose something because we were concerned about someone maybe not using it correctly
1:03
Or we might make something not quite as fast as it could be in order to make it a little bit easier to use
1:09
And with .NET Core over the last, say, five, six, seven years, we've really revisited that and made sure that performance is right up there with security and safety as being the most important things
1:24
Now, usability absolutely is critical. But when it comes down to deciding, do we make this slightly less usable for something slightly more, something faster
1:33
That's a conversation that we have now, whereas previously, generally always went in favor of usability
1:39
Performance does end up being key at all levels now of everything we do in the stack
1:47
.NET is open source, we get dozens of PRs every day and every single one of them
1:53
whether it came from folks that work on the .NET team, whether it came from folks elsewhere in Microsoft
1:57
whether it came from folks in the broader.NET community, every single one of them is viewed through a performance lens
2:02
Even if it's a bug fix, it's very likely that one of us will comment on it saying
2:06
hey, I'm wondering about this aspect of Perf, Can you share some benchmark numbers
2:10
Because we view everything through the lens of performance. On top of that, we're also spending a huge amount of
2:17
our time building features that are very much end-to-end engineered for performance
2:23
not just thinking about performance as part of some other feature, but actually building features explicitly focused on performance
2:30
such that a developer can take advantage of them and make their app screen. One of my favorite things about all of this
2:37
and we'll talk about this a little bit more when it comes to open source, but the .NET community really gets to vote with their feet
2:43
Meaning, it used to be that if you working on your application
2:48
hit some sort of performance bottleneck in some .NET class, and you wanted to make it faster
2:54
you had to write up your issue, somehow get it to the relevant folks at Microsoft
3:00
have it find its way to a program manager thinking about the next version of .NET
3:04
have that be prioritized and triaged, and eventually find its way to a developer
3:08
who might be able to work on it for the next release in two years or whatever it may be
3:13
But with .NET Core, you can clone the repo, find the area you want to improve, make the change, submit a PR
3:21
and it could very well be in a nightly build the next day. Which also means that all of us are spending a good portion
3:26
of our time actually collaborating and reviewing a lot of these PRs that are coming through
3:31
So a good portion of my time is actually spent thinking about Perf, not in terms of my own changes
3:36
but in terms of other people's changes. So it's a really exciting time. I've described a lot of this over the years
3:44
in some blog posts. If you haven't read them, there are links here
3:48
I'm not sure if I can distribute this deck after the fact. I'm happy to. So you can click them, but you can also
3:52
just search for these titles in my name online and they'll pop up in Google or Bing or your favorite search engine
3:58
Around .NET Core 2.0, I was realizing, looking back over all of the perf improvements
4:02
that had gone into the various .NET Core releases, and realizing we hadn't really talked about it much
4:07
So I took a stab at summarizing some of the key changes that had gotten into the release
4:12
and it got a lot of good feedback. So for .NET Core 2.1
4:16
which while being a quote, quote, dot release or a minor release, it was actually a pretty substantial major release
4:22
I spent a lot of time going through and documenting a lot of the performance improvements that had gone in
4:26
And along the way, I try and also sort of educate the reasons that the changes were made
4:30
and why they helped and what the trade-offs were and so on
4:35
then again for .NET Core 3.0, again for .NET 5. And sometime in the next few weeks
4:39
I'll probably sit down and start working on a .NET 6 post since we're currently working on one of the last previews
4:46
for .NET 6, which is due out in the November timeframe. So after this talk, if you haven't read them
4:53
I humbly encourage you to go read them. I think they're pretty extensive
4:58
They're fairly long and hopefully educational in one way, shape, or form
5:02
All of these changes, the ones in the blog post, some of the ones we'll talk about today
5:08
they really exist for two reasons. One of which is we really want you to be able to take
5:14
your.NET application, let's say it's on.NET Core 2.1, and change your CSproj to update to.NET 5
5:21
and have everything just get faster. That's a key goal. You don't have to rewrite a line of code
5:26
We want you to be able to either recompile or just redeploy and have everything get better
5:33
On top of that, we also want you to be able to do a little bit of work and take advantage of new functionality
5:38
that is again, explicitly focused on performance. You can upgrade, get benefits
5:44
write some code, get benefits. Those same APIs, we also then try to use wherever we
5:50
can within the rest of the stack, such that those existing APIs benefit from the new ones
5:57
And so a lot of the changes we're going to be talking about today fit mostly into the former category, as do most of the things I've talked about in the blog posts
6:09
But this kind of is the scope of the changes that we work on when it when it comes to performance
6:13
Now, you know, various changes we'll look at are about this or that change we made to
6:20
the code. But I would argue that the most impactful change, bar none, that we made for .NET Core
6:28
versus .NET Framework was distribution vehicle, which might seem a little strange
6:33
You don't normally think of how you deploy bits as impacting the runtime throughput or
6:38
memory consumption of those bits, but it's been absolutely critical for one key reason
6:43
.NET Framework is distributed as part of Windows. That means that however billions of machines out there are running Windows
6:52
those billions of machines are running .NET Framework, and there is one per box
6:56
So whatever version of .NET Framework is on the box, every application on that machine is using that framework
7:04
you know, separate from VMs and the like. If we make a change to .NET Framework as part of a patch
7:11
as part of a new version. And we push that out. Every machine that gets that automatically
7:16
picks up that new behavior. Every app is now running against it
7:21
And at that scale no matter how safe the change you made you think is you are going to break someone guaranteed To the extent that we have made changes in the past that were 100 functionally correct all they did was make something
7:37
significantly faster. And that still, quote, quote, broke people because, for example, in this one
7:43
case I'm thinking of, they had a latent race condition in their code, had never manifested
7:48
because the area of code that would have caused the race condition simply wasn't fast enough
7:51
We made a change, we improved performance, we pushed it out, and all of a sudden we're getting calls about how Microsoft broke this customer's application simply because we had made something faster and the race conditions started triggering in their application
8:07
But there are also cases where we simply have made changes and there have been subtle corner cache niche bugs and they have, again, broken people
8:17
In fact, there's a fairly famous case for around 2008, where in the late March, I think early April timeframe, we pushed out an update to .NET framework and broke printing in TurboTax
8:30
And if you're in the United States, you can probably guess that the one application you don't want to break during tax season is TurboTax
8:38
You think we'd learn from our lesson, but no, in 2012, we did it again
8:43
And to this point, we basically try and avoid any patches other than critical security patches to .NET Framework during U.S. tax season
8:51
So we're extremely careful now about what kinds of changes we make to .NET Framework
8:57
And we've basically announced that while we are going to continue supporting .NET Framework and shipping security updates and patches for a very, very, very, very, very long time, it's also not where we're investing
9:08
We're investing in .NET Core. It's where all the new functionality is going
9:12
It's where all the new perf improvements are going. That's where all of our primary energy and thought is going
9:18
The reason I say distribution vehicle matters here is because largely .NET Core is side by side
9:24
On the same machine, you can have .NET Core 2.1, .NET Core 3.1, .NET Core 5, .NET 6, Preview
9:32
whatever we just released, Preview 5. You can have all those on your machine at the same time in different applications
9:37
can be targeting different versions. If there's a breaking change that went into a version
9:46
you don't have to upgrade to it until you have adapted your code to handle it
9:49
On top of all that, you can actually ship your own private copy of .NET Core with your application
9:55
which means you're totally walled off from anything that might have been done on the machine
10:00
and you get to choose exactly which bits you use. In fact, there are even benefits to doing that
10:04
If you do want to deploy in that matter, you can then trim or tree shake your application such that anything from that framework that's not
10:13
being used just evaporates. You get it down pretty small. For example, with the new WebAssembly
10:19
support we have for Blazor Wasm, with .NET 6, I think the number is the default template app for
10:26
Blazor Wasm, is something like 1.5 or 2 megabytes for the entire application and framework
10:33
everything that kind of heads down to the browser. So lots of different mechanisms. And all of this
10:39
means that our risk tolerance is way higher. So breaking changes, still a very big deal. We still
10:46
think about them very thoughtfully, but we do make them. We also have a much higher tolerance for
10:53
churning the code base because even if something creeps in, there are mitigations that end users
10:58
have. They can choose to wait until we fix an issue. They can choose when they roll forward
11:02
when they move to a new version. So all of this means we do large, massive rewrites
11:08
We do lots and lots and lots of small tweaks, all with the safety of knowing that our distribution vehicle
11:15
enables this in a much better fashion. So distribution vehicle has been key for performance
11:21
The other thing that I alluded to earlier that's been absolutely key for performance is open source
11:26
There are only so many of us at Microsoft that actually work on .NET itself
11:31
But we want the broad community at large to benefit from the broad community at large
11:37
So if you look at, for example, the blog post that I wrote, this was around July of last
11:42
year on performance improvements that had gone into .NET 5 since .NET Core 3.1
11:49
The post at that point covered around 250 PRs focused on performance
11:54
That was a minor subset of all the PRs that had gone into release
11:59
It wasn't even all the ones that were focused on perf. Between July and when we actually shipped in November
12:04
I think there were another 30 to 50 per focus PRs that actually made it in
12:08
But the key thing here is more than a fifth of them were actually from folks outside of Microsoft
12:14
Then there was another percentage from folks outside of the.NET team and obviously a large portion
12:17
from the.NET team itself. If you look at what I'll be writing about for a.NET 6 post
12:24
it already has way more than that, which is really exciting. Now, sometimes these performance improvements are things that impact everyone
12:32
For example, there was a change, I think it was in .NET Core 3.0
12:36
that went in from one of our prolific external contributors, that helps with how much time we spend doing some zeroing of stack frames
12:45
in support of making sure the GC is able to do its job correctly. This is something that impacts every single method that you call
12:52
He made it faster by employing vectorization to do the zeroing faster
12:57
Sometimes they're still significant but way more niche. So for example, an external contributor was found spending a lot of time
13:05
doing color conversions using the drawing color class and was able to submit a PERC improvement
13:09
for some of those conversion methods. And then there are other things where sometimes they're entirely experimental
13:14
For example, the Bing team wanted to play around with using operating system large pages as part of their application
13:22
And when it added experimental support to the GC, we reviewed it, we got it merged
13:26
and now it's an option for everyone to use. Still experimental, but anyone can throw the flag and try it out
13:31
We love all of this. If you haven't contributed to .NET runtime
13:35
or .NET ASP.NET Core, or .NET WinForms, and you're interested, I'd highly encourage you to
13:40
We love the contributions, they benefit everyone. Those are two big areas of changes
13:48
But obviously we take many changes, and so the question is how do we decide what we're going to take
13:55
Well, one way we do that is just focusing on data, microbenchmarks
14:00
So we've standardized basically on benchmark.net, which we use for measuring microbenchmarks on pretty much everything
14:07
And anytime you submit one of these PRs and we have a question about performance, we'll ask you for some benchmarks that demonstrate what it is you're trying to improve or maybe have you tested this to see if maybe that's regressing
14:18
We also have a full suite regression suite of benchmark.net based benchmarks that we run on a continuous cycle
14:25
If you haven't used it, super easy. .NET add package, benchmark.net, goes up to NuGet, downloads the bits
14:32
updates your project file, and then you can write an application like this
14:37
This is my entire program. This line of code here is basically telling benchmark.net to
14:42
find all of the benchmarks that exist in my application, of which I only have one right now
14:49
That one benchmark is just taking a field, storing it in 32 and calling two string on it
14:53
So this is the benchmark to see the performance we have for formatting integers into strings I can run this using however complex the command line I want but right now I saying I want to run this in a release build
15:07
I want to target the lowest surface area that I care about
15:12
In this case, that's .NET Framework 4.8. I want to run all the benchmarks and I want to run it on
15:16
.NET Framework 4.8, .NET Core 2.1, .NET Core 3.1, and .NET 5
15:20
Benchmark.NET is going to proceed to run this benchmark an appropriate number of times on all these platforms
15:26
doing all the statistics it needs to do to separate the noise from the signal
15:31
and end up spitting out lots of interesting information, including results like this
15:36
These are the results I got on my machine. And you can see in this particular benchmark
15:39
.NET Framework 4.8, it was taking us about 48 nanoseconds to format this 12345 value into a string and allocating about 40 bytes
15:49
A few years back when we released .NET Core 2.1, that cost was cut in half from 48 nanoseconds to 22 nanoseconds
15:57
still allocating the same amount. When we moved to .NET Core 3.1, we got a tad faster
16:01
but we also shaved 20 percent of the cost of the memory off the operation
16:07
Then for .NET 5, the performance doubled again, halving again, getting down to 11 nanoseconds
16:13
We can really see the progression, lots and lots of changes in each release
16:17
something as simple as int32 formatting, improving every single release of .NET Core
16:24
and that continues. Other things we measure, we don't just look at micro benchmarks
16:29
we also look at say industry benchmarks. Tech Empower is a key one that we focus on
16:35
This is a set of web service focused benchmarks. We pay attention to different ones every release
16:44
For a long time, we were focused on the plain text benchmark, which is just how fast can you send
16:52
and receive the smallest possible response you can think of to the server
16:56
And you can see that ASP.NET, over I don't know how many
17:00
of the previous rounds, has been in the top positions, tied for basically number one with any number
17:05
of other frameworks. But for example, in the last release, we picked two specific benchmarks
17:10
For .NET 5, we really wanted to improve our standing on the JSON serialization benchmark
17:15
which, while it does involve JSON serialization, is really another request response
17:20
And then also the fortunes benchmark, which is about a request coming into a web server
17:24
that then contacts a backend database as part of the requested response
17:29
And you can see for the fortune benchmarks, for example, we had two entries with two different database providers
17:36
sitting behind it, or sorry, two different transport layers sitting behind it
17:40
But they were both approximately the same, about 280,000 requests per second
17:44
This is on Linux. And you can see this in the rankings, this put us at about 24 or 26
17:50
depending on which transport we were using. When we moved to .NET 5
17:55
you can see that 280,000 requests per second in the official numbers went up to 400,000 requests per second
18:01
and jump days p.net to be in the top 10. So these are the kinds of things we're paying attention to
18:05
And there are many, many changes where a PR will come up and either we'll ask the contributor
18:10
if they had access to our backend test systems to try it out with Tech Empower, or we'll take it ourselves
18:15
and try it out with the various Tech Empower benchmarks to see if and how a particular change may positively
18:22
or negatively impact the various benchmarks that we're looking at. Obviously, we care about performance on benchmarks
18:30
But at the end of the day, there's some reason we're trying to improve
18:34
the performance of something. Sometimes, we want to improve the user experience
18:38
We want requests to make it to the user faster so that they show up faster
18:42
We want to be able to allow more users at the same time so that people aren't timing out or whatever
18:47
We want the user interface to be snappy. But it can also just be, at the end of the day, saving dollars
18:54
If I can do more with less, if I can get things done faster and use my machine for less time, I can use that machine for something else
19:01
And this can translate into real dollar savings. So one of the things I focus on is various internal and external systems where I'm able to sort of see, waving my hands, how many cores are being used for some duration of time in the execution of particular functions
19:19
When that number reaches something like, say, 1,000 or 10,000 or 20,000, I start to really pay attention to, wow, this is really an area that we should go and invest
19:28
Because at the end of the day, that's going to save everyone lots of money. And to David's previous comments about the environment and CO2, it's going to keep a lot of emissions out of the atmosphere
19:39
So, for example, in the last release in .NET 5, we, you know, regex was something that was popping that was being really showing up as being used by lots of folks, Microsoft and non-Microsoft, in Azure, out of Azure, as being, you know, a real consumer of cycles, of CPU cycles
19:56
And so it's an area we invested. So I grabbed this regex off the net
20:00
This is a regex for parsing emails. And a simple example here
20:06
just looking for someone at example.org. Again, this is with the benchmark.net syntax
20:10
So just paste that into the previous example I had and see if this particular address matches with this regex
20:18
And we can run it again on .NET Framework 4.8 and then .NET Core 3.1 and .NET 5
20:22
and see the kinds of improvements that went into the release. So for example, on .NET Framework 4.8
20:28
We were taking about 470 nanoseconds. Then on .NET Core 3.1, between the few years in between there, we managed to shave off about 20% of the costs
20:41
But then for .NET 5, we were able to cut that by another third or so, getting it down to 148 nanoseconds for this particular operation
20:49
That is a huge savings when it comes to CPU cycles and the money we're spending on these resources
20:58
If you've got an app and you're paying per core in Azure and you can do it with
21:02
one fourth the cores, that's going to make you pretty happy and it's going to make your budget pretty happy
21:08
Another example in the 1000 cores club from recent releases is compression
21:14
Compression shows up in all manner of workloads. Here, a little benchmark
21:19
it's grabbing I think I grabbed the complete works of Mark Twain from
21:23
Gutenberg as a text file. I'm compressing it as a setup step
21:28
I'm compressing it into a stream. Then the benchmark is just repeatedly
21:32
decompressing it and seeing how fast we can decompress that. Ran this on .NET Framework 4.8 and ran it on .NET 5
21:40
You can see we get about 20 percent faster. We also dramatically reduced the amount of allocation
21:47
required going from 350K down to basically nothing. But what's really interesting about this is
21:54
I mentioned we're very careful about what the changes we make to .NET Framework in general
22:01
This one was so impactful. The improvements that we made to compression were
22:08
about three or four X previous releases. This was so impactful and showed up in so many workloads
22:13
that this .NET Framework 4.8 number actually includes having backported this particular set of
22:18
performance improvements back to .NET Framework just because it was of that order of magnitude
22:23
This is one of, I don't know, I can probably count on one maybe two hands the number of performance focus changes that have gone back from Core to Framework but this was one of them So this is for 5 this is 20 on top of that 3x gain
22:36
that was already reported back. So these are, you know, we think of compression and regex
22:41
as being sort of meaty examples. There are also things we're spending
22:46
just a whole lot of time simply because the sheer number of times you invoke them
22:51
So for something like dictionary, right, shows up everywhere. And you expect something like try get value to be so fast that who cares about
22:59
how much it costs, it's going to be so fast, it doesn't really matter. But you do that millions
23:03
and millions and billions and billions of time, any performance improvements there actually add up
23:09
And so it's an area where every release, we and again, I mean, the broader we because a lot of
23:13
contributions here have actually come from external sources, external to Microsoft, we have
23:18
every release driven down the cost of operations on dictionary. So for example, .NET Framework 4.8
23:27
you can see this particular example is basically doing 1,000 lookups, 1,000 ordinal ignore case lookups on strings in the dictionary
23:36
summing them and returning that value. So that was taking about 60 microseconds on .NET Framework 4.8
23:42
We halved that for .NET Core 3.1, getting it down to about 30 microseconds
23:46
and then shaved off another 30% on top of that for Notnet 5
23:50
getting it down to about 20 microseconds. So we make changes to things that are big
23:55
and we make changes to things that are small but used a lot. Just one more example, since we talk about the web
24:03
and UTF-8 is very prevalent on the web, something like UTF-8 encoding ends up being used a lot
24:10
and is part of this, what I term the thousands of Chorus Club. And so this benchmark here
24:15
But all I'm doing is I'm turning a string into an array of bytes, then going back to
24:18
a string. And we can see on .NET Framework 4.8, this was taking about 190 nanoseconds on my machine
24:25
and we more than doubled the throughput down to about 80 nanoseconds for .NET 5
24:29
And all this stuff continues to just get faster for .NET 6
24:34
So there are many ways that we have achieved these improvements. And again, we, I mean the broader we, I'm not just talking about folks at Microsoft
24:43
all the contributions from folks at Microsoft and in the broader .NET community
24:48
Sometimes it's via complete rewrites. And these are the kinds of things that we would always shy away from in .NET framework
24:55
for fear of any kind of compatibility issue. But for example, concurrent queue
24:59
Concurrent queue first shipped in .NET framework 4.0. And it was a good implementation, but it wasn't a great implementation
25:06
And so for .NET core 2.1, we completely rewrote it, focusing on the kinds of costs
25:12
the kinds of use cases, the kinds of scenarios we saw concurrent queue being used in
25:17
with a particular focus on for large scale, reducing synchronization costs, and most importantly, reducing memory usage
25:26
We got it to the point where it was so fast, we were able to throw away Threadpool's queue
25:32
and replace it literally just with concurrent queue. Now the system.threading.threadpool, Its global queue is a concurrent queue of D
25:41
And we can see why. If I have a little benchmark here
25:46
I've got a concurrent queue of ints. I spin up two tasks. They both wait for each other
25:51
so they start at the same time. One of them then queues a million integers
25:55
and the other one tries DQs, spins until it can DQ all one million of those integers
26:02
We can see on Dona framework 4.8, this particular benchmark on my machine was taking about 45 milliseconds
26:07
and was allocating about eight and a half megabytes of memory. And on .NET 5, we dropped that by 4x down to 11 milliseconds
26:17
and basically got it to allocate nothing. So from eight and a half megabytes to a few hundred bytes
26:24
for this particular benchmark, which is pretty great. Sometimes it's not about doing complete rewrite
26:31
Sometimes it's being thoughtful about the algorithmic complexity that's being used in these various operators
26:37
in various operations. So for example, in link, we can see in this benchmark
26:43
I've got some data in the array and I'm sorting it. And then I'm going to skip the first 10
26:50
and I'm going to grab the first thing after the first 10
26:55
And previously, this operation would have done a full sort, an order n log n sort, and then skip through the first 10
27:04
and then return that element. But we can actually pass data from the order by to the skip to the first
27:11
which is actually where the execution happens. And the first operation could say, interesting, you're doing an order by
27:17
but you only care about one value. So I can do, instead of doing, say, a full sort
27:23
I can just do a partial select to only zero in, to only sort the pieces of the data that I need to
27:29
to find the one value that's critical. changing the algorithmic complexity and the amount of work being done
27:36
And you can see that then drops me in this particular example from 3.7 seconds to about 200 milliseconds
27:45
And that difference would grow dramatically the more data I threw at it to the point where I probably couldn't even put it on the screen
27:50
because my operation wouldn't have finished in time. So sometimes it's complete rewrite, sometimes it's algorithmic complexity
27:57
And sometimes it's totally changing the way we write our code. So for example, in .NET Core 2.1, we introduced spam
28:08
And this has been a complete game changer for the code that we write throughout the .NET stack
28:14
permeating all levels of the stack from the runtime all the way through up ASP.NET Core
28:20
and to applications on top of the runtime and core libraries for developers that can use it
28:27
care about performance. What is span? Well, I'm using a little bit of made up C Sharp syntax here
28:33
because today in C Sharp, you can't actually express what span is
28:38
You probably will be able to in a future release of C Sharp. But essentially, span is a tuple of a ref and an int
28:45
So it's basically a pointer, just a managed pointer, like if you were passing a ref to a function
28:50
except this one happens to be a part of a struct, and a length
28:55
Basically, we're referring to some contiguous piece of memory, where it starts and then a length
29:01
I can use a span to represent a string. I can use a span to represent an array
29:06
I can use a span to represent some data on the stack. I can use a span to represent some native memory
29:12
I can represent all these different things with one type so that I can unify all of my logic
29:20
around all of these data types, And that type is fully safe just as array
29:25
So it's fully bounds checked and all that. And then I can do really interesting operations on this
29:31
So the conversion to span is zero allocation indexing. And I can index just like I can with array
29:36
I can slice it in an allocation free manner since all I'm doing is moving the pointer
29:40
and changing the length. I can also reinterpret this. So with an array, if I had an array of integers
29:45
and I wanted an array of bytes, I would have to allocate a new array of bytes
29:50
copy over the relevant data and then use that. With spans, I can just reinterpret
29:54
effectively reinterpret CAS like you had in C++, reinterpret that span of integers as a span of bytes
29:59
and then continue to operate on it. And then throughout the framework, there are hundreds upon hundreds upon hundreds of methods now
30:08
that operate on spans in addition to strings and arrays and so on
30:13
All of this gives us performance and safety and in a way that we're able to leverage
30:23
throughout pretty much everything that we now do. On top of that, there are now optimizations
30:27
more and more optimizations in the JIT focused on the spans so that certain patterns that developers write
30:34
are now hyper-optimized, whereas previously with arrays they may not have been
30:37
And also optimizations showing up in things like the C-sharp compiler itself for generating
30:42
more efficient code when you use spans because of certain constraints the compiler can trust based
30:48
on it. So let's just take one simple example here. We've got a function here where I'm taking in a quoted string
31:00
and I want to say hello to that person's name, in this case, Steven, by stripping off the quotes
31:05
So I have to do a substring operation, and I can catenate hello and the substring
31:11
I run this on .NET Framework 4.8 and .NET 5. And you can see it's nice. We're maybe 15% faster on .NET 5 already
31:17
but we can do better than that. So if I wanted to use spans, I could, for example, first stack allocate a span
31:25
And you notice I don't have to use unsafe here because the C-sharp compiler trusts that if I'm storing a stack alloc directly into a span, that is a safe operation
31:34
I'm not directly accessing a pointer. I can copy the literal hello into the span
31:40
I can slice that string as a span, the name string, to slice off the quotes, copy that to the output
31:47
and then I can finally return the relevant portion of that span as a string
31:51
Now I run this on .NET 5, and we see that 22 nanoseconds drops to 18 nanoseconds
31:58
but I've also cut my allocation in half. But now we can say, all right, well, that's nice
32:02
but that's a lot of code, some complication there. Couldn't we wrap that up into helpful functions that I could reuse
32:08
And so we end up exposing things like an overloaded string.cat that accepts spans
32:14
And now I can write almost the exact same code, shorter even than the original code, just using spans and slicing them
32:21
And now not only do I allocate half the amount, but I'm more than double the throughput that I had in the original code
32:28
There are also other significant benefits to span. One of the things that we've focused on as part of these pervasive coding changes is we've actually focused a lot on moving code that was in C and C++ into C Sharp
32:45
And this might be counterintuitive. Most people think, oh, C and C++, they're way faster than C Sharp
32:51
But there are many cases where that's simply not true and many reasons for that
32:56
We can look at an example of this. In .NET 5, we added a whole bunch of sorting routines for spans
33:04
In fact, we took all of the code that was in the runtime for sorting arrays
33:09
It was written in C and C++. We deleted it. We instead wrote it in C Sharp, almost entirely safe C Sharp
33:17
with just a few unsafe pieces in very constrained regions that we could hyper-focus on reviewing
33:24
did it in terms of spans, and then we actually have array.sort on top of that
33:30
Array.sort is actually using the span sorting. We can see the impact of this
33:34
I've got a benchmark here that just fills an array with some reverse sorted numbers
33:39
then sorts it, and we can run it on .NET Framework 4.8
33:42
This particular operation was taking 95 nanoseconds. There were some minor improvements that had gone to
33:47
the core 3.1 that shaved about 10 percent off of that. But then you can see for .NET 5
33:53
where we move this into C-sharp and wrote it with spans. You can see it's twice as fast as what it was in .if framework 4.8
34:01
There are lots of great benefits to spans and also to this rewriting that we're doing from C++ to C-sharp
34:09
One of them is this pervasive use of spans. Because the logic was in C-sharp
34:15
it was a lot easier for us to then use spans with it This then allowed us to not only use it internally but expose it out for other code to take advantage of But there are other advantages When that code was all in C and C it was all sort of unsafe
34:30
It was all pointer manipulation. Whereas when we moved it to C Sharp, as I mentioned, we were able to keep it almost entirely in bounds checked C Sharp code
34:39
and only make the one or two critical tight inner loops in a few places, I think two places
34:45
in the entire implementation ended up being, quote, quote, unsafe. Another benefit of this is, you know, a lot of the contributors
34:54
that come to the .NET runtime repo prefer working in C Sharp than in C++
35:00
They can experiment faster given the syntax and the niceties of the language
35:04
and the libraries and whatnot. And so we end up finding that more people are interested in experimenting
35:09
and contributing more when code is in C Sharp. And so we end up seeing many more PRs to further improve performance and experiment with things than we did when they were in C or C++
35:19
So it's not that C sharp in this case ends up being faster. It's that more time was invested in finding ways to optimize the code
35:27
At the same time, we're also, because it is safe, we're happy to accept those kinds of changes because we don't have to
35:34
we're happy to more readily accept them because we don't have to worry about the same kind of potential security vulnerabilities
35:39
or buffer overruns and the like that we would have had to scrutinize the native for
35:44
And therefore, we can spend less time doing a better job reviewing the changes
35:48
and having a higher level of faith in them. But there's actually another benefit to this moving stuff out
35:56
And that has to do with garbage collection. So for any of you who do web services and you're focused on things like
36:06
I'm running my server GC, but every once in a while, it pauses for a massive period of time
36:12
and my latency shoots up. And so my 95 or 99th percentile latency is much higher than I would
36:18
like it to be. That's because of GC pause time. When the GC needs to do a lot of its work
36:21
it has to pause the world, make sure all the threads are in a stable position, then it can do its work, and then it can allow the threads to release
36:30
But there are certain operations that you could do. Well, let me take a step back
36:34
we every time we transition from C-sharp like CoreLib into the runtime, there is transition
36:41
costs. Like when you make a P invoke, there's some overhead associated with the P invoke. When we transition into the runtime, there's some overhead associated with that. And for certain
36:48
really fast operations, we try and eschew as much of that overhead as possible by cutting some
36:54
corners. And sometimes cutting those corners means that the GC isn't able to do as good a job
37:02
of interrupting those threads as we would like. We can see an example of that here
37:06
I've got a little application that is spitting up a background thread
37:11
and all it's doing is sitting in a tight loop sorting an array, which is always going to be sorted
37:17
It's just very fast, try and sort it, already sorted, try and sort it
37:20
already sorted, in and out of managed code between C-sharp and the runtime when the sorting code was in native
37:26
Then I've got my main thread is 10 times, it's forcing a garbage collection and then it's sleeping for about 15 milliseconds
37:36
forcing a garbage collection sleeping for 15 milliseconds. So because I'm doing this 10 times, you'd expect it to sleep for 15, 16
37:43
17, 150, 160, 170 milliseconds. And then the GC is going to take some time
37:50
So let's round it up and guesstimate that this is going to take about 200 milliseconds to run
37:56
But I run it on .NET Core 3.1 and I get similar results for .NET Framework 4.8
38:01
And we see it's nowhere close to that. It's taking three seconds, four seconds, five seconds
38:06
to do this operation. And the reason for that is because of those fast transitions
38:10
where we cut some corners, the GC is having a really hard time
38:14
interrupting that background thread that's calling into the runtime. And as such, it's having to pause the world
38:22
and that main thread while it keeps trying to interrupt that background thread
38:27
When we moved all this into manage code, the GC was able to do a much better job at managing this
38:33
Now when we run it on .NET 5, you can see that 2345 seconds is much closer to the 200 milliseconds that we initially sort of waved our hands at and predicted So span is sort of one very pervasive change you seen throughout
38:47
We see throughout. Another has to do with async. So async await and tasks in C-sharp have permeated everything everyone does
38:58
And as a result in .NET Core 2.1, we completely rewrote the infrastructure
39:02
behind async await. You can see the results of this in a simple benchmark
39:06
Here I've got a benchmark that is 1,000 times its yielding. It's basically awaiting something that
39:13
completes immediately after the wait. It's always just forcing it to post back
39:17
Just for fun, I've added an async local since that shows up in a lot of these
39:22
When I run this on .NET Framework 4.8, you can see this was taking about two milliseconds
39:27
When I run it on .NET Core 2.1, you can see that more than cut in half to about 930 microseconds
39:35
By .NET 5, we got that down to about 800 microseconds. But the big thing you can see here is in memory allocation
39:42
the jump between .NET Framework 4.8 and .NET Core 2.1, we went from over 600k to about 100k allocated for this operation
39:51
which is great. You can see why here. If I do a memory trace for
39:58
This is using the allocation profiler in Visual Studio. You can see all the objects that were being
40:03
allocated in .NET Framework 4.8, and then in .NET 5, we just have this one allocation per iteration
40:09
This is very nice, but we're still, that one allocation is for the task
40:14
that's being returned from that yield once method. What if we could do something about that
40:21
In I believe, .NET Core 2.0, we introduced value task, value task of T
40:28
Value task of T was nothing more than a discriminated union between a T and a task of T
40:33
which meant that if you returned this from one of your async methods and the async method completed
40:38
synchronously, which is actually very common. In fact, probably 95% of all async methods end up
40:45
completing synchronously. The first time you call, say, reading from a buffered stream
40:50
it's going to have to go out and touch do some IO. But the next time it's got data buffered and so
40:55
so that async read actually completes synchronously. For that case, we can return the T in the value task
41:01
rather than a task of T and avoid that allocation. But even for asynchronously completing operations
41:11
for asynchronously completed operations, this didn't help. But we actually, in dynamic core 2.1
41:17
introduced the ability to help by adding a new interface. And so not only could a value task wrap a T or a task of T, it could also wrap an I value task source of T, which then allowed an implementation to plug in its own sort of backing object behind a value task
41:35
And in doing so, enabled it to pool it or reuse the same object over and over
41:40
And we can see this as an example using sockets. So sockets as of .NET Core 2.1 take advantage of this because it's extremely common to do
41:49
you know, say very, very tight hot loop receive and send a sync operations on sockets
41:56
And we want those to be as allocation free as possible. So I run this benchmark
42:02
which is just doing 1000 receives. So there's no data available, it is going to yield
42:07
then I send some data. So and then I wait for the receive because the receive is now able to
42:12
to be satisfied by the sent data. And we can see for this particular benchmark of 1,000 receives and sends
42:18
on .NET Framework 4.8, this was taking 35 milliseconds and allocating over 300K
42:23
And on .NET Core 2.1, you'll notice the allocation column has a dash
42:28
There is zero allocation associated with this benchmark now because once the socket is created
42:33
and the object that is being used to back these value tasks has been created
42:37
we know we never again have to allocate anything for synchronously or asynchronously completing send or receive operations
42:47
And you can see over time in the networking stack, we've also made further improvements. So in .NET Core 3.1, things got, throughput got better
42:53
And again in 5 things got even better So async has been a big area where we made pervasive changes across the stack And you can see many more places taking advantage of these I value task source implementations
43:06
In fact, for dotnet six, one of the likely things we have landing is both framework support and delay and compiler support for being able to use pooled I value tasks
43:17
iValueTasks source and implementation behind your own async value task methods automatically
43:24
just via tagging them with an attribute. So I'm excited. Hopefully that will land for .NET 6
43:31
One other thing that's been pervasive has been array pool. I don't have a benchmark for this
43:36
but just as an example, this is very similar to the code that we have in stream.copy2async
43:42
It used to be that we would allocate a brand new array for every time you did copy to async
43:47
Now we rent one from the shared array pool, which we can optimize to our heart's content
43:53
We rent the buffer, we use it, we return it. And in general, this drastically reduces the amount of allocation while doing better around
44:02
cache coherency and not thrashing caches and the like that we might otherwise get
44:09
One more example of pervasive changes we've made, and that has to do with vectorization and hardware intrinsics
44:17
So for years now, .NET has had vector types for doing multiple operations at the same time
44:24
taking advantage of functionality in extension architectures and things like SSE, SSE2, ABX, ABX2, and so on
44:33
But we've also in recent years started adding direct access, individual access to these instructions, yielding literally thousands of new methods in .NET Core 3.0, 5.0, and 6.0
44:45
And these operations are now being used pervasively throughout the lower levels of the libraries in span, string, encoding, JSON serializer, even things like WebSocket are getting it on the game
44:58
We can see example of why this matters. I've got a little bit of Shakespeare here, and I'm just searching string.contains for
45:06
a particular word in this text doing an ordinal search. When I run this on .NET Framework 4.8, you can see this was taking about 40 nanoseconds
45:14
This was not vectorized. In .NET Core 2.1, we took advantage of the vector types and vectorized this, and we went
45:20
from about 41 nanoseconds to about 10 nanoseconds, so about 4x faster
45:25
And then starting in 3.1 and 5.0, we started taking advantage of the direct intrinsics
45:31
We got a lot of the way there with the general vector support and then taking advantage of the direct intrinsics
45:36
to eke out even more. So we went from 10 nanoseconds to 7
45:40
and then for .5 to 6 nanoseconds. And so every release, we're really pushing the boundaries
45:45
And more and more functionality ends up using this sort of base level of performance-focused API
45:54
So that was a whirlwind tour. I am basically out of time
45:59
If I could ask two things of everyone here, it is first, if you're still on .NET Framework
46:08
please start thinking about moving to .NET Core. It's where all of our energies are going
46:12
It's where all of the performance improvements are going. It's where all the new functionality is going
46:15
It's where the world is moving. So if you haven't gotten there, totally understandable
46:19
but please be thinking about it. If you are already on .NET Core, please stay current
46:24
because as you can see, every release just gets better and better and better
46:28
And staying current also then means that you have a better chance of contributing issues that you find
46:35
You can come help us fix, and you can take advantage of things as soon as they're released
46:40
And then my second request is get involved. If there are things you'd like to see be better, file issues on .NET Runtime or .NET ASP.NET Core or .NET WinForms, whoever it may be
46:49
Or if you'd like to actually get your feet wet and actually try and make some PRs, we would love it
46:56
Come find things to improve, whether it be something as simple as comments or something as complicated as vectorizing some new operation, taking advantage of hardware intrinsics
47:06
Whatever it is, we would love to have you. And with that, I will say thanks
#Programming
#Windows & .NET


