Code Quality and Infrastructure as Code by Bill Penberthy
21K views
Nov 6, 2023
Building software is more than just building an application that works. A good software not only focus on its functionality but user experience, security, and performance. In this non-coding talk, I will share some of the tips and advice on building better software.
View Video Transcript
0:00
Hey, everybody. As David mentioned, my name's Bill Penberthi, and we're here to kind of talk about code quality and infrastructure as code, or IAC
0:08
So I'm currently a consultant at Unify Consultant, which is headquartered here in Seattle
0:14
My specialties in systems and software architecture with an emphasis towards operations in the cloud
0:21
I've written a couple of dot-net books, and I'm two chapters in three weeks away from completing my third, which is called Pro
0:28
dot net on AWS for A Press Publishing. I've spent, oh, 20 years in consulting plus an additional
0:36
10 plus in industry. And I've worked in literally every aspect of the software developer
0:42
lifecycle. You know, I like to claim, and I don't know why I like is the right word for it
0:48
But my claim to fame is that I've probably made every single possible software developer
0:53
mistake out there. And I have the gray hairs to prove it. So hopefully the things that we'll go over
0:58
today may help you avoid some of those problems yourself. So this is our agenda. A real quick
1:05
description of infrastructure is code so that we all get on the same page as to what I mean here
1:10
We'll then talk about code quality and how infrastructure go hand in hand. We'll then go into some of
1:16
the differences around code quality and an application versus code quality and infrastructure as code
1:21
And then we'll talk a little bit more about what you should do or what you should do about those
1:25
differences. So what is infrastructure as code? Well, you know, I've got Microsoft's definition up
1:33
on the slide, so I'm not going to read that. So what I'll simply say is that IAC is basically defining
1:39
your infrastructure, such as your compute resources, whether it's a virtual machine or a container
1:44
or a serverless function, your networking configuration, your firewall rules, system rights
1:52
anything and everything that you use to provide the infrastructure on which you're your applications run. All of that, in IAC, you define in code. Infrastructure is code. The code is then
2:02
checked into source code, and whenever you need an environment, you rerun that code. So this solves
2:07
real-world problems. Environmental drift is one. IAC offers item potence, where a deployment always starts
2:17
the infrastructure in the same configuration every time. And since this is checked into source code
2:22
and item potent, you can then roll your environment back to a previous version
2:28
You can check out previous versions and do some work against that. These are all solving real-world problems
2:34
IAC also offers real benefits beyond how it helps solve those problems
2:39
You know, one of the coolest things about the cloud is how you can spin up new instances
2:45
up and down really easily whenever you need it. It's something going on and all of a sudden you need a new testing instance of an application
2:52
maybe to test an integration or to give a partner a sandbox to play on something like that
2:59
Well, now it's really easy to do that because you have your infrastructure to find in code
3:02
You know, in the old days, way back wins, typically you would have a, like a, what I call it, a playbook
3:10
a written set of steps that you would take to create and configure your environment
3:15
And a new environment took time. And since nobody ever really had time, it would be a big
3:22
deal and it would get a lot of pushback to create in a new environment. Well, in the cloud with
3:28
IAC, it's now not that big of a deal. And lastly, and it kind of doesn't fit in here, but I think
3:34
it's important to start bringing this up now, is that IAC should be used with a declarative definition
3:41
This means that the code that you create and manage describes what the environment is rather than
3:48
how it's built. IAC is, you know, typically you look at it as it's kind of based on a product like
3:55
Palumi or Terraform or AWS CloudForm. And those products are kind of responsible for filling the
4:02
how based upon the declarations that you provide in code. In other words, your job isn't to do it
4:08
Your job is to define what it is. And we just briefly mentioned environmental drift. Let's go to this
4:15
a little bit more because this is, in many ways, this may have been my biggest waking nightmare
4:20
for the last few decades. So, as you may know, environmental drift is when there are discrepancies between environments
4:28
This happens when you have a VM or a system on which you deploy without redoing the environment
4:35
Very, very typical last few years, especially when deploying into Windows VMs, you know
4:41
so I could guarantee that like two years ago when you first created those
4:45
environments, they were exactly the same. Now they aren't. Test will be
4:49
different because it's had things tested on it. Go figure. Whether it's an OS
4:53
patch or an application update that ended up getting rolled back, at some point
4:57
those environments are going to evolve differently. Which means there'll potentially be
5:03
some different behaviors that are on those environments. They could be performance-based
5:08
say you bumped up production resources to 16 gig while your dev or tester at 8 gig
5:15
it could also be something more significant like different versions of installed frameworks
5:22
especially when you don't take those frameworks with you when you deploy. And that's what IAC prevents this drift
5:28
It means that you'll be using the same environment, just like you use the same bits, hopefully
5:33
when you test the deployment and then when you push to prod. So what you're evaluating and test and what you're running in production are the same
5:42
The advent of containers has done a lot around helping management. manage this, and that's great. However, those of us, as I mentioned earlier in the dot net world
5:51
and especially at enterprise companies, are still have a lot of applications running on dotnet
5:56
framework 4.x, which means most likely running on a Windows VM and more than likely that
6:02
system hasn't been updated for a long time. And you will have drift. So why IAC matters
6:11
You know, we've kind of talked about all these points, but I wanted to reiterate them. IAC matters
6:15
Using IAC makes your environments more predictable, and that has a lot of value in and of itself
6:21
So here's an example of IAC using Pallumi and C Sharp to define an Ubuntu instance on a on a T3 micro instance type
6:31
So what this is is we've defined the operating system and some other ancillary kind of software that we want installed on it
6:38
and we've determined with the instance type what size of processor and memory availability
6:44
and network connections do we want to have available. So we're basing this on a well-known machine image here, the Ubuntu slash images
6:56
You can see that in the first part of the Get Amy Filter Args. But what this means is if you choose the proper AMI here or Amazon machine image for this one
7:11
that you can get Ubuntu with dotnet core already installed on it
7:17
which means that before you use this EC2 instance, before you do anything, it's already defined and it's ready for use
7:23
So by using this, all of a sudden now, any work that you have to do after starting your environment is already done for you
7:31
And so that eliminates a lot of those problems that we used to have with the playbooks
7:35
around having to manually manage some of those workflows. So here's a different approach where the instance is defined in JSON
7:46
This is an example for the template definition for creating an AWS EC2 instance
7:53
So you can also create these definitions in YAML but personally YAML drives me crazy so I didn include those Filling out these values and running in AWS Cloud Formation will create the instance Now if you look at these you see what you doing is you describing what the configuration
8:13
is going to be, what the hardware item that you're going to be deploying is. You're not telling
8:20
the system how to do it. You're instead just defining this is what I need to run my infrastructure
8:25
Now, an interesting thing about this is that the code that we looked at on the last slide
8:30
Typically what you'll see with that or with like the AWS Cloud Developer Kit is that you're able to create code in the language that you're familiar with
8:39
And what it ends up doing is it outputs these definitions. So you're able to use those kinds of C-sharp constructs that you're used to
8:46
But you're going to end up building these kinds of JSON templates that the cloud providers themselves are going to understand how to work with
8:56
So that's a huge value of that. So think about that. We'll talk about this a little bit more when we get into it
9:02
But think about it as your code quality and approaching how you're going to be building this infrastructure, how they are starting to interrelate
9:09
And that's kind of what we'll go over now with, you know, IAC matters
9:14
And the rest of this conference kind of talked about code quality and how that matters
9:18
So let's kind of mash them together. So first of all, let's go over what code quality looks like because
9:26
these are the points that I'm going to kind of come back to as I go deeper into IAC and how that works
9:32
You know, we won't worry about a definition or anything like that. Let's just talk about the characteristics of quality code
9:38
And when we touch on these, think about them in the context of your infrastructure and not just your application
9:45
Does what it should. Okay. That's something I really want out of my infrastructure
9:49
I don't know about you. But I really want it to just do what it should. And next is easy to understand
9:54
That gets a little bit more complex when you're starting to think about infrastructure as code
9:59
because let's be real. Not everybody is used to IAC. And there's a lot of developers, especially in the older generation
10:10
Dot net world that don't have much experience in systems and infrastructure at all
10:15
So everything we could do to make the next guy that comes along be able to perform something meaningful with the infrastructure code becomes important
10:23
and that's where quality comes out. So the code should be well documented
10:28
This seems to go hand in hand with easy to understand, but consider what that means for IAC
10:33
First of all, the rate of change in code in IAC tends to be much lower
10:41
than it is in an active application. It's pretty unusual that you'd be changing your environment every release, for example
10:48
You could. I mean, definitely you could, but generally you'd be running the same environmental code
10:53
for a longer period than the applicable application code. I mean, you'll still include it in every release
11:01
You just won't have had put any change sets against it. So obviously, once an application goes into maintenance mode
11:06
it's completely possible that the only changes that you'll do is maybe like a quarterly OS security update
11:11
But that's generally not the case. But because it's updated the least frequent amount out of all of those parts of your system
11:20
documentation becomes even more important because, you know, I don't know about you, but if I go back and look at code that I wrote three months ago, I still have to spend that minute or two scratching my head going, what was I thinking
11:36
And the last point here is that quality IAC code could be tested. This is huge because a primary differentiator between IAC authored by a systems guy and an IAC authored by a developer
11:48
We'll go into this in more detail later, but think about your infrastructure in a way that means you have common coding concepts like separation of concerns and code reuse and all of the other coding techniques that you use to ensure that you're writing quality code in your application that you apply those to your infrastructure as well
12:10
So what does quality code affect? Well, first thing is reliability, the probability that a system will run without failure over a specified period of operation
12:18
operation. Okay, that sounds pretty good. I think we'd want that as our infrastructure as well
12:24
Maintainability, you know, how easily the software can be maintained, size, consistency
12:31
structure, and complexity of the code base are all part of that. Useful documentation plays into
12:38
that. Testability measures how well the software supports testing efforts, how well you can
12:48
control, observe, isolate, and automate testing. Portability measures, you know, how usable
12:56
the same software is in different environments. And reusability, whether existing assets such as
13:02
code, can be used again. So when we go into the different solution points, these are how we're
13:07
going to compare them back to, these five points right here. So we've talked about IAC. We've talked
13:15
about code quality and the importance of quality code on successful IAC, let's go point out some more
13:20
of the differences between normal code or code in an application and code used in infrastructure
13:26
as code and how could affect things differently as you think about it in code quality
13:35
So there's four major points that we're going to talk about. The first is the footprint of effect
13:40
The second is of the failure vectors or the different ways in which a failure can occur. The third
13:45
is the complexity of testing. And the fourth is the unusualness of the problem set
13:52
I put that in quotes because I wasn't sure it's a real word, but I really like it, so I'm keeping it anyway
13:59
The footprint of effect, how much stuff can be impacted by poor quality code
14:06
The first point is that an environmental failure or misconfiguration or simply poor quality can cause an outage in the application
14:12
and that may be the only cause of the outage. But of course, since it's the application, everyone will blame the application first
14:21
unless, of course, you know, you're one of the app developers, in which case you know it's the environment, but nobody will believe you
14:27
But there's a lot of problems that a poorly configured environment can cause
14:32
Think about it. If the database driver that you may need isn't installed or a route isn't open in the firewall
14:39
or access rights aren't given correctly, the application won't work, but it will take some research to figure out why
14:46
Is the code wrong or is it something somewhere else that's making the problems
14:51
And even beyond the application, a lot of things that a modern system relies on, such as tracing or alerting, logging, all could be broken because something went wrong with the environment
15:01
My favorite was once where the environment was so broken that the alerts we used to tell us if something was wrong, well, they wouldn't fire
15:10
it was so broken that our systems couldn't report that it was broken
15:14
And so we didn't expect it to be broken, and we just kind of skipped that whole part of the troubleshooting process
15:20
And all of these shows how these dependencies trickle down. You know, if logging doesn't work, for example
15:28
you may not notice it until you have to go look for a specific problem. So now you have two problems
15:33
One, the app is hiccuping and you're not logging it. And why
15:37
Because your infrastructure code is of low quality. So let's visualize what I was just talking about
15:43
The blue box is your application. Everything else around it is what you manage an impact when using an IAC to manage your infrastructure
15:51
Thus, an IAC deployment can affect everything inside and including the white box, which is your network boundary
15:57
So not only is the problem or potential problem the application it everything that the application communicates with So all of these different points become your failure vectors or the most common direction in which errors can occur
16:16
Poorly designed IAC could affect networking and connectivity. Imagine something like a firewall rule misconfigured or a routing table entry missing
16:26
or anything like that that talks about network and communications within the network
16:31
Monitoring, as we mentioned above. But, you know, that's another direction. of failure if you have to ensure that your code quality extends to that code that's managing the
16:40
monitoring of configuration and setup. And lastly, the actual processing of the infrastructure
16:47
creation. Much of this is out of your hands, but there are definitely some quality decisions
16:53
that you can control that have a significant impact on how everything is put together and
16:58
processed by your IAC tool of choice. And here's another visualization that kind of shows the extent
17:04
of IAC slightly differently because it can also impact your ability to connect with other systems
17:09
or even managing the rights that you assign your own resources. And because of the failure vectors, managing the testing is very difficult
17:24
Think back to when we went over the characteristics of quality code and how one of those main characteristics
17:29
of testability. Well, the complexity of this testing effort makes it even more critical that
17:34
that the testability of the code is maximized. Sure, more than likely the release schedule for your infrastructure changes will be different
17:42
than your application changes, but you need to make sure that you have enough time to manage
17:45
these testing needs. And we'll go into these in a little bit more detail a bit
17:52
And lastly, is the unusualness of the code in IAC. Well, why is it unusual
18:00
Well, first of all, it describes infrastructure. Many developers do not have the most complete grasp on the intricacies of the entire cloud system
18:09
A VP, for example, or virtual private cloud is generally pretty well understood
18:15
But knowing that you need to add an internet gateway to that VPC before you can access the internet
18:20
requires a deeper understanding of the peculiarities of whichever cloud provider are using than most developers really wish to have
18:29
Another unusual... I'm sorry, I went too fast. There we go
18:40
Another unusual difference is that the IAC describes what the output is rather than how to build it
18:46
This is a strength of IAC because that makes the outcome more predictable
18:50
But it's the opposite of what the normal developer behavior is, is developers tend to focus on how it gets done because they're generally the ones that are responsible for doing that
19:00
And lastly, is that it should be treated at the same level as on application. What do I mean by that? Well, many times you kind of see structure considered to be
19:11
part of an application, generally because of how tightly bound they are. But just like those
19:20
visualizations that we went over a little bit ago, it puts the infrastructure in an inferior
19:29
position. And what that means is that people then start looking at the infrastructure as being subordinate to the application, when really you should
19:40
be thinking of them as being at the same level or partners in it rather than one of them owning
19:46
the other one. Even though, to be honest, the sole purpose of the infrastructure is to run the
19:52
application, we need to break that mental model about the infrastructure being subordinate
19:57
to the application. and arguments against it being different. So I've heard the argument that the infrastructure should be considered to be similar to a framework or an SDK
20:11
They're similar that there will be changes and updates to the framework, but there are very likely to be a lot fewer of those changes
20:20
And also any quality issues in the framework could result in incorrect behavior in the application, just like it would when looking at the infrastructure
20:27
So that argument says there's no real difference between being a framework module and being IAC
20:34
And there are definitely some similarities. But we talked about those vectors for failure earlier
20:40
And that's where I think IAC needs to be treated differently than a framework or other application inclusive items
20:46
But once again, you know, if you're going to look at it as subordinate, then it would make sense to kind of consider them even more
20:52
but the fact that the infrastructure can have many other things that can cause problems other than the application makes me think that it should be separately
21:04
Another one that talks about complexity and how there are other systems that are complex as well, you know, this too is true, but I think it kind of misses the point
21:13
The examples that were given speak specifically about systems. So a group of applications working together to fulfill a set of requirements
21:22
perhaps being complicated is still really application bound. Whether an application has, whether
21:28
you know, an application is one of them with a million lines of codes or a hundred separate
21:32
applications with 10,000 lines of code, the failure vectors are different between the applications
21:37
and between the infrastructure. And lastly, code is code is code, so that in regards there
21:43
should be no differences between IAC and application code. And in many ways, I actually feel that
21:50
this is the most compelling of these arguments. However, there are some things that you'll need to do differently when you look at IAC as opposed
21:57
to regular code. And lastly, there are some additional considerations that are mostly burdensome on the
22:04
infrastructure team. The first of these are pre-established company's standards around generally IT
22:12
governance or security or things like that. You know, the standards talk about when and how and why IT infrastructure changes can
22:19
happen and how that change needs to happen, who has to approve it and so on. You know, you can't
22:24
just throw those standards out when you go to IAC or exempt your infrastructure from them
22:30
Imagine what would happen if, say, some big audit came through and you have systems that aren't
22:35
following your company's published IT standard. Well, you'd probably fail that audit
22:40
and that could be a very expensive proposition. The next consideration is industry compliance
22:46
There are additional controls put on the infrastructure if you're in a PCI compliant environment or GDPR, any of the other kind of industry-led standards and rules
22:58
Both of these are considerations that put additional burden on using IAC, and it helps, you know, I think explain why code quality needs to be important and how those decisions need to be made
23:11
So what do you have to do to ensure quality in IAC
23:18
Well, we talked a little bit about audits and governance, but these are definitely factors that you have to take into account
23:25
when building infrastructure. While not a quality code construct in and of itself
23:32
audits and governance, they provide the structure and methodologies that are going to be enforced through infrastructure as code
23:40
You know, these structures need to be managed, managed in a more rigid way than just assuming they're one of the set of requirements
23:47
that you're going to use to evaluate your testing success. Instead, reviews and audits should really be set up before the infrastructure work can be
23:56
performed, such as deployed to a new environment. These reviews are all around compliance and whether or not the infrastructure meets
24:04
the various needs and requirements that they've got. And that the key to this is whenever the infrastructure has changed no matter how small of a change you must confirm that you still in compliance And that more than just a QA test You need to be much more formal about it so a governance audit
24:23
These help support the reliability of your infrastructure. And next is a golden construct, because once you have governance, this is the best way to ensure compliance and audit management
24:35
This is your definition of each part of the infrastructure. If you're using virtual machines
24:40
then you should have the golden construct for VMs that you want to use. You should keep these number to the smallest possible
24:46
For example, don't have a golden construct for Ubuntu and a different construct for Debian
24:51
unless you really, really, really have to support both operating systems. Instead, standardize whenever possible and use variables to manage those real differences
25:01
If you think back to the IAC code that I showed earlier, this one
25:05
think about what makes sense to support different variable-driven settings. For example, I think that we'd want to easily configure the Amy value because that could be different from one another
25:19
And in many ways, that could easily impact the application. And we certainly know that we want to do that with managing the instance type or where we define the size of the machine that's to be started
25:30
That's a logical place to get started because it's easy to see what you may want to use a different size of, you know, hard drive and memory based upon the workload that you're looking to work. with
25:42
So your golden construct is really a series of constructs. Each one of your infrastructure components needs a golden construct
25:50
Thus, your infrastructure is really a bunch of gold kind of stitched together as a unified
25:56
hole, kind of like how a chain necklace is made up of golden links put together
26:01
Since your golden construct is literally defined as the definition of that construct that
26:05
could be used anywhere, then the reusability expectations for code quality are satisfied
26:10
These golden constructs also obviously help drive reliability as well. So automated testing, as I hinted about earlier, is massively important and massively complicated
26:24
You can obviously use automated application testing to get some assurance, but when you have to ensure that you have a way to manage the testing of all the rest of the different aspects
26:35
Some of this you could do through post-deployment integration testing. You know, I saw one approach where they ran an automated UI-based testing after every deployment is a smoke test
26:49
And that test series was a set of test pass that were really carefully calculated to hit every service database and external data feed that the system relies on
27:03
thus making sure that any potential infrastructure or coding changes didn't impact the ability of the application to function
27:10
Generally added 15 to 20 minutes to every release. So at a release cycle of every two weeks, the cadence and the extra time was pretty minimal
27:21
The other area of concerns, the alarms, performance, observability, those kinds of things
27:26
There's no real standard way to do this because it really depends heavily on how you manage your operations
27:30
but there are some ways that I've seen put this together quite successfully
27:35
I've seen, for example, reports set up in Splunk or whatever log monitoring platform that you may use
27:40
that evaluates all the incoming data post release and yzes it against previous releases
27:45
and then calls out any differences. Some of these differences are expected because of code changes, new features, and things like that
27:52
But others come as a real surprise, and those are the ones that you look for. Imagine back to my example I gave earlier where our system was so broken that we're
28:00
we couldn't even tell that it was broken. This would have helped us figure that out much sooner
28:08
These don't replace your traditional anomaly scanning, but instead are special dashboards that look for expected input that's missing
28:17
And lastly, is the code quality tools. We've just seen several talks about those
28:22
You know, these tools use various approaches to scan your source code
28:27
and look for code smells or, or areas where things like maintainability based on coding standards such as test methods
28:38
test method names that need to follow a standard convention or naming conventions that are
28:44
followed for constants, methods, and property, those kinds of rules. Boy, I'm drawing some blanks on the rest of them now
28:51
My favorite line about this is Sonor Cubes got the coding standards rules promote civility
28:58
I really enjoy that. And if you saw Abyshex talk earlier, you'll realize that you've seen this capability in more detail
29:05
I think you went over Sonor and Terraform. And we just saw some things with the editor config
29:12
But these are all powerful tools to help you statically yze your code
29:15
for potential issues around quality. Typically, you most commonly see these being configured or run during a build
29:22
say when your unit tests are ran. But I've also seen it set up as a gating event where you can
29:28
not complete your code check-in and because of concerns raised by the tool
29:32
Generally, these tools are best suited for application code, but they are getting smarter and smarter around code smells and IAC code
29:41
So this will be interesting to watch over the next few years as a fully matures
29:47
I expect that, you know, probably in the next year to things, you know
29:53
open source tools like AWS's cloud developer kit will be pretty powerful
29:58
about helping us understand and diagnose the same amount of problems that you're seeing with things like SonorCube or checkmarks
30:09
But unfortunately, we're not able to fulfill every aspect of code quality when you consider IAC
30:16
One of the examples that I mentioned earlier was that portability is a good goal of code quality
30:22
Unfortunately, at this time, that's not possible. even when using a third-party tool such as Pulumi
30:29
they're unable to build out an abstract model that can run across multiple providers easily and safely
30:35
However, they're getting better and better at this every year. So there is still the opportunity that at some point
30:41
you'll be able to write infrastructure code that can be deployed against many different targets. Fortunately, I don't think we're there yet
30:48
Now, in summary, there are four different approaches that help assure quality code
30:54
when working with IAC. Audits and governance, you're most likely going to have to perform these anyway
31:02
So rather than looking at them as a chore, take advantage of the structures that they offer
31:07
and use them to support quality code. The next is the golden construct
31:11
that one representation of your infrastructure. You should only make updates to your golden construct
31:17
when you're 105% confident that they can do the job. Because this is your definition of your infrastructure
31:24
structure. Remember, I call it a golden construct, but think of it as a combination of little gold
31:31
things all kind of combined together with each infrastructure component adding to the strength of
31:37
the overall construct. Next is automated tasting and evaluation. This includes regular
31:44
application testing as well as integration testing that stresses every part of the system
31:49
This should help you determine things like system rights or installed software problems. As part of
31:54
this, you also need to come up with ways to ensure that other areas relying on your system is validated
32:00
like alert, slogging, all those other things that you rely on to be able to ensure that your
32:06
operational systems are up and working correctly, well, you need to make sure that those still
32:12
work after an evaluation or after a change as well. And lastly, take advantage of code quality
32:17
tools. They're getting better and better at accessing IAC code
#Development Tools