Hello, world
The Fargate/Terraform tutorial I wish I had
In my last way-too-long, way-too-technical, seriously-nobody-cares technical post, I wrote about serverless functions. The main benefit of serverless functions, I wrote, is that you can deploy code to production without having to worry about keeping a server online, secure, and up-to-date. But a secondary benefit of serverless function is also its main trade-off: they’re just functions. Computer scientists might call them pure functions, because the outputs of serverless functions are usually entirely dependent on their inputs and nothing else. You could also call them stateless, because they don’t retain any artifacts or side effects from any one invocation. (The runtimes from AWS and Google fudge this somewhat, but let’s pretend.) This trade-off makes the code simpler to understand and to debug.
For many cases, as it was for Louvre, the trade of simplicity for state is well worth it. But other times, it’s worth it to have a more stateful system. An API might want to store database connections for reusability, or maintain in-memory caches for speed, or simply maintain a counter for the purpose of rate-limiting. And that’s where AWS Fargate comes in.
Fargate is sort of the best of both worlds. Like its predecessor, it’s a way of launching containers on AWS while maintaining visibility on the container after it launches. But unlike its predecessor, the EC2 launch type, Fargate doesn’t require you to pre-allocate and maintain an instance on which to run your container. With Fargate, you simply get to define your container and launch it.
Or at least that’s the promise. AWS, however, is complicated, and launching a Fargate service using the console is no mean feat. You have to use at least five different AWS services, in a specific order, and that’s not including any databases or other integrations you might want to use. The console never really tells you where to start. It never tells you where to go next. Sometimes the information the console gives to you is just plain wrong. If you mess up, you might be able to fix it. But if not, you might have to start from scratch.
That’s why infrastructure configuration languages like Terraform are so appealing. You simply define your infrastructure once, in code. Then you run a program which uses that configuration to build your infrastructure. If you mess up, or want to try something new, you can simply blow it all up and rest assured that it’s just as easy to recreate it. Best of all, all infrastructure changes can now be peer-reviewed and committed to version control, a requirement in highly-regulated environments and a plus everywhere else.
But the promise of Terraform is a little too good to be true, and that’s because Terraform has to play by the rules of your cloud provider. Terraform will build whatever infrastructure you tell it to, but you still have to know what you want. With AWS, and newer services like Fargate in particular, this isn’t always clear. So while you’ll see a lot of Terraform in this tutorial, this is really a tutorial on how to set up a Fargate service.
Here’s the goal: we’re going to try to spin up a Fargate service, using Terraform and as minimal a configuration as we can get away with. I’ll show the code step by step below, but at the end of this article I’ll provide a link to a Github repository with all of the Terraform necessary to start a Fargate service, with a few minimal changes.
(If you’ve read this far and find yourself wanting to run from the room screaming, thanks for sticking with me for this long! I’ll write about something more interesting next time, I promise.)
The setup
Let’s start with the app. The core building blocks of Fargate services are Docker containers and the whole point of Docker, or containerization in general, is that the host operating system no longer has to care about what sort of app is in the container (and vice versa!). So as a demo, I wrote a quick Go app (natch), but it could easily be a Node app, a Rails app or even just a webserver serving static files (don’t actually do this last one – there are better ways to solve that particular problem).
Here’s our application:
|
|
Hopefully, it’s pretty straightforward. We’re creating an HTTP server and exposing two endpoints on it. One of them is just a health check; the other queries an API for the sunrise and sunset times for a particular location on a particular day. The whole app is less than 100 lines of code, but it’s doing two things that we’d want a typical API to do: listen for requests and act as a gateway to make requests to an upstream service.
Next, in order to deploy it on Fargate, we need to define the Docker container – or Dockerize – our app. Here’s the Dockerfile which makes that happen:
|
|
We’re using two stages in this Dockerfile. The first stage, which starts on line 1, builds the application. The second stage, starting on line 7, copies the built application into a slimmer and less permissive environment.
To make things a little easier, I took the liberty of building the Docker image and pushing it to a public Docker repository on Github. This saves us the couple of steps required to create a private repository and push an image to it. The service we’re about to create can just use the image from the public repository.
I promise this isn’t some half-witted attempt to get you to install malicious code in a Docker container in your AWS account. But if you’d prefer to be cautious, you can create the image yourself and upload it to any Docker repository you control. (For simplicity, make sure it’s public for now.) Here’s an example of how to do it on Docker Hub:
- Create a public repository called
sun-api
on Docker Hub. - Make sure you’re logged into your Docker account on the CLI by running
docker login
. - Grab the two files above and put them in a directory together (not your
$GOPATH
). Rungo mod init
; this should create ago.mod
file and ago.sum
file. - Run
docker build -t <your_docker_username>/sun-api:latest .
. - Run
docker push <your_docker_username>/sun-api:latest
.
Keep an eye out for when we use this image name later on in the tutorial and replace my image’s URL in the image path with <your_docker_username>/sun-api:latest
.
(By the way, creating a private ECR repo to push Docker images to and making your service pull from that repo isn’t hard. The part that’s a bit of a pain is actually pushing your image from your machine to the ECR repo. So I’m opting to skip it. But I’ll give you the Terraform for creating the ECR repo as well, if you want to do that in the future.)
Now that our app is ready to deploy, let’s start writing some Terraform. Add these lines to a file named config.tf
in your directory:
|
|
Both of the blocks in this file contain a line that says profile = "tfuser"
. This tells Terraform how to authenticate with your AWS account. You’ll need to set this up manually: under IAM in the AWS Console, select Users in the left hand nav, then find the Add User button (or just click here). The username should be tfuser
, and make sure the checkbox labeled “programmatic access” is checked. On the next screen, make sure to add the AdministratorAccess
policy.
After creating your user, you should see a screen with an Access Key ID and Secret Access Key. Copy those values into a file named ~/.aws/credentials
, with the following format:
|
|
Terraform reads every file ending in .tf
in the same directory as part of the same workspace, so we can split up our code into meaningful files. We combined most of our config into one file, but if things ever get more complicated, we can split out this config into a provider.tf
, backend.tf
and versions.tf
, for example.
Our backend
block under terraform
is telling AWS we’re going to put the state file in an S3 bucket called terraform
with a filenamed called terraform.tfstate
. You’ll probably need to change the bucket name to something more unique; since S3 bucket names are unique per region, there’s a really good chance someone is using the name terraform
. Create a bucket with the name you picked in the S3 console (the default, totally private settings are what you want), then set that as your bucket name in backend.tf
.
Next, from the command line and in the same directory as your config.tf
file, run terraform init
. You should see a success message that looks something like this:
|
|
If that’s what you see, great! If not, make sure your tfuser
user has the appropriate AWS permissions and verify that Terraform is installed correctly on your machine (if you run terraform version
, you should see something along the lines of Terraform v1.0.5
).
You may have also noticed that Terraform created a file called .terraform.lock.hcl
in your directory. If you’re using version control, this file is like a package-lock.json
or go.sum
and is safe to commit.
The service
Our end goal is to create a Fargate ECS service. So let’s start by creating that and see where we get. From the Terraform documentation, it seems like we want to create an aws_ecs_service
. Let’s add an aws_ecs_service
resource
block, with the required fields filled out as well as we can. (Paste this into a new file called ecs.tf
.)
|
|
It turns out there are only two absolutely required fields, so our first iteration is pretty simple. We’re supplying the name
of the service, which is arbitrary. We don’t know what the task_definition
is yet, so we’ll just use an empty string for now. This is obviously not going to be our final solution, but let’s run a plan and see where we are.
|
|
This is just a plan, meaning we haven’t actually made any changes to our AWS environment yet. But it’s always a good idea to inspect the plan output to make sure Terraform is doing what we expect it to do. In this case, the only thing that seems off is the launch_type
. Terraform is saying it will be “known after apply,” which means it’ll use whatever AWS defaults to. We want to ensure it’s FARGATE
, so let’s add that line:
|
|
And here’s the resulting output:
|
|
This seems too easy, but let’s run an apply anyway, just to see what happens:
|
|
As we suspected, that config wasn’t all we needed. (We haven’t even specified the image yet!) But it’s a good example of the difference between running terraform plan
and terraform apply
. terraform plan
validates your config to make sure the syntax is valid, that any variables being referenced are defined, and that the required fields are populated. Even though a plan might be valid, however, Terraform doesn’t have much of an idea what AWS will say when it tries to execute the plan.
In this case, aws_ecs_service
documentation specifies that TaskDefinition
should be: “The family and revision (family:revision
) or full ARN of the task definition that you want to run in your service.” It’s a good reminder that while Terraform helps us define our infrastructure, it doesn’t guarantee that the infrastructure we define will even run, much less meet best practices.
The good news is this: we know what to fix! Now that we’ve gone through one iteration of the code/plan/apply troubleshooting cycle, I’ll move a little faster. Let’s add these blocks to the ecs.tf
file:
|
|
Next, update the task_definition
field in our aws_ecs_service
block:
|
|
This is our first example of using a variable to populate another field, and it’s one of Terraform’s most powerful and appealing features. Instead of having to hardcode that ARN into our config, we can simply say: “that task definition I just created, whatever its ARN is, use it here.” If we ever destroy and recreate that task definition, and it gets a new ARN, this config will still work perfectly.
Running terraform apply
should give us our first partial success:
|
|
A couple things actually got created! Now that we’re in this for real, if you need to tear down everything, you can run terraform destroy
. Like apply, it’ll give you a plan output that specifies what it intends to destroy, so make sure you inspect that closely. But Terraform will only touch resources it knows about, so it should only affect resources you’ve created here.
We’re still stuck on that task definition and it’s about to get weird, because it’s time to add some permissions. We need to create a role for the task to use while it’s running, but we have to also explicitly allow our ECS task to assume that role. AWS provides a policy we can use for execution, but we’ll have to attach it to a role we create. Add these lines to ecs.tf
:
|
|
Here, we’re creating a role that AWS will use to run our app. First, we attach a policy that allows the role to be assumed by ECS tasks (blocks 1 and 2). Then we grab the AWS-defined default policy for ECS task execution and attach it (blocks 3 and 4).
Now we can add this line to our aws_ecs_task_definition
resource:
|
|
If we run terraform apply
now, it seems to try for a long time to create the service before finally failing.
|
|
That output suggests that we need a cluster in which to put our service, so let’s create it:
|
|
After adding those lines, our next terraform plan
run should tell us that we’re closing in. We just need to create the cluster and the service. But when we try to apply that, we get the following:
|
|
The good news is we’re still making progress. The bad news is we’re about to talk about networking.
Networking
We set our task definition’s network_mode
to be awsvpc
because that’s what AWS requires for Fargate tasks. Unfortunately, that comes with some other hidden dependencies. Namely, Fargate tasks need to be in a VPC.
Creating the VPC by itself is fairly simple, but it also requires you to define subnets, route tables, NAT gateways and more. So I’ll save you the pain I went through trying to get this stuff working properly, and just give you the config. Open a new file called network.tf
and copy these lines into it.
|
|
From a high level, here’s what’s going on. First, we create a VPC. The VPC lets us ensure our services are isolated from the rest of AWS and the world, which is definitely a good thing. But VPCs don’t come with any built-in configuration, so we have to do that ourselves. To our VPC, we add two sets of public and private subnets in two availability zones. This is a best practice that happens to be an AWS requirement: even if one of the availability zones go down, we should still be okay. Next, we define a route table for the public and private subnets and associate them accordingly: our public subnets will be exposed to the Internet via the Internet gateway directly, but we’ll put our private subnets behind a NAT gateway so that it can talk to the Internet but the Internet can’t get in. Finally, we’ll create some security groups so the Internet can reach our ALB, our ALB can reach our service and our service can reach the Internet.
Note that if we were using the console to do these operations, we’d get a couple security groups by default. Terraform removes these, however, so we have to recreate them explicitly. Also, check out those output
blocks, which will tell us the VPC and subnet IDs on the command line when they’re created.
After adding that file, we can run terraform apply
to create our VPC and various networking pieces. Everything should create successfully, but we’ll still see this error:
|
|
Back in ecs.tf
, add this block to your aws_ecs_service
block:
|
|
With any luck, the service should now create successfully when we run terraform apply
! We’re not done yet, but this calls for a celebration.
Load balancer
We’re getting really close now. Our service is created and our task is configured; all we need now is a way to let incoming traffic in. We need a load balancer (or ALB, for Application Load Balancer). Let’s add these lines to our ecs.tf
file:
|
|
This last output
block is important because it will tell us what URL we’ll use to reach the service without us having to go into the AWS console to figure it out.
Next, add this block to your aws_ecs_service
block:
|
|
One more change. By default, the ECS service we created won’t start any containers. We need to tell it how many containers we want.
|
|
Finally, run terraform apply
one more time. (The ALB may take a bit to spin up.)
|
|
Finally, finally, finally, copy and paste that URL into your browser. If all has gone well, you should see the service respond!
If you’re tired of reading, feel free to skip to the end; the hard part is over. But if you’re in the mood to tackle just a couple more changes, we can really put a bow on this API.
Cleanup
You may have noticed that the load balancer is listening on HTTP, not HTTPS. In most cases, we’ll want APIs to be served over HTTPS, so let’s try and correct that using a certificate issued by AWS. You’ll need a domain (or a subdomain) with DNS that you control.
Add these lines to your ecs.tf
file, substituting in your domain name in the first block (fully qualified, but without https
):
|
|
Next, find your sun_api_http
listener and change the default action to this:
|
|
Running terraform apply
here should update your existing HTTP listener in place, then create a new HTTP listener which redirects to HTTPS. It’ll also create your certificate. But before you can turn on the HTTPS listener, you’ll need to validate the domain you chose with your DNS provider. The output of the apply should give you all the information you need:
|
|
That block tells me I should create a CNAME
record that looks like this:
|
|
While you’re there, go ahead and create a second CNAME record that points your domain at your load balancer URL. For me, that’d be:
|
|
Your DNS provider should have instructions on how to create CNAME records, like this page from Cloudflare.
Once the validation CNAME record is created, you can uncomment the HTTPS listener block and run terraform apply
once more. If this seems to try for a while before timing out, the DNS for the validation record may not have propagated yet, which means AWS hasn’t been able to validate your domain. Give it a few minutes and then try again. (You can also monitor the status of your certificate in ACM in the console.)
Whenever the listener gets created successfully, you should be able to hit the API using https://<your-domain>
rather than the load balancer URL.
One last thing. Say we’re ready to start writing our own proprietary code and we want to switch our service to pull from a private ECR repository. This is actually pretty straightforward. Go ahead and add these lines to ecs.tf
:
|
|
Next, change the image
field in your task definition JSON to reference the ECR repo:
|
|
You can run terraform apply
to make these changes, but remember that your service won’t work properly until you actually push an image to your new ECR repo. Follow these instructions provided by AWS to authenticate your Docker CLI with your new ECR repo.
Conclusion
That wasn’t so bad, was it?
…
Okay, maybe it was a little rough. But we accomplished quite a bit. Not only did we spin up a Fargate service on HTTPS from scratch, but we did it using Terraform. That means rather than wasting time haphazardly clicking random buttons in the AWS console, we have an exact blueprint for how we spun this service up. And even better, we can instantly clone this Fargate service and create a second one that functions in the same – or similar – way. We might even decide that this is the way we want to create all Fargate services in the future and turn this into a module. That way, all we’ll have to do to spin up a new service is invoke the module with the parameters we define, abstracting away all of the boilerplate AWS stuff we now know we need. But that can wait until next time.
Thanks very much for reading! As promised, here’s a link to the repository with everything we’ve done today. If you are or are aspiring to be a technical person, I hope this was useful. Please let me know how I can improve this post in the comments or by emailing me at feedback@section411.com.
And if you’re not technical and you made it this far, I really appreciate you reading. I’ll be back with some baseball, a movie review or a personal story next time.
Thanks to Sara Sawczuk for reading a draft of this post. When she hits it big as an editor for real writers, I hope she gives me a family discount. Thanks also to Jordan Castillo Chavez for reviewing the more technical parts of this post.
This post and its accompanying repository was updated in September 2021 to use Terraform 1.0.5 and clean up some weird resource names, and then again in December 2021 to improve the networking setup.