Here at Honeybees, we use AWS for almost everything web development related. AWS is great; it provides a wide array of services that, among other things, make coding and DevOps tasks easier and ECS is no exception to that rule. As such, it should come to no surprise that we chose ECS to deploy and manage many of our web applications.
During our now more than 1 year stint using the service, it has served us very well but it has not been all sunshine and rainbows. Deployment times left a lot to be desired. Even services with a small number of task instances would take on average close to 10 minutes or more to deploy. Throughout development — especially during the early phases — we would therefore get bottle necked whenever a developer accidentally pushed changes with show stopping bugs to the development environment, as rolling back to a previous task definition would mean that ECS would have to perform the deployment process all over again. Waiting for the developer to fix the bug and then deploying it would take even longer, so that was not really any better.
The average web application doesn’t take very long to setup and run. You download the project from somewhere and if necessary, install its dependencies and build the source code and then finally launch it. So what on earth is causing these long deployment times, anyway?
For one, the EC2 instances/Fargate tasks that the ECS cluster manages need to re-download the desired Docker image from the ECR repository. These vary in size and in our case they are around 50MB but can easily be bigger. Naturally, the bigger the image is the more time it will take. After finishing the download, the task instances need to get the Docker image up and running, which takes additional time.
Whenever a new task instance gets initialized, it needs to pass a set of health checks in order to be marked as healthy and fully deploy. During this period, ECS will keep running the old task instances alongside the newly created ones until the new ones have passed the health checks, in order to prevent downtime. How fast the new task instances get labeled as healthy and how fast the new task instances replace the old ones, depends on how strictly you have configured your health checks as well as how many free resources you have available on the ECS cluster in question.
With this information on hand, we found ourselves asking: If all we want is to push a change consisting of a few lines of code to the development environment and/or update a project dependency, is it really necessary to go through all the steps outlined above?
During that moment, a realization dawned upon us; why not engineer our own solution that allows us to upload whatever changes we want applied onto the already existing task instances in the cluster instead of having ECS replacing them? After all, we know that updating and launching your average web app isn’t very time consuming. It’s the deployment process that’s taking time and well..
Service Hotreloader — a way to deploy without actually deploying
Following the realization, we set out to explore this idea and Service Hotreloader is the result of that excursion. Below is a high level overview of how it is setup:
At its core, it’s not much more than a generic wrapper containing logic that imitates what a developer would do in order to get an application up and running. That is, the App handler:
Downloads the project from a version control host of your choosing like Github or Bitbucket.
Installs the project dependencies and builds the source.
Launches the application.
The Hotreloader also establishes a connection with a Pub/Sub endpoint upon starting. Whenever the listener receives a notification that the project has changed, it will tell the App handler to reload itself by repeating the above steps. This is what effectively allows us to bypass the ECS deployment process.
To better illustrate how the Hotreloader would be used inside an ECS cluster:
So, how much time did we save by going down this route? As mentioned early on in the article, even a service with a low number of task instances would take close to 10 minutes to deploy. With the new hot reloading system in place, it now only takes at most 30 seconds. If there are no additional dependencies to update or download it takes less than 10 seconds. Since the majority of our deployments do not contain dependency updates, our “deployment” times are nowadays therefore mostly in the sub 10 second range — even on services with a large number of task instances like the ones we have in the production environment. Needless to say, we are very pleased with the results and found our endeavors to be well worth the trouble.
We hope this article proved to be insightful to you and hope it gives you some ideas on how to reduce your deployment times if you find yourself facing the same problems we were!
In order to keep things as concise as possible, we chose to gloss over a lot of the technical aspects, such as potential drawbacks, how the Pub/Sub endpoint works, how it deals with an incoming update notification while it is in the middle of reloading its project and how it fits into the rest of our CI/CD pipeline. If you’d like to more about any of the above or anything else, let us know in the comment section. If there’s enough interest, we’d be happy to write a follow-up article!