How SuperAwesome scales containers for kidtechJanuary 31, 2019
Containers are a critical part of SuperAwesome’s infrastructure, allowing us to grow very rapidly without losing focus on our goals.
Sometimes, some really simple changes can allow you to save you money, build simpler infrastructure, ensure a better developer experience, and reduce maintenance.
What we do
Every day, 170,000 children go online for the first time. Unfortunately, the internet was built by adults for adults, without taking the specific needs of children into account. As a result, many of the platforms they use are collecting vast amounts of their personal data (which is now illegal) and haven’t been designed for child engagement (in terms of design or moderation).
Our mission at SuperAwesome is to make the internet safer for kids. To do this we built a technology platform that helps all players in the digital ecosystem create kid-safe digital engagement (apps and sites but also more specific functionality like video, safe ad monetization, authentication and community functionality).
At a technology level, we want to maximise our time building technology that keeps kids safe online. We also do our best to keep our platform simple. We go for tools and technologies that are maintainable, that allow us to get things off the ground quickly, and can be improved in an iterative way.
From development to production
Our journey with containers started four years ago. At the time, to get their development environment up and running, developers had to clone a few repositories, build some virtual machines, and run them on their laptops. This was resource intensive and unreliable. It took new developers on our team a few days to even get set up. Then, that environment would use almost all the resources on their laptops.
To solve this problem, we introduced containers into our development process. We took all these projects and dockerized them. We got rid of all the virtual machines but one (docker for Mac did not exist at the time), installed docker and docker-compose on it, and ran our projects in containers.
At the time, our production environments were managed by an external company. We had one consultant working full time on it. One day, he went on holiday and that external company sent us a replacement. The new guy made a manual change in production, brought our whole system down, and then couldn’t fix it. The team and I spent the whole day and night recovering from it. After that, we took our infrastructure back in-house and used containers in production.
To start, we improved what we had on EC2 in the same way as our development environment. Over the following months, the team created more and more small services. This increased the number of EC2 instances that we were running, and our infrastructure costs. To keep them under control, we decided to switch to something more adapted to containers. Kubernetes was in its infancy so we opted for AWS Elastic Container Service (ECS). It did not have many features but we know that it would be stable and reliable.
During the following two years, Kubernetes changed quite a lot. It got more features, became more stable, and grew its community. At some point, the benefits of switching were higher than the cost and the time it would take us. We created two Kubernetes clusters and a script to transform our ECS tasks into helm charts, and we switched to Kubernetes.
Switching to EKS
Over these four years, SuperAwesome changed. We went from running a handful of EC2 instances in one region to hundreds in different regions and AWS accounts. From a few hundred of HTTP requests per minute to tens of thousands per second. From a few million analytics events per day to more than a billion. And from two Kubernetes clusters to more than twenty…
Growth brought complexity. To keep things simple and manageable, we switched to AWS’s Elastic Container Service for Kubernetes (EKS). This allowed us to keep benefit from all the features of Kubernetes without the hassle of managing the clusters ourselves.
What should we manage?
To run a Kubernetes cluster, you will need to run masters (the control plane) and nodes (the data plane). If you want a highly available cluster, you will want to run at least three masters in different availability zones. On these masters, you will need to run a few services. For example, etcd, which is a distributed key value store used by Kubernetes to store its data. You will need to set it up, secure it, back it up, and think about disaster recovery. As your cluster grows, you will need to think if you want to scale your masters up (add more CPU and memory) or out (start more of them).
While these are extremely interesting engineering challenges, solving them does not bring us closer to our goal of a safer internet for kids. Every hour spent on this is an hour that we are not spending making kidtech.
We still want to manage the nodes ourselves. We want to run some extra security tools and services and all of the nodes. Also, we want to have the ability to take advantage of spot instances. Today, 90% to 95% of our production workloads run on spot instances on Kubernetes.
We were already using a number of AWS services. Wherever possible, we set them up using infrastructure as code with Terraform. EKS has a terraform integration so it was easy to integrate with our existing automation.
Interestingly, using EKS is also cheaper. When we ran our own Kubernetes clusters, we would use 3 on-demand m4.large instances (to avoid t2/t3 instances that have CPU credits), 100GB of disk space, and we would monitor them using Datadog (15$/master). This would cost us around $291 per cluster per month ($0.100/hour * 720 hours + $0.10/GB * 100GB + 15$). With EKS, we pay a flat fee of $144 ($0.20/hour * 720 hours).
Security and networking
When dealing with containers, two of the recurring discussion points are security and networking. It would take a long time to go indepth on these topics, but here are two things that became easier to manage by switching to EKS.
When accessing some services provided by AWS, you will need to go through IAM. You will authenticate with it. Then, the AWS API will check whether you have the right to perform the actions that you are trying to perform (authorization).
With Kubernetes, you will do something similar. First, you will need to authenticate with the Kubernetes API. Then, using something like RBAC, Kubernetes will check that you are able to read/create/modify Kubernetes objects.
With EKS, you can authenticate to Kubernetes using the aws-iam-authenticator. In our case, this drastically reduced the number of kubeconfig files that we had to manage.
When running containers, you will have something similar to the setup below. Your EC2 instances will get their IP addresses from one range, and the containers from another. In our case, this adds some management overhead as we run clusters in multiple AWS regions and sometimes we need to have pod-to-pod communication between clusters.
With the AWS CNI, pods are given an IP address on the VPC. And, with a small amount of configuration, you can have that pod-to-pod communication.
Once again, these things can be done without EKS but that work would be undifferentiated heavy lifting. The aws-iam-authenticator can be used with clusters created with other tools like kops. Pod-to-pod communication across clusters can be done with other CNIs (or dealing with things like hostPort). For us, that extra work would have a big cost of opportunity.
If you run any kind of public API in your containers, you will need to somehow make them accessible to the outside world. On Kubernetes, you create ingresses and run a reverse proxy (= ingress controller in this context). If, like us, you have a very dynamic environment, you would also need to create a network level load balancer.
Your traffic would flow like in the diagram below. The connections go to the load balancer that forward them to one of the reverse proxies where you terminate the SSL connections. Then, the requests are sent to the correct pods. If you have a highly available cluster, your traffic may well cross availability zone boundaries twice and you will be charged for it twice (on AWS, traffic inside one availability zone is free).
There are different ways to solve this. Most solutions will force you to add some extra configuration or run some extra code to avoid having an unbalanced cluster.
To solve this for SuperAwesome, we simply got rid of our reverse proxy and switched to the application load balancer. As we are using EKS, it is aware of the ingresses defined on Kubernetes and it can forward the traffic directly to the right pods.
We started with a very manual and unmaintainable setup. Slowly, we switched to containers, started deploying our code to more regions and environments, automated (almost) everything, and made the system more cost effective.
Now, more than 90% of our workloads run on spot instances (including some stateful services), which saves us a lot of money. By switching to EKS, we reduced the fixed cost of our Kubernetes clusters by 47% and reduced the time spent on maintenance by 80%. This also let us to take advantage of the Amazon VPC CNI Plugin and the Application Load Balancer to simplify our networking and reduce our data transfer costs.
All of this allowed us to stop doing undifferentiated work and focus on what really matters to us, making the internet safer for kids. To learn more about why it is so important, have a look at the PwC Kids Digital Media Report 2019.
This was originally published on AWS Startups Blog.
Do you have any questions? Would you have done things differently? If so, please don’t hesitate to get in touch!