Kubernetes Cluster Autoscaling: Karpenter vs Cluster Autoscaler

Kubernetes Cluster Autoscaling: Karpenter vs Cluster Autoscaler

In this blog, we will explore how Karpenter works as compared to Cluster Autoscaler.

⌛ tl;dr: This blog covers the fundamental differences between Cluster Autoscaler and Karpetnter, followed by a demo that scales up an identical real-world cluster for a sample workload with 100 replicas.

AWS Announcement at re:Invent 2021 📣

AWS has released the GA version of Karpenter at re:Invent 2021. 🎉 🎁

Karpenter is an open-source, flexible, high-performance Kubernetes Cluster Autoscaler built with AWS.

This means it's officially ready for production workloads as per AWS. However, it's been discussed for almost a year. Some of the conference talks can be found on the Karpenter's Github repo.

The official blog announcement promises support for K8s clusters running in any environment:

📋 Karpenter is an open-source project licensed under the Apache License 2.0. It is designed to work with any Kubernetes cluster running in any environment, including all major cloud providers and on-premises environments.

💡 Currently, it only supports AWS Cloud, but do keep an eye on the Karpenter project roadmap if you use other underlying cloud providers or on-premises datacenters.

Kubernetes Autoscaling Capabilities

Alright! So, before we get started with comparing Cluster Autoscaler with Karpenter; let's quickly go over the autoscaling capabilities Kubernetes offers.

📋 Kubernetes enables autoscaling at the node level as well as at the pod level. These two are different but fundamentally connected layers of Kubernetes architecture.

Pod-based autoscaling

Node-based autoscaling

📕 NOTE: This post will only explore the aforementioned Kubernetes Cluster Autoscalers.

Kubernetes Cluster Autoscalers: What has changed with Karpenter?

Cluster Autoscaler is a Kubernetes tool that increases or decreases the size of a Kubernetes cluster (by adding or removing nodes), based on the presence of pending pods and node utilization metrics. Cluster Autoscaler.

It automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

  • there are pods that failed to run in the cluster due to insufficient resources.
  • there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

Karpenter automatically provisions new nodes in response to unschedulable pods. Karpenter does this by observing events within the Kubernetes cluster, and then sending commands to the underlying cloud provider.

Karpenter works by:

  • Watching for pods that the Kubernetes scheduler has marked as unschedulable
  • Evaluating scheduling constraints (resource requests, nodeselectors, affinities, tolerations, and topology spread constraints) requested by the pods
  • Provisioning nodes that meet the requirements of the pods
  • Scheduling the pods to run on the new nodes
  • Removing the nodes when the nodes are no longer needed

Karpenter has two control loops that maximize the availability and efficiency of your cluster.

  • Allocator - Fast-acting controller ensuring that pods are scheduled as quickly as possible
  • Reallocator - Slow-acting controller replaces nodes as pods capacity shifts over time.

Architecture

Cluster Autoscaler watches for pods that fail to schedule and for nodes that are underutilized. It then simulates the addition or removal of nodes before applying the change to your cluster.

The AWS Cloud Provider implementation within Cluster Autoscaler controls the .DesiredReplicas field of your EC2 Auto Scaling Groups. The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail or are rescheduled onto other nodes. The Cluster Autoscaler is typically installed as a Deployment in your cluster. It uses leader election to ensure high availability, but scaling is done by only one replica at a time.

cluster_autoscaler.png

Karpenter works in tandem with the Kubernetes scheduler by observing incoming pods over the lifetime of the cluster. It launches or terminates nodes to maximize application availability and cluster utilization. When there is enough capacity in the cluster, the Kubernetes scheduler will place incoming pods as usual.

When pods are launched that cannot be scheduled using the existing capacity of the cluster, Karpenter bypasses the Kubernetes scheduler and works directly with your provider’s compute service, (for example, Amazon EC2), to launch the minimal compute resources needed to fit those pods and binds the pods to the nodes provisioned. As pods are removed or rescheduled to other nodes, Karpenter looks for opportunities to terminate under-utilized nodes.

karpenter.png

Detailed Kubernetes Autoscaling guidelines from AWS can be found here.

Karpenter claims to offer the following improvements

  • Designed to handle the full flexibility of the cloud: Karpenter has the ability to efficiently address the full range of instance types available through AWS. Cluster Autoscaler was not originally built with the flexibility to handle hundreds of instance types, zones, and purchase options.

  • Group-less node provisioning: Karpenter manages each instance directly, without use of additional orchestration mechanisms like node groups. This enables it to retry in milliseconds instead of minutes when capacity is unavailable. It also allows Karpenter to leverage diverse instance types, availability zones, and purchase options without the creation of hundreds of node groups.

  • Scheduling enforcement: Cluster Autoscaler doesn’t bind pods to the nodes it creates. Instead, it relies on the kube-scheduler to make the same scheduling decision after the node has come online. A node that Karpenter launches has its pods bound immediately. The kubelet doesn’t have to wait for the scheduler or for the node to become ready. It can start preparing the container runtime immediately, including pre-pulling the image. This can shave seconds off of node startup latency.

Let's get down to business 🔨

📋 Installing Karpenter and testing autoscaling on an existing EKS Cluster (with Cluster Autoscaler).

In this section, we will create the required IAM/K8s resources for Karpenter using Terraform and Helm. Once we are all set up then we will give it a go and check how quickly Karpenter scales up your cluster.

Install these tools before proceeding:

We are assuming that the underlying VPC and network resources are already created along with the EKS cluster and we are adding Karpenter resources on top of that.

💡 Login to the AWS CLI with a user that has sufficient privileges to access the EKS cluster and create required k8s objects for this demo.

📕 NOTE: Please make sure that subnets wherein you want to scale out your node capacity, are tagged with "kubernetes.io/cluster/${var.cluster_name}" = "owned".

If you are using terraform then it's as simple as adding this one line in your VPC module:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  ...
  private_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "owned"
  }
  ...
}

Configure the KarpenterNode IAM Role

Please create a karpenter.tf in your terraform repo to add the below snippet:

The EKS module creates an IAM role for worker nodes. We’ll use that for Karpenter (so we don’t have to reconfigure the aws-auth ConfigMap), but we need to add one more policy and create an instance profile.

data "aws_iam_policy" "ssm_managed_instance" {
  arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_role_policy_attachment" "karpenter_ssm_policy" {
  role       = module.eks.worker_iam_role_name
  policy_arn = data.aws_iam_policy.ssm_managed_instance.arn
}

resource "aws_iam_instance_profile" "karpenter" {
  name = "KarpenterNodeInstanceProfile-${var.cluster_name}"
  role = module.eks.worker_iam_role_name
}

Now, Karpenter can use this instance profile to launch new EC2 instances and those instances will be able to connect to your cluster.

Configure the KarpenterController IAM Role

module "iam_assumable_role_karpenter" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "4.7.0"
  create_role                   = true
  role_name                     = "karpenter-controller-${var.cluster_name}"
  provider_url                  = module.eks.cluster_oidc_issuer_url
  oidc_fully_qualified_subjects = ["system:serviceaccount:karpenter:karpenter"]
}

resource "aws_iam_role_policy" "karpenter_contoller" {
  name = "karpenter-policy-${var.cluster_name}"
  role = module.iam_assumable_role_karpenter.iam_role_name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:CreateLaunchTemplate",
          "ec2:CreateFleet",
          "ec2:RunInstances",
          "ec2:CreateTags",
          "iam:PassRole",
          "ec2:TerminateInstances",
          "ec2:DescribeLaunchTemplates",
          "ec2:DescribeInstances",
          "ec2:DescribeSecurityGroups",
          "ec2:DescribeSubnets",
          "ec2:DescribeInstanceTypes",
          "ec2:DescribeInstanceTypeOfferings",
          "ec2:DescribeAvailabilityZones",
          "ssm:GetParameter"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

Karpenter requires permissions like launching instances, which means it needs an IAM role that grants it access. The config below will create an AWS IAM Role, attach a policy, and authorize the Service Account to assume the role using IRSA. We will create the ServiceAccount and connect it to this role during the Helm chart install.

Install Karpenter Helm Chart

Use helm to deploy Karpenter to the cluster. We are going to use the helm release to do the deploy and pass in the cluster details and IAM role Karpenter needs to assume.

~ helm repo add karpenter https://charts.karpenter.sh
"karpenter" has been added to your repositories
~ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "karpenter" chart repository
...Successfully got an update from the "kubernetes-dashboard" chart repository
Update Complete. ⎈Happy Helming!~ helm upgrade --install karpenter karpenter/karpenter --namespace karpenter \
--create-namespace --set serviceAccount.create=true --version 0.5.3 \
--set serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn=${IAM_ROLE_ARN} \
--set controller.clusterName=${CLUSTER_NAME} \
--set controller.clusterEndpoint=$(aws eks describe-cluster --name ${CLUSTER_NAME} --region ${REGION} --profile ${AWS_PROFILE} --query "cluster.endpoint" --output json) --wait
Release "karpenter" does not exist. Installing it now.
NAME: karpenter
LAST DEPLOYED: Tue Dec 28 04:02:53 2021
NAMESPACE: karpenter
STATUS: deployed
REVISION: 1
TEST SUITE: None

This should create the following resources in karpenter namespace:

~ kubectl get all
NAME                                        READY   STATUS    RESTARTS   AGE
pod/karpenter-controller-64754574df-gqn86   1/1     Running   0          29s
pod/karpenter-webhook-7b88b965bc-jcvhg      1/1     Running   0          29s

NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/karpenter-metrics   ClusterIP   10.x.x.x   <none>        8080/TCP   29s
service/karpenter-webhook   ClusterIP   10.x.x.x    <none>        443/TCP    29s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/karpenter-controller   1/1     1            1           30s
deployment.apps/karpenter-webhook      1/1     1            1           30s

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/karpenter-controller-64754574df   1         1         1       30s
replicaset.apps/karpenter-webhook-7b88b965bc      1         1         1       30s

Also, it should create a Service Account:

~ kubectl describe sa karpenter -n karpenter
Name:                karpenter
Namespace:           karpenter
Labels:              app.kubernetes.io/managed-by=Helm
Annotations:         eks.amazonaws.com/role-arn: <Obfuscated_IAM_Role_ARN>
                     meta.helm.sh/release-name: karpenter
                     meta.helm.sh/release-namespace: karpenter
Image pull secrets:  image-pull-secret
Mountable secrets:   karpenter-token-dwwrs
Tokens:              karpenter-token-dwwrs
Events:              <none>

Configure a Karpenter Provisioner

A single Karpenter provisioner is capable of handling many different pod shapes. Karpenter makes scheduling and provisioning decisions based on pod attributes such as labels and affinity. In other words, Karpenter eliminates the need to manage many different node groups.

Create a default provisioner using the command below. This provisioner configures instances to connect to your cluster’s endpoint and discovers resources like subnets and security groups using the cluster’s name.

The ttlSecondsAfterEmpty value configures Karpenter to terminate empty nodes. This behavior can be disabled by leaving the value undefined.

Review the provisioner CRD for more information. For example, ttlSecondsUntilExpired configures Karpenter to terminate nodes when a maximum age is reached.

📕 NOTE: This provisioner will create capacity as long as the sum of all created capacity is less than the specified limit.

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
  limits:
    resources:
      cpu: 1000
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
  ttlSecondsAfterEmpty: 30
EOF

Alright, so now we have a Karpenter provisioner that supports Spot capacity type.

In a real-world scenario, you might have to manage a variety of Karpenter provisioners that can support your workloads. Please go through the Provisioner API documentation to know more about the configuration options.

Let's see how the Cluster scales

So, now, all we need to do is deploy a sample app and see how it scales via Cluster Autoscaler and Karpenter respectively.

kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.2

Also, let's set some resource requests for the vanilla inflate deployment:

kubectl set resources deployment inflate --requests=cpu=100m,memory=256Mi

Alright, please scale down your Karpenter controller deployment to 0 before we request the autoscaling with Cluster Autoscaler.

kubectl scale deploymment karpenter-controller -n karpenter --replicas=0

And, finally, go ahead and scale the inflate deployment

🔥Please note this may incur cost to your AWS cloud bill, so be careful. 🔥

I'm scaling it to 100 replicas for a few minutes to see how Cluster Autoscaler processes it:

kubectl scale deployment inflate -n default --replicas 100
I1228 13:09:23.880213       1 static_autoscaler.go:194] Starting main loop
I1228 13:09:23.880988       1 filter_out_schedulable.go:66] Filtering out schedulables
E1228 13:09:23.881082       1 utils.go:60] pod.Status.StartTime is nil for pod inflate-866ccdf4c8-tdrmg. Should not reach here.
.....

Cluster Autoscaler works with the node groups i.e the ASG in order to scale out or in as per the dynamic workloads. In order to get efficient Autoscaling with Cluster Autoscaler, there are a lot of considerations that need to be reviewed and applied accordingly to the requirements. More details here.

This demo is done on a test cluster that has an ASG with identical instance specifications with the instance type c5.large.

ca_workload_status.png

The inflate deployment kept on waiting for the Cluster Autoscaler to schedule the rest of the 82 pods. I scaled down the deployment after 10 minutes. It took Cluster Autoscaler 6 minutes to trigger a downscaling event.

ASG Activity shows these two events as the latest:

Successful    Terminating EC2 instance: i-xxx    At 2021-12-28T13:39:36Z instance i-xxx was taken out of service in response to a user request, shrinking the capacity from 3 to 2.    2021 December 28, 02:39:36 PM +01:00    2021 December 28, 02:40:38 PM +01:00
Successful    Launching a new EC2 instance: i-xxx    At 2021-12-28T13:23:29Z a user request explicitly set group desired capacity changing the desired capacity from 2 to 3. At 2021-12-28T13:23:35Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 2 to 3.    2021 December 28, 02:23:38 PM +01:00    2021 December 28, 02:23:55 PM +01:00

Now, please scale down your Cluster Autoscaler and scale up the Karpenter Controller and let’s monitor the logs:

2021-12-28T14:23:58.816Z    INFO    controller.provisioning    Batched 89 pods in 4.587455594s    {"commit": "5047f3c", "provisioner": "default"}
2021-12-28T14:23:58.916Z    INFO    controller.provisioning    Computed packing of 1 node(s) for 89 pod(s) with instance type option(s) [m5zn.3xlarge c3.4xlarge c4.4xlarge c5ad.4xlarge c5a.4xlarge c5.4xlarge c5d.4xlarge c5n.4xlarge m5ad.4xlarge m5n.4xlarge m5.4xlarge m5a.4xlarge m6i.4xlarge m5d.4xlarge m5dn.4xlarge m4.4xlarge r3.4xlarge r4.4xlarge r5b.4xlarge r5d.4xlarge]    {"commit": "5047f3c", "provisioner": "default"}
2021-12-28T14:24:01.013Z    INFO    controller.provisioning    Launched instance: i-xxx, hostname: ip-10-x-x-x.eu-x-1.compute.internal, type: c5.4xlarge, zone: eu-x-1c, capacityType: spot    {"commit": "5047f3c", "provisioner": "default"}
2021-12-28T14:24:01.222Z    INFO    controller.provisioning    Bound 89 pod(s) to node ip-10-x-x-x.eu-x-1.compute.internal    {"commit": "5047f3c", "provisioner": "default"}
2021-12-28T14:24:01.222Z    INFO    controller.provisioning    Waiting for unschedulable pods    {"commit": "5047f3c", "provisioner": "default"}

So, basically, Karpenter detects there are some unschedulable pods in the cluster. It does the math and provisions the best-suited spot instance from the available options.

As you can see, it took 3 minutes for the inflate deployment with 100 replicas to be fully deployed.

karpenter_workload_status.png

Finally, let’s wrap it up by scaling down the deployment to 0 replicas.

2021-12-28T14:31:12.364Z    INFO    controller.node    Added TTL to empty node    {"commit": "5047f3c", "node": "ip-10-x-x-x.eu-x-1.compute.internal"}
2021-12-28T14:31:42.391Z    INFO    controller.node    Triggering termination after 30s for empty node    {"commit": "5047f3c", "node": "ip-10-x-x-x.eu-x-1.compute.internal"}
2021-12-28T14:31:42.418Z    INFO    controller.termination    Cordoned node    {"commit": "5047f3c", "node": "ip-10-x-x-x.eu-x-1.compute.internal"}
2021-12-28T14:31:42.620Z    INFO    controller.termination    Deleted node    {"commit": "5047f3c", "node": "ip-10-x-x-x.eu-x-1.compute.internal"}

It took 30 seconds for Karpenter to terminate the node once there was no pod scheduled on it.

Conclusion

The Karpenter project is exciting and breaks away from the old school Cluster Autoscaler way of doing things. It is VERY and efficient and fast but also not as much battle-tested as Cluster Autoscaler.

📋 Please note that the above experiment with Cluster Autoscaler could be greatly improved if we set up the right instance types and autoscaling policies. However, that's the whole point of this comparison that with Karpenter you don't have to ensure all of these configurations beforehand. You might end up having a lot of provisioners for your different workloads but that's something I want to cover in the following blog post in this series.

If you are running Kubernetes then I’m sure you will enjoy spinning it up and playing around with it.

Good luck!

Also, do let me know if you have any questions or if there's something you would have done differently.

Next Steps

  • Karpenter is an open-source project with a roadmap open to all. I'm definitely going to watch that space.
  • Something else that I will be keeping an eye on would be Karpenter issues that people might face.
  • And, as an extension to this blog, we will be writing about our experience of migrating a real-world cluster from Cluster Autoscaler to Karpenter.