AWS EKS Subnet IP Address Exhaustion

4 min readNov 21, 2022

I am Edwin Philip, Senior Devops Engineer in HackerRank.

In HackerRank, almost all of our applications run as Kubernetes pods basically containerized. We run all our workloads in AWS EKS.

Problem

In AWS EKS workload scaling usually depends on the available Subnet IP Address. The available subnet IPs is the remaining IPs that can used when a pod is created and assigned to worker node. Most often exhaustion of the IPs in the subnet is not the problem but EC2 getting created in a subnet which has only 3 or less IPs. When a EC2 gets created in this subnet, the EC2 node would join the EKS cluster. The Pods get scheduled based on the EC2 allocatable resources. The EC2 instance in the exhausted subnet would have enough CPU and Memory resources, so pods would get scheduled.

When the Pods get scheduled, Pods will be stuck in ContainerCreating with “failed to assign an IP address to container” and when debugging you will end up finding the subnet ran out of IP addresses.

Solution 1

As a temporary solution, you can drain the worker node using the following kubectl command

kubectl drain <nodename> --delete-local-data --ignore-daemonsets

Replace the nodename with your worker nodename

Solution 2

Eventhough Solution 1 works, new pods during scaling or during rolling deployment update, pods will be scheduled back to the same node and will be stuck in ContainerCreating state.

As an Extended solution to Solution 1, you can drain and taint the node by running the following command

kubectl taint node <nodename> NoIpAddressAvailable=true:NoSchedule

The above add a taint NoIpAddressAvailable=true:Noschedule to the worker node. This will stop new pods from getting scheduled in this worker node or EC2

Solution 3

The above both solutions needs someone to watch for IP Address exhaustion in subnets and run the commands

Instead of doing this manually, We will create a daemonSet with tolerations. This daemonSet pod will query EC2 metadata endpoint for Subnets and its available IP address. When the available IP address is below a threshold, the daemonSet pod will taint the node before any pod gets scheduled in the worker node.

Also checking the Available IP address means taking total of Available Primary + Secondary IP in the Subnet.

When the worker node is tainted and no pods are scheduled, Cluster Autoscaler scales down this EC2 worker node automatically.

We call this little node controller as nodekicker

The Dockerfile for the nodekicker

FROM amazonlinux
RUN yum install -y awscli
COPY node_kicker.sh /
RUN chmod +x node_kicker.sh
ENTRYPOINT ["/node_kicker.sh"]

The node_kicker.sh shell script

#!/bin/bash

curl -s -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
#NODE_NAME=$(curl --silent http://169.254.169.254/latest/meta-data/hostname)
NODE_NAME=$(curl --silent http://169.254.169.254/latest/meta-data/local-hostname|cut -d '.' -f1).ec2.internal

INTERFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/ | head -n1)
SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${INTERFACE}/subnet-id)
VPC_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${INTERFACE}/vpc-id)
AvailableIpAddressCount=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID --query 'Subnets[*].AvailableIpAddressCount' --region us-east-1 --output text)
echo "SubnetID: $SUBNET_ID"
echo "AvailableIpAddressCount : $AvailableIpAddressCount"

if [ $AvailableIpAddressCount -lt 20 ];
    then
    ./kubectl taint node $NODE_NAME AvailableIpAddress=false:NoSchedule
    ./kubectl label node $NODE_NAME AvailableIpAddress=false  
    echo "$NODE_NAME: Subnet doesnot have enough IP's available, Tainting Node"
fi 

sleep 60
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
SECONDARY_IPS=$(aws ec2 describe-instances --region us-east-1 --instance-ids $INSTANCE_ID --query 'Reservations[*].Instances[*].NetworkInterfaces[*].PrivateIpAddresses[*].{PrivateIpAddress:PrivateIpAddress}' --output text | wc -l)

echo "$NODE_NAME: SECONDARY_IPS on Node : $SECONDARY_IPS"

if [ $SECONDARY_IPS -lt 6 ]; 
    then
    ./kubectl taint node $NODE_NAME SecondaryIpAddress=false:NoSchedule
    ./kubectl label node $NODE_NAME SecondaryIpAddress=false
else
    echo "$NODE_NAME: Enough secondary IP's available, not Tainting"
fi

sleep 120

INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
SECONDARY_IPS=$(aws ec2 describe-instances --region us-east-1 --instance-ids $INSTANCE_ID --query 'Reservations[*].Instances[*].NetworkInterfaces[*].PrivateIpAddresses[*].{PrivateIpAddress:PrivateIpAddress}' --output text | wc -l)
echo "$NODE_NAME: SECONDARY_IPS on Node : $SECONDARY_IPS"

if [ $SECONDARY_IPS -gt 6 ]; 
    then
    ./kubectl taint node $NODE_NAME SecondaryIpAddress=false:NoSchedule-
    ./kubectl label node $NODE_NAME SecondaryIpAddress-
    echo "$NODE_NAME: Enough secondary IP's available,  Untainting"
fi

while true; do sleep 30; done

Docker command to build and push the nodekicker:latest docker image

docker build -t node_kicker:latest .
docker push nodekicker:latest

The following K8s manifest creates a DaemonSet, ClusterRoleBinding and ServiceAccount

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nodekicker
  namespace: kube-system
  labels:
    k8s-app: nodekicker
spec:
  selector:
    matchLabels:
      k8s-app: nodekicker
  template:
    metadata:
      labels:
        k8s-app: nodekicker
    spec:
      containers:
        - name: nodekicker
          image: >-
            nodekicker:latest
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      dnsPolicy: ClusterFirst
      serviceAccountName: nodekicker
      serviceAccount: nodekicker
      hostNetwork: true
      securityContext: {}
      tolerations:
        - operator: Exists
      priorityClassName: system-node-critical
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  revisionHistoryLimit: 10
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nodekicker
subjects:
  - kind: ServiceAccount
    name: nodekicker
    namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: 'system:node'
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nodekicker
  namespace: kube-system

Though this solution works, one issue is that this daemonSet pod running in each node is again going to take IP address. Some improvements can be made to the script like checking the Available IP address in a while loop.

Permanent Solution

Though the Solution 3 works, it has some disadvantages such as daemonset pod taking up IP address. The permanent solution is to enable Secondary CIDR in the EKS VPC.

Please comment your ideas if you have faced this issue and how did you solve it.