AWS EKS Subnet IP Address Exhaustion
I am Edwin Philip, Senior Devops Engineer in HackerRank.
In HackerRank, almost all of our applications run as Kubernetes pods basically containerized. We run all our workloads in AWS EKS.
Problem
In AWS EKS workload scaling usually depends on the available Subnet IP Address. The available subnet IPs is the remaining IPs that can used when a pod is created and assigned to worker node. Most often exhaustion of the IPs in the subnet is not the problem but EC2 getting created in a subnet which has only 3 or less IPs. When a EC2 gets created in this subnet, the EC2 node would join the EKS cluster. The Pods get scheduled based on the EC2 allocatable resources. The EC2 instance in the exhausted subnet would have enough CPU and Memory resources, so pods would get scheduled.
When the Pods get scheduled, Pods will be stuck in ContainerCreating with “failed to assign an IP address to container” and when debugging you will end up finding the subnet ran out of IP addresses.
Solution 1
As a temporary solution, you can drain the worker node using the following kubectl
command
kubectl drain <nodename> --delete-local-data --ignore-daemonsets
Replace the nodename with your worker nodename
Solution 2
Eventhough Solution 1 works, new pods during scaling or during rolling deployment update, pods will be scheduled back to the same node and will be stuck in ContainerCreating state.
As an Extended solution to Solution 1, you can drain and taint the node by running the following command
kubectl taint node <nodename> NoIpAddressAvailable=true:NoSchedule
The above add a taint NoIpAddressAvailable=true:Noschedule
to the worker node. This will stop new pods from getting scheduled in this worker node or EC2
Solution 3
The above both solutions needs someone to watch for IP Address exhaustion in subnets and run the commands
Instead of doing this manually, We will create a daemonSet with tolerations. This daemonSet pod will query EC2 metadata endpoint for Subnets and its available IP address. When the available IP address is below a threshold, the daemonSet pod will taint the node before any pod gets scheduled in the worker node.
Also checking the Available IP address means taking total of Available Primary + Secondary IP in the Subnet.
When the worker node is tainted and no pods are scheduled, Cluster Autoscaler scales down this EC2 worker node automatically.
We call this little node controller as nodekicker
The Dockerfile for the nodekicker
FROM amazonlinux
RUN yum install -y awscli
COPY node_kicker.sh /
RUN chmod +x node_kicker.sh
ENTRYPOINT ["/node_kicker.sh"]
The node_kicker.sh
shell script
#!/bin/bash
curl -s -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
#NODE_NAME=$(curl --silent http://169.254.169.254/latest/meta-data/hostname)
NODE_NAME=$(curl --silent http://169.254.169.254/latest/meta-data/local-hostname|cut -d '.' -f1).ec2.internal
INTERFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/ | head -n1)
SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${INTERFACE}/subnet-id)
VPC_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${INTERFACE}/vpc-id)
AvailableIpAddressCount=$(aws ec2 describe-subnets --subnet-ids $SUBNET_ID --query 'Subnets[*].AvailableIpAddressCount' --region us-east-1 --output text)
echo "SubnetID: $SUBNET_ID"
echo "AvailableIpAddressCount : $AvailableIpAddressCount"
if [ $AvailableIpAddressCount -lt 20 ];
then
./kubectl taint node $NODE_NAME AvailableIpAddress=false:NoSchedule
./kubectl label node $NODE_NAME AvailableIpAddress=false
echo "$NODE_NAME: Subnet doesnot have enough IP's available, Tainting Node"
fi
sleep 60
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
SECONDARY_IPS=$(aws ec2 describe-instances --region us-east-1 --instance-ids $INSTANCE_ID --query 'Reservations[*].Instances[*].NetworkInterfaces[*].PrivateIpAddresses[*].{PrivateIpAddress:PrivateIpAddress}' --output text | wc -l)
echo "$NODE_NAME: SECONDARY_IPS on Node : $SECONDARY_IPS"
if [ $SECONDARY_IPS -lt 6 ];
then
./kubectl taint node $NODE_NAME SecondaryIpAddress=false:NoSchedule
./kubectl label node $NODE_NAME SecondaryIpAddress=false
else
echo "$NODE_NAME: Enough secondary IP's available, not Tainting"
fi
sleep 120
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
SECONDARY_IPS=$(aws ec2 describe-instances --region us-east-1 --instance-ids $INSTANCE_ID --query 'Reservations[*].Instances[*].NetworkInterfaces[*].PrivateIpAddresses[*].{PrivateIpAddress:PrivateIpAddress}' --output text | wc -l)
echo "$NODE_NAME: SECONDARY_IPS on Node : $SECONDARY_IPS"
if [ $SECONDARY_IPS -gt 6 ];
then
./kubectl taint node $NODE_NAME SecondaryIpAddress=false:NoSchedule-
./kubectl label node $NODE_NAME SecondaryIpAddress-
echo "$NODE_NAME: Enough secondary IP's available, Untainting"
fi
while true; do sleep 30; done
Docker command to build and push the nodekicker:latest
docker image
docker build -t node_kicker:latest .
docker push nodekicker:latest
The following K8s manifest creates a DaemonSet, ClusterRoleBinding and ServiceAccount
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nodekicker
namespace: kube-system
labels:
k8s-app: nodekicker
spec:
selector:
matchLabels:
k8s-app: nodekicker
template:
metadata:
labels:
k8s-app: nodekicker
spec:
containers:
- name: nodekicker
image: >-
nodekicker:latest
imagePullPolicy: Always
restartPolicy: Always
terminationGracePeriodSeconds: 10
dnsPolicy: ClusterFirst
serviceAccountName: nodekicker
serviceAccount: nodekicker
hostNetwork: true
securityContext: {}
tolerations:
- operator: Exists
priorityClassName: system-node-critical
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 10%
revisionHistoryLimit: 10
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nodekicker
subjects:
- kind: ServiceAccount
name: nodekicker
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: 'system:node'
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: nodekicker
namespace: kube-system
Though this solution works, one issue is that this daemonSet pod running in each node is again going to take IP address. Some improvements can be made to the script like checking the Available IP address in a while loop.
Permanent Solution
Though the Solution 3 works, it has some disadvantages such as daemonset pod taking up IP address. The permanent solution is to enable Secondary CIDR in the EKS VPC.
Please comment your ideas if you have faced this issue and how did you solve it.