Health Check failing with Network Load Balancer on EKS

Hello!

I'm having a hard time understanding what's going wrong here. I'm running Traefik (2.2) on Kubernetes (1.17) via the CRD in Amazon EKS. I'm trying to configure a Network Load Balancer in front of Traefik. The pod is failing Health Checks in the Target Group and it seems to be the Kubernetes Service's HealthCheck port is returning a 503. You can see when I describe the Traefik K8s Service:

$ kubectl describe svc/traefik -n traefik
Name:                     traefik
Namespace:                traefik
Labels:                   app=traefik
Annotations:
                          service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: Name=traefik-staging,Provisoner=kubernetes
                          service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector:                 app=traefik
Type:                     LoadBalancer
Port:                     admin  8888/TCP
TargetPort:               8888/TCP
NodePort:                 admin  32457/TCP
Endpoints:                172.17.113.150:8888,172.18.32.210:8888
...
External Traffic Policy:  Local
HealthCheck NodePort:     31769

The HealthCheck is automatically configured to use port 31769. This is also the ports the Target Group are using. When I get into my network and try to curl against port 31769 on the K8s node, it returns:

< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Thu, 30 Jul 2020 04:54:45 GMT
< Content-Length: 88
<
{
	"service": {
		"namespace": "traefik",
		"name": "traefik"
	},
	"localEndpoints": 0
}

So it seems like the health check is returning a 503 and thus failing the Load Balance health checks. Why is this failing and returning a 503? I can see from my K8s service that there is indeed endpoints in the service. Thank you for the help in advance!

I figured out the issue and it was nothing to do with Traefik. I was mistaken in thinking that the 503 error was a health check in Traefik when it was actually a health check from kube-proxy. For anyone interest in the solution:

Ultimately, this was due to the fact that my EC2 instance's hostname was different than the node name registered in Kubernetes. For whatever reason, cube-proxy does not like that. So I had to set my cube-proxy hostname to the same value as my node's name in the control plan.

This was done by patching the kube-proxy daemon set with:

env:
- name: NODE_NAME
    valueFrom:
    fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName

Then you need to reference that NODE_NAME variable in your --hostname-override flag.

- command:
  - kube-proxy
  - --hostname-override=$(NODE_NAME)
  - --v=2
  - --config=/var/lib/kube-proxy-config/config

Now the health check is passing and Traefik is working flawlessly with a Network Load Balancer in EKS.

1 Like