Out of sync on Kubernetes

I experience traefik v1 gets out-of sync on kubernetes, usually once a week. This means, that for some ingresses traefik returns 'Service not available'. The fortunate situation is that I am running two replicas of traefik, and right now only one pod is in this state.

I have the following ingress:

$ kubectl -n monitoring get ingress prometheus -o yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    traefik.ingress.kubernetes.io/frontend-entry-points: http
  creationTimestamp: "2020-01-01T19:58:29Z"
  generation: 1
  name: prometheus
  namespace: monitoring
  resourceVersion: "94371309"
  selfLink: /apis/extensions/v1beta1/namespaces/monitoring/ingresses/prometheus
  uid: 265b224a-0ac7-450c-9cf0-a34cd010b1ec
spec:
  rules:
  - host: prometheus.k8s.lan
    http:
      paths:
      - backend:
          serviceName: prometheus
          servicePort: http
        path: /
status:
  loadBalancer: {}
$ kubectl -n monitoring get endpoints prometheus -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2020-03-22T15:21:11Z"
  creationTimestamp: "2020-01-01T19:57:08Z"
  labels:
    service.kubernetes.io/headless: ""
  name: prometheus
  namespace: monitoring
  resourceVersion: "133667974"
  selfLink: /api/v1/namespaces/monitoring/endpoints/prometheus
  uid: 9ccf0e33-a1ac-447c-ae93-6d2a7162d75f
subsets:
- addresses:
  - ip: 10.112.10.2
    nodeName: k8s-node12
    targetRef:
      kind: Pod
      name: prometheus-5654c5c5df-rcwx6
      namespace: monitoring
      resourceVersion: "133667972"
      uid: 6977a701-b639-4b6c-b1c6-d893a87e1315
  ports:
  - name: http
    port: 9090
    protocol: TCP

And, I have two traefik pods, endpoints:

$ kubectl -n ingress-traefik get pod -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
traefik-ingress-controller-6656b6b56-5mrcj   1/1     Running   0          139m   10.112.6.216   k8s-node06   <none>           <none>
traefik-ingress-controller-6656b6b56-8s2tn   1/1     Running   0          142m   10.112.13.40   k8s-node05   <none>           <none>
$ kubectl -n ingress-traefik get endpoints traefik-ingress-service -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2020-03-22T13:25:55Z"
  creationTimestamp: "2020-01-24T19:22:56Z"
  name: traefik-ingress-service
  namespace: ingress-traefik
  resourceVersion: "133626548"
  selfLink: /api/v1/namespaces/ingress-traefik/endpoints/traefik-ingress-service
  uid: ca7441cf-6f33-446a-b328-5617e07ac26e
subsets:
- addresses:
  - ip: 10.112.13.40
    nodeName: k8s-node05
    targetRef:
      kind: Pod
      name: traefik-ingress-controller-6656b6b56-8s2tn
      namespace: ingress-traefik
      resourceVersion: "133625315"
      uid: 77fb2877-9591-40d3-834f-1e61a74cd894
  - ip: 10.112.6.216
    nodeName: k8s-node06
    targetRef:
      kind: Pod
      name: traefik-ingress-controller-6656b6b56-5mrcj
      namespace: ingress-traefik
      resourceVersion: "133626546"
      uid: b6b93c7d-ee5d-496c-8ee6-ede116dfd4c1
  ports:
  - name: https
    port: 8443
    protocol: TCP
  - name: http
    port: 8000
    protocol: TCP

Now, if try to access the ingress through each traefik, I get different results:

# curl -H 'Host: prometheus.k8s.lan' http://10.112.6.216:8000/
<a href="/graph">Found</a>.

# curl -H 'Host: prometheus.k8s.lan' http://10.112.13.40:8000/
Service Unavailable#

Just deleteing the ingress-traefik pod with IP 10.112.13.40 usually resolves the problem.

$ kubectl -n ingress-traefik get pod -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP             NODE         NOMINATED NODE   READINESS GATES
traefik-ingress-controller-6656b6b56-5mrcj   1/1     Running   0          145m   10.112.6.216   k8s-node06   <none>           <none>
traefik-ingress-controller-6656b6b56-bd9pq   1/1     Running   0          14s    10.112.4.238   k8s-node11   <none>           <none>

The old one and the new pod serves the ingress right:

# curl -H 'Host: prometheus.k8s.lan' http://10.112.6.216:8000/
<a href="/graph">Found</a>.

# curl -H 'Host: prometheus.k8s.lan' http://10.112.4.238:8000/
<a href="/graph">Found</a>.

My traefik deployment is:

$ kubectl -n ingress-traefik get deploy traefik-ingress-controller -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "50"
  creationTimestamp: "2019-09-23T20:32:27Z"
  generation: 59
  labels:
    k8s-app: traefik-ingress-lb
  name: traefik-ingress-controller
  namespace: ingress-traefik
  resourceVersion: "133626552"
  selfLink: /apis/apps/v1/namespaces/ingress-traefik/deployments/traefik-ingress-controller
  uid: 0154be87-f7ac-4e3f-affe-f4132422a216
spec:
  progressDeadlineSeconds: 2147483647
  replicas: 2
  revisionHistoryLimit: 2147483647
  selector:
    matchLabels:
      k8s-app: traefik-ingress-lb
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - traefik-ingress-lb
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --api
        - --kubernetes
        - --entrypoints=Name:http Address::8000
        - --entrypoints=Name:https Address::8443 TLS
        - --forwardingtimeouts.dialtimeout=5s
        - --kubernetes.namespaces=xx,yy,zz
        env:
        - name: GOGC
          value: "50"
        image: traefik:v1.7.21
        imagePullPolicy: IfNotPresent
        name: traefik-ingress-lb
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 8443
          name: https
          protocol: TCP
        - containerPort: 8080
          name: admin
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 250m
            memory: 32Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        ingress-controller: traefik
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsGroup: 8888
        runAsUser: 8888
      serviceAccount: traefik-ingress-controller
      serviceAccountName: traefik-ingress-controller
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 60

I am running this on bare PI boards. Perhaps, they are not as fast as a vm on intel hardware, but still, I suspect it should work. Perhaps it misses some events from k8s during processing another?

Today it happened again. Logs from traefik:

time="2020-03-30T14:01:43Z" level=info msg="Server configuration reloaded on :8443"
time="2020-03-30T14:01:43Z" level=info msg="Server configuration reloaded on :8080"
W0330 14:12:02.564081       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137552844 (137554684)
W0330 14:14:57.422138       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137552898 (137555698)
W0330 14:15:22.447411       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137552915 (137555868)
W0330 14:42:11.543770       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137564635 (137565293)
W0330 14:49:10.457297       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137564904 (137567714)
W0330 15:59:45.786539       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137591379 (137592265)
W0330 16:10:13.708571       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137595033 (137595933)
W0330 16:11:32.571162       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137596040 (137596403)
W0330 16:13:47.592215       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137596205 (137597186)
W0330 16:39:53.669012       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137605510 (137606239)
W0330 16:46:02.600148       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137607956 (137608388)
W0330 17:55:35.909355       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137632455 (137632531)
W0330 18:06:09.831617       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137636095 (137636207)
W0330 18:12:22.710594       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137637326 (137638384)
W0330 18:14:43.685921       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137636545 (137639197)
W0330 18:39:22.805371       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137646410 (137647747)
W0330 18:46:25.737819       1 reflector.go:341] github.com/containous/traefik/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.Endpoints ended with: too old resource version: 137648557 (137650194)

It can be seen, that once it starts receiving too old resource version error messages.