Traefik 1.7.9 Endpoints not available for default/myAwesomeService result in 504 and 504

Hello all,

I am dealing with a networking issue in GKE where every now and then my synthetic tests to a /healthcheck endpoint of my service will return a 503 or a 504 (similar to Traefik 1.7.12 ingress requests ending with a HTTP 504 gateway timeout). Following the suggestions of @daniel.tomcej, I turned on debug mode for the logs in traefik and waited for my issue to happen again.

To add some context, I am using traefik installed via helm (chart version traefik-1.63.1, traefik version 1.7.9) and it is the main ingress controller for my kubernetes cluster (on GKE).

These are the logs that I believe are related to my issue (correlated by timestamp and deployment id):

    {
        "msg": "Endpoints not available for default/myAwesomeService",
        "kube_labels_heritage": "Tiller",
        "log": {
            "msg": "Endpoints not available for default/myAwesomeService",
            "level": "warning",
            "time": "2019-07-09T09:18:54Z"
        },
        "level": "warning",
        "lifted_annotations": {
            "kubernetes_io/limit-ranger": "LimitRanger plugin set: cpu request for container traefik-live-1",
            "checksum/config": "0f81f7b30bdd55207aba4ab87c9eac4ec5b6acea03a16e46027b4f88419b16ac"
        },
        "kube_labels_release": "traefik-live-1",
        "lifted_pod_name": "traefik-live-1-6f555d46c5-ktctm",
        "lifted_container_name": "traefik-live-1",
        "lifted_docker_id": "b0fb4c928464a218dc0eafef299a1911f5b0258fa036f446cb9cf06c10f68437",
        "@timestamp": "2019-07-09T09:18:54.063Z",
        "lifted_pod_id": "89493634-a221-11e9-b4e1-42010a1e000a",
        "stream": "stdout",
        "lifted_namespace_name": "default",
        "kube_labels_pod-template-hash": "6f555d46c5",
        "time": "2019-07-09T09:18:54Z",
        "kube_labels_chart": "traefik-1.63.1",
        "kube_labels_app": "traefik"
    }
    {
        "msg": "Endpoints not available for default/myAwesomeService",
        "kube_labels_heritage": "Tiller",
        "log": {
            "msg": "Endpoints not available for default/myAwesomeService",
            "level": "warning",
            "time": "2019-07-09T09:18:54Z"
        },
        "level": "warning",
        "lifted_annotations": {
            "kubernetes_io/limit-ranger": "LimitRanger plugin set: cpu request for container traefik-live-1",
            "checksum/config": "0f81f7b30bdd55207aba4ab87c9eac4ec5b6acea03a16e46027b4f88419b16ac"
        },
        "kube_labels_release": "traefik-live-1",
        "@timestamp": "2019-07-09T09:18:54.066Z",
        "lifted_pod_id": "8f36d3cd-a221-11e9-b4e1-42010a1e000a",
        "stream": "stdout",
        "lifted_namespace_name": "default",
        "kube_labels_pod-template-hash": "6f555d46c5",
        "time": "2019-07-09T09:18:54Z",
        "kube_labels_chart": "traefik-1.63.1",
        "kube_labels_app": "traefik"
    }

In kubernetes I can see the following warning appearing in my deployment, but I am not sure why it is happening:

    ➜  ~ kubectl get events --namespace=default --field-selector reason=FailedToUpdateEndpoint
        LAST SEEN   TYPE      REASON                   KIND        MESSAGE
        39m         Warning   FailedToUpdateEndpoint   Endpoints   Failed to update endpoint default/location-search: Operation cannot be fulfilled on endpoints "myAwesomeService": the object has been modified; please apply your changes to the latest version and try again

So far it doesn't look like the GKE load balancer is failing or that the traefik pods are causing the issue, so I suspect it is a problem with my deployments. I am not sure how to invetigate this further however, so I would appreciate any help!

Hello @dperezmavro,

There are a few questions I have:

  1. Endpoints not available for default/myAwesomeService means that no pods are ready, and that they have either not passed, or failed their readiness checks. This should be logged on the pods themselves. I would view the pod logs for those potentially failed healthchecks.
  2. Are you healthchecking through a service? aka are you testing service.namespace/healthcheck? if so, that should be useless, as the service should always pass, as there should always be endpoints that pass, see point #1. The readiness and liveliness checks are done at the pod level, and are not handled by services.
  3. Just to confirm, when you get the errors, traefik returns a 5xx error, because your app pods are not responding?

Could you provide us with your deployment spec (kubectl get deployment myapp -o yaml) to show us the configuration, and could you provide the kubectl get events on that deployment too?

Hey there @daniel.tomcej ! Thank you for your fast reply!

So, I have been gathering information and here's what I have:

  1. When you say the pod's logs, do you mean on the service level (i.e. whatever custom messages I log) or is this a kubernetes log? The custom logs I print for each healthcheck always show up as successes.
  2. Sorry I should have clarrified that earlier. By synthetic tests I mean an external provider (runscope in this case) GETs https://myAwesomeService.mydomain.com/healthcheck and that returns some information, and HTTP 200 to show it's alive. Occassionally this will timeout and return 504 or 503 (no other 5XX errors have been observed).
  3. This is what I am trying to find out! So far I have not worked out exactly what the problem is.

I could not get the events from that deployment (I think they must have expired), but here's the deployment information:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "20"
  creationTimestamp: 2019-06-20T11:29:51Z
  generation: 20
  labels:
    app.kubernetes.io/instance: myAwesomeService
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: myAwesomeService
    build-number: 0.0.52
    environment: myEnv
    helm.sh/chart: myAwesomeService-0.0.52
    release-name: myAwesomeService
    service-name: myAwesomeService
  name: myAwesomeService
  namespace: default
  resourceVersion: "11732515"
  selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/myAwesomeService
  uid: b845cac3-934e-11e9-a1d2-42010a1e0007
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: myAwesomeService
      app.kubernetes.io/name: myAwesomeService
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service: myAwesomeService
        consul.hashicorp.com/connect-service-port: http
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: myAwesomeService
        app.kubernetes.io/name: myAwesomeService
        build-number: 0.0.52
        environment: myEnv
        release-name: myAwesomeService
        service-name: myAwesomeService
    spec:
      containers:
      - env:
        - name: TEST
          value: "false"
        image: eu.gcr.io/myProjectId/myAwesomeService:52
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthcheck
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: myAwesomeService
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthcheck
            port: 3000
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1800Mi
          requests:
            cpu: 100m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: 2019-06-20T11:29:51Z
    lastUpdateTime: 2019-07-09T19:18:29Z
    message: ReplicaSet "myAwesomeService-7c8c664669" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-07-10T09:24:01Z
    lastUpdateTime: 2019-07-10T09:24:01Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 20
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

I will post more information as these events happen (I can't exactly recreate them, they just occur every so often), but thank you very much for your help!

Hello @dperezmavro,

  1. When I say the pod logs, these are the kubernetes logs, as usually containerized applications log to stdout.

however, your deployment may shed some light on what is going on:

       livenessProbe:
          httpGet:
            path: /healthcheck
            port: 3000
            scheme: HTTP
        readinessProbe:
          httpGet:
            path: /healthcheck
            port: 3000
            scheme: HTTP

It is not a great idea to set your readiness check to be the same as the liveliness check. Liveliness failures result in the container/pod being stopped, and restarted. Readiness checks tell kubernetes that the pod is ready to receive traffic. With this configuration, as soon as a readiness check would fail, the pod gets completely restarted.

I would reconfigure your liveliness probes to be different than your readiness probes:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Hello @daniel.tomcej !

So there are 0 errors about the healthcheck failing on my logs, so I think the request never even reaches the pod for some reason (also the target pods themselves are not restarted at all).

I have now reconfigured the liveness probe to be this:

       livenessProbe:
        httpGet:
          path: "/healthcheck"
          port: {{ .Values.service.containerPort }}
        failureThreshold: 4
        initialDelaySeconds: 5
        periodSeconds: 5
        timeoutSeconds: 1

and I have removed the readiness probe just for testing purposes. Thank you for your help, I will keep this thread updated with any info I have!

@daniel.tomcej so in order to debug htis I have turned on datadog metrics for traefik and added a retry count of 3. I have also increased the time between healthchecks, mirroring the readiness/liveness probes in the traefik chart.