Traefik not getting SSL certificates for some domains

I've previously asked this question on SO, so far without luck.

I've got Traefik/Docker Swarm/Let's Encrypt/Consul set up, and it's been working fine. It managed to successfully get certificates for the domains admin.domain.tld, registry.domain.tld and matomo.domain.tld, but others like domain.tld and staging.domain.tld aren't getting any certificates (browser warns of self signed certificate because it's the default Traefik certificate). Though some tries (after deleting the consul data and starting over from scratch) the set of domains working or not working are different...

My Traefik configuration (that's being uploaded to Consul):

debug = false
logLevel = "DEBUG"

insecureSkipVerify = true

defaultEntryPoints = ["https", "http"]

[entryPoints]
    [entryPoints.ping]
    address = ":8082"
    [entryPoints.http]
    address = ":80"
        [entryPoints.http.redirect]
        entryPoint = "https"
    [entryPoints.https]
    address = ":443"
    [entryPoints.https.tls]

[traefikLog]
    filePath = '/var/log/traefik/traefik.log'
    format = 'json'
[accessLog]
    filePath = '/var/log/traefik/access.log'
    format = 'json'
    [accessLog.fields]
        defaultMode = 'keep'
        [accessLog.fields.headers]
            defaultMode = 'keep'
            [accessLog.fields.headers.names]
                "Authorization" = "drop"

[retry]

[api]
entryPoint = "traefik"
dashboard = true
debug = false

[ping]
entryPoint = "ping"

[metrics]
    [metrics.influxdb]
    address = "http://influxdb:8086"
    protocol = "http"
    pushinterval = "10s"
    database = "metrics"

[docker]
endpoint = "unix:///var/run/docker.sock"
domain = "domain.tld"
watch = true
exposedByDefault = false
network = "net_web"
swarmMode = true

[acme]
email = "jan@maildomain.tld"
storage = "traefik/acme/account"
entryPoint = "https"
onHostRule = true
acmeLogging = true
[acme.httpChallenge]
entryPoint = "http"

The full log of startup and trying to load one page that's not working is available here (it's been passed through grep -i -A3 -B3 acme). The last two lines are probably the issue:

{"level":"debug","msg":"http: TLS handshake error from 10.255.0.2:60638: remote error: tls: unknown certificate","time":"2019-07-11T17:19:01Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.255.0.2:60637: remote error: tls: unknown certificate","time":"2019-07-11T17:19:01Z"}

I have along the way trying to fix this seen various other issues:

{"level":"error","msg":"Error getting ACME certificates [matomo.domain.tld] : cannot obtain certificates: acme: Error -\u003e One or more domains had a problem:\n[matomo.domain.tld] acme: error: 400 :: urn:ietf:paramsacme:error:connection :: Fetching http://matomo.domain.tld/.well-known/acme-challenge/WJZOZ9UC1aJl9ishmL2ACKFbKoGOe_xQoSbD34v8mSk: Timeout after connect (your server may be slow or overloaded), url: \n","time":"2019-07-09T16:27:43Z"}

I hope someone is able to shed some light on this, I'm growing ever closer to sticking nginx in front of Traefik to handle HTTPS-termination, but that sounds like a really bad hack.

In order to rule out the docker part I tried specifying my domains using [[acme.domains]], that resulted in a bunch of these errors on startup:

{"level":"error","msg":"Error getting ACME certificate for domain [\"domain.xyz\" \"www.domain.xyz\" \"staging.domain.xyz\" \"pagefle.domain.xyz\" \"gris.domain.xyz\" \"gefleteknologerna.domain.xyz\" \"staging.pagefle.domain.xyz\" \"staging.gris.domain.xyz\" \"staging.gefleteknologerna.domain.xyz\" \"admin.domain.xyz\" \"matomo.domain.xyz\" \"portainer.admin.domain.xyz\" \"registry.domain.xyz\"]: cannot obtain certificates: acme: Error -\u003e One or more domains had a problem:\n[domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://domain.xyz/.well-known/acme-challenge/NynlPWanf5_76iKQTIzQbA2GBpK182oaNmxfSk2x1qw: Connection refused, url: \n[gefleteknologerna.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://gefleteknologerna.domain.xyz/.well-known/acme-challenge/GG8JdOyzwKCXRZvE99UzKcrMjFKtEy7XdWVNlmMV1zQ: Connection refused, url: \n[gris.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://gris.domain.xyz/.well-known/acme-challenge/3FSpxOYL8U67oKOF5Mvdej9w-DsSD0f-b_h72MFuQro: Connection refused, url: \n[staging.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.domain.xyz/.well-known/acme-challenge/4qyFTx8xJYfvzystUc37U3Zk0VSHC1vjjp4BaCYSfbw: Connection refused, url: \n[staging.gefleteknologerna.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.gefleteknologerna.domain.xyz/.well-known/acme-challenge/Oy0AS9i-Hce-iWY93KRrAC9nJDzocuj6ooFa8naJhro: Connection refused, url: \n","time":"2019-07-12T18:41:45Z"}

Which is kinda weird, since I have no issues reach those domains (just not the ACME-path of course) from my browser (just with an invalid certificate).

Hi @02JanDal,

  • First of all, do you see the same behaviour when not using consul for storing certificates? You have to know that the consul "Let's Encrypt certificate" storage is not considered production ready (it ahas always been considered as a beta thing), because it's suffering a lot of issues (check the labels on the GH issue tracker :slight_smile: ). We built TraefikEE for this core reason. So removing Consul from the equation could help to track what is going wrong here.
  • In the case the issue does not appear when consul isn't used for certs., could you provide a reproduction case please? Between the consul configuration missing in the traefik.toml, the CloudFlare config (or at least the domains to test the setup from outside), or even how are Traefik or Consul running, it's hard to analyse to help you.
  • If the issue remains, then it's something else, can you try the tlsChallenge as well? Given the error message given by Let's Encrypt, it looks like that the issue is not DNS related, but TCP related (Connection refused), as if Let's Encrypt agent was not able to establish a TCP connection to the IP resolved by the expected DNS. I wonder if the port TCP/80 is available for the failing domain: switching to tlsChallenge would ensure everything stays on the TCP/443 port.

Hi!

Attempt without Consul and with tlsChallenge sadly gives exactly the same results: https://pastebin.com/ZwLcQkFF (400, Connection refused).

For reproduction (same error even without Docker: https://pastebin.com/Pzt4WLn2):

debug = true
logLevel = "DEBUG"

insecureSkipVerify = true

defaultEntryPoints = ["https", "http"]

[entryPoints]
    [entryPoints.ping]
    address = ":8082"
    [entryPoints.http]
    address = ":80"
        [entryPoints.http.redirect]
        entryPoint = "https"
    [entryPoints.https]
    address = ":443"
    [entryPoints.https.tls]

[traefikLog]
    filePath = '/var/log/traefik/traefik.log'
    format = 'json'
[accessLog]
    filePath = '/var/log/traefik/access.log'
    format = 'json'
    [accessLog.fields]
        defaultMode = 'keep'
        [accessLog.fields.headers]
            defaultMode = 'keep'
            [accessLog.fields.headers.names]
                "Authorization" = "drop"

[retry]

[api]
entryPoint = "traefik"
dashboard = true
debug = false

[ping]
entryPoint = "ping"

[acme]
email = "jan@dalheimer.de"
storage = "/acme.json"
entryPoint = "https"
onHostRule = true
acmeLogging = true
[acme.tlsChallenge]
[[acme.domains]]
main = "domain.xyz"
sans = [
	"www.domain.xyz",
	"staging.domain.xyz",
	"pagefle.domain.xyz",
	"gris.domain.xyz",
	"gefleteknologerna.domain.xyz",
	"staging.pagefle.domain.xyz",
	"staging.gris.domain.xyz",
	"staging.gefleteknologerna.domain.xyz",
	"admin.domain.xyz",
	"matomo.domain.xyz",
	"portainer.admin.domain.xyz",
	"registry.domain.xyz"
]
[[acme.domains]]
main = "staging.otherdomain.se"

Docker Swarm configuration:

version: '3.7'
services:
  traefik:
    image: traefik:alpine
    networks:
      - public
      - backbone
      - web
    ports:
      - 80:80
      - 443:443
      - 8080:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/docker/test/acme.json:/acme.json
      - /home/docker/test/traefik.toml:/etc/traefik/traefik.toml
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.role == manager
      update_config:
        parallelism: 1
        delay: 10s
      labels:
        traefik.enable: 'true'
        traefik.port: 8080
        traefik.frontend.rule: 'Host: admin.domain.xyz; PathPrefixStrip: /traefik'
    healthcheck:
      test: 'printf "GET /ping HTTP/1.1\r\nHost: 127.0.0.1\r\nAccept: */*\r\n\r\n" | nc localhost 8082'

networks:
  web:
    driver: overlay
    internal: true
  public:
    driver: overlay
  backbone:
    driver: overlay
    internal: true

Instructions: Place both files as traefik.toml and stack.yaml in the same directory, also touch and chmod 600 acme.json. Start with docker stack deploy -c stack.yaml test, wait until container shows up in docker ps, then run docker exec {container id} tail /var/log/traefik/traefik.log to view the log.

Ok, I'm starting an env to reproduce. The HTTP/400 Connectionr efused from let's Encrypted is totally looking a network issue (as per all the Let's Encrypt communit posts on this topic as https://community.letsencrypt.org/t/error-400-connection-refused/89267/6 ).

Could you investigate with a machine outside your infrastructure, on another internet provider, that you can curl on both IPv4 and IPv6 the ip mentioned in the A and AAAA records for the faulty domains?

Is there any firewall that could refuse external connections?

Based on some post I found early on (that I can't find right now) I tried removing the AAAA record, so everything's been going through IPv4. Firewall has likewise been disabled since my initial troubleshooting attempts (that's the reason I've removed the real domain name in all logs etc.).

My Cloudflare configuration:


(and so on for all subdomains)

otherdomain.se is on a different DNS provider but with a similar configuration.

Also still in the test case the interesting point about some subdomains seemingly succeeding (at least not being included in the error) confuses me. Could it be some form of rate limit? If so, what?

Rate limits for LetsEncrypt are well documented:

And when you hit a rate limit, the error returned explicitly state the rate limit that is applying.

There is no guesswork involved for rate limiting!

I guess I can rule that out then. Good because one less possible issue, bad since it was the only explanation I could come up with...

@dduportal Have you had any luck reproducing?

Hi @02JanDal, alas no sorry. I tried with AWS Route53 and Digital Ocean DNS, with 4 domains I own, and 4 subsomains per domain. Unable to reproduce, unless when I block the Let's Encrypt access (either with a security group rule in AWS, a firewall rule in ifw or a bad port mapping).

What is the result of the command openssl s_client -connect <failing-domain>:443 from a machine under your platform? And from a machine from external (like through a 4G access point, a from another Internet provider) ?

Got pretty much exactly the same result both from my own local computer and the server itself.

CONNECTED(00000005)
depth=0 CN = TRAEFIK DEFAULT CERT
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = TRAEFIK DEFAULT CERT
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/CN=TRAEFIK DEFAULT CERT
   i:/CN=TRAEFIK DEFAULT CERT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDRTCCAi2gAwIBAgIPBDjiwdJnkJpTTMm9qRDtMA0GCSqGSIb3DQEBCwUAMB8x
HTAbBgNVBAMTFFRSQUVGSUsgREVGQVVMVCBDRVJUMB4XDTE5MDcxNjE2NTEyM1oX
DTIwMDcxNTE2NTEyM1owHzEdMBsGA1UEAxMUVFJBRUZJSyBERUZBVUxUIENFUlQw
ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC21fGv2lZREwKWDr/RUkPu
mINdCBKkrGV3wTyAZKrTI4jQbeHnyRWHMbXS1puI5DPzXSjmaqRfF7oFE6PJTTlC
ohK4+h3M7HB3AGNuZcTg3mAA3lvSU24GCDyVI/hyXNvE++KLU154Q3rV1AGZONCl
RIHBr1dzmWkG2iZ4sL9HTrYkvORdjjuNCCT1E2fkZByzYuUz5JAssNg20Etojett
AGHQXoW0UaWdqh/FBz3XHXNJXCR4KqSEp9BzUbM3F85SvrAlV8VF6Ls5pbfx3ZJh
Tbj0b5AzHiwJq984o9u8mhjyWlsv4EGPRIiFfDOF1JDGrBw015wYUOMBma5KLm3T
AgMBAAGjfjB8MA4GA1UdDwEB/wQEAwIFIDAMBgNVHRMBAf8EAjAAMFwGA1UdEQRV
MFOCUTZhNTRkZDJiMjhlYTA3MmJhN2UyZTBmMDM3YTBkNDE5LjA3OGU5NTg1ZTE0
YzBhMWVkOGQzYjFiZWFlOGZlYjZmLnRyYWVmaWsuZGVmYXVsdDANBgkqhkiG9w0B
AQsFAAOCAQEAdhMeaYlWrwI4E3Ufd/FmVOvcz8C+ccs/k4RGs5n9UMvhOjFWax9E
ZKx+r2brvNhkSl8j9TBNe3M7OaoWRU8UI0Gry/eUhuCXjsltJPsL8HZIae/LG12/
jYlEqLYd7ojzzEyRvDFaVaRn9+kh2OFhgt8zOFZiN0L8BGm91KF2ZR+bWYucoRq6
H96myievTLyIwn6+r3Giqw5l7IHbQ+keqgxsCxseWvpgUsxOFSBCZI2AvZajdphm
EVWkB5N97WE6TGRAPZPjgaq3lzrCLJiPlLy5A5ksXp0RwhWZLTB9VffMcKmFSYDx
DoOQ6Ac1gdxIVDeqldf1fneU240vl9Q80w==
-----END CERTIFICATE-----
subject=/CN=TRAEFIK DEFAULT CERT
issuer=/CN=TRAEFIK DEFAULT CERT
---
No client certificate CA names sent
Server Temp Key: ECDH, X25519, 253 bits
---
SSL handshake has read 1414 bytes and written 293 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES128-GCM-SHA256
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES128-GCM-SHA256
    Session-ID: 7CC416EF8E9DA069909A26D8FC291B9FB08175B422ABF6939CE91624519E0A20
    Session-ID-ctx: 
    Master-Key: 6650A9E449D1ED2E561AC6F0FC5FAC34B7C648D4DF324846E176EA687ACE28231B3A49E05ECAA3D6C690C16197D052ED
    TLS session ticket:
    0000 - fc ee 6e e0 a0 15 bc 3a-eb 44 7a 1e a5 7a 86 66   ..n....:.Dz..z.f
    0010 - 3f c4 81 96 bb 8a 83 63-a2 a7 a2 c9 ab 89 4f 06   ?......c......O.
    0020 - b8 54 f1 f4 b4 b5 79 07-62 24 44 5d da 3f 76 0d   .T....y.b$D].?v.
    0030 - ab 9a 0d ad c8 1a 1e 3f-41 a1 21 1b 46 aa e5 cf   .......?A.!.F...
    0040 - 64 d8 34 2e 91 9a 97 16-44 74 d5 9f 92 50 79 0b   d.4.....Dt...Py.
    0050 - 68 89 b9 ce 8e 95 f7 4c-dc 89 99 d1 7e dd f6 9c   h......L....~...
    0060 - 14 ef f5 bc 9b 62 79 ca-cd 48 93 36 da 74 11 d0   .....by..H.6.t..
    0070 - 7f 25 49 3f a1 ec 93 4e-                          .%I?...N

    Start Time: 1563985377
    Timeout   : 7200 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---

Since some domains succeed; could it be that Traefik is trying to do the challenges before Docker and Traefik are setup and the ports ready?