Load balancing with wrr doesn't honor server weights after a server recovers from failing health check

A link to my comment on the GitHub issue board: https://github.com/containous/traefik/issues/2986#issuecomment-506820573

I have two servers configured for my backend, one with a very large weight (performs better), and another with weight of 1 (slower, used as a backup if main server fails). If the server with the high weight fails a health check and then recovers, the wrr load balancer treats both servers like they have a weight of 1, until I manually change the value of the weight in EITHER server. Once one of the server’s weights is changed, the server’s load balancing returns to normal.

I believe I found the cause of the problem in Traefik’s source code. In the file pkg/healthcheck/healthcheck.go, when the health check detects that a failed server is up again it runs the following code:

if err := checkHealth(disableURL, backend); err == nil {
	log.Warnf("Health check up: Returning to server list. Backend: %q URL: %q", backend.name, disableURL.String())
	if err = backend.LB.UpsertServer(disableURL, roundrobin.Weight(1)); err != nil {
		log.Error(err)
	}
	// FIXME serverUpMetricValue = 1
} else {

I think the issue specifically lies with if err = backend.LB.UpsertServer(disableURL, roundrobin.Weight(1)); err != nil { because instead of passing in the previously assigned weight from the server, it is passing in a hard-coded roundrobin.Weight(1). I’ve been trying to dive deeper into traefik’s source to see if there’s a way to obtain the server’s previous weight in the function that contains the above code, but I’ve had no luck so far since I’m very unfamiliar with traefik’s code.

My question for anyone with good understanding of traefik’s source, is my assumption above correct? If so, how are server weights stored internally and how can it be obtained within the checkBackend method?

Hello,

the code inside the branch master is related to the v2 of Traefik.
In the v2 we have hard-coded weights because we will rewrite the load-balancing system.

Then, sorry, but your assumption is wrong.

Hello,

Thank you for responding. Could you please expand on why my assumption is wrong? The (mostly) same code appears in the v1.7 branch as well, and as much as I’d like to upgrade to v2, I cannot at the moment because v2 does not support Consul KV as a provider yet.

The code pasted from v1.7 branch:

if err := checkHealth(url, backend); err == nil {
	log.Warnf("Health check up: Returning to server list. Backend: %q URL: %q", backend.name, url.String())
	backend.LB.UpsertServer(url, roundrobin.Weight(1))
	serverUpMetricValue = 1
} else {

You’re right, the bug is in the 1.7 too. I change labels to confirm the bug.

And thanks for your PR.