etcd clientv3的lease keepalive租约频繁续期bug 排查小记

背景:

测试环境 2W 台主机,每个主机都注册一个包含lease的key,key的 ttl 设置为10s 设置keepalive 集群节点7台,节点member之间 https 通讯,与 client 之间也是 https 通讯 产品等着发布上线,持续遇到集群性能问题,长时间找不到原因,感觉我要崩溃了 集群性能问题不解决,线上更多的服务器更是没法部署,领导不认可,不满意,我就要被开除了,我就要辞职了。。。 Etcd github官方提了issue,官方没时间帮我们看(理解官方的苦衷,issue 太多,很多都是使用者的问题)…
var err error
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
leaseResp, err := this.GetEtcdClient().Grant(ctx, 10)	//租约时间设定为10秒
cancel()
if err != nil {
	return err
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
kvc := etcdclient.NewKV(this.kapi)
var txnResp *etcdclient.TxnResponse
txnResp, err = kvc.Txn(ctx).
	If(etcdclient.Compare(etcdclient.CreateRevision(keyName), "=", 0)).
	Then(etcdclient.OpPut(keyName, value.GetRegisterInfo(),etcdclient.WithLease(etcdclient.LeaseID(leaseResp.ID)))).
	Commit()
_, err = this.kapi.Put(ctx, keyName, value.GetRegisterInfo())
cancel()

if err != nil {
	return err
}

if !txnResp.Succeeded {
	return fmt.Errorf("无法注册自己:%s ,该 Key 已经存在",keyName)
}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)
_, err = this.GetEtcdClient().KeepAlive(context.TODO(), leaseResp.ID)
cancel()
if err != nil {
	return err
}

问题现象:

带有lease id的 key 无故丢失 leader节点 大部分时间 TIME-WAIT数量非常高 leader 节点的 CLOSE-WAIT 有时也非常高 leader 的 CPU 使用率持续在80-90%的样子 leader 节点的内存使用率持续上升,疑似内存泄露的症状

监控图如下:

etcd leader的 etcd 进程入手:

先分析「疑似内存溢出」问题,在 etcd 里,已经有pprof的信息,golang的tool里也有直接读取 http pprof 性能分析的工具,只是不支持 https 的。需要将 pprof 文件下载到本地,再从本地读取

curl --cacert /xxx/xxx/etcd/config/ssl/ca.pem --cert /xxx/xxx/etcd/config/ssl/client.pem --key /xxx/xxx/etcd/config/ssl/client-key.pem   -L https://leader_ip:2379/debug/pprof/profile > leader_ip_cpu.txt

结果如下:

cfc4n@cnxct:~/Downloads$go tool pprof leader_ip_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:13pm (CST)
Duration: 2.64mins, Total samples = 34.54mins (1306.43%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 23.98mins, 69.42% of 34.54mins total
Dropped 1702 nodes (cum <= 0.17mins)
Showing top 20 nodes out of 211
      flat  flat%   sum%        cum   cum%
  3.62mins 10.48% 10.48%   3.62mins 10.48%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
  2.99mins  8.66% 19.13%   2.99mins  8.66%  math/big.divWVW /usr/local/go/src/math/big/arith_amd64.s
  2.34mins  6.78% 25.91%   2.34mins  6.78%  math/big.bitLen /usr/local/go/src/math/big/arith_amd64.s
  1.90mins  5.51% 31.42%  10.44mins 30.22%  math/big.nat.montgomery /usr/local/go/src/math/big/nat.go
  1.67mins  4.82% 36.24%   1.67mins  4.82%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s
  1.46mins  4.22% 40.47%   3.57mins 10.33%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
  1.36mins  3.93% 44.40%   1.44mins  4.17%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
  1.30mins  3.78% 48.17%   1.30mins  3.78%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
  1.04mins  3.01% 51.19%   1.04mins  3.02%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
  0.94mins  2.71% 53.90%   0.94mins  2.71%  math/big.divWVW
  0.80mins  2.33% 56.23%   0.83mins  2.41%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
  0.80mins  2.32% 58.55%   3.17mins  9.17%  math/big.nat.divLarge /usr/local/go/src/math/big/nat.go
  0.78mins  2.27% 60.82%   0.78mins  2.27%  math/big.addMulVVW /usr/local/go/src/math/big/arith_amd64.s
  0.57mins  1.66% 62.48%   0.57mins  1.66%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
  0.57mins  1.64% 64.12%   0.57mins  1.64%  math/big.(*byteReader).Read <autogenerated>
  0.42mins  1.22% 65.33%   3.22mins  9.32%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
  0.40mins  1.17% 66.50%   0.40mins  1.17%  math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s
  0.34mins  0.99% 67.49%   0.34mins  0.99%  vendor/golang_org/x/crypto/curve25519.ladderstep /usr/local/go/src/vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s
  0.34mins  0.99% 68.49%   0.34mins  0.99%  math/big.bitLen
  0.32mins  0.93% 69.42%   0.41mins  1.18%  sync.(*Pool).pin /usr/local/go/src/sync/pool.go

可以看到大量 math 相关的包函数占用了大量的 cpu,而我们程序的业务中,并不涉及数学计算相关业务,也不会引用 math 包。唯一可能性就是在 https 的 tls 证书在做 RSA 算法校验时,用到了大量数学计算。我把 pprof 导出问 svg 图确认了一下,确实是 tls 的证书计算占用了大量 CPU。

但是 https 的链接,不应该是建立1次后就不再建立了吗?为什么会有这么大量的 cpu 计算呢。难道 TIMEWAIT 这么高跟这个有关系吗?回到 TIMEWAIT 的问题上来,此刻,leader 节点是 xxxx-etcdserver-xxx02,且TIME_WAIT非常高

[user@xxxx-etcdserver-xxx02 etcd]$ ss |grep 2380|awk '{print $5}'|awk -F ":" '{print $1}'|sort |uniq -c|sort -rn
  53882 192.168.198.234
  53667 192.168.227.202
  53539 192.168.255.210
  53197 192.168.58.204
  53135 192.168.137.210
  51966 192.168.128.205

leader 节点上的time-wait来源,都是来自各个member节点,平均每个节点将近几万个 TIMEWAIT,相隔几秒再次查询,还是5万多个。而各个 member 节点上,查到的是跟 leader 节点的2380端口 保持1千多个 ESTABLISHED 状态的连接,也确实是频繁的创建链接。为什么member节点频繁的创建链接呢?tcpdump抓点TCP 通讯包看一下:

如上抓包的图,发现了什么?

120140 2018-06-22 20:37:30.219 10.72.198.234 74 10.20.127.136 TCP 12902 → etcd-server(2380) [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=3008487334 TSecr=0 WS=128 120141 2018-06-22 20:37:30.219 10.20.127.136 74 10.72.198.234 TCP etcd-server(2380) → 12902 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=2796204056 TSecr=3008487334 WS=128 120362 2018-06-22 20:37:30.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=3008487358 TSecr=2796204056 120368 2018-06-22 20:37:30.242 10.72.198.234 204 10.20.127.136 TCP 12902 → etcd-server(2380) [PSH, ACK] Seq=1 Ack=1 Win=14720 Len=138 TSval=3008487358 TSecr=2796204056 120369 2018-06-22 20:37:30.242 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=139 Win=15616 Len=0 TSval=2796204080 TSecr=3008487358 174076 2018-06-22 20:37:40.242 10.72.198.234 66 10.20.127.136 TCP 12902 → etcd-server(2380) [FIN, ACK] Seq=139 Ack=1 Win=14720 Len=0 TSval=3008497358 TSecr=2796204080 174377 2018-06-22 20:37:40.282 10.20.127.136 66 10.72.198.234 TCP etcd-server(2380) → 12902 [ACK] Seq=1 Ack=140 Win=15616 Len=0 TSval=2796214120 TSecr=3008497358

从这些数据包中,你能看出什么问题?想一想,打开这张图,想5分钟再往下看。。。

120140是10.72.198.234 的 member 节点作为客户端,向10.20.127.136的2380的 leader 节点发起TCP 连接请求。 120141是 leader 回的握手确认包。120362是 member 发的 ACK 握手完成确认包。这些都很简单明了。 120368是 member 向 leader 发的数据包,120369是 leader 回 member 收到前面数据包的 ack。嗯,这些也很好明白。 174076是什么? member向 leader 发的`FIN` 关闭 TCP 连接的数据包,为什么?为什么member要关闭连接? 再抽查几个数据包看看,从 “专家信息”里预览的1000多个“SYN”里找….

可以看出,都是 member主动向 leader 发 FIN 关闭包,这是为什么呢? 仅仅是看出来这些吗? 有没有注意到 member 发的消息包,收到 leader 回的 ack 之后,距离 member 主动发的 FIN 之间的时间间隔?都是10s ?为什么是10s?

leader 回了 ack,但没主动向 member 发“FIN,ACK”,为什么没回呢?若没回,那么 leader 上的这个 TCP 五元组的状态是什么?是的,是“CLOSE-WAIT”,难道这就是造成leader上大量 CLOSE-WAIT 的原因?先怀疑一下,继续往下看…

如上所说,member 向 leader 发了一个PSH数据包,发来什么内容呢?为什么 leader 没回信息?

payload的总长度是138,完全看不出来是什么数据包?不,应该还是有眼熟的,看前3个字节,0x16,0x03,0x01,如果对 HTTPS 熟悉的话,会联想到这是 TLS 的magic number,0x16对应十进制的22 即“Client Hello 的 Handshake”,后面的0x31,0x01是 TLS 的版本号。再配合我们的 member 跟 leader 之间确实是 https 通讯,可以确定这个地方是 tls 的client Hello握手包。也就是说配合前面发现的 10s 的间隔,可以看出 member 与 leader 之间的 TLS 握手并未完成,leader 并没有回 Server Hello… 整个TCP 包比较短,没有发现其他有用的信息…

去掉 HTTPS,减轻 cpu 负担:

回到之前的 CPU pprof 情况来看,是 TLS 相关证书计算,导致 CPU 使用率较高,但大量的 TCP 连接的发起,才是导致大量 TLS 计算的原因,导致内存飙升,消息积压的原因。不过,我们还是尝试去掉 member 与 leader 时间的 tls 证书校验,尝试减轻 leader 的CPU 使用。但结果影响不大,CPU 使用率依旧很高,不过,这也在意料之中,毕竟不是问题根本原因,查看pprof 如下

cfc4n@cnxct:~/Downloads$ go tool pprof 111.111.111.111-0703_cpu.txt
File: etcd
Type: cpu
Time: Jul 3, 2018 at 4:55pm (CST)
Duration: 30.01s, Total samples = 6.16mins (1231.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 4.29mins, 69.69% of 6.16mins total
Dropped 928 nodes (cum <= 0.03mins)
Showing top 20 nodes out of 181
      flat  flat%   sum%        cum   cum%
  2.13mins 34.62% 34.62%   2.13mins 34.62%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
  0.59mins  9.51% 44.13%   0.62mins 10.06%  syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s
  0.33mins  5.42% 49.55%   0.33mins  5.42%  runtime._ExternalCode /usr/local/go/src/runtime/proc.go
  0.25mins  4.07% 53.62%   1.23mins 20.00%  runtime.lock /usr/local/go/src/runtime/lock_futex.go
  0.13mins  2.18% 55.80%   0.13mins  2.18%  runtime.procyield /usr/local/go/src/runtime/asm_amd64.s
  0.11mins  1.80% 57.60%   0.23mins  3.72%  runtime.selectgoImpl /usr/local/go/src/runtime/select.go
  0.10mins  1.57% 59.17%   0.10mins  1.57%  runtime.osyield /usr/local/go/src/runtime/sys_linux_amd64.s
  0.09mins  1.39% 60.56%   1.47mins 23.85%  runtime.unlock /usr/local/go/src/runtime/lock_futex.go
  0.07mins  1.12% 61.69%   0.33mins  5.30%  runtime.mallocgc /usr/local/go/src/runtime/malloc.go
  0.07mins  1.06% 62.74%   0.07mins  1.06%  runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go
  0.06mins  1.01% 63.76%   0.15mins  2.36%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go
  0.05mins  0.77% 64.53%   0.05mins  0.77%  runtime.siftupTimer /usr/local/go/src/runtime/time.go
  0.04mins  0.71% 65.24%   0.04mins  0.71%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go
  0.04mins  0.71% 65.95%   0.07mins  1.14%  runtime.deferreturn /usr/local/go/src/runtime/panic.go
  0.04mins   0.7% 66.65%   1.35mins 21.86%  runtime.deltimer /usr/local/go/src/runtime/time.go
  0.04mins  0.67% 67.32%   0.12mins  1.92%  runtime.pcvalue /usr/local/go/src/runtime/symtab.go
  0.04mins  0.61% 67.93%   0.04mins  0.64%  runtime.siftdownTimer /usr/local/go/src/runtime/time.go
  0.04mins   0.6% 68.53%   0.23mins  3.74%  runtime.gentraceback /usr/local/go/src/runtime/traceback.go
  0.04mins  0.59% 69.11%   0.04mins  0.59%  runtime.casgstatus /usr/local/go/src/runtime/proc.go
  0.04mins  0.57% 69.69%   0.04mins  0.57%  runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s

CPU 使用率较高的,是 runtime.futex 跟系统调用的 syscall.Syscall,以及runtime下的包,heapBitsForObjectscanobjectgreyobject是 golang GC 时的运行时函数,意味着GC 已经占用很大的 CPU 了。

配合memory pprof来看下

cfc4n@cnxct:~/Desktop$ go tool pprof 10.3.5.52.txt
File: etcd
Type: inuse_space
Time: Jul 3, 2018 at 4:57pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 10665.34MB, 94.72% of 11259.98MB total
Dropped 368 nodes (cum <= 56.30MB)
Showing top 20 nodes out of 57
      flat  flat%   sum%        cum   cum%
 7156.71MB 63.56% 63.56%  7156.71MB 63.56%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*recvBuffer).put /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/transport.go
 2258.28MB 20.06% 83.61%  9417.49MB 83.64%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).HandleStreams.func4 /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go
  663.28MB  5.89% 89.51%   663.28MB  5.89%  net/http.http2putRequestBodyBuf /usr/local/go/src/net/http/h2_bundle.go
  277.15MB  2.46% 91.97%   277.15MB  2.46%  runtime.makechan /usr/local/go/src/runtime/chan.go
   66.11MB  0.59% 92.55%    66.11MB  0.59%  crypto/tls.(*block).reserve /usr/local/go/src/crypto/tls/conn.go
   65.24MB  0.58% 93.13%    68.74MB  0.61%  net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go
   56.52MB   0.5% 93.64%    56.52MB   0.5%  runtime.malg /usr/local/go/src/runtime/proc.go
   54.52MB  0.48% 94.12%    85.91MB  0.76%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
   23.53MB  0.21% 94.33%   113.68MB  1.01%  crypto/tls.(*Conn).readHandshake /usr/local/go/src/crypto/tls/conn.go
      15MB  0.13% 94.46%    74.58MB  0.66%  runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s
      12MB  0.11% 94.57%      795MB  7.06%  net/http.(*http2Server).ServeConn /usr/local/go/src/net/http/h2_bundle.go
      11MB 0.098% 94.67%    65.23MB  0.58%  net/http.(*http2Framer).readMetaFrame /usr/local/go/src/net/http/h2_bundle.go
       2MB 0.018% 94.68%   990.71MB  8.80%  net/http.(*conn).serve /usr/local/go/src/net/http/server.go
       2MB 0.018% 94.70%    77.73MB  0.69%  net/http.(*http2serverConn).readFrames /usr/local/go/src/net/http/h2_bundle.go
       1MB 0.0089% 94.71%   721.45MB  6.41%  net/http.(*http2serverConn).newWriterAndRequest /usr/local/go/src/net/http/h2_bundle.go
       1MB 0.0089% 94.72%   755.99MB  6.71%  net/http.(*http2serverConn).serve /usr/local/go/src/net/http/h2_bundle.go
         0     0% 94.72%    64.32MB  0.57%  [etcd]
         0     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).Handshake /usr/local/go/src/crypto/tls/conn.go
         0     0% 94.72%    64.60MB  0.57%  crypto/tls.(*Conn).readRecord /usr/local/go/src/crypto/tls/conn.go
         0     0% 94.72%   190.21MB  1.69%  crypto/tls.(*Conn).serverHandshake /usr/local/go/src/crypto/tls/handshake_server.go

如上,可以看出,大约65%的内存都用在了*recvBuffer.put

// recvBuffer is an unbounded channel of recvMsg structs.
// Note recvBuffer differs from controlBuffer only in that recvBuffer
// holds a channel of only recvMsg structs instead of objects implementing "item" interface.
// recvBuffer is written to much more often than
// controlBuffer and using strict recvMsg structs helps avoid allocation in "recvBuffer.put"
type recvBuffer struct {
	c       chan recvMsg
	mu      sync.Mutex
	backlog []recvMsg
}

func newRecvBuffer() *recvBuffer {
	b := &recvBuffer{
		c: make(chan recvMsg, 1),
	}
	return b
}

func (b *recvBuffer) put(r recvMsg) {
	b.mu.Lock()
	if len(b.backlog) == 0 {
		select {
		case b.c <- r:
			b.mu.Unlock()
			return
		default:
		}
	}
	b.backlog = append(b.backlog, r)
	b.mu.Unlock()
}

这里的 put 方法是将收到的recvMsg写入到b.backlog这个 slice 中的,func (b *recvBuffer) put(r recvMsg) (b *recvBuffer) load()两个函数都对b.backlog有读写操作,也就是说,这个slice 写入、读取时,都会进行加锁,严重占用 CPU 以及影响性能,增大系统负载。而且若 (b *recvBuffer) load()处理速度跟不上func (b *recvBuffer) put(r recvMsg)写入速度,意味着这个slice会持续增长,造成看似内存溢出的现象。。。
那问题来了,func (b *recvBuffer) put(r recvMsg)为什么有这么大的调用量,为什么在频繁写入…为什么呢???

定位到内存溢出的疑似原因:

grpc/server.go func (s *Server) serveStreams(st transport.ServerTransport)
grpc/transport/handler_server.go 文件284行 func (ht *serverHandlerTransport) HandleStreams 函数中的334行

	s.trReader = &transportReader{
		reader:        &recvBufferReader{ctx: s.ctx, recv: s.buf},
		windowHandler: func(int) {},
	}

	// readerDone is closed when the Body.Read-ing goroutine exits.
	readerDone := make(chan struct{})
	go func() {
		defer close(readerDone)

		// TODO: minimize garbage, optimize recvBuffer code/ownership
		const readSize = 8196
		for buf := make([]byte, readSize); ; {
			n, err := req.Body.Read(buf)
			if n > 0 {
				s.buf.put(recvMsg{data: buf[:n:n]})
				buf = buf[n:]
			}
			if err != nil {
				s.buf.put(recvMsg{err: mapRecvMsgError(err)})
				return
			}
			if len(buf) == 0 {
				buf = make([]byte, readSize)
			}
		}
	}()

然后再走到了,func (b *recvBuffer) put(r recvMsg)函数里,也就是说,这里的消息slice堆积,是因为接收到了大量http request。那接下来就是要找到这些http request 是来自谁了?

阅读 etcd 源码:

配合 member 之间大量的 TCP TIME-WAIT 状态链接,排查方向改到 member 节点上。

要查找 etcd member 建立 TCP/HTTP 链接的地方, 在源码里,还是非常多的,一时间没找到更好的bug 定位方式,只要从etcd 源码去逐步分析排查。

etcdmain/main.go 39行 startEtcdOrProxyV2() etcdmain/etcd.go 103行 startEtcd() etcdmain/etcd.go 186行 embed.StartEtcd(cfg) embed/etcd.go 107行 startPeerListeners() ,用 rafthttp.NewListener 创建了一个Listener。用于 etcd member 之间通讯。依赖github.con/coreos/etcd/pkg/transport包 embed/etcd.go 110行 startClientListeners() 用 net.Listen 创建了一个Listener。用于 client 跟 member 之间通讯。keepalive 使用github.con/coreos/etcd/pkg/transport包 157行 etcdserver.NewServer() 对本地配置读取,数据库目录读取,设置mvcc多版本控制函数,设置 lease 租约续期函数等等 166行 e.servePeers() 进行对 Linster 启动 Serve 接受请求 317行 etcdhttp.NewPeerHandler(e.Server) 设置HTTP Handler,里面函数如下:

如上,etcd 节点之间的通讯,设定了这几个Handler

“/” –> http.NotFound “/raft”、”/raft/” –> raftHandler “/members” –> mh “/leases/internal” 、 “/leases” –> raftHandler “/version” –> versionHandler

也就是说,member 之间通讯的大量通讯只会是这些 handler 中间的一个,除非源码找漏了。。。
从这个开源产品功能特性上来看,member 之间需要大量同步的,应该就是数据了,应该把线索放到raftHandler上吗?头大….

raftHandler是s.RaftHandler() 返回的, 值是 *EtcdServer.r.transport.Handler()返回值。
*EtcdServer.r是在etcdserver/server.go的414行赋值的

r: *newRaftNode(
			raftNodeConfig{
				isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },
				Node:        n,
				heartbeat:   heartbeat,
				raftStorage: s,
				storage:     NewStorage(w, ss),
			},
		),

etcdserver/server.go 的511行 *EtcdServer.r.transport = tr进行赋值,类型为*rafthttp.Transport ,其Handler 如下代码:

func (t *Transport) Handler() http.Handler {
	pipelineHandler := newPipelineHandler(t, t.Raft, t.ClusterID)
	streamHandler := newStreamHandler(t, t, t.Raft, t.ID, t.ClusterID)
	snapHandler := newSnapshotHandler(t, t.Raft, t.Snapshotter, t.ClusterID)
	mux := http.NewServeMux()
	mux.Handle(RaftPrefix, pipelineHandler)
	mux.Handle(RaftStreamPrefix+"/", streamHandler)
	mux.Handle(RaftSnapshotPrefix, snapHandler)
	mux.Handle(ProbingPrefix, probing.NewHandler())
	return mux
}

这里又有几个 Handler ,而且都是 raft 处理相关的,代码读起来比较吃力,很难理解作者意图,

打印http.dial堆栈:

要想从这里查找 TCP 连接的发起代码,恐怕是难上加难,所以,我换了个思路,直接在 TCP 连接 Dail 的地方,判断 addr 是 etcd leader 的地址,打印调用栈。使用
runtime/debug下的debug.PrintStack()函数输出。同时打印了连接完成后的remote.addr,用于校对。

重新编译 etcd server,替换到集群中的某一个 member 节点上。 开始压测这个节点,重现大量 TIME-WAIT 的现象。现象重现后,在日志里也拿到了运行堆栈

  0. net.(*Dialer).Dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/dial.go:323
  1. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial() /Users/cfc4n/gopat
h/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeout_d
ialer.go:29
  2. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial-fm() /Users/cfc4n/go
path/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeou
t_transport.go:51
  3. net/http.(*Transport).dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:887
  4. net/http.(*Transport).dialConn() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:1060
  5. net/http.(*Transport).getConn.func4() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:943
  6. runtime.goexit() /usr/local/Cellar/go/1.9.1/libexec/src/runtime/asm_amd64.s:2337
 

在压测时,我用 netstat 抽样打印几次当前 member 上的 TCP 连接信息数据,也是用于校对日志输出的数据。
在压测的日志数据里,每建立一个 TCP 连接,都打印了这些堆栈,并且打印了远程 IP 的地址、端口。跟前面抽样打印的 TCP 数据核对一下,发现连接 leader 2380端口的几个 TCP 连接五元组跟 上面打印的remote.addr的五元组一致。
也就是说,是这个函数创建了大量的从 member 到 leader 之间的 TCP 连接。好了,如果找到大量 TCP 连接发起的函数了,又找到了该函数的调用栈了,那岂不是找到问题了?

然而,事情并没有这么简单,在上面的调用栈日志里,只能定位到coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial,并不能打印出所有调用堆栈,为什么?

// src/net/http/transport.go:886、887行 @TODO 高亮下面两行
func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

找到 etcd member 10s 后主动断开 TCP 连接的原因:

在系统类库里的net/http/transport.go的*Transport.dial方法中,如果该Transport自定义了Dial方法,则调用自定义的…这个自定义的Dial方法是在github.com/coreos/etcd/pkg/transport/transport.go里定义的,如下:

func NewTransport(info TLSInfo, dialtimeoutd time.Duration) (*http.Transport, error) {
	cfg, err := info.ClientConfig()
	if err != nil {
		return nil, err
	}

	t := &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		Dial: (&net.Dialer{
			Timeout: dialtimeoutd,
			// value taken from http.DefaultTransport
			KeepAlive: 30 * time.Second,
		}).Dial,
		// value taken from http.DefaultTransport
		TLSHandshakeTimeout: 10 * time.Second,
		TLSClientConfig:     cfg,
	}

	dialer := (&net.Dialer{
		Timeout:   dialtimeoutd,
		KeepAlive: 30 * time.Second,
	})
	dial := func(net, addr string) (net.Conn, error) {
		return dialer.Dial("unix", addr)
	}

	tu := &http.Transport{
		Proxy:               http.ProxyFromEnvironment,
		Dial:                dial,
		TLSHandshakeTimeout: 10 * time.Second,	//高亮,这里是10s,也解开了最初的10s TLS 超时的原因
		TLSClientConfig:     cfg,
	}
	ut := &unixTransport{tu}

	t.RegisterProtocol("unix", ut)
	t.RegisterProtocol("unixs", ut)

	return t, nil
}

在代码dial := func(net, addr string) (net.Conn, error) … 这里,自定义了一个 Dial 方法,赋值到http.Transport.Dial 函数属性上,并且设定TLS 的超时时间是 TLSHandshakeTimeout: 10 * time.Second,这也就是之前抓包时,看到的10s 后,member 节点主动断开连接的原因—-超过 TLS 设定的timeout时间了。
函数NewTransport的调用地方,只在NewTimeoutTransport这里找到,如下:

func NewTimeoutTransport(info TLSInfo, dialtimeoutd, rdtimeoutd, wtimeoutd time.Duration) (*http.Transport, error) {
	tr, err := NewTransport(info, dialtimeoutd)
	if err != nil {
		return nil, err
	}

	if rdtimeoutd != 0 || wtimeoutd != 0 {
		// the timed out connection will timeout soon after it is idle.
		// it should not be put back to http transport as an idle connection for future usage.
		tr.MaxIdleConnsPerHost = -1
	} else {
		// allow more idle connections between peers to avoid unnecessary port allocation.
		tr.MaxIdleConnsPerHost = 1024
	}

	tr.Dial = (&rwTimeoutDialer{
		Dialer: net.Dialer{
			Timeout:   dialtimeoutd,
			KeepAlive: 30 * time.Second,
		},
		rdtimeoutd: rdtimeoutd,
		wtimeoutd:  wtimeoutd,
	}).Dial
	return tr, nil
}

也就是说,找到NewTimeoutTransport调用着,就能确定发起大量 TCP 的代码…但之前的堆栈打印,并没有打印到这些函数调用,只好再想其他办法。

NewTimeoutTransport的代码里,有一句对 HTTP Keepalive的最大持有空闲长连接数字的设置tr.MaxIdleConnsPerHost = 1024,这个参数决定 golang 的 http client 包对 server 发起 http 请求时,能持有的最大空闲 TCP 连接数量,在 etcd 的代码里,被设置为1024了,也就是说,etcd member 跟 etcd leader 之间空闲时,持有TCP 连接数可以是1024个,便于下次复用。(插一句,有人说,http 是最好的 RPC,这里我是不认同的,光从TCP 连接复用上,就有非常大的差异,在http 1.1 TCP 链接 「另类」复用,http 1.1 的keepalive属性也是仅当前一个请求响应后,这个链接才能继续复用,请求响应串行化。而在 http2里,是不需要等待上一个请求返回的,才有真正意义上的多路复用见:Streams and Multiplexing , 如果你做过 rpc 的功能实现,你就知道 rpc 在 tcp 上是如何实现请求响应的一一对应的,你就能很快明白http2\http1.1的 TCP 链接复用的区别了。)

排查 etcd member 跟 etcd leader 之间的通讯:

回到正题,对于代码这里的定位,我尝试去阅读代码来协助定位,但 etcd 的源码太复杂了,我读到 raft 协议节点之间数据痛苦这块,卡住了。开始从其他路子尝试解决,之前的 cpu 占用较高问题,我们在 member 节点之间去掉了 tls 证书验证,那么我们在 member 上做 tcp 抓包,抓2380端口数据包,看看数据发送的都是什么内容。

数据包中,可以看出 member 向 leader 上发送了大量的 /leases的 POST 请求。根据这个 uri,找到发起 HTTP 请求的代码,在etcdserver/v3_server.go的252行

func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) {
	ttl, err := s.lessor.Renew(id)
	if err == nil { // already requested to primary lessor(leader)
		return ttl, nil
	}
	if err != lease.ErrNotPrimary {
		return -1, err
	}

	cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())
	defer cancel()

	// renewals don't go through raft; forward to leader manually
	for cctx.Err() == nil && err != nil {
		leader, lerr := s.waitLeader(cctx)
		if lerr != nil {
			return -1, lerr
		}
		for _, url := range leader.PeerURLs {
			lurl := url + leasehttp.LeasePrefix
			ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt)
			if err == nil || err == lease.ErrLeaseNotFound {
				return ttl, err
			}
		}
	}
	return -1, ErrTimeout
}

阅读这段代码,可以看到函数调用leasehttp.RenewHTTP将请求发送到leader.PeerURLs,即发送到 leader 节点上。继续跟进leasehttp.RenewHTTP

func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) {
	// will post lreq protobuf to leader
	lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal()
	if err != nil {
		return -1, err
	}

	cc := &http.Client{Transport: rt}
	req, err := http.NewRequest("POST", url, bytes.NewReader(lreq))
	if err != nil {
		return -1, err
	}
	req.Header.Set("Content-Type", "application/protobuf")
	req.Cancel = ctx.Done()

	resp, err := cc.Do(req)
	if err != nil {
		return -1, err
	}

可以看出,该请求使用 golang 标准包的http.Client.Do来将请求发送出去。http.Client.Transport使用参数传入的rt http.RoundTripper。

定位到10s TLS timeout 的相关函数调用:

回到上一层代码

s.peerRt参数来自*EtcdServer.peerRt,在etcdserver/server.go的429行被赋值peerRt: prt prt变量在该文件的288行被赋值prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout()) rafthttp.NewRoundTripper函数调用rafthttp.NewTimeoutTransport 没错,rafthttp.NewTimeoutTransport调用了rafthttp.NewTransport函数,也就是之前堆栈打印时,发现大量 TCP 链接创建使用的这个函数,也就是10s TLS timeout 的地方。

这里可以关联上了,确认大量发起 TCP 连接的代码调用关系,证明方向正确,剩下的就是顺藤摸瓜,找到发起大量请求的原因即可。
回到LeaseRenew函数调用的地方,调用func (s *EtcdServer) LeaseRenew函数代码见github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go的88行附近

func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) error {
	for {
		req, err := stream.Recv()
		if err == io.EOF {
			return nil
		}
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
			}
			return err
		}

		// Create header before we sent out the renew request.
		// This can make sure that the revision is strictly smaller or equal to
		// when the keepalive happened at the local server (when the local server is the leader)
		// or remote leader.
		// Without this, a lease might be revoked at rev 3 but client can see the keepalive succeeded
		// at rev 4.
		resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}}
		ls.hdr.fill(resp.Header)

		ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID))
		if err == lease.ErrLeaseNotFound {
			err = nil
			ttl = 0
		}

		if err != nil {
			return togRPCError(err)
		}

		resp.TTL = ttl
		err = stream.Send(resp)
		if err != nil {
			if isClientCtxErr(stream.Context().Err(), err) {
				plog.Debugf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			} else {
				plog.Warningf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())
			}
			return err
		}
	}
}
71行 func (ls *LeaseServer) LeaseKeepAlive函数调用func (ls *LeaseServer) leaseKeepAlive() etcdserver/etcdserverpb/rpc.pb.go的3417行func _Lease_LeaseKeepAlive_Handler(srv interface{}, stream grpc.ServerStream)调用了func (ls *LeaseServer) LeaseKeepAlive函数 etcdserver/etcdserverpb/rpc.pb.go的3480行将函数_Lease_LeaseKeepAlive_Handler设置为 StreamName: “LeaseKeepAlive”的 Handler,在通过*grpc.Server的RegisterService方法,在启动时,注册到监听器中 leaseKeepAlive函数是一个事件循环,从stream.Recv()拿到数据,则进行发送到 leader 节点上 stream.Recv()函数是google.golang.org/grpc/stream.go中的函数,这个不是 etcd 官方自己的代码,而是另外一个开源项目,意味着我还需要再把这块代码再阅读阅读,理解理解。

通过漫长的阅读,可以看到这里收到的消息都是来自 client 发送来的消息,其实可以从变量命名上看到端倪,比如 client 跟 member 之间通讯处理的变量都包含stream字样,而member到 leader 或 member 节点之间都带有peer字样,其实,client 跟 etcd 节点之间通讯用的是 http2传输协议,消息编码用的是grpc+protobuf;而etcd member节点之间是http1.1传输,消息编码也是 protobuf

etcd lease 租约续期的时机:

就是说,这里的大量lease 租约续期的请求,来自 client 的请求,在我的案例中,我把主机 lease 的 TTL 设置为10S,这块会有异常吗?client 的 lease 续期是谁来发送?client 自己吗?多久发送一次?过期前的1秒?

在 etcd 的 client 类库中,对于租约续期的代码在clientv3/lease.go的474行

	// send update to all channels
	nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
	ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
	for _, ch := range ka.chs {
		select {
		case ch <- karesp:
			ka.nextKeepAlive = nextKeepAlive
		default:
		}
	}

nextKeepAlive变量是 lease TTL 的三分之一,用于下次续期时的触发时刻,在后面的sendKeepAliveLoop函数续期时,回每隔500ms 进行一次循环,检测是否续期。也就是说,我们场景下,10s 的 TTL 会每隔3S 一次续期。3W 客户端的话,每秒1W 的写请求,而之前我们压测结果是集群节点可以有4W 以上的 (官方压测数据)qps,不至于这点都撑不住。这是个疑点。不过,前面的分析判断,大量的续约续期导致大量 TCP 链接的建立,大量 TCP 链接的建立,导致 leader 节点上大量对象分配,大量内存申请,golang gc 时,消耗大量 CPU,那这个问题也应该要优化一下,为此,我们把 TTL 设定为600S,即10分钟,也就是说,会在200秒后进行一次续期。

扩大lease TTL,尝试优化集群负载:

修复后的版本,我们在线下发布了6000多台,观察效果,果然,在发布后,内存有明显的下降,CPU 指标也有,但是在1个小时后,指标又上升了。。。

问题依旧:

这是为什么呢?中间大约1个小时的负载指标下降是因为发布过程带来的假象吗?上次变更的 lease 有效果吗?可我更改时,是测试过的,测试报告都有呢—-agent 的 etcd lease租约时长 调整 ,TTL 续期是200S 一次了,但这是为什么?为了确认这个问题,我在其中一台 member 上抓包确认

上面确实是大量的 lease 包,那么每个 lease id 之间的发送间隔是多少?是我们预期的200秒吗?或者这些 lease 包是同一个 lease id 吗?一共有多少个 lease id 呢?

分析 tcpdump 的pcapng文件:

之前从 etcd 的源码分析可以看出,etcd node 节点之间的通讯消息编码也是 protobuf 的,我们需要把 http response body 的字节流复制出来,用 protobuf 的对应结果体解码即可。可是从 2018-07-05 16:38:31.015981 到 2018-07-05 16:43:20.420401 之间,间隔5分钟之间,一共103万个数据包,手动复制,实在没法统计。借用pcap包,对整个 tcpdump抓包的pcapng文件,分析, 过滤响应类型是application/protobuf的,取 http response body内容,并对其做 pb.LeaseKeepAliveRequest protobuf的结构体解析,计算 lease 总数与唯一数


运行结果: lease id 总数数: 77550, lease id 唯一数:1382。

这些结果说明,相同的 lease id 在频繁发送续约请求。再次写程序,把这些 lease id 取出来,到 etcd 集群上查看 TTL 剩余时间是否是预期的400-600秒之间,但结果却总是599秒,以其中一个 lease id 为例,每隔1秒去抓取 TTL 信息,但每次结果 TTL 都是599秒, grantTTL 是600,意味着 etcd member 持续不断的为这些 lease id 不停的续期。 我们之前排查结果是, member 的续期是来自 client 的 lease keepalive 请求,那么方向得放到 client 上。

抽查其中一台 client,抓包分析,也确实是在频繁发送/etcdserverpb.Lease/LeaseKeepAlive
请求,这就很奇怪,跟之前我的测试结果完全不一样,明明测试时,是每个200S 一起续期请求,为什么这里变得这么频繁,每秒都发起。

回到 etcd clientv3的代码上:

现象很奇怪,排查这个问题,只能再次从源码层面入手,方向也放到 /etcdserverpb.Lease/LeaseKeepAlive的地方,函数sendKeepAliveLoop这里开始循环处理判断是否需要续期租约

// sendKeepAliveLoop sends keep alive requests for the lifetime of the given stream.
func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) {
	for {
		var tosend []LeaseID

		now := time.Now()
		l.mu.Lock()
		for id, ka := range l.keepAlives {
			if ka.nextKeepAlive.Before(now) {
				tosend = append(tosend, id)
			}
		}
		l.mu.Unlock()

		for _, id := range tosend {
			r := &pb.LeaseKeepAliveRequest{ID: int64(id)}
			if err := stream.Send(r); err != nil {
				// TODO do something with this error?
				return
			}
		}

		select {
		case <-time.After(500 * time.Millisecond):
		case <-stream.Context().Done():
			return
		case <-l.donec:
			return
		case <-l.stopCtx.Done():
			return
		}
	}
}

如上代码,每隔 500毫秒一次循环,遍历l.keepAlives map,判断每个LeaseID的nextKeepAlive是否到了,client 决定是否发起LeaseKeepAliveRequest。这个nextKeepAlive的赋值是在clientv3/lease.go的recvKeepAlive函数内,第473行

// send update to all channels
nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)
ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)
for _, ch := range ka.chs {
	select {
	case ch <- karesp:
		ka.nextKeepAlive = nextKeepAlive
	default:
	}
}

是的,这块代码之前看过,之前关注的只是 TTL 时间的设定,但仔细看看这段代码ka.nextKeepAlive = nextKeepAlive,这里做下一次lease 续期的赋值,判断方法是select case,判断条件是ch

文章来源:

Author:CFC4N
link:https://www.cnxct.com/etcd-lease-keepalive-debug-note/