The kubectl-debug plugin for debugging in Kubernetes pods

At the end of last year, a plug-in for kubectl was introduced on Reddit to help debug Kubernetes cluster pods - kubectl-debug . This idea immediately seemed interesting and useful to our engineers, so we decided to look at its implementation and are happy to share our results with Habra readers.

Why is it even needed?

At the moment there is a serious inconvenience in the process of debugging something within the framework of the pods. The main goal when assembling an image of a container is to minimize it, i.e. make as small as possible in size and containing as little as possible of the "extra" inside. However, when it comes to problems in the work of the final software in containers or debugging its communication with other services in the cluster / outside ... minimalism plays a cruel joke with us - after all, there is nothing in the containers for the actual process of finding problems. Utilities such as netstat / ip / ping / curl / wget, etc. are usually not available.

And often it all ends with the fact that the engineer in haste puts the necessary software right in the running container in order to “see the light” and see the problem. It is for such cases that the kubectl-debug plugin seemed to be a very useful tool, because it saves from the immediate pain.

With it, you can run a container with all the necessary tools on board in the context of the problem pod and study all the processes “from the side” while inside. If you’ve ever encountered troubleshooting at Kubernetes, it sounds good, doesn't it?

What is this plugin?

In general terms, the architecture of this solution looks like a bundle of a plug-in for kubectl and an agent , launched using the DaemonSet controller. The plugin serves commands starting with the kubectl debug … and interacts with agents on the cluster nodes. The agent, in turn, runs on the host network, and the host docker.sock is also mounted in the agent pod for full access to the containers on this server.

Accordingly, when prompted to run a debug container in the specified pod:
There is a process to identify the hostIP , and also sends a request to the agent (running on a suitable host) to start the debug container in the namespaces (namespaces) corresponding to the target pod.

A more detailed understanding of these stages is available in the project documentation .

What is required for work?

The author of kubectl-debug claims compatibility with Kubernetes 1.12.0+ client / cluster versions , but I had K8s 1.10.8 on hand, on which everything worked without visible problems ... with a single note: for the kubectl debug team to work It is in this form that the version of kubectl is exactly 1.12+ . Otherwise, all the commands are similar, but are only called via kubectl-debug …

When you start the DaemonSet template described in README you should not forget about the taint'es you use on the nodes: without the appropriate tolerations of the agent’s pods, they don’t live there and, as a result, the pods that live on these nodes do not will be able to connect a debugger.

Help at the debugger is quite complete and seems to describe all the current capabilities for launching / configuring the plugin. In general, the utility pleases with a large number of start-up directives: you can enclose certificates, specify the kubectl context, specify a separate kubectl config or the address of the cluster API server and more.

Work with debugger

Installation before the “everything works” is reduced to two stages:

execute kubectl apply -f agent_daemonset.yml ;
directly install the plugin itself - in general, everything as described here .

How to use it? Suppose we have the following problem: the metrics of one of the services in the cluster are not collected - and we want to check if there are any network problems between Prometheus and the target service. As you can guess, the Prometheus image lacks the required tools.

Let's try to connect to the container with Prometheus (if there are several containers in the pod, you will need to specify which one to connect to, otherwise the debugger will choose the first one by default):

 kubectl-debug --namespace kube-prometheus prometheus-main-0 Defaulting container name to prometheus. pulling image nicolaka/netshoot:latest... latest: Pulling from nicolaka/netshoot 4fe2ade4980c: Already exists ad6ddc9cd13b: Pull complete cc720038bf2b: Pull complete ff17a2bb9965: Pull complete 6fe9f5dade08: Pull complete d11fc7653a2e: Pull complete 4bd8b4917a85: Pull complete 2bd767dcee18: Pull complete Digest: sha256:897c19b0b79192ee5de9d7fb40d186aae3c42b6e284e71b93d0b8f1c472c54d3 Status: Downloaded newer image for nicolaka/netshoot:latest starting debug container... container created, open tty... [1] → root @ /

Previously, we found out that the problem service lives on the address 10.244.1.214 and listens to port 8080. Of course, we can check availability from the hosts, however, for a reliable debugging process, these operations must be reproduced in identical (or as close as possible) conditions. Therefore, checking out pod / container with Prometheus is the best option. Let's start with the simple:

  [1] → ping 10.244.1.214 PING 10.244.1.214 (10.244.1.214) 56(84) bytes of data. 64 bytes from 10.244.1.214: icmp_seq=1 ttl=64 time=0.056 ms 64 bytes from 10.244.1.214: icmp_seq=2 ttl=64 time=0.061 ms 64 bytes from 10.244.1.214: icmp_seq=3 ttl=64 time=0.047 ms 64 bytes from 10.244.1.214: icmp_seq=4 ttl=64 time=0.049 ms ^C --- 10.244.1.214 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 61ms rtt min/avg/max/mdev = 0.047/0.053/0.061/0.007 ms

All is well. Maybe the port is unavailable?

  [1] → curl -I 10.244.1.214:8080 HTTP/1.1 200 OK Date: Sat, 12 Jan 2019 14:01:29 GMT Content-Length: 143 Content-Type: text/html; charset=utf-8

And there are no problems. Then check if the actual communication between Prometheus and the endpoint with metrics occurs:

  [2] → tcpdump host 10.244.1.214 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 14:04:19.234101 IP prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278 > 10.244.1.214.8080: Flags [P.], seq 4181259750:4181259995, ack 2078193552, win 1444, options [nop,nop,TS val 3350532304 ecr 1334757657], length 245: HTTP: GET /metrics HTTP/1.1 14:04:19.234158 IP 10.244.1.214.8080 > prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278: Flags [.], ack 245, win 1452, options [nop,nop,TS val 1334787600 ecr 3350532304], length 0 14:04:19.290904 IP 10.244.1.214.8080 > prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278: Flags [P.], seq 1:636, ack 245, win 1452, options [nop,nop,TS val 1334787657 ecr 3350532304], length 635: HTTP: HTTP/1.1 200 OK 14:04:19.290923 IP prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278 > 10.244.1.214.8080: Flags [.], ack 636, win 1444, options [nop,nop,TS val 3350532361 ecr 1334787657], length 0 ^C 4 packets captured 4 packets received by filter 0 packets dropped by kernel

Requests, answers come. As a result of these operations, we can conclude that there are no problems at the level of network interaction, which means (most likely) - we need to look at the application side. We connect to the container with exporter (also, of course, using the debugger in question, because exporters always have extremely minimalistic images) and ... we are surprised to find that there is a problem in the service configuration - for example, they forgot to send the exporter to the correct address of the final application. The case is solved!

Of course, in the situation described here, other ways of debugging are possible, but we leave them outside the article. The result is that kubectl-debug has plenty of opportunities to use: after all, you can run absolutely any image in the work, and if you want, you can even collect some of your specific (with the necessary set of tools).

What other application options immediately come to mind?

"Silent" application that ~~harmful~~ developers have not implemented normal logging. But he has the ability to connect to the service port and debug with a specific tool, which, of course, is not worth putting into the final image.
The launch next to the combat application is identical in the “manual” mode, but with debug enabled - to check the interaction with neighboring services.

In general, it is obvious that there are much more situations in which such a tool can be useful. Engineers who encounter them at work every day will be able to assess the potential of the utility in terms of “live” debugging.

findings

Kubectl-debug is a useful and promising tool. Of course, there are Kubernetes clusters and applications for which it does not make much sense, but it is more likely that it will provide invaluable help in debugging - especially if it comes to the combat environment and the need to quickly find the reasons the problem occurred.

The first experience of use revealed an acute need for connectivity to the pod / container, which does not start up completely (for example, “hangs” in CrashLoopbackOff ), just with the aim to check the causes of the “non-launch” of the application on the go. On this occasion, I created a corresponding issue in the project repository, to which the developer responded positively and promised implementation in the near future. Very pleased with the fast and adequate feedback. So we will look forward to new features of the utility and its further development!