tc
and netfilter
. Simple network settings can be reproduced using network namespaces and veth devices, while more complex settings require connecting the virtual machines with a software bridge and using standard network tools, such as iptables
or tc
, to simulate incorrect behavior. If there is a problem with ICMP responses generated when the SSH server iptables -A INPUT -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable
, iptables -A INPUT -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable
in the correct namespace can help solve the problem.iptables
nor nftables
could manipulate TCP flags, while tc
could, but only overwrite flags and interrupt new connections and TCP generally.iptables
, conntrack
and tc
, but I decided that this was a great job for eBPF.tc
to qdisc in the input or output. SEC("prog") int xdp_main(struct xdp_md *ctx) { void *data_end = (void *)(uintptr_t)ctx->data_end; void *data = (void *)(uintptr_t)ctx->data; struct ethhdr *eth = data; struct iphdr *iph = (struct iphdr *)(eth + 1); struct icmphdr *icmph = (struct icmphdr *)(iph + 1); /* sanity check needed by the eBPF verifier */ if (icmph + 1 > data_end) return XDP_PASS; /* matched a pong packet */ if (eth->h_proto != ntohs(ETH_P_IP) || iph->protocol != IPPROTO_ICMP || icmph->type != ICMP_ECHOREPLY) return XDP_PASS; if (iph->ttl) { /* save the old TTL to recalculate the checksum */ uint16_t *ttlproto = (uint16_t *)&iph->ttl; uint16_t old_ttlproto = *ttlproto; /* set the TTL to a pseudorandom number 1 < x < TTL */ iph->ttl = bpf_get_prandom_u32() % iph->ttl + 1; /* recalculate the checksum; otherwise, the IP stack will drop it */ csum_replace2(&iph->check, old_ttlproto, *ttlproto); } return XDP_PASS; } char _license[] SEC("license") = "GPL";
include
, helpers, and an optional code, is an XDP program that changes the TTL of the received ICMP echo replies, namely pongs, to a random number. The main function receives the xdp_md
structure, which contains two pointers to the beginning and end of the packet. $ clang -O2 -target bpf -c xdp_manglepong.c -o xdp_manglepong.o
$ readelf -h xdp_manglepong.o ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Linux BPF <--- HERE [...]
ip
from the iproute2
package with the following syntax: # ip -force link set dev wlan0 xdp object xdp_manglepong.o verbose
$ ping -c10 192.168.85.1 PING 192.168.85.1 (192.168.85.1) 56(84) bytes of data. 64 bytes from 192.168.85.1: icmp_seq=1 ttl=41 time=0.929 ms 64 bytes from 192.168.85.1: icmp_seq=2 ttl=7 time=0.954 ms 64 bytes from 192.168.85.1: icmp_seq=3 ttl=17 time=0.944 ms 64 bytes from 192.168.85.1: icmp_seq=4 ttl=64 time=0.948 ms 64 bytes from 192.168.85.1: icmp_seq=5 ttl=9 time=0.803 ms 64 bytes from 192.168.85.1: icmp_seq=6 ttl=22 time=0.780 ms 64 bytes from 192.168.85.1: icmp_seq=7 ttl=32 time=0.847 ms 64 bytes from 192.168.85.1: icmp_seq=8 ttl=50 time=0.750 ms 64 bytes from 192.168.85.1: icmp_seq=9 ttl=24 time=0.744 ms 64 bytes from 192.168.85.1: icmp_seq=10 ttl=42 time=0.791 ms --- 192.168.85.1 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 125ms rtt min/avg/max/mdev = 0.744/0.849/0.954/0.082 ms
iptables
nor tc
could do this. Writing code for this scenario is a snap: configure two virtual machines connected by an OVS bridge, and simply connect eBPF to one of the virtual VM devices.rx
path of the receiving virtual machine will not have any effect on the switch.tc
and connected to the output VM path, because tc
can download and connect eBPF programs to qdisk. To mark packets leaving the host, eBPF must be connected to the output qdisk.XDP
and tc
API: the default is different section names, the type of the structure of the main function argument is different, the return values are different. But it's not a problem. Below is a fragment of a program marking TCP when it is attached to a tc action: #define RATIO 10 SEC("action") int bpf_main(struct __sk_buff *skb) { void *data = (void *)(uintptr_t)skb->data; void *data_end = (void *)(uintptr_t)skb->data_end; struct ethhdr *eth = data; struct iphdr *iph = (struct iphdr *)(eth + 1); struct tcphdr *tcphdr = (struct tcphdr *)(iph + 1); /* sanity check needed by the eBPF verifier */ if ((void *)(tcphdr + 1) > data_end) return TC_ACT_OK; /* skip non-TCP packets */ if (eth->h_proto != __constant_htons(ETH_P_IP) || iph->protocol != IPPROTO_TCP) return TC_ACT_OK; /* incompatible flags, or PSH already set */ if (tcphdr->syn || tcphdr->fin || tcphdr->rst || tcphdr->psh) return TC_ACT_OK; if (bpf_get_prandom_u32() % RATIO == 0) tcphdr->psh = 1; return TC_ACT_OK; } char _license[] SEC("license") = "GPL";
clang -O2 -target bpf -c tcp_psh.c -o tcp_psh.o
# tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress matchall action bpf object-file tcp_psh.o
tcpdump
confirms that the new eBPF code is working, and about 1 out of every 10 TCP packets have the PSH flag set. Only 20 lines of C code were needed to selectively mark TCP packets leaving a virtual machine, reproduce the error that occurs “in combat”, and all without recompiling or even restarting! This greatly simplified the verification of the Open vSwitch fix , which could not be achieved with the help of other tools.Source: https://habr.com/ru/post/436528/