Conclusion: When the network interrupts handling happens to land on the cores that handles softirq, that core(s) are stressed. It can not handle the work and drops packets. Since we have irqbalance on, the smp_affinity can change from boot to boot. So network interrupts handling might/not land on the same core.
Solutions:
1. check rps_cpu to see which cpu handles softirq
cat /sys/class/net/<NIC>/queues/rx-0/rps_cpus
00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00111111
The value is Hexadecimal.Converting to binary it is: 100010001000100010001. This makes softirq goes to cpu 0,4,8,12,16,20.
2.Monitor which CPU the interrupts goes. As we are using infiniband, we grep melonox driver.
cat /proc/interrupts
150: 3 69695238 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth-mlx4-19:00.0-1
173: 3822 413 0 0 0 1576152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge mlx4-19:00.0-(async)
173: mlx4_core Handles low-level functions, hardware interrupts
153: handles QP interrupt and runs the soft-interrupt threads.
We noticed smp_affinity for 150 changes after reboot.
3. To make it fix, we used /etc/sysconfig/irqbalance.conf
IRQ_AFFINITY_MASK=FFFFD1
This mask defines which cpu to skip(1 means skip) . Converting the Hexadecimal to binary, it is
111111111111111111010001. It pins interrupt handling on socket0, but skips cpu 0,4.
Deep dive about interrupt, irq, softirq, rps_cpu
What happens when NIC receives PDU(Protocol Data Unit)?
reference: http://perso.hexabyte.tn/ichihi/projects/hpc_optimisations.pdf
When NIC gets PDU, NIC copies PDU into kernel buffers using DMA(Direct Memory Access). NIC notifies kernel the arrival of PDU by raising a hard interrupt. Device driver (part of the kernel) handles the Hard interrupt. The hard interrupt handlers perform minimal work and schedule the rest to be handled asynchronously by a softirq. Hard interrupt handlers can not be preempted. Softirqs are processed as regular kernel code by special kernel threads. Kernel will drop packets if it cannot pick them from the NIC
quickly enough
IRQBalance
Reference: http://irqbalance.org/documentation.html
IRQBalance distributes interrupts among the cpus. It might be in performance mode or power save mode depending on the load.
"As a first step, the interrupts are assigned to packages. In power save mode, all interrupts are assigned to the first package (to allow the other packages to stay in their respective sleep states longer), while in performance mode, irqbalance tries to spread the interrupts over the available packages as much as possible such that the cumulative amount of work that each package will do is made equal as much as possible. In addition, in performance mode, irqbalance will always distribute interrupts of the same class to different packages as long as there are not more interrupts in the class than there are packages. This is done to prevent one networking (or storage) interrupt from interfering with another interrupt in the same class via the sharing of resources that goes with being assigned to the same package. To make things even more complex, irqbalance also takes into account which package the interrupt was already assigned to previously and will try to keep the interrupt assigned to the same package (since there might be valuable data in the cache there already for the device). It also takes into account which packages are directly connected to the hardware in a NUMA system (since using a core with such a direct connection is going to be faster than using a core which has an indirect connection)."
"For the Networking interrupt class, it is essential that the interrupt goes to one and one core only. The implementation of the Linux TCP/IP stack will then use this property to get some major efficiencies in its operation. In addition, if an interrupt source of another class is very high rate, irqbalance will also assign this to a specific core in order to maximize the efficiency of the level 1 cache of this core. This assigning to the cores uses the same algorithm as the assignment to cache-domains."
In my case, hw/fw interrupts are handled by 173 smp_affinity cpu, and generate software interrupts. irqbalancer check the node (NUMA architecture, cpu, scheduler ) and pick a cpu to send the sw interrupts(150 smp_affinity). The cpu pulls interrupts from hw, classify them. Need to send the to receiving processor. RPS(receiving package steering) comes in here. It generate hash value and spread the sw interrupts evenly to the rps_cpu.
RPS(Receive packet Steering)
Reference:http://lwn.net/Articles/362339/
RPS distributes the incoming data across the cpus.
"Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines. Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream. Parallelizing such a stream requires some intelligence on the part of the host operating system."
"Tom's patch provides that intelligence by hooking into the receive path - netif_rx() and netif_receive_skb() - right when the driver passes a packet into the networking subsystem. At that point, it creates a hash from the relevant protocol data (IP addresses and port numbers, in particular) and uses it to pick a CPU; the packet is then enqueued for the target CPU's attention. By default, any CPU on the system is fair game for network processing, but the list of target CPUs for any given interface can be configured explicitly by the administrator if need be."
/proc/interrupts
still shows only CPU defined by /proc/<irq>/smp_affinity is used for NIC interrupt handling, even with RPS enabled. If you want to find out whether RPS is working, you have to look at /proc/softirqs instead (eg. with watch -n1 cat /proc/softirqs
) Tools:
to see all the interrupts: sar -I SUM -P ALL
to see pkg by interface: netstat -i
to see which thread is assigned to which cpu: ps -p <pid> -L -o pid,tid,psr,pcpu
to see which thread this process can set affinity to: taskset [-c] -p <pid>
to see cpu utilization per core: mpstat -P ALL 1