Comment 13 for bug 1772675

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Given that we're could be changing reset behavior that might be expected from the firmware, I wrote a quick set of kprobes to force the firmware to raise MDD events and test out the patched kernel from the PPA.

I tried to force faulty TX descriptors according to "Table 7-138. Tx Descriptor Validity Checks" in the XL710 Datasheet, under section "7.6.2.2.1 Interrupt on Misbehavior of VM (Malicious Driver Detection)". This document is publicly available at Intel's Technical Library site for this NIC.

The test setup is as follows:
- Create 2 VFs on primary NIC
- Passthrough VF 1 to a Bionic VM
- Start iperf3 client on VM, going through i40evf interface
- Start another iperf3 client on host, going through i40e interface

The iperf3 servers in my testing were running on a separate host, so I only had clients using the i40e NIC. This was primarily to verify what the networking and connectivity impact would be if we ran into any MDDs.

After both iperf3 clients were running, I loaded the kprobe modules according to a specific TX check to validate. Raising MDDs on the VF turned out to be pretty trivial, and most of the i40e probes also work on i40evf. MDDs on the PF were a bit more tricky to get, but I had good results with corrupting the final TX descriptor's cmd_type_offset_bsz field. As this happens right before the driver notifies the NIC about the new data, it should force the firmware to raise the MDD event, as opposed to us "manually" triggering it from the driver. This has the benefit of keeping things consistent from the firmware's point of view, as in the end it is the one responsible for detecting and notifying the kernel about those events.

The primary point with this test was to verify whether we could leave the NIC in an inconsistent state, by avoiding or delaying the PF reset. The results were promising, and should hopefully give some more data on the value of the upstream patch.

When raising MDDs on the VF, the firmware correctly slaps the appropriate queues and schedules any resets as required. This is the same behavior as before. With the test kernel however, we don't issue any resets to the PF, so the iperf3 tests continue running uninterrupted as desired.

When raising MDDs on the PF, we don't issue any resets anymore and depending on what probe was used, connectivity will stop momentarily. The netdev watchdog kicks in shortly afterwards, and issues a PF reset as appropriate, and network connectivity resumes. This confirms that even with the upstream patch any hung queues that don't reset immediately will recover afterwards, as the queue watchdogs will take care of those. This is consistent with the upstream behavior, and the kernel logs look similar as to the one below:

[ 573.279608] NETDEV WATCHDOG: ens1f1 (i40e): transmit queue 1 timed out
[ 573.279652] WARNING: CPU: 14 PID: 0 at /build/linux-lqvoqZ/linux-4.15.0/net/sched/sch_generic.c:323 dev_watchdog+0x221/0x230
[ 573.279659] Modules linked in: vhost_net vhost tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio i40evf xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter devlink ebtables nls_iso8859_1 intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp intel_cstate intel_rapl_perf lpc_ich hpilo ipmi_si ipmi_devintf ipmi_msghandler shpchp ioatdma acpi_power_meter mac_hid sch_fq_codel kvm_intel kvm irqbypass iptable_filter ip6table_filter ip6_tables br_netfilter bridge stp llc arp_tables ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
[ 573.279726] async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc mgag200 i2c_algo_bit ttm aesni_intel aes_x86_64 crypto_simd glue_helper cryptd drm_kms_helper syscopyarea ixgbe sysfillrect sysimgblt fb_sys_fops dca i40e drm tg3 ptp nvme hpsa pps_core nvme_core mdio scsi_transport_sas wmi [last unloaded: probe_tx_desc]
[ 573.279756] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G OE 4.15.0-137-generic #141+TEST298651v20210225b1-Ubuntu
[ 573.279757] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
[ 573.279763] RIP: 0010:dev_watchdog+0x221/0x230
[ 573.279764] RSP: 0018:ffff8f28bf183e58 EFLAGS: 00010286
[ 573.279766] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000083f
[ 573.279767] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[ 573.279769] RBP: ffff8f28bf183e88 R08: 0000000000000694 R09: 0000000000000004
[ 573.279770] R10: ffff8f28bf183ee0 R11: 0000000000000001 R12: 0000000000000040
[ 573.279772] R13: ffff8f2827c69000 R14: ffff8f2827c69478 R15: ffff8f2827fa4f40
[ 573.279774] FS: 0000000000000000(0000) GS:ffff8f28bf180000(0000) knlGS:0000000000000000
[ 573.279775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 573.279777] CR2: 000055d7e4f1b0e8 CR3: 0000000e9160a006 CR4: 00000000001626e0
[ 573.279778] Call Trace:
[ 573.279781] <IRQ>
[ 573.279790] ? dev_deactivate_queue.constprop.33+0x60/0x60
[ 573.279795] call_timer_fn+0x30/0x130
[ 573.279799] run_timer_softirq+0x3f3/0x430
[ 573.279805] ? ktime_get+0x43/0xb0
[ 573.279813] ? lapic_next_deadline+0x26/0x30
[ 573.279820] __do_softirq+0xe4/0x2d4
[ 573.279827] irq_exit+0xc5/0xd0
[ 573.279831] smp_apic_timer_interrupt+0x79/0x140
[ 573.279835] apic_timer_interrupt+0x90/0xa0
[ 573.279838] </IRQ>
[ 573.279847] RIP: 0010:mwait_idle+0x9f/0x190
[ 573.279849] RSP: 0018:ffff9d4d86347e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 573.279852] RAX: 0000000000000000 RBX: 000000000000000e RCX: 0000000000000000
[ 573.279854] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 573.279857] RBP: ffff9d4d86347ea8 R08: 0000008566d3c328 R09: ffff8f282a5a4e00
[ 573.279859] R10: 0000000000000000 R11: 00000152a21daba2 R12: 000000000000000e
[ 573.279861] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 573.279870] arch_cpu_idle+0x15/0x20
[ 573.279874] default_idle_call+0x23/0x30
[ 573.279878] do_idle+0x172/0x1f0
[ 573.279882] cpu_startup_entry+0x73/0x80
[ 573.279885] start_secondary+0x1ab/0x200
[ 573.279890] secondary_startup_64+0xa5/0xb0
[ 573.279892] Code: 36 00 49 63 4e e8 eb 92 4c 89 ef c6 05 08 1d d7 00 01 e8 f3 1d fd ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 60 ed 39 85 e8 3f e2 7e ff <0f> 0b eb c0 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[ 573.279922] ---[ end trace bc176e8d4716bac2 ]---
[ 573.279942] i40e 0000:08:00.1 ens1f1: tx_timeout: VSI_seid: 391, Q 1, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
[ 573.279955] i40e 0000:08:00.1 ens1f1: tx_timeout recovery level 1, hung_queue 1
[ 573.282420] i40e 0000:08:00.1: VSI seid 391 Tx ring 0 disable timeout
[ 573.338312] i40e 0000:08:00.1: VSI seid 393 Tx ring 64 disable timeout
[ 579.167611] i40e 0000:08:00.1 ens1f1: tx_timeout: VSI_seid: 391, Q 10, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
[ 579.167650] i40e 0000:08:00.1 ens1f1: tx_timeout recovery level 2, hung_queue 10
[ 579.168257] i40e 0000:08:00.1: VSI seid 391 Tx ring 0 disable timeout
[ 579.169227] i40evf 0000:08:02.1: PF reset warning received
[ 579.169231] i40evf 0000:08:02.1: Scheduling reset task
[ 579.224464] i40e 0000:08:00.1: VSI seid 393 Tx ring 64 disable timeout
[ 579.279847] i40e 0000:08:00.0: VSI seid 390 Tx ring 0 disable timeout
[ 579.335352] i40e 0000:08:00.0: VSI seid 392 Tx ring 64 disable timeout
[ 582.377042] i40e 0000:08:00.0: DCBX offload is not supported or is disabled for this PF.

My first test run was to validate the patches on a Bionic host, I'll move on to testing Xenial next. I've attached the kprobe module source, if anyone wants to try breaking i40e as well. The offsets in the current version have been calculated for kernel 4.15.0-136 and are the same for test kernel 4.15.0-137.