[dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC

Matt Laswell laswell at infinite.io
Thu May 4 15:04:08 CEST 2017


Hey Folks,

I'm seeing some strange behavior with regard to the RSS hash values in my
applications and was hoping somebody might have some pointers on where to
look.  In my application, I'm using RSS to divide work among multiple
cores, each of which services a single RX queue.  When dealing with a
single long-lived TCP connection, I occasionally see packets going to the
wrong core.   That is, almost all of the packets in the connection go to
core 5 in this case, but every once in a while, one goes to core 0 instead.

Upon further investigation, I find two problems are occurring.  The first
is that problem packets have the RSS hash value in the mbuf incorrectly set
to zero.  They are therefore put in queue zero, where they are read by core
zero.  Other packets from the same connection that occur immediately before
and after the packet in question have the correct hash value and therefore
go to a different core.   The second problem is that we sometimes see
packets in which the RSS hash in the mbuf appears correct, but the packets
are incorrectly put into queue zero.  As with the first, this results in
the wrong core getting the packet.  Either one of these confuses the state
tracking we're doing per-core.

A few details:

   - Using an Intel X540-AT2 NIC and the igb_uio driver
   - DPDK 16.04
   - A particular packet in our workflow always encounters this problem.
   - Retransmissions of the packet in question also encounter the problem
   - The packet is IPv4, with header length of 20 (so no options), no
   fragmentation.
   - The only differences I can see in the IP header between packets that
   get the right hash value and those that get the wrong one are in the IP ID,
   total length, and checksum fields.
   - Using ETH_RSS_IPV4
   - The packet is TCP with about 100 bytes of payload - it's not a jumbo
   or a runt
   - We fill the key in with 0x6d5a to get symmetric hashing of both sides
   of the connection
   - We only configure RSS information at boot; things like the key or
   header fields are not being changed dynamically
   - Traffic load is light when the problem occurs

Is anybody aware of an errata, either in the NIC or the PMD's configuration
of it that might explain something like this?   Failing that, if you ran
into this sort of behavior, how would you approach finding the reason for
the error?  Every failure mode I can think of would tend to affect all of
the packets in the connection consistently, even if incorrectly.

Thanks in advance for any ideas.

--
Matt Laswell
laswell at infinite.io


More information about the dev mailing list