[dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC

Jesper Wramberg jesper.wramberg at gmail.com
Mon Nov 2 14:57:12 CET 2015


Hi again,

Sorry I missed your first email. Wow, I can't believe I missed that. I read
the output from ethernet_bw as Mbit/s :-( That's kind of embarrassing.
You are right. My calculations are wrong. Sorry for bothering you with my
bad math. For whats it's worth I have spent quite some time wondering what
was wrong.
I still have some way to go though, since my original problems started in a
much larger, more complicated setup. But I'm glad this basic Tx/Rx setup
works as expected.

Thank you, best regards
Jesper


2015-11-02 13:57 GMT+01:00 Jesper Wramberg <jesper.wramberg at gmail.com>:

> Hey,
>
> As a follow-up I tried changing interrupts around without any changes to
> achieved speed.
> Lastly, after some iperf testing using 10 threads it would seem that it is
> impossible to achieve over 10G BW.
>
> Got an interesting output from "perf top -p <pid>" however while running
> the raw_ethernet_bw script.
>
>   37.22%  libpthread-2.17.so  [.] pthread_spin_lock
>   10.00%  libmlx4-rdmav2.so   [.] 0x000000000000b05a
>    1.20%  libmlx4-rdmav2.so   [.] 0x000000000000b3ec
>    1.07%  libmlx4-rdmav2.so   [.] 0x000000000000b06c
>    1.07%  libmlx4-rdmav2.so   [.] 0x000000000000afc0
>    1.06%  raw_ethernet_bw     [.] 0x000000000001484f
>    1.06%  raw_ethernet_bw     [.] 0x0000000000014869
>    1.06%  raw_ethernet_bw     [.] 0x00000000000142ec
>    1.05%  libmlx4-rdmav2.so   [.] 0x000000000000b41c
>    1.05%  libmlx4-rdmav2.so   [.] 0x000000000000aff6
>    1.05%  raw_ethernet_bw     [.] 0x0000000000014f09
>    1.03%  libmlx4-rdmav2.so   [.] 0x0000000000005a60
>    1.03%  libmlx4-rdmav2.so   [.] 0x000000000000be51
>    1.03%  libpthread-2.17.so  [.] pthread_spin_unlock
>    1.01%  libmlx4-rdmav2.so   [.] 0x000000000000afdc
>    1.00%  libmlx4-rdmav2.so   [.] 0x000000000000b042
>    1.00%  raw_ethernet_bw     [.] 0x0000000000014314
>    0.98%  libmlx4-rdmav2.so   [.] 0x000000000000bf38
>    0.97%  libmlx4-rdmav2.so   [.] 0x000000000000b3d2
>    0.97%  raw_ethernet_bw     [.] 0x00000000000142a4
>    0.96%  raw_ethernet_bw     [.] 0x0000000000014282
>    0.96%  libmlx4-rdmav2.so   [.] 0x000000000000b415
>    0.96%  raw_ethernet_bw     [.] 0x000000000001425e
>
> I wonder if the tool is supposed to spend so much time in
> pthread_spin_lock..
>
> Best regards,
> Jesper
>
> 2015-11-02 11:59 GMT+01:00 Jesper Wramberg <jesper.wramberg at gmail.com>:
>
>> Hi again,
>>
>> Thank you for your input. I have now switched to using the
>> raw_ethernet_bw script as transmitter and the test-pmd as receiver. An
>> immediate result I discovered was that the raw_ethernet_bw tool achieves
>> very similar TX performance as my DPDK transmitter.
>>
>>
>> (note both cpu10 and mlx4_0 is on same numa node as wanted)
>> taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20
>> -s 1480 --dest_mac F4:52:14:7A:59:80
>>
>> ---------------------------------------------------------------------------------------
>> Post List requested - CQ moderation will be the size of the post list
>>
>> ---------------------------------------------------------------------------------------
>>                     Send Post List BW Test
>>  Dual-port       : OFF          Device         : mlx4_0
>>  Number of qps   : 1            Transport type : IB
>>  Connection type : RawEth               Using SRQ      : OFF
>>  TX depth        : 128
>>  Post List       : 3
>>  CQ Moderation   : 3
>>  Mtu             : 1518[B]
>>  Link type       : Ethernet
>>  Gid index       : 0
>>  Max inline data : 0[B]
>>  rdma_cm QPs     : OFF
>>  Data ex. method : Ethernet
>>
>> ---------------------------------------------------------------------------------------
>> **raw ethernet header****************************************
>>
>> --------------------------------------------------------------
>> | Dest MAC         | Src MAC          | Packet Type          |
>> |------------------------------------------------------------|
>> | F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
>> |------------------------------------------------------------|
>>
>>
>> ---------------------------------------------------------------------------------------
>>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>  1480       33242748         0.00               4691.58
>>  3.323974
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>> Running it with the 64 byte packets Olga specified gives me the following
>> result:
>>
>>
>> ---------------------------------------------------------------------------------------
>>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>  64         166585650        0.00               1016.67
>>  16.657163
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>> The results are the same with and without flow control. I have followed
>> the Mellanox DPDK QSG and done everything in the performance section
>> (except the things regarding interrupts).
>>
>> So to answer Olga's questions :-)
>>
>> 1: Unfortunately I can't. If I try the FW update complains since the
>> cards came with Dell configuration (PSID: DEL0A70000023).
>>
>> 2: In my final setup I need jumboframes but just for the sake of testing
>> I tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config.
>> This did not really change anything, neither in my initial setup nor the
>> one described above.
>>
>> 3: In the final setup, I plan to share the NICs between multiple
>> independent processes. For this reason, I wanted to use SR-IOV and
>> whitelist a single VF to each process. Anyway, for the tests above I have
>> used the PFs for simplicity.
>> (Side note: I discovered that multiple DPDK instances can use the same
>> PCI address which might eliminate the need for SR-IOV. I wonder how that
>> works :-))
>>
>> So conclusively, isn't the raw_ethernet_bw tool supposed to have larger
>> output BW with 1480 byte packets ?
>>
>> I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I
>> can mail this to anyone who have the time to look further into it.
>>
>> Thank you for your help, best regards
>> Jesper
>>
>> 2015-11-01 11:05 GMT+01:00 Olga Shern <olgas at mellanox.com>:
>>
>>> Hi Jesper,
>>>
>>> Several suggestions,
>>> 1.      Any chance you can install latest FW from Mellanox web site or
>>> the one that is included in OFED 3.1 version that you have downloaded?  The
>>> latest version is  2.35.5100.
>>> 2.      Please configure SGE_NUM=1 in DPDK config file  in case you
>>> don't need jumbo frames. This will improve performance.
>>> 3.      Not clear from your description, if you are running DPDK on VM ?
>>> Are you suing SRIOV ?
>>> 4.      I suggest you  to run  first,  testpmd application. The traffic
>>> generator can be raw_ethernet_bw application that coming with MLNX_OFED, it
>>> can generate L2, IPV4 and TCP/UDP packets
>>>         For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0
>>> -i 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
>>>         This will send L2 packets via mlx4_0 NIC port 1 , packet size =
>>> 64, for 10 sec, batch = 3 (-l)
>>>         You can see according to testpmd counters the performance.
>>>
>>> Please check Mellanox community  posts, I think they can help you.
>>> https://community.mellanox.com/docs/DOC-1502
>>>
>>> We also have performance suggestions in our QSG:
>>>
>>> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf
>>>
>>> Best Regards,
>>> Olga
>>>
>>>
>>> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date
>>> : samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <
>>> jesper.wramberg at gmail.com>  À : users at dpdk.org
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I am experiencing some performance issues in a somewhat custom setup
>>> with two Mellanox ConnectX-3 NICs. I realize these issues might be due to
>>> the setup, but I was hoping someone might be able to pinpoint some possible
>>> problems/bottlenecks.
>>>
>>>
>>>
>>>
>>> The server:
>>>
>>> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on
>>> each socket). I have a minimal Centos 7.1.1503 installed with kernel-
>>> 3.10.0-229.
>>> Note that this kernel is re-build with most things disabled to minimize
>>> size, etc. It has infiniband enabled, however, and mlx4_core as a module
>>> (since nothing works otherwise). Finally, I have connected the two NICs
>>> from port 2 to port 2.
>>>
>>>
>>>
>>> The firmware:
>>>
>>> I have installed the latest firmware for the NICs from dell which is
>>> 2.34.5060.
>>>
>>>
>>>
>>> The drivers, modules, etc.:
>>>
>>> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used
>>> its rebuild feature to build it against the custom kernel. I have installed
>>> it using the --basic option since I just want libibverbs, libmlx4, kernel
>>> modules and openibd service stuff. The mlx4_core.conf is set for ethernet
>>> on all ports. Moreover, it is configured for flow steering mode -7 and a
>>> few VFs. I can restart the openibd service successfully and everything
>>> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
>>> only problems I have encountered at this stage is that the links doesn't
>>> always seem to come up unless I unplug and re-plug the cables.
>>>
>>>
>>>
>>> DPDK setup:
>>>
>>> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
>>> package. I build it using the default values for everything. Running the
>>> simple hello world example I can see that everything is initialized
>>> correctly, etc.
>>>
>>>
>>>
>>> Test setup:
>>>
>>> To test the performance of the NICs I have the following setup. Two
>>> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
>>> running on NIC B. All processes use virtual functions on their respective
>>> NICs. Depending on the test, the processes can either transmit or receive
>>> data. To transmit, I use a simple DPDK program which generates 32000
>>> packets and transmits them over and over until it has sent 640 million
>>> packets. Similarly, I use a simple DPDK program to receive which is
>>> basically the layer 2 forwarding example without re-transmission.
>>>
>>>
>>>
>>> First test:
>>>
>>> In my first test, P1 transmits data to P3 while the other processes are
>>> idle.
>>>
>>> Packet size: 1480 byte packets
>>>
>>> Flow control: On/Off, doesn’t matter I get same result.
>>>
>>> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
>>> 4.9Gbit/s
>>>
>>>
>>>
>>> Second test:
>>>
>>> I my second test, I attempt to increase the amount of data transmitted
>>> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>>>
>>> Packet size: 1480 byte packets
>>>
>>> Flow control: On/Off, doesn’t matter I get same result.
>>>
>>> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~
>>> 1.75 Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>>>
>>>
>>>
>>>
>>>
>>> Does anyone has any idea what I am doing wrong here ? In the second test
>>> I would expect P1 to transmit with the same speed as in the first test. It
>>> seems that there is a bottleneck somewhere, however. I have left most
>>> things to their default values but have also tried tweaking queue sizes,
>>> number of queues, interrupts, etc. with no luck
>>>
>>>
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Jesper
>>>
>>
>>
>


More information about the users mailing list