[dpdk-users] Issue with more Cores assigned: Cannot mmap device resource file

Kai Zhang kay21s at gmail.com
Wed Mar 15 18:02:36 CET 2017


Hi, David

I got your point now ;-)

I don't know why, but my program works correctly now... even without
setting the base-virtaddr. I will try your method when the error happens
again.

Thanks for your detailed explanation, I really appreciate it.

Regards,
Kai

On Wed, Mar 15, 2017 at 11:48 PM, David Coen <d.coen at resi.it> wrote:

> Hi Kai,
>
> I'm sure that it's not necessary to use --base-virtaddr option on the
> secondary process.
>
> Referring to addresses of your last post, to fully try my method,
> you should set your real primary application with
>
>  --base-virtaddr=0x7ffef5000000
>
> that is the smallest address I can see in your post (see below "Region 5").
>
> I hope this could help you,
>
> David
> ------------------------------------------------------------
> ------------------------------
> Da: Kai Zhang [mailto:kay21s at gmail.com]
> Inviato: mercoledì 15 marzo 2017 05:14
> A: Wiles, Keith
> Cc: David Coen; Van Haaren, Harry
> Oggetto: Re: [dpdk-users] Issue with more Cores assigned: Cannot mmap
> device resource file
>
> I have also tried to use the same option --base-virtaddr=0x7fffdc200000 on
> the secondary process. But it does not help.
>
> Thank you, Keith. I think I can try to figure it out first, if the
> internal is not too complicated ...
>
> Regards,
> Kai
>
>
> On Wed, Mar 15, 2017 at 11:28 AM, Wiles, Keith <keith.wiles at intel.com>
> wrote:
>
> > On Mar 15, 2017, at 10:56 AM, Kai Zhang <kay21s at gmail.com> wrote:
> >
> > Hi David,
> >
> > I find your method not work for me :-(
> >
> > The dummy primary application shows the following regions:
> > Region 0: virtual address [0x7fffdc200000, 0x7ffff5a00000], physical
> address 0x59c00000, len 427819008
> > Region 1: virtual address [0x7fffdbe00000, 0x7fffdc000000], physical
> address 0x7b600000, len 2097152
> > Region 2: virtual address [0x7fffdba00000, 0x7fffdbc00000], physical
> address 0xf25800000, len 2097152
> > Region 3: virtual address [0x7ffef5800000, 0x7fffdb800000], physical
> address 0xf25c00000, len 3858759680
> > Region 4: virtual address [0x7ffef5400000, 0x7ffef5600000], physical
> address 0x100f000000, len 2097152
> > Region 5: virtual address [0x7ffef5000000, 0x7ffef5200000], physical
> address 0x1024000000, len 2097152
> >
> > I set the real primary application with --base-virtaddr=0x7fffdc200000
> >
> > The error in the secondary process is:
> > EAL: Cannot mmap device resource file /sys/bus/pci/devices/0000:02:00.0/resource0
> to address: 0x7ffff2bfd000
>
> This one seems like a hardware issue around the PCI device can not be set
> to the correct. The path above is the device path to the resource0 value in
> the PCI and the system is having problem mapping the address. The secondary
> process, does it need to have the same option setting the base address?
>
> Sorry, not much help here as I not able to focus on the problem more
> because I am off site at a week long meeting.
>
> >
> > It seems that they are not accessing the same region.
> >
> > Regards,
> > Kai
> >
> > On Wed, Mar 15, 2017 at 12:47 AM, David Coen <d.coen at resi.it> wrote:
> > Hi Kai, I agree with you.
> >
> >
> >
> > Hi have quite the same issue, a primary application and a secondary one
> running, sometimes, with more than 4 cores.
> >
> > I'm using DPDK 16.11 on RedHat 6.7.
> >
> >
> >
> > Till now I solved in this way:
> >
> >
> >
> > - Disabling ASLR by adding those two lines to "/etc/sysctl.conf":
> >
> >                 # Disable Address Space Layout Randomization (ASLR)
> (needed by DPDK)
> >
> >                 kernel.randomize_va_space = 0
> >
> >
> >
> > - Getting virtual address of the first (the one with the minimum address
> value) memory segment returned from the function
> "rte_eal_get_physmem_layout ()", called from a "dummy" primary application
> used only to get this address.
> >
> > - Passing the above virtual address as a parameter for the "real"
> primary application using the " --base-virtaddr= " dpdk command line
> option. When secondary app starts, it all goes well with the specified base
> address.
> >
> >
> >
> > I've tested this solution on different servers and it's always ok.
> >
> > I think that there is some kind of limitation on DPDK primary/secondary
> initialization process that could be improved.
> >
> >
> >
> > Regards,
> >
> > David
> >
> >
> >
> > -----Messaggio originale-----
> >
> > Da: Kai Zhang [mailto:kay21s at gmail.com]
> >
> > Inviato: lunedì 13 marzo 2017 11:59
> >
> > A: Van Haaren, Harry
> >
> > Cc: Wiles, Keith; users at dpdk.org
> >
> > Oggetto: Re: [dpdk-users] Issue with more Cores assigned: Cannot mmap
> device resource file
> >
> >
> >
> > Thank you for your info, Harry.
> >
> >
> >
> > Even if the ASLR is the root reason, I don't think DPDK should expect
> users to disable it to use the primary/secondary model. Is it possible for
> the DPDK team to check this issue and fix the bug?
> >
> >
> >
> > Regards,
> >
> > Kai
> >
> >
> >
> > On Mon, Mar 13, 2017 at 5:58 PM, Van Haaren, Harry <
> harry.van.haaren at intel.com> wrote:
> >
> >
> >
> > > > From: users [mailto:users-bounces at dpdk.org] On Behalf Of Kai Zhang
> >
> > > > Subject: Re: [dpdk-users] Issue with more Cores assigned: Cannot
> >
> > > > mmap
> >
> > > device resource
> >
> > > > file
> >
> > > >
> >
> > > > Yes, my application is somewhat special and should run with the
> >
> > > > primary/secondary mode. I will search for the way to turn of the
> >
> > > > random page mapping and try it.
> >
> > >
> >
> > >
> >
> > > You're searching for ASLR, or Address Space Layout Randomization.
> >
> > >
> >
> > > Some useful links regarding ASLR, DPDK and Linux;
> >
> > > http://dpdk.readthedocs.io/en/v16.04/prog_guide/multi_proc_
> >
> > > support.html#multi-process-limitations
> >
> > > http://askubuntu.com/questions/318315/how-can-i-temporarily-disable-as
> >
> > > lr-
> >
> > > address-space-layout-randomization
> >
> > > http://dpdk.org/ml/archives/dev/2015-June/019364.html
> >
> > >
> >
> > > Please note that ASLR is a security feature of the OS, think twice
> >
> > > before disabling it.
> >
> > >
> >
> > >
> >
> > > Hope that helps, -Harry
> >
> > >
> >
> > >
> >
> > > > Thanks for your help :)
> >
> > > >
> >
> > > > Regards,
> >
> > > > Kai
> >
> > > >
> >
> > > > On Mon, Mar 13, 2017 at 3:24 AM, Wiles, Keith
> >
> > > > <keith.wiles at intel.com>
> >
> > > wrote:
> >
> > > >
> >
> > > > >
> >
> > > > > > On Mar 12, 2017, at 6:39 PM, Kai Zhang <kay21s at gmail.com> wrote:
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > Your application may be attaching to the same port for each core.
> >
> > > > > Normally this means the each core could be allocating memory and
> >
> > > > > the
> >
> > > 4th
> >
> > > > > core just goes over the amount of memory you have reserved.
> >
> > > > > >
> >
> > > > > > I don't think so. Because the error is in the rte_eal_init(),
> >
> > > > > > which
> >
> > > is
> >
> > > > > executed in the first line of the main() function. At the time,
> >
> > > > > the
> >
> > > other
> >
> > > > > threads are not even launched.
> >
> > > > > >
> >
> > > > > > Is it possible to consider this as a bug in DPDK?
> >
> > > > >
> >
> > > > > One more thing, I run Pktgen as two processes all of the time. The
> >
> > > > > big difference is I do not run in primary and secondary modes. I
> >
> > > > > run two different instances of pktgen at the same time without
> >
> > > > > seeing this type problem. If the failure is associated with
> >
> > > > > primary/secondary
> >
> > > application
> >
> > > > > model, then it could be a bug in that code as a lot of syncing up
> >
> > > between
> >
> > > > > the two processes needs to be done because of memory/device
> sharing.
> >
> > > One
> >
> > > > > problem with P/S applications is memory needs to be mapped at the
> >
> > > > > same address between the processes and Linux has the Random memory
> >
> > > > > mapping builtin for security reasons. I forget the name of the
> >
> > > > > mode in Linux to turn off the random page mapping and google is
> not work for me ATM.
> >
> > > > >
> >
> > > > > Does your application require running as a primary/secondary
> >
> > > application?
> >
> > > > >
> >
> > > > > >
> >
> > > > > > Regards,
> >
> > > > > > Kai
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > >
> >
> > > > > > > EAL: Cannot mmap device resource file
> /sys/bus/pci/devices/0000:02:
> >
> > > 00.0/resource0
> >
> > > > > to address: 0x7fff65bfc000
> >
> > > > > > > EAL: Error - exiting with code: 1
> >
> > > > > > >   Cause: Requested device 0000:02:00.0 cannot be used
> >
> > > > > > >
> >
> > > > > > > Regards,
> >
> > > > > > > Kai
> >
> > > > > > >
> >
> > > > > > > On Sun, Mar 12, 2017 at 11:21 AM, Kai Zhang <kay21s at gmail.com>
> >
> > > wrote:
> >
> > > > > > >
> >
> > > > > > > Command line:
> >
> > > > > > > primary:      sudo ./primary -l 0,1,2,3 -n 4
> --proc-type=primary
> >
> > > > > > > secondary: sudo ./secondary -l 4,5,6,7,8 -n 4
> >
> > > > > > > --proc-type=secondary
> >
> > > > > > >
> >
> > > > > > > The configurations are as follows:
> >
> > > > > > > A) 1 x Intel E5-2650 v4, 12 cores [UMA],     XL710 40GbE, bind
> >
> > > > > 02:00.0,    2048 x 4k huge page
> >
> > > > > > > 02:00.0 Ethernet controller: Intel Corporation Ethernet
> >
> > > > > > > Controller
> >
> > > > > XL710 for 40GbE QSFP+ (rev 02)   [<<- Only bind this one]
> >
> > > > > > > 02:00.1 Ethernet controller: Intel Corporation Ethernet
> >
> > > > > > > Controller
> >
> > > > > XL710 for 40GbE QSFP+ (rev 02)
> >
> > > > > > > 05:00.0 Ethernet controller: Intel Corporation I210 Gigabit
> >
> > > > > > > Network
> >
> > > > > Connection (rev 03)
> >
> > > > > > > 06:00.0 Ethernet controller: Intel Corporation I210 Gigabit
> >
> > > > > > > Network
> >
> > > > > Connection (rev 03)
> >
> > > > > > >         Socket 0
> >
> > > > > > > --------
> >
> > > > > > > Core 0  [0, 12]
> >
> > > > > > > Core 1  [1, 13]
> >
> > > > > > > Core 2  [2, 14]
> >
> > > > > > > Core 3  [3, 15]
> >
> > > > > > > Core 4  [4, 16]
> >
> > > > > > > Core 5  [5, 17]
> >
> > > > > > > Core 8  [6, 18]
> >
> > > > > > > Core 9  [7, 19]
> >
> > > > > > > Core 10 [8, 20]
> >
> > > > > > > Core 11 [9, 21]
> >
> > > > > > > Core 12 [10, 22]
> >
> > > > > > > Core 13 [11, 23]
> >
> > > > > > >
> >
> > > > > > > B) 2 x Intel E5-2640 v4, 10 cores [NUMA],      No Port Bind,
> >
> > > 2048 x
> >
> > > > > 4k huge page
> >
> > > > > > > 05:00.0 Ethernet controller: Intel Corporation I210 Gigabit
> >
> > > > > > > Network
> >
> > > > > Connection (rev 03)
> >
> > > > > > > 06:00.0 Ethernet controller: Intel Corporation I210 Gigabit
> >
> > > > > > > Network
> >
> > > > > Connection (rev 03)
> >
> > > > > > >         Socket 0        Socket 1
> >
> > > > > > >         --------        --------
> >
> > > > > > > Core 0  [0, 20]         [10, 30]
> >
> > > > > > > Core 1  [1, 21]         [11, 31]
> >
> > > > > > > Core 2  [2, 22]         [12, 32]
> >
> > > > > > > Core 3  [3, 23]         [13, 33]
> >
> > > > > > > Core 4  [4, 24]         [14, 34]
> >
> > > > > > > Core 8  [5, 25]         [15, 35]
> >
> > > > > > > Core 9  [6, 26]         [16, 36]
> >
> > > > > > > Core 10 [7, 27]         [17, 37]
> >
> > > > > > > Core 11 [8, 28]         [18, 38]
> >
> > > > > > > Core 12 [9, 29]         [19, 39]
> >
> > > > > > >
> >
> > > > > > > Ah, as machine B does not have a 40GbE, I did not bind any NIC
> >
> > > > > > > and
> >
> > > run
> >
> > > > > my program with locally generated packets. But I am using other
> >
> > > > > DPDK features, such as memory sharing and message passing. Maybe
> >
> > > > > that is the reason it works correctly? I can only access machine B
> >
> > > > > remotely, so I
> >
> > > am
> >
> > > > > unable to install a NIC on it. I have another PC that is used as a
> >
> > > client
> >
> > > > > that only has four cores, which also cannot be used for
> verification...
> >
> > > > > > >
> >
> > > > > > > Regards,
> >
> > > > > > > Kai
> >
> > > > > > >
> >
> > > > > > >
> >
> > > > > > > On Sun, Mar 12, 2017 at 2:59 AM, Wiles, Keith <
> >
> > > keith.wiles at intel.com>
> >
> > > > > wrote:
> >
> > > > > > >
> >
> > > > > > > > On Mar 11, 2017, at 9:45 AM, Kai Zhang <kay21s at gmail.com>
> wrote:
> >
> > > > > > > >
> >
> > > > > > > > Hi Keith,
> >
> > > > > > > >
> >
> > > > > > > > Thank you for your reply.
> >
> > > > > > > >
> >
> > > > > > > > I have tested my program on two machines
> >
> > > > > > > > A) 1 x Intel E5-2650 v4, 12 cores [UMA]
> >
> > > > > > > > B) 2 x Intel E5-2640 v4, 10 cores [NUMA]
> >
> > > > > > > >
> >
> > > > > > > > I am very sure that the primary process uses different cores
> >
> > > > > > > > with
> >
> > > > > the secondary process. The strange thing is that my program works
> >
> > > correctly
> >
> > > > > on machine B. But on machine A, the above issue happens with more
> >
> > > > > than
> >
> > > 4
> >
> > > > > cores assigned to the secondary process.
> >
> > > > > > > >
> >
> > > > > > > > I have tried to assign cores 1-5  to the secondary process
> >
> > > > > > > > and
> >
> > > also
> >
> > > > > tried other core assignment policies, but the error still happens
> >
> > > > > rte_eal_init() with more than 4 cores.
> >
> > > > > > >
> >
> > > > > > > It would be nice to see both command lines. I am not sure I
> >
> > > > > > > can
> >
> > > help
> >
> > > > > more all I can do is suggest some ideas to look at.
> >
> > > > > > >
> >
> > > > > > > Does machine B have the same number and type of NICs? Use
> >
> > > > > > > ‘lspci |
> >
> > > > > grep Ethernet’ to get a list of all Ethernet devices on both
> machines.
> >
> > > > > > >
> >
> > > > > > > What is the number of hugepages you have allocated for both
> >
> > > machines.
> >
> > > > > > >
> >
> > > > > > > Also look at the cpu_layout.py script to see why adding the
> >
> > > > > > > 5th
> >
> > > core
> >
> > > > > would be different on the two machines and try to make them the
> same.
> >
> > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > Regards,
> >
> > > > > > > > Kai
> >
> > > > > > > >
> >
> > > > > > > > On Sat, Mar 11, 2017 at 10:52 PM, Wiles, Keith <
> >
> > > > > keith.wiles at intel.com> wrote:
> >
> > > > > > > >
> >
> > > > > > > > > On Mar 10, 2017, at 9:35 PM, Kai Zhang <kay21s at gmail.com>
> >
> > > wrote:
> >
> > > > > > > > >
> >
> > > > > > > > > Hi, there
> >
> > > > > > > > >
> >
> > > > > > > > > I am using DPDK-16.11 on XL710 40GbE NIC. OS: CentOS
> >
> > > > > > > > > 7.3.1611
> >
> > > with
> >
> > > > > Linux
> >
> > > > > > > > > kernel version 3.8.0-30.
> >
> > > > > > > > >
> >
> > > > > > > > > I have a master process and a secondary process. When I
> >
> > > > > > > > > run the
> >
> > > > > secondary
> >
> > > > > > > > > process with less than or equal to 4 cores, it works
> correctly.
> >
> > > > > Such as:
> >
> > > > > > > > > sudo ./program -l 4,5,6,7 -n 4 --proc-type=secondary sudo
> >
> > > > > > > > > ./program -c 0x0f -n 4 --proc-type=secondary
> >
> > > > > > > > >
> >
> > > > > > > > > However, there will be error in the rte_eal_init if I
> >
> > > > > > > > > assign
> >
> > > more
> >
> > > > > than 4
> >
> > > > > > > > > cores.
> >
> > > > > > > > > sudo ./program -l 0,1,2,3,4 -n 4 --proc-type=secondary
> >
> > > > > > > > > sudo ./program -c 0x1f -n 4 --proc-type=secondary
> >
> > > > > > > > >
> >
> > > > > > > > > EAL: Cannot mmap device resource file
> >
> > > > > > > > > /sys/bus/pci/devices/0000:02:00.0/resource0 to address:
> >
> > > > > 0x7fff65bfc000
> >
> > > > > > > > > EAL: Error - exiting with code: 1
> >
> > > > > > > > >  Cause: Requested device 0000:02:00.0 cannot be used
> >
> > > > > > > >
> >
> > > > > > > > I assume you have at least 8 cores. Have you tried -l 1-5 on
> >
> > > > > > > > the
> >
> > > > > secondary process.
> >
> > > > > > > >
> >
> > > > > > > > You did not show the primary process command line, but the
> >
> > > > > > > > if you
> >
> > > > > use 1-5 then you can only give primary process -l 6-7 or two
> >
> > > > > cores. It
> >
> > > is
> >
> > > > > always a reasonable thing is to leave core zero for linux to use.
> >
> > > > > > > >
> >
> > > > > > > > Also it could be you ran out of memory or hugepages you
> >
> > > allocated to
> >
> > > > > the system.
> >
> > > > > > > >
> >
> > > > > > > > >
> >
> > > > > > > > > Anyone knows why this happens?
> >
> > > > > > > > >
> >
> > > > > > > > > Thanks a lot,
> >
> > > > > > > > > Kai Zhang
> >
> > > > > > > >
> >
> > > > > > > > Regards,
> >
> > > > > > > > Keith
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > >
> >
> > > > > > > Regards,
> >
> > > > > > > Keith
> >
> > > > > > >
> >
> > > > > > >
> >
> > > > > > >
> >
> > > > > >
> >
> > > > > > Regards,
> >
> > > > > > Keith
> >
> > > > >
> >
> > > > > Regards,
> >
> > > > > Keith
> >
> > > > >
> >
> > > > >
> >
> > >
> >
> >
> >
> >
> >
> >
> Regards,
> Keith
>
>
>


More information about the users mailing list