[dpdk-dev,v4] eal: Set numa node value for system which not support it.

Message ID 1494467793-19887-1-git-send-email-nic@opencloud.tech (mailing list archive)
State Accepted, archived
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

nickcooper-zhangtonghao May 11, 2017, 1:56 a.m. UTC
  The NUMA node information for PCI devices provided through
sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
It is good to see more checking for valid values.

Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
---
 lib/librte_eal/linuxapp/eal/eal_pci.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)
  

Comments

Sergio Gonzalez Monroy June 22, 2017, 3:15 p.m. UTC | #1
Just fyi, the summary line should be lowercase apart from acronyms (DPDK 
guidelines).

On 11/05/2017 02:56, Tonghao Zhang wrote:
> The NUMA node information for PCI devices provided through
> sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
> on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
> It is good to see more checking for valid values.
>
> Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
> ---

IMHO the message could be slightly improved by adding some of the 
replies that you made to your v3.
ie. Typical wrong numa node in VMs

$ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
-1

>   lib/librte_eal/linuxapp/eal/eal_pci.c | 18 +++++++++---------
>   1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c b/lib/librte_eal/linuxapp/eal/eal_pci.c
> index 595622b..c817b4c 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
> @@ -310,18 +310,18 @@
>   			dev->max_vfs = (uint16_t)tmp;
>   	}
>   
> -	/* get numa node */
> +	/* get numa node, default to 0 if not present */
>   	snprintf(filename, sizeof(filename), "%s/numa_node",
>   		 dirname);
> -	if (access(filename, R_OK) != 0) {
> -		/* if no NUMA support, set default to 0 */
> -		dev->device.numa_node = 0;
> -	} else {
> -		if (eal_parse_sysfs_value(filename, &tmp) < 0) {
> -			free(dev);
> -			return -1;
> -		}
> +
> +	if (eal_parse_sysfs_value(filename, &tmp) == 0 &&
> +		tmp < RTE_MAX_NUMA_NODES)
>   		dev->device.numa_node = tmp;
> +	else {
> +		RTE_LOG(WARNING, EAL,
> +			"numa_node is invalid or not present. "
> +			"Set it 0 as default\n");
> +		dev->device.numa_node = 0;
>   	}
>   
>   	rte_pci_device_name(addr, dev->name, sizeof(dev->name));

The code changes look fine, so I leave it to Thomas regarding the commit 
message :)

Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
  
Thomas Monjalon June 23, 2017, 1:02 p.m. UTC | #2
22/06/2017 17:15, Sergio Gonzalez Monroy:
> Just fyi, the summary line should be lowercase apart from acronyms (DPDK 
> guidelines).
> 
> On 11/05/2017 02:56, Tonghao Zhang wrote:
> > The NUMA node information for PCI devices provided through
> > sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
> > on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
> > It is good to see more checking for valid values.
> >
> > Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
> > ---
> 
> IMHO the message could be slightly improved by adding some of the 
> replies that you made to your v3.
> ie. Typical wrong numa node in VMs
> 
> $ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
> -1
[...]
> The code changes look fine, so I leave it to Thomas regarding the commit 
> message :)
> 
> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>

Applied, thanks
  
Sergio Gonzalez Monroy June 26, 2017, 9:14 a.m. UTC | #3
On 23/06/2017 14:02, Thomas Monjalon wrote:
> 22/06/2017 17:15, Sergio Gonzalez Monroy:
>> Just fyi, the summary line should be lowercase apart from acronyms (DPDK
>> guidelines).
>>
>> On 11/05/2017 02:56, Tonghao Zhang wrote:
>>> The NUMA node information for PCI devices provided through
>>> sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
>>> on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
>>> It is good to see more checking for valid values.
>>>
>>> Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
>>> ---
>> IMHO the message could be slightly improved by adding some of the
>> replies that you made to your v3.
>> ie. Typical wrong numa node in VMs
>>
>> $ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
>> -1
> [...]
>> The code changes look fine, so I leave it to Thomas regarding the commit
>> message :)
>>
>> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
> Applied, thanks

It looks like some systems have quite a few devices that report -1 as 
numa_node value causing lots of warning messages being printed.
Quick fixes that come to mind would be:
1) Change log level to DEBUG
2) Add static var to only print the message once.

I also think that the message itself should show at least the BDF to at 
least know which devices are reporting bad numa_node values.

Thoughts?

Sergio
  
Thomas Monjalon June 26, 2017, 9:39 a.m. UTC | #4
26/06/2017 11:14, Sergio Gonzalez Monroy:
> On 23/06/2017 14:02, Thomas Monjalon wrote:
> > 22/06/2017 17:15, Sergio Gonzalez Monroy:
> >> Just fyi, the summary line should be lowercase apart from acronyms (DPDK
> >> guidelines).
> >>
> >> On 11/05/2017 02:56, Tonghao Zhang wrote:
> >>> The NUMA node information for PCI devices provided through
> >>> sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
> >>> on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
> >>> It is good to see more checking for valid values.
> >>>
> >>> Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
> >>> ---
> >> IMHO the message could be slightly improved by adding some of the
> >> replies that you made to your v3.
> >> ie. Typical wrong numa node in VMs
> >>
> >> $ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
> >> -1
> > [...]
> >> The code changes look fine, so I leave it to Thomas regarding the commit
> >> message :)
> >>
> >> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
> > Applied, thanks
> 
> It looks like some systems have quite a few devices that report -1 as 
> numa_node value causing lots of warning messages being printed.
> Quick fixes that come to mind would be:
> 1) Change log level to DEBUG

As it is important for performance, it should not be just for DEBUG.

> 2) Add static var to only print the message once.

Yes good idea.

> I also think that the message itself should show at least the BDF to at 
> least know which devices are reporting bad numa_node values.

With the static variable, we will have only the first device BDF.
Is it relevant?
  
Sergio Gonzalez Monroy June 26, 2017, 12:50 p.m. UTC | #5
On 26/06/2017 10:39, Thomas Monjalon wrote:
> 26/06/2017 11:14, Sergio Gonzalez Monroy:
>> On 23/06/2017 14:02, Thomas Monjalon wrote:
>>> 22/06/2017 17:15, Sergio Gonzalez Monroy:
>>>> Just fyi, the summary line should be lowercase apart from acronyms (DPDK
>>>> guidelines).
>>>>
>>>> On 11/05/2017 02:56, Tonghao Zhang wrote:
>>>>> The NUMA node information for PCI devices provided through
>>>>> sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
>>>>> on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
>>>>> It is good to see more checking for valid values.
>>>>>
>>>>> Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
>>>>> ---
>>>> IMHO the message could be slightly improved by adding some of the
>>>> replies that you made to your v3.
>>>> ie. Typical wrong numa node in VMs
>>>>
>>>> $ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
>>>> -1
>>> [...]
>>>> The code changes look fine, so I leave it to Thomas regarding the commit
>>>> message :)
>>>>
>>>> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
>>> Applied, thanks
>> It looks like some systems have quite a few devices that report -1 as
>> numa_node value causing lots of warning messages being printed.
>> Quick fixes that come to mind would be:
>> 1) Change log level to DEBUG
> As it is important for performance, it should not be just for DEBUG.
>
>> 2) Add static var to only print the message once.
> Yes good idea.
>
>> I also think that the message itself should show at least the BDF to at
>> least know which devices are reporting bad numa_node values.
> With the static variable, we will have only the first device BDF.
> Is it relevant?
>

I think it is relevant if it affects a device used by DPDK, but we don't 
know that when doing full pci_scan.

At least on x86 platforms we usually see many PCI devices without numa_node:
ls /sys/bus/pci/devices | xargs -n 1 -I {} head -v 
"/sys/bus/pci/devices/{}/numa_node"

A single warning is not going to mean much if all platforms have PCI 
devices without proper numa_node, right?

A more cleaner solution might be to leave -1 if we failed to parse 
numa_node, then on rte_pci_probe_one_driver after checking if it is 
blacklisted check if socket_id is -1 and show warning message defaulting 
to 0?

I would be inclined to:
a) leave it as it is with DEBUG log level, also showing PCI BDF (very 
noisy in debug mode).
b) show the warning and default to 0 in rte_pci_probe_one_driver, 
showing only relevant devices.

Sergio
  
Thomas Monjalon June 26, 2017, 2:36 p.m. UTC | #6
26/06/2017 14:50, Sergio Gonzalez Monroy:
> On 26/06/2017 10:39, Thomas Monjalon wrote:
> > 26/06/2017 11:14, Sergio Gonzalez Monroy:
> >> On 23/06/2017 14:02, Thomas Monjalon wrote:
> >>> 22/06/2017 17:15, Sergio Gonzalez Monroy:
> >>>> Just fyi, the summary line should be lowercase apart from acronyms (DPDK
> >>>> guidelines).
> >>>>
> >>>> On 11/05/2017 02:56, Tonghao Zhang wrote:
> >>>>> The NUMA node information for PCI devices provided through
> >>>>> sysfs is invalid for AMD Opteron(TM) Processor 62xx and 63xx
> >>>>> on Red Hat Enterprise Linux 6, and VMs on some hypervisors.
> >>>>> It is good to see more checking for valid values.
> >>>>>
> >>>>> Signed-off-by: Tonghao Zhang <nic@opencloud.tech>
> >>>>> ---
> >>>> IMHO the message could be slightly improved by adding some of the
> >>>> replies that you made to your v3.
> >>>> ie. Typical wrong numa node in VMs
> >>>>
> >>>> $ cat /sys/devices/pci0000:00/0000:00:18.6/numa_node
> >>>> -1
> >>> [...]
> >>>> The code changes look fine, so I leave it to Thomas regarding the commit
> >>>> message :)
> >>>>
> >>>> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
> >>> Applied, thanks
> >> It looks like some systems have quite a few devices that report -1 as
> >> numa_node value causing lots of warning messages being printed.
> >> Quick fixes that come to mind would be:
> >> 1) Change log level to DEBUG
> > As it is important for performance, it should not be just for DEBUG.
> >
> >> 2) Add static var to only print the message once.
> > Yes good idea.
> >
> >> I also think that the message itself should show at least the BDF to at
> >> least know which devices are reporting bad numa_node values.
> > With the static variable, we will have only the first device BDF.
> > Is it relevant?
> >
> 
> I think it is relevant if it affects a device used by DPDK, but we don't 
> know that when doing full pci_scan.
> 
> At least on x86 platforms we usually see many PCI devices without numa_node:
> ls /sys/bus/pci/devices | xargs -n 1 -I {} head -v 
> "/sys/bus/pci/devices/{}/numa_node"
> 
> A single warning is not going to mean much if all platforms have PCI 
> devices without proper numa_node, right?
> 
> A more cleaner solution might be to leave -1 if we failed to parse 
> numa_node, then on rte_pci_probe_one_driver after checking if it is 
> blacklisted check if socket_id is -1 and show warning message defaulting 
> to 0?
> 
> I would be inclined to:
> a) leave it as it is with DEBUG log level, also showing PCI BDF (very 
> noisy in debug mode).
> b) show the warning and default to 0 in rte_pci_probe_one_driver, 
> showing only relevant devices.

Looks a good proposal Sergio!

Thanks
  

Patch

diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c b/lib/librte_eal/linuxapp/eal/eal_pci.c
index 595622b..c817b4c 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
@@ -310,18 +310,18 @@ 
 			dev->max_vfs = (uint16_t)tmp;
 	}
 
-	/* get numa node */
+	/* get numa node, default to 0 if not present */
 	snprintf(filename, sizeof(filename), "%s/numa_node",
 		 dirname);
-	if (access(filename, R_OK) != 0) {
-		/* if no NUMA support, set default to 0 */
-		dev->device.numa_node = 0;
-	} else {
-		if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-			free(dev);
-			return -1;
-		}
+
+	if (eal_parse_sysfs_value(filename, &tmp) == 0 &&
+		tmp < RTE_MAX_NUMA_NODES)
 		dev->device.numa_node = tmp;
+	else {
+		RTE_LOG(WARNING, EAL,
+			"numa_node is invalid or not present. "
+			"Set it 0 as default\n");
+		dev->device.numa_node = 0;
 	}
 
 	rte_pci_device_name(addr, dev->name, sizeof(dev->name));