[dpdk-dev] [PATCH v2 1/6] eal: introduce oops handling API

Jerin Jacob jerinjacobk at gmail.com
Wed Aug 18 11:37:25 CEST 2021


On Tue, Aug 17, 2021 at 9:22 PM Stephen Hemminger
<stephen at networkplumber.org> wrote:
>
> On Tue, 17 Aug 2021 20:57:50 +0530
> Jerin Jacob <jerinjacobk at gmail.com> wrote:
>
> > On Tue, Aug 17, 2021 at 8:39 PM Stephen Hemminger
> > <stephen at networkplumber.org> wrote:
> > >
> > > On Tue, 17 Aug 2021 13:08:46 +0530
> > > Jerin Jacob <jerinjacobk at gmail.com> wrote:
> > >
> > > > On Tue, Aug 17, 2021 at 9:23 AM Stephen Hemminger
> > > > <stephen at networkplumber.org> wrote:
> > > > >
> > > > > On Tue, 17 Aug 2021 08:57:18 +0530
> > > > > <jerinj at marvell.com> wrote:
> > > > >
> > > > > > From: Jerin Jacob <jerinj at marvell.com>
> > > > > >
> > > > > > Introducing oops handling API with following specification
> > > > > > and enable stub implementation for Linux and FreeBSD.
> > > > > >
> > > > > > On rte_eal_init() invocation, the EAL library installs the
> > > > > > oops handler for the essential signals.
> > > > > > The rte_oops_signals_enabled() API provides the list
> > > > > > of signals the library installed by the EAL.
> > > > >
> > > > > This is a big change, and many applications already handle these
> > > > > signals themselves. Therefore adding this needs to be opt-in
> > > > > and not enabled by default.
> > > >
> > > > In order to avoid every application explicitly register this
> > > > sighandler and to cater to the
> > > > co-existing application-specific signal-hander usage.
> > > > The following design has been chosen. (It is mentioned in the commit log,
> > > > I will describe here for more clarity)
> > > >
> > > > Case 1:
> > > > a) The application installs the signal handler prior to rte_eal_init().
> > > > b) Implementation stores the application-specific signal and replace a
> > > > signal handler as oops eal handler
> > > > c) when application/DPDK get the segfault, the default EAL oops
> > > > handler gets invoked
> > > > d) Then it dumps the EAL specific message, it calls the
> > > > application-specific signal handler
> > > > installed in step 1 by application. This avoids breaking any contract
> > > > with the application.
> > > > i.e Behavior is the same current EAL now.
> > > > That is the reason for not using SA_RESETHAND(which call SIG_DFL after
> > > > eal oops handler instead
> > > > application-specific handler)
> > > >
> > > > Case 2:
> > > > a) The application install the signal handler after rte_eal_init(),
> > > > b) EAL hander get replaced with application handle then the application can call
> > > > rte_oops_decode() to decode.
> > > >
> > > > In order to cater the above use case, rte_oops_signals_enabled() and
> > > > rte_oops_decode()
> > > > provided.
> > > >
> > > > Here we are not breaking any contract with the application.
> > > > Do you have concerns about this design?
> > >
> > > In our application as a service it is important not to do any backtrace
> > > in production. We rely on other infrastructure to process coredumps.
> >
> > Other infrastructure will work. For example, If we are using standard coredump
> > using linux infra. In Current implementation,
> > - EAL handler dump the DPDK OOPS like kernel on stderr
> > - Implementation calls SIG_DFL in eal oops handler
> > - The above step creates the coredump or re-directs any other
> > infrastructure you are using for coredump.
> >
> > >
> > > This should be controlled enabled by a command line argument.
> >
> > If we allow other infrastructure coredump to work as-is, why
> > enable/disable required from eal?
>
> The addition of DPDK OOPS adds additional steps which make all
> faults be identified as the oops code.

Since we are using SA_ONSTACK it is not losing the original segfault
info.

I verified like this, Please find below the steps.

0) Enable coredump infra in Linux using coredumpctl or so
1) Apply this series
2) Apply for the following patch to create a segfault from the library.
This will test, segfault caught by eal and forward to default Linux singal
handler.

[main]dell[dpdk.org] $ git diff
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3438a96b75..b935c32c98 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1338,6 +1338,8 @@ rte_eal_init(int argc, char **argv)

        eal_mcfg_complete();

+       /* Generate a segfault */
+       *(volatile int *)0x05 = 0;
        return fctret;

 }
3)Build
meson --buildtype debug build
ninja -C build

4) Run
$ ./build/app/test/dpdk-test --no-huge  -c 0x2

Please find oops dump[1] and gdb core dump backtrace[2].
Gdb core dump trace preserves the original segfault cause and trace.

Any other concerns?


[1]
[main]dell[dpdk.org] $ ./build/app/test/dpdk-test --no-huge  -c 0x2
EAL: Detected 56 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Static memory layout is selected, amount of reserved memory can
be adjusted with -m or --socket-mem
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /run/user/1000/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: WARNING: Main core has no memory on local socket!
Signal info:
------------
PID:           2666512
Signal number: 11
Fault address: 0x5

Backtrace:
----------
[  0x5582acd1e08a]: rte_eal_init()+0xe18
[  0x5582ac086f4e]: main()+0x298
[  0x7f0facf1fb25]: __libc_start_main()+0xd5
[  0x5582ac079c9e]: _start()+0x2e

Arch info:
----------
R8 : 0x0000000000000002  R9 : 0x00007ffe9273c590
R10: 0x0000000000000000  R11: 0x0000000000000246
R12: 0x00005582bc3ce7a0  R13: 0x00000000000000ca
R14: 0x0000000000000000  R15: 0x0000000000000000
RAX: 0x0000000000000005  RBX: 0x00005582bc3c75c8
RCX: 0x00007ffe9273c530  RDX: 0x0000000000000000
RBP: 0x00007ffe9273c820  RSP: 0x00007ffe9273c690
RSI: 0x0000000000000008  RDI: 0x00000000000000ca
RIP: 0x00005582acd1e08a  EFL: 0x0000000000010246


[2]

Core was generated by `./build/app/test/dpdk-test --no-huge -c 0x2'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  rte_eal_init (argc=4, argv=0x7ffe9273cec8) at ../lib/eal/linux/eal.c:1342
1342            *(volatile int *)0x05 = 0;
[Current thread is 1 (Thread 0x7f0faca83c00 (LWP 2666512))]
(gdb) bt
#0  rte_eal_init (argc=4, argv=0x7ffe9273cec8) at ../lib/eal/linux/eal.c:1342
#1  0x00005582ac086f4e in main (argc=4, argv=0x7ffe9273cec8) at
../app/test/test.c:146




>


More information about the dev mailing list