[dpdk-dev] [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

Liang, Cunming cunming.liang at intel.com
Fri Feb 27 13:22:08 CET 2015


Hi,

From: David Marchand [mailto:david.marchand at 6wind.com]
Sent: Friday, February 27, 2015 6:34 PM
To: Liang, Cunming
Cc: dev at dpdk.org; Stephen Hemminger; Thomas Monjalon; Zhou, Danny
Subject: Re: [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

I am not really comfortable with this api.

This is just creating something on top of the standard epoll api with limitations.
In the end, we could just use an external lib that does this already.
[Liang, Cunming] Not really, I think. We try to protect the data inside ‘rte_intr_handle’, it doesn’t expect user to understand the things defined inside ‘rte_intr_handle’.
It’s better typedef ‘rte_intr_handle’ as a raw integer ID, having a function to get it from a ethdev. Then all the interrupt api is around it.
It provides the common pci NIC devices rxtx interrupt processing approach. For the limitations, we can fix it step by step.

So ok, this will work for your limited use case, but this will not be really useful for anything else.
Not sure it has its place in eal, this is more an example to me.
[Liang, Cunming] ‘limited use case’ do you means only for rxtx ? It don’t expect to provide a generic event mechanism (like libev/libevent does), but a simple way to allow PMD work with DMA interrupt. It mainly abstract for rx interrupt purpose. I appreciate if you could help to list more useful cases.


On Fri, Feb 27, 2015 at 5:56 AM, Cunming Liang <cunming.liang at intel.com<mailto:cunming.liang at intel.com>> wrote:
This patch does below:
 - Create multiple VFIO eventfd for rx queues.
 - Handle per rx queue interrupt.
 - Eliminate unnecessary suspended DPDK polling thread wakeup mechanism
   for rx interrupt by allowing polling thread epoll_wait rx queue
   interrupt notification.

Signed-off-by: Danny Zhou <danny.zhou at intel.com<mailto:danny.zhou at intel.com>>
Signed-off-by: Cunming Liang <cunming.liang at intel.com<mailto:cunming.liang at intel.com>>
---
v6 changes
 - split rte_intr_wait_rx_pkt into two function, wait and set.
 - rewrite rte_intr_rx_wait/rte_intr_rx_set to remove queue visibility on eal.
 - rte_intr_rx_wait to support multiplexing.
 - allow epfd as input to support flexible event fd combination.


 lib/librte_eal/linuxapp/eal/eal_interrupts.c    | 224 +++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_pci_vfio.c      |  23 ++-
 lib/librte_eal/linuxapp/eal/rte_eal_version.map |   2 +
 3 files changed, 201 insertions(+), 48 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_interrupts.c b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
index 8c5b834..f90c2b4 100644
--- a/lib/librte_eal/linuxapp/eal/eal_interrupts.c
+++ b/lib/librte_eal/linuxapp/eal/eal_interrupts.c

[snip]


+static void
+eal_intr_process_rxtx_interrupts(struct rte_intr_handle *intr_handle,
+                                struct epoll_event *events,
+                                uint32_t *vec, int nfds)
+{
+       int i, bytes_read;
+       union rte_intr_read_buffer buf;
+       int fd;
+
+       for (i = 0; i < nfds; i++) {
+               /* set the length to be read for different handle type */
+               switch (intr_handle->type) {
+               case RTE_INTR_HANDLE_UIO:
+                       bytes_read = sizeof(buf.uio_intr_count);
+                       break;
+               case RTE_INTR_HANDLE_ALARM:
+                       bytes_read = sizeof(buf.timerfd_num);
+                       break;
+#ifdef VFIO_PRESENT
+               case RTE_INTR_HANDLE_VFIO_MSIX:
+               case RTE_INTR_HANDLE_VFIO_MSI:
+               case RTE_INTR_HANDLE_VFIO_LEGACY:
+                       bytes_read = sizeof(buf.vfio_intr_count);
+                       break;
+#endif
+               default:
+                       bytes_read = 1;
+                       break;
+               }
+
+               /**
+               * read out to clear the ready-to-be-read flag
+               * for epoll_wait.
+               */
+               vec[i] = events[i].data.u32;
+               assert(vec[i] < VFIO_MAX_RXTX_INTR_ID);
+
+               fd = intr_handle->efds[vec[i]];
+               bytes_read = read(fd, &buf, bytes_read);
+               if (bytes_read < 0)
+                       RTE_LOG(ERR, EAL, "Error reading from file "
+                               "descriptor %d: %s\n", fd, strerror(errno));
+               else if (bytes_read == 0)
+                       RTE_LOG(ERR, EAL, "Read nothing from file "
+                               "descriptor %d\n", fd);
+       }
+}

Why unconditionnally read ?
You are absorbing events from the application if the application gave you an external epfd and populated it with its own fds.
[Liang, Cunming] The vector number was checked. If an external epfd populated some event carry fd rather than a data.u32 but the value inside the valid range, it considers as a valid vector number. No matter the read success or not, it always notify the event. Do you have any suggestion used here to check the condition ?

+
+static int init_tls_epfd(void)
+{
+       int pfd = epoll_create(1);
+       if (pfd < 0) {
+               RTE_LOG(ERR, EAL,
+                       "Cannot create epoll instance\n");
+               return -1;
+       }
+       return pfd;
+}
+
+int
+rte_intr_rx_wait(struct rte_intr_handle *intr_handle, int epfd,
+                uint32_t *vec, uint16_t num)
+{

In the end, this "rx" does not mean anything to eal.
[Liang, Cunming] That’s a good point. I tried to remove ‘rx’ and use a generic word here.
‘rte_intr_wait’ looks like too generic, ‘rte_intr_epfd_wait’ looks not abstract with bsd.
As the function only serves for rxtx vector, so using the rx prefix. Which name do you prefer ?


+#define MAX_EVENTS      8
+       struct epoll_event events[MAX_EVENTS];
+       int ret, nfds = 0;
+
+       if (!intr_handle || !vec) {
+               RTE_LOG(ERR, EAL, "invalid input parameter\n");
+               return -1;
+       }
+
+       if (intr_handle->type != RTE_INTR_HANDLE_VFIO_MSIX) {
+               RTE_LOG(ERR, EAL, "intr type should be VFIO_MSIX\n");
+               return -1;
+       }
+
+       if (epfd == RTE_EPOLL_FD_ANY) {
+               /* using per thread epoll fd */
+               if (unlikely(RTE_PER_LCORE(_epfd) == -1))
+                       RTE_PER_LCORE(_epfd) = init_tls_epfd();
+               epfd = RTE_PER_LCORE(_epfd);
+       }

Rather than testing every time, this should be set by the caller, i.e. epfd is always valid.
If application does not want to create a epfd, then it calls  rte_intr_rx_wait with RTE_EPOLL_FD_ANY (this name is not well chosen) that is a macro wrapped to RTE_PER_LCORE(_epfd).
[Liang, Cunming] It sounds good to me. As we don’t expect to expose *rte_per_lcore__epfd* as an public symbol, so will define rte_epfd() instread.
Within rte_epfd(), if RTE_PER_LCORE(_epfd) not assigned, then init_tls_epfd() once.

init_tls_epfd() should be called only once at init time.
No need to check every time.
[Liang, Cunming] As it probably not need per thread epfd at all. So I prefer to create it when it real needed as above I mentioned.

+
+       do {
+               ret = epoll_wait(epfd, events,
+                                RTE_MIN(num, MAX_EVENTS),
+                                EAL_INTR_EPOLL_WAIT_FOREVER);
+               if (unlikely(ret < 0)) {
+                       /* epoll_wait fail */
+                       RTE_LOG(ERR, EAL, "epoll_wait returns with fail\n");
+                       return -1;
+               } else if (ret > 0) {
+                       /* epoll_wait has at least one fd ready to read */
+                       eal_intr_process_rxtx_interrupts(intr_handle, events,
+                                                        vec, ret);
+                       num -= ret;
+                       vec += ret;
+                       nfds += ret;
+               } else if (nfds > 0)
+                       break;
+       } while (num > 0);
+
+       return nfds;
+}

You are blocking unless all fds have been set, so you are serialising all events.
[Liang, Cunming] I’m not sure fully got your point. If any event arrives, it gets back. Do you means if no fds added in, it’s always blocking.
You expect to have a timeout return ?

+
+int
+rte_intr_rx_set(struct rte_intr_handle *intr_handle, int epfd,
+               int op, uint32_t vec)
+{
+       struct epoll_event ev;
+
+       if (!intr_handle || vec >= VFIO_MAX_RXTX_INTR_ID) {
+               RTE_LOG(ERR, EAL, "invalid input parameter\n");
+               return -1;
+       }
+
+       if (intr_handle->type != RTE_INTR_HANDLE_VFIO_MSIX) {
+               RTE_LOG(ERR, EAL, "intr type should be VFIO_MSIX\n");
+               return -1;
+       }
+
+       switch (op) {
+       case RTE_INTR_EVENT_ADD:
+               op = EPOLL_CTL_ADD;
+               break;
+       case RTE_INTR_EVENT_DEL:
+               op = EPOLL_CTL_DEL;
+               break;
+       default:
+               RTE_LOG(ERR, EAL, "event op type mismatch\n");
+               return -1;
+       }
+
+       if (epfd == RTE_EPOLL_FD_ANY) {
+               /* using per thread epoll fd */
+               if (RTE_PER_LCORE(_epfd) == -1)
+                       RTE_PER_LCORE(_epfd) = init_tls_epfd();
+               epfd = RTE_PER_LCORE(_epfd);
+       }
+
+       ev.data.u32 = vec;
+       ev.events = EPOLLIN | EPOLLPRI;
+       if (epoll_ctl(epfd, op, intr_handle->efds[vec], &ev) < 0) {
+               RTE_LOG(ERR, EAL, "Error op %d fd %d epoll_ctl, %s\n",
+                       op, intr_handle->efds[vec], strerror(errno));
+               return -1;
+       }
+
+       return 0;
+}


diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
index ee9660f..d90d23c 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
@@ -38,6 +38,7 @@
 #include <sys/socket.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
+#include <sys/epoll.h>

 #include <rte_log.h>
 #include <rte_pci.h>
@@ -274,16 +275,18 @@ pci_vfio_setup_interrupts(struct rte_pci_device *dev, int vfio_dev_fd)
                ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);
                if (ret < 0) {
                        RTE_LOG(ERR, EAL, "  cannot get IRQ info, "
-                                       "error %i (%s)\n", errno, strerror(errno));
+                               "error %i (%s)\n", errno, strerror(errno));
                        return -1;
                }

Garbage, this has nothing to do with the patch.
[Liang, Cunming] It’s for line number exceed 80 margin complain.


                /* if this vector cannot be used with eventfd, fail if we explicitly
                 * specified interrupt type, otherwise continue */
                if ((irq.flags & VFIO_IRQ_INFO_EVENTFD) == 0) {
-                       if (internal_config.vfio_intr_mode != RTE_INTR_MODE_NONE) {
+                       if (internal_config.vfio_intr_mode !=
+                           RTE_INTR_MODE_NONE) {
                                RTE_LOG(ERR, EAL,
-                                               "  interrupt vector does not support eventfd!\n");
+                                       "  interrupt vector "
+                                       "does not support eventfd!\n");
                                return -1;
                        } else
                                continue;

Idem.
[Liang, Cunming] The same.


@@ -293,17 +296,27 @@ pci_vfio_setup_interrupts(struct rte_pci_device *dev, int vfio_dev_fd)
                fd = eventfd(0, 0);
                if (fd < 0) {
                        RTE_LOG(ERR, EAL, "  cannot set up eventfd, "
-                                       "error %i (%s)\n", errno, strerror(errno));
+                               "error %i (%s)\n", errno, strerror(errno));

Idem.
[Liang, Cunming] The same.


                        return -1;
                }

                dev->intr_handle.fd = fd;
                dev->intr_handle.vfio_dev_fd = vfio_dev_fd;
-

Idem.
[Liang, Cunming] Accept.

                switch (i) {
                case VFIO_PCI_MSIX_IRQ_INDEX:
                        internal_config.vfio_intr_mode = RTE_INTR_MODE_MSIX;
                        dev->intr_handle.type = RTE_INTR_HANDLE_VFIO_MSIX;
+                       for (i = 0; i < VFIO_MAX_RXTX_INTR_ID; i++) {
+                               fd = eventfd(0, 0);
+                               if (fd < 0) {
+                                       RTE_LOG(ERR, EAL,
+                                               "cannot setup eventfd,"
+                                               "error %i (%s)\n",
+                                               errno, strerror(errno));
+                                       return -1;
+                               }
+                               dev->intr_handle.efds[i] = fd;
+                       }
                        break;
                case VFIO_PCI_MSI_IRQ_INDEX:
                        internal_config.vfio_intr_mode = RTE_INTR_MODE_MSI;
diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index 5f1857d..892a452 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -64,6 +64,8 @@ DPDK_2.0 {
        rte_intr_callback_unregister;
        rte_intr_disable;
        rte_intr_enable;
+       rte_intr_rx_set;
+       rte_intr_rx_wait;
        rte_log;
        rte_log_add_in_history;
        rte_log_cur_msg_loglevel;
--
1.8.1.4




--
David Marchand



More information about the dev mailing list