[dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

Stephen Hemminger stephen at networkplumber.org
Tue Feb 17 17:05:00 CET 2015

Previous message: [dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock
Next message: [dpdk-dev] [PATCH v2 7/7] rte_sched: rearrange data structures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 16 Feb 2015 22:44:31 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu at intel.com> wrote:

> Hi Stephen,
> 
> Sorry, NACK.
> 
> 1. Overflow issue
> As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
> 	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
> Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.

The tsc shifted is still 64 bits.
and rate is 32 bits bytes/sec.

I chose scale such that
if clock = 3 Ghz
then min rate = 715 bytes/sec =  5722 bits/sec

> 2. Integer division has a higher cost than floating point division
> My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
> 	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> with an integer division:
> 	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
> I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
> 	Instruction FDIV double precision: latency = 38-40 cycles
> 	Instruction IDIV: latency = 56 - 80 cycles

I observed that performance when from 5Gbit/sec to 10Gbit/sec.
Mostly because the floating point engages more instruction units and does not
pipeline. Cycle count is not everything.  This was on Ivy Bridge processor.


> 3. Alternative
> I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?
>

I looked into multiplative integer method, and will do it in future. But it has
more scaling issues since it would require that the values both be 32 bits.

Previous message: [dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock
Next message: [dpdk-dev] [PATCH v2 7/7] rte_sched: rearrange data structures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list