I am using Intel x520 and x540 dual port NIC attached to Dell PowerEdge server. All NIC ports can work at 10Gbps, hence total 40 Gbps. The system has 2 sockets containing Xeon E5-2640 v3 CPU(Haswell Microarchitecture).
There are many problems I am facing and can be resolved using PCIe and DMA benchmarking. However, I couldn't find any proper way to do the same. I am unable to achieve 40Gbps throughput even with DPDK based driver and libraries(with 64 byte packets). I need to perform the experiments with 64 byte size and can't change the packet size.
I am generating packets using DPDK-pktgen and counting the events using Intel-PCM, ./pcm-pci.x. However, the counting is one way, in the sense, I am counting the number of events and can't tell what is the maximum number of each events the system can support. The results from pcm-pci.x :
Skt PCIeRdCur RFO CRd DRd ItoM PRd WiL
0 73 M 3222 K 784 K 63 M 52 M 0 2791 K
My NICs are connected to socket 0 and that's why I am not putting socket 1 results.
Is there any way to benchmark PCIe bus and DMA engine? and Is there any way to get the precise latency at IO subsystem(at each level) for packet processing(can't use rdtsc() to measure harware level latencies)?
You didn't mention if your CPU cores are running at 100% utilisation? If they are running at maximum capacity and you're not getting line rate 40Gbps then the problem is possibly software related.
Have a look at SystemTap; you can use it it to debug and record the run time, latency and jitter (create a histrogram) of kernel events and functions. There is a great example on this blog post: https://blog.cloudflare.com/revenge-listening-sockets/
This isn't actually exactly what you requested but what you can do is use it to help narrow down the bottleneck in your testing. You can use SystemTap to monitor Kernel function call count, execution time (latency) and jitter, and perf under Linux is also very helpful for monitoring system performance (context switches, branch missed etc, see here and here) so together these will help you to narrow down a bottleneck in your software.
This may lead you to a function which is interacting directly with hardware like this: http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8000
^ Explanation:
Or these tools may lead you to a software function that is causing the latency, maybe it has a high cash miss rate for example.
Edit:
You also didn't mention you OS version, Kernel version, NIC driver and firmware version etc. It is very important in my experience for good performance with DPDK, that you are using the latest NIC firmware, drivers and a recent Kernel build.