PCIe LOW LATENCY | Brightai

Architecture

We have developed Lancero SGDMA high bandwidth and Vulcano SGDMA Low Latency FPGA IP cores and drivers. Lancero focusses on bulk transfers, Vulcano focusses on low latencies for typically small transfers. Our next generation IP core combines the features of both and will broaden OS driver support for CPU and GPU P2P.

Zero-copy SGDMA between FPGA on-chip and CPU user-space virtual memory allocated buffers. CPU is not involved in data copies, only transfer setup.
Autonomous cyclic SGDMA, CPU only sets up transfer once initially.
Ultra low latency ‘Vulcano’ SGDMA variant optimized for lowest possible latencies on PCIe.

Standards compliant

For software developers, the complexities are hidden under a standard API.
Linux kernel drivers for FPGA SGDMA I/O, POSIX compliant.
Windows kernel drivers for FPGA SGDMA I/O, OS standard compliant.
QNX Neuron Real-time Automotive device driver. QNX development partner.
On-chip FPGA bus standard compliant.

Features with use cases

Fastest possible (partial) reconfiguration of FPGA via our SGDMA solution to quickly adapt to different workloads, such as for AI and machine learning.
Autonomous cyclic SGDMA realizes use cases such as a 256 audio channel Windows WaveRT compliant FPGA based solution complementing GPU video.
PCIe P2P SGDMA to/from other PCIe devices (GPU/FPGA) such as for video streaming between FPGA and GPU, or video capture through FPGA multi-camera interfaces.
Linux kernel driver DMABUF support to provide device to device SGDMA. One commercial use case is GPU to FPGA offload of 16K+ large resolution framebuffers.
Linux kernel driver uio_ring support for highest IOPS for asynchronous I/O between FPGA and CPU and/or GPU available in 2022.

Optimizations at microarchitecture

Deep knowledge and use of (micro) architectural optimizations, such as:
Outbound buffer coalescing to reduce PCIe TLP overheads.
Interrupt coalescing and hold-off to reduce OS overheads.
Multi-queuing to hide latencies.
CPU directed I/O to reduce cache pollution.
PCIe device side bus master polling to minimize start-up latencies.
CPU to FPGA Burst transfers to reduce register writes to SGDMA or other control registers.
We are working on optimizing solutions for CXL 1.0+ based on PCIe Gen 5.0.

Benchmarks

On Gen3 x8 consumer grade CPU market-leading 1.4 us unidirectional latency and 3 us loopback latency, a commercial use case is the highest bandwidth RSA encryption offload PCIe solution (2019).
These latencies are 10 (XDMA) up to 300 (XDMA+XRT) times better than the existing solution Xilinx and others are using for compute / accelerate functions.
On recent server CPU and Xilinx as well as Intel FPGA PCIe Gen4 x16 our solution will enable < 1 us latency at 1.000.000 IOPS in early 2022.