The adoption of GPU's in LabVIEW is driven by the ever increasing data volumes being generated by our DAQ systems.
No matter what field you are in, you already felt the struggle of keeping up with the performance delivered by the devices:

  • VSTs like the PXIe-5840 deliver 1GHz of bandwidth, equalling to 5GB/s over the PXI backplane.
  • NI Framegrabbers deliver 3GB/s worth of image data for each framegraber in the system.
  • Custom FPGA designs used for High Speed Serial communication, RF Signal generation and many other can push PXI systems to their absolute limit.

Handling this data in a modern PXI system becomes more difficult when taking into account that the maximum speed of a PXI Gen3 x8 slot is 8GB/s. This leaves very little room for error and ineffienciencies.

Conventional GPU IO

GPU computing exists for several years within the LabVIEW ecosystem. To get data from our data acquisition device to our signal analysis GPU we use the same set of operations, no matter the toolkit:

We acquire the data within LabVIEW, condition the data (1D array) and send it to the GPU.

These operations however mean that the data is copied several times between subsystems. Even looking at the simplified diagram we can see that data is not simply past on to the GPU.

These copy actions have several detrimental effects on our system:

  1. The RAM memory controller needs to do much more works, putting strain on the limited resources we have in our system.
  2. The LabVIEW memory manager needs to manage much more data for short periods of time.

If we consider each subsystem it becomes even more clear, massive amounts of data means more strain on all parts of our system:

Our hardware, drivers, code, ... , all are working tirelessly to keep up with the data but just can't.
Within a high end PXI system we can reach speeds of up to 1 - 3GB/s through this method while our CPU is at 50 - 70% utilisation.
We have no room for extra calculations or processing because all our resources are being used to keep up with this small volume of data.
This results in more system lag, stuttering and undefined behaviour.

Zero Copy GPU IO

If we could bypass the LabVIEW memory manager and reduce the amount of memory copies, preferably to none we would see a drastic system throughput increase. In comes zero copy.

Zero copy is an existing technique within LabVIEW through which DMA regions can be linked to send data between FPGA or VSTs to SSD's. It is extremely powerfull due to its compatibility with Windows and low system overhead.
The Zero Copy API for G2CPU builds upon this existing LabVIEW API to link the NI DMA manager with the CUDA DMA manager by letting them reference the same memory region.

By not bringing the data itself into the environment we are no longer sending the data through the drivers and programming environment.
Rather, the programming environment, and by extension the CPU cores will take up a supervisory role. Managing the data, by letting the DMA hardware know when the data is available and where to find it.

The benefit is that we can stream up to the full real speed of the PXI system which provides us 6 to 7GB/s of transfer speeds.
With a PXIe-5840 VST which generates 5GB/s we would have plenty of overhead to deal with any stutters commonly found within a Windows system.
Not only that, it becomes even possible to process multiple high speed devices at the same time.

G2CPU Zero Copy API

The G2CPU zero copy API has been designed to make this highly technical feat a breeze.

Within LabVIEW there are 2 paths you can follow to share data between a GPU and High Speed Data device.
We can use the NI Hardware DMA data handling or use the G2CPU DMA data handling.

NI Hardware data handling

Select NI Device categories support direct access to the NI DMA Manager.
These include:

  1. NI RFSA driver to access the NI VST devices like the PXIe-5840.
  2. NI FPGA driver which provides us access to framegrabbers, high speed serial and flexrio.
  3. NI IMAQ driver which grants access to the computer vision toolkits memory for efficient image processing.

In order to achieve Zero Copy we need to tell the NI DMA manager and NVIDIA CUDA driver to link with each other.
This can simply be achieved with the "G2CPU Register DMA" function linked to the DMA manager.
By registering the existing memory chunk within G2CPU, G2CPU can set up a connection for you between the NI DMA manager and NVIDIA CUDA.
Doing so reserves special memory on the GPU as well to facilitate high throughput speeds to and from the GPU.
The memory manager handles these special regions for you, so you can manage the life-cycle of each Zero Copy stream individually.

This simplified block diagram shows the basic order of operations.

  1. Create a manager via "G2CPU Create DMA Manager" for the Zero Copy operations. Through this you can manage the life-cycle of the Zero Copy pipeline. You can create a new DMA manager for each pipeline between devices.
  2. Start our handling loop. Keep acquiring the data and sending it to the GPU for as long as your application requires.
  3. Acquire data using the NI Device driver which delivers a data value reference.
  4. Check if the Driver DMA region specified is known to the NVIDIA CUDA via "G2CPU Register DMA".
    If the Driver DMA region was previously unkown it will be linked to the NVIDIA DMA Region and GPU memory will be allocated for fast transfers.
    If the Driver DMA region has already been seen the function will not do anything.
    The function will tell you if the DMA Memory was registered. This is particulary of use in high bandwidth application where the allocation time for new memory takes longer than the time in which new data comes in.
    Under normal operations this means that the first pass through the DMA region will drop all data (true case in the diagram) and all passes afterwards will send data to the GPU (false case in the diagram).
  5. Through the "G2CPU DMA Upload" we can effeciently send data to the GPU.
  6. Once where done we can call "Delete Data Value Reference" to give the DMA Region back to the NI Driver.

The downside of this topology is we are only doing one data operation at the same time. Either we are sending data from the VST to the RAM or sending data from the RAM to the GPU.
We can however use the asynchronous nature of the G2CPU DMA Upload to both send data from VST to RAM, RAM to GPU and do GPU processing. Maximizing system throughput.

In this example that ships with G2CPU v1.6.0, we can see that a simple produce consumer architecture allows us to do all transfers and analysis in parallel.
When running this example we can see the amazing throughput with an extremely large FFT (4M point) calculation. Something which previously was impossible on PXI systems..

Disk Data handling

While the NI Device Drivers provide their own DMA references, the TDMS library in LabVIEW requires you to provide a reference to DMA memory.

For this, G2CPU has added "G2CPU Allocate DMA". This function allows you to create CUDA DMA memory which is directly accessible in LabVIEW and can be accepted by the LabVIEW TDMS Library.

Please consider the following example:

Here we create a new G2CPU DMA Manager in which we will allocate CUDA DMA memory through the "G2CPU Allocate DMA" function.
By passing this memory to the "TDMS Advanced Asynchronous Read (Data Ref)" it can be filled with data from the file we specified.
We provide the memory as EDVR to the "G2CPU DMA Upload" function. Once the TDMS read action has completed the data will be uploaded to the GPU, after which calculation will be done.
Once the data is uploaded we provide the memory back to the TDMS Read loop to be filled once again.

Through this we can continuously stream, at the speed of our SSD to the GPU.

We have verified the performance up to 7GB/s on select hard disk configurations.

Please take care in not to release this memory untill you are done with you reading or writing action.

Tips and tricks

NVIDIA CUDA GPUs for PXI

When the need arises to get the most out of your PXI system consider our trusted partner RADX Technologies.
RADX provides a wide range of high-performance, COTS NVIDIA GPU, RAID and Removable Data Storage, Network and I/O solutions for PXIe Systems, and have been doing so for over 10 years.

All PXIe performance claims where made on RADX GPUs in NI PXIe-1092 or PXIe-1095 Chassis with NI PXIe-8880 or PXIe-8881 Embedded Controllers.

RADX hardware enables in-chassis PXIe solutions that eliminate the need for cumbersome external RAID or GPU servers that will enable your test systems to meet extreme performance requirements for signal processing, analysis, AI, record and playback and other demanding applications.

Maximizing PCIe data transfers

When transferring data between devices it is important to consider the optimal package sizes for maximum throughput.
This is dependent on the hardware within your system, but from our experience we have discovered that a size between 32MB to 256MB provides the highest throughput.

In order to maximize throughput even further a package size of a power of 2 is an absolute must.

Some quick values:

32M = 33 554 432
64M = 67 108 864
128M = 134 217 728 
256M = 268 435 456 

DMA Manager Cohesion

An import concept to consider is DMA Manager Cohesion.
This means that the memory locations passed to the "G2CPU Register DMA" functions need to be the same location.
To ensure this you will need to allocate a DMA Region size which is a multiple of the read size.
Not following this rule will cause the DMA manager to allocate GPU memory for each region permutation, often causing an out-of-memory error.

It is strongly recommended to allocate DMA regions to the power of 2 in your VST, FPGA etc driver during configuration.