Latest Tweets

@iProov has developed a world class authentication solution using biometric factors. We bring our use-case and aggr…
Face morphing is a critical part of our product. Because it's massively parallel, it's an ideal use case for…
Our latest article from @CKotselidis explaining #TornadoVM and dynamic reconfiguration across #CPUs, #GPUs, and…
Follow E2Data on Twitter

Showcasing the Potential of GPU Acceleration in Data Analytics

by Clemens Lutz, DFKI

Today, businesses and scientists rely on insights into their data to provide new services and to gain knowledge. Data analytics are the tools that practitioners use to acquire insights from their data. In the E2Data project, we are working to provide automated frameworks that use modern hardware, such as GPUs and FPGAs, to achieve a short data-to-knowledge time for large data sets. In this post, we showcase the performance potential of GPU acceleration on the basis of k-means, a well-known data analytics algorithm.

Keira Knightley Segmented image using k-means

Original image of Keira Knightley
© Martín Killing CC-BY-SA 4.0

Segmented image using k-means

Figure 1: Segmenting an image into light and dark regions.

The k-means algorithm solves the clustering problem: similar data points are identified and grouped together into "clusters". This helps us humans to make sense of millions or billions of data points. As a toy example, we can use k-means to identify similar regions in an image, such as the light and dark regions in Figure 1. However, there are many real-world applications that use k-means. For example, scientists have recently used k-means to analyze aerosol particles in their investigations on climate change. Speeding up k-means enables new insights by exploiting larger data sets in higher quality.

Harnessing the computing power of GPUs involves solving several challenges:

  1. The algorithm must be programmed in a programming language that supports GPU execution.
  2. The program must be tuned to the specific device, because GPUs are sensitive to tuning parameters.
  3. Data must be copied from disk or main-memory to the GPU. This step is typically slow compared to the high processing power of GPUs. Thus, ideally, data is copied only once and then cached on the GPU.
  4. Some algorithms can be modified to take advantage of inherent hardware properties of GPUs. Although solving these challenges is non-trivial, they are necessary to achieve peak performance.

In the case of k-means, we have improved the k-means algorithm for GPUs (i.e., challenge 4 above). Previous works execute parts of the algorithm on the CPU. The underlying reason for this design decision is that k-means uses the processor cache inefficiently on GPUs. In contrast, by devising a new, cache-efficient k-means algorithm for GPUs, we are able to perform all algorithm steps on the GPU. In addition, we solve challenges 1-3 manually.

We compare the runtimes of our work with previous approaches in an experiment. In this experiment, we use a 2 GB data set with 4 feature dimensions. We configure k-means to group these data into 4 clusters. We run the experiment on a quad-core 3.4 GHz Intel i7-6700K CPU with hyper-threading and an Nvidia GTX 1080 GPU. We assume that all data are cached in the memory of the respective processor.

strategy comparison blog
Figure 2: The runtime of k-Means on a GPU compared to previous GPU and CPU approaches. A highly-optimized GPU algorithm can perform more than an order of magnitude faster than an optimized CPU version of the same algorithm.

In Figure 2, we show the result of our experiment. We achieve a 13× speed-up on the GPU compared to a highly-optimized CPU implementation. Compared to the previous GPU approach, we achieve a speed-up of 20×. Furthermore, we compare our implementations to Armadillo and Rodinia. Armadillo is a linear algebra library for C++ that uses OpenMP for multi-threading, whereas Rodinia is a GPU benchmark suite written in CUDA. Although our baseline implementations use the same k-means algorithms as these libraries, even our respective CPU and GPU baselines show significant runtime improvements. We attribute these improvements to parameter tuning for our specific hardware setup.

We observe that GPUs enable fast runtimes for k-means. However, we take away that to achieve these fast runtimes, we must invest significant programming and algorithm design efforts. Undertaking these steps typically requires deep technical knowledge of hardware architectures as well as the algorithm under consideration. With the Tornado and Apache Flink frameworks, software engineers without hardware expertise will be able to use GPUs and other hardware accelerators to achieve performance gains in their own applications.

We invite you to read our journal article for more details on our k-means algorithm for GPUs. You can find the published version in Datenbanken Spektrum. We also provide the full source code of our new k-means approach online under the open source MPL-2.0 license for you to reproduce our results.

This project has received funding from the European Union's Horizon H2020 research and innovation programme under grant agreement No 780245.