by Christos Kotselidis, The University of Manchester
Heterogeneous hardware accelerators such as GPUs and FPGAs are becoming pervasive in order to help us improve the performance and energy efficiency of our applications.
Unlike traditional CPU-only execution, programming such devices poses several technical challenges that among others include programmability, memory management, and Just-in-Time (JIT) compilation.
In the context of E2Data we propose a new paradigm which will facilitate the transition traditional Big Data frameworks to the heterogeneous era by allowing the seamless integration and execution of hardware accelerators.
At the backbone of the proposed E2Data framework is the TornadoVM that allows us to dynamically compile arbitrary Java code to target GPUs and FPGAs. Besides the enablement of such heterogeneous execution [2], TornadoVM has recently introduced the notion of Dynamic Reconfiguration in version 0.2 [1].
What is dynamic reconfiguration?
Imagine a system with a CPU, an integrated GPU (iGPU), a discrete GPU, and an FPGA.
Ideally, we would like to exploit the strength of each device and combine their execution capabilities in order to improve the total execution time of our applications.
To achieve that, we would need a way to allow permutations of our applications on these devices until the fastest combination is discovered.
For example, if we examine a common benchmark like Discrete Fourier Transform (DFT) we can see from the graph below that depending on the input size given to the problem, different accelerators perform better.
Figure 1: TornadoVM speedup over sequential Java on a range of different input sizes.
As shown in Figure 1, for small data sets multi-core CPU execution outperforms both GPU and FPGA execution due to the overheads they have when transferring data from host to device. On the contrary, when the data sizes become large both the GPU and the FPGA outperform the CPU.
Dynamic reconfiguration in TornadoVM
TornadoVM adds the capability to dynamically reconfigure our application at runtime until the system discovers the “best” possible combination of tasks and devices they execute on.
This is achieved by the addition of a novel virtualization layer with, TornadoVM specific bytedoces, that orchestrates the execution of code on hardware accelerators.
Since TornadoVM operates in collaboration with a standard host JVM, such as OpenJDK, the new bytecodes that comprise its heterogeneous virtualization layer reside inside the host JVM.
This results in a “VM in a VM” where the host JVM handles CPU execution while TornadoVM handles hardware acceleration.
What is the benefit of dynamic reconfiguration?
Through the TornadoVM API for dynamic reconfiguration, developers can define the metric through which their code will be assessed.
For example, by defining the policies below,
taskSchedule.execute(Policy.PERFORMANCE)
taskSchedule.execute(Policy.LATENCY)
taskSchedule.execute(Policy.END_2_END)
we instruct TornadoVM to assess the execution of our code on heterogeneous devices per: a) kernel performance, b) latency, and c) end to end performance including data transfer times.
TornadoVM will execute the application on all devices and decide on the final mapping between code and devices based on the policy defined by the user (Figure 2).
Figure 2: TornadoVM Dynamic Reconfiguration.
All the above take place completely automatically and transparently to the users who will only observe the performance improvement of their applications by utilizing all available hardware resources.
Figure 3: TornadoVM dynamic reconfiguration on NBody computation.
Figure 3 shows the performance improvements of TornadoVM’s dynamic reconfiguration on NBody computation. As shown, we assess two performance policies (End2End and Peak Performance) across three devices (CPU, GPU, and FPGA).
We compare the relative speedup over the sequential Java code that both TornadoVM and all three static configurations of the Tornado compiler can achieve.
The solid blue line signifies how TornadoVM can always choose the highest performing device between the available ones and even resort back to single threaded CPU execution if it is the fastest choice (e.g., End2End policy up to 211 data size).
In contrast, if we did not have dynamic reconfiguration we would need to resort to Tornado’s static scheduling; a task that we would need to perform offline and hardcode the predicted performance improvements inside the TornadoVM.
With dynamic reconfiguration, developers can just run their applications leaving TornadoVM to automatically find the highest performing device per data size.
To conclude, our experimental evaluation detailed in our latest VEE’19 paper [3], demonstrates that with dynamic reconfiguration we can achieve up to 7.7x performance improvements compared to static scheduling.
Further examples on how to use TornadoVM and dynamic reconfiguration can be found in [4, 5].
Resources
[1] TornadoVM: https://github.com/beehive-lab/TornadoVM
[2] Tornado compiler paper
[3] TornadoVM paper
[4] NBodyDynamic
[5] DynamicReconfiguration Example