In recent years, popular, resource-intensive tasks in the Machine Learning, Artificial Intelligence and Data Analytics domains leverage the power of specialized hardware such as GPUs, FPGAs, TPUs and ASICs. This can be achieved using various Big Data frameworks that support the use of hardware-accelerated kernels. Moreover, there exist cluster management tools that can allocate such tasks to the underlying hardware and cloud offerings that provide heterogeneous infrastructures. Although the enabling technologies do exist, user expertise is still required to make the right choices on both the amount and type of required resources to achieve any high level policy. Indeed, offloading parts of the code to specialized hardware is not always the wisest choice for any Big Data task: Factors such as data transfer between RAM and GPUs, different algorithmic characteristics (one-pass vs multi-pass), dataset size etc., heavily affect performance, energy efficiency and cost, rendering the scheduling decision a difficult process even for the most experienced analyst.
We showcase this difficulty by experimenting with kernels from the popular Rodinia benchmark [1], which we executed over a multicore CPU system, a high-end GPU and a low-end GPU. For different kernels, we get different qualitative results for the end-to-end performance. As shown in the figure, streamcluster, a clustering algorithm with execution time dominated by processing, gains significant speedup when executed over GPUs. On the other hand, hybridsort, a sorting algorithm that spends a significant portion of its running time copying data from the host's memory to the device (HtoD) and vice versa (DtoH) is a better fit for the multicore CPU system.
We argue that the manual tuning of an application can be a really costly and time-consuming process. Thus, our work in E2Data consists in designing and implementing HAIER, a Heterogeneous-Aware, Intelligent Resource Scheduler that can automatically identify the optimal mapping between tasks and devices in order to optimize any user-defined policy. Given the code of an application and a set of available resources, HAIER should understand which tasks best fit the execution model of each hardware platform and schedule them according to a global optimization function.
We envision our scheduler to work as follows: The user submits code developed using her favorite Big Data framework, along with the desired optimization policy (e.g., minimize execution time, minimize power consumption or both). The code is passed to the Big Data execution engine, where tasks are internally organized in a graph structure. Having information about the code itself along with the optimization objective, the structure of the task graph and monitoring information HAIER can produce the optimal execution plan, which allocates each task to the most beneficial set of hardware resources among the available ones.
The basis of this workflow optimization process lies in the utilization or detailed models of the cost and performance characteristics of Big Data tasks over various underlying hardware, be it CPUs, GPUs or FPGAs. The models are stored and updated in a model library. Whenever a new workflow executes atop HAIER, these models are used in order to intelligently assign and orchestrate workflow parts to the available hardware according to the user optimization policy. Once the optimal execution plan is available, the Big Data framework enforces it through a cluster management frame-work that can handle heterogeneous resources (e.g., YARN, Mesos). During runtime, the workflow execution is being monitored for failures and/or performance degradation. In that case, HAIER dynamically adapts to the current conditions by creating a new execution plan for the remaining tasks.
More on our vision of the E2Data scheduler in our CloudCom workshop paper.