E2Data in a nutshell

Imagine a Big Data application with the following characteristics: (i) it has to process large amounts of complex streaming data, (ii) the application logic that processes the incoming data must execute and complete within a strict time limit, and (iii) there is a limited budget for infrastructure resources. In today's world, the data would be streamed from the local network or edge devices to a cloud provider which is rented by a customer to perform the data execution. The Big Data software stack, in an application and hardware agnostic manner, will split the execution stream into multiple tasks and send them for processing on the nodes the customer has paid for. If the outcome does not match the strict three second business requirement, then the customer has two options: 1) scale-up (by upgrading processors at node level), 2) scale-out (by adding nodes to their clusters), or 3) manually implement code optimizations specific to the underlying hardware.

However, the customer does not have the financial capability to achieve that. Ideally, they would like to achieve their business requirements without stretching their hardware budget. The natural question that arises is the following:

"How can we improve execution times while using less hardware resources?"

In order to address the alarming scalability concerns, both end users as well as cloud infrastructure vendors (such as Google, Microsoft, Amazon, and Alibaba) are investing in heterogeneous hardware resources able to utilize a diverse selection of architectures such as CPUs, GPUs, FPGAs, and MICs aiming to further increase performance while minimizing the climbing operational costs. Furthermore, despite current investments in heterogeneous resources large companies such as Google develop in-house ASICs with TensorFlow being the prime example.

Screen Shot 2018 07 09 at 10.38.07

E2Data will provide a new Big Data software paradigm of achieving the maximum resource utilization for heterogeneous cloud deployments without affecting current Big Data programming norms (i.e. no code changes in the original source). The proposed solution takes a cross-layer approach by allowing vertical communication between the four key layers of Big Data deployments (application, Big Data software, scheduler/cloud provider, and execution run time) which will allow the E2Data-enabled stack to adress the following question:

"How can the user establish for each particular business scenario which is the highest performing and cheapest hardware configuration?"

The E2Data consortium brings together two distinct cutting edge EU Big Data practitioners to achieve its ambitious goals. On the one hand we have the following four Big Data users in specific markets with strict requirements in terms of performance and infrastructure costs:

  1. EXUS in the Health Sector,
  2. Neurocom in Fintech,
  3. SparkWorks/CTI in Green Building Infrastructure, and
  4. iProov in security and biometric recognition.

On the other hand four Big Data technology providers will implement the E2Data solution by extending cutting-edge European technologies:

  1. DFKI, the creators of Apache Flink (the number one European competitor of Apache Spark), will provide solutions in the core of the Big Data stack,
  2. ICCS will deliver a novel Big Data scheduler (i.e. the component that assigns hardware resources to tasks during execution) capable of intelligent resource selection,
  3. UNIMAN with expertise in heterogeneous computing will work at the system level and enable dynamic code compilation and execution on diverse heterogeneous hardware resources, and
  4. Kaleao will showcase that its high-performing, low-power, cloud architecture can strengthen EU's Big Data hardware capabilities with E2Data's proposed technologies.