Real-time processing and analysis of IoT data streams: The Green Buildings Use Case in E2Data

The Spark Works Internet of Things platform ( is designed to enable an easy and fast implementation of applications that utilize an IoT infrastructure, like the scenario envisioned in the Green Buildings use case for E2Data. Overall, it offers high scalability in terms of users, number of connected devices and volume of data processed. In addition, the platform accommodates real-time processing of information collected from mobile sensors and smartphones and offers fast analytics. The platform offers real-time processing and analysis of unlimited IoT data streams with minimal delay and processing costs.

Overall, the “Green Buildings” use case is implemented over data received from an IoT infrastructure implemented and maintained by CTI. The infrastructure is implemented over 25 educational buildings in Europe, which has been continuously updated throughout the duration of the project. In the current version of the Green Buildings deployment, a huge amount of data is produced in a daily basis. This kind of data is currently produced by approx. 1400 sensing hardware endpoints, however in the scenario of a region-wide deployment it would translate to tens of thousands of sensing endpoints. In this context, the deployed IoT infrastructure generates, handles, transfers and stores a tremendous amount of data, which cannot be processed in an efficient manner using current platforms and techniques. Spark Works utilizes E2Data for processing rapidly the continuously accumulated data and tackling issues generated by the thousands of sensors generating data that need to be processed in real time.

iot nodes

Such sensing endpoints provide data regarding the energy consumption of the respective buildings, along with a set of environmental parameters with respect to indoor conditions in specific rooms and areas within these buildings. In certain cases, there are additional data available related to outdoor conditions, such as the ones provided by weather stations, e.g., temperature, humidity, precipitation, etc. Such data produced by the sensing hardware endpoints are augmented by aggregates calculated by the Spark Works platform that aim to provide additional data and statistics for these buildings, e.g., average sensing values for energy consumption or temperature.

The Spark Works platform receives events from the sensors in this infrastructure and executes aggregate operations on them. Sensors produce (periodically/asynchronously) events that are sent to the Sparks Analytics Engine. These events are usually tuples of pairs: value and timestamp. All data received is collected and forwarded to a queue. From there, it gets processed in real time by the Sparks Processing Engine cluster and the computed analytics summaries are stored in a NoSQL database. Each Sparks Engine processing job has the ability to be easily modified, in order to accommodate aggregation operations. The engine consists of tasks responsible for a specific type of sensor. The chain of aggregators, called process blocks in the Sparks Analytics Engine, aggregate data for specific time intervals. Events reaching the Analytics Engine message broker are processed consecutively in a time-window manner calculating aggregation results.

The Spark Works platform kernels that are accelerated by the E2Data platform are the following:

  • Compute sum: It computes the summary of a batch of sensor measurements retrieved from the platform’s message broker.
  • Compute max: It computes the maximum value from a batch of sensor measurements retrieved from the platform’s message broker.
  • Compute min: It computes the minimum value from a batch of sensor measurements retrieved from the platform’s message broker.
  • Compute average: It computes the average value of a batch of sensor measurements retrieved from the platform’s message broker.
  • Outliers detection: It detects the outliers from a batch of sensor measurements retrieved from the platform’s message broker. The kernel computes an upper and lower bound with every value exceeding those bounds being marked as an outlier. Hence, an outlier can be detected and potentially removed from the batch of measurements.

Regarding the utilization of the E2Data Big Data framework, i.e., Apache Flink as the big data processing platform in the Spark Works IoT engine, we implemented and migrated the analytics operations and functions of the Spark Works analytics engine from Apache Storm to Apache Flink.

For a preliminary performance evaluation of the Apache Flink DataSet version of the “Green Buildings” Analytics UC, we used a synthetic input dataset simulating the processing of 7.200.000 data entries for a time period of 60 minutes. In detail, the evaluated dataset contains input data from 2.000 sensors generating data with 1-minute granularity per sensor for a 1 hour time period. The performance evaluation results of the “Green Buildings” UC are performed in a combined scale-out/scale-up way on both the x86 and Aarch64 E2Data testbeds. The number of TaskManagers/Workers are increased to measure the influence of scaling out each Flink cluster, while at the same time the number of the task slots per TaskManager/Worker is increased, to investigate the scaling up impact on each individual TaskManager/Worker of the cluster.

The following two figures present the execution time in seconds for the “Green Buildings” UC Flink implementation on the Aarch64 (KMAX) and x86 testbeds respectively.

 Performance evaluation in ARM cluster

 Performance evaluation in x86 cluster

The first immediate, and anticipated, result is that increasing the total number of task slots on the clusters always results in improved runtimes of the evaluated application. Moreover, we can notice that scaling out the cluster provides higher gains than scaling up the task slots on each node. The improvement on the runtime performance tends to be sublinear but this is not always the case on both clusters due to parameters other than the available task slots such as the Flink initialization time, and the network communication for the inter-worker synchronization. Finally, we should note the significantly higher performance of the x86 cluster against the Aarch64 cluster, due to the different type/level of computational resources of the two testbeds.

Overall, the performance boost on data processing provided by the E2Data heterogeneous platform will enable the Spark Works Analytics Engine to provide more accurate and valuable analytics data by increasing the rate that the IoT devices sent data to the platform.

This project has received funding from the European Union's Horizon H2020 research and innovation programme under grant agreement No 780245.

E2Data is part of the Heterogeneity Alliance