Flexible and Efficient Acceleration of Natural Language Processing in E2Data

Processing of unstructured data (text) is a powerful tool, necessary to extract knowledge from articles and social media messages. It is applied within several business domains to support variant types of operations, where sentiment analysis and opinion mining are of significant role (such as tourism, marketing, press, postage, banking and finances). Complex natural language processing (NLP) algorithms are aiming to identify syntax patterns, correlate phrases and words with lexical and semantic resources and score or annotate expressions and text entities. Extreme time constraints make the execution of such algorithms, harder to achieve their business goals. Critical language processing algorithms are falling in the critical path in the knowledge extraction process; therefore, acceleration is considered as a solution towards enhanced performance.

Experimenting on heterogeneous computation resources, measuring and migrating algorithmic solutions of a software product to FPGAs/GPUs needs investment in time and effort in respect to code development and rewriting on different programming languages. Moreover, scalability and elasticity must be assured in prior. Hence, solutions must be solid and trustworthy to support productisation; otherwise, programming languages and tools with heterogeneous-aware abilities can provide the easy means to follow such migration paths. E2data achievements are aligned to industry goals: performance efficiency gains by heterogeneous hardware and development flexibility based on run-time intelligence. Experiments and measurements performed showcase performance advancements and scalability.

The Language Processing application use case focused on processing large amounts of messages from social media, to perform semantic information extraction, sentiment analysis, summarization, interpretation and organization of content. Such analysis occurs by extracting from each tweet phrases with specific syntactic forms. The most important NLP engine types incorporate functions which were accelerated are:

  • WDK, Word Distance Kernel - Levenshtein distance for lexicographic ranking the similarity of an input (reference) word with the words of a dictionary. The kernel outputs the distances of the dictionary words with the input word.
  • K-Means clustering that uses the Euclidean distance between the documents of the dataset. The algorithm takes as input a C-Trie representation of the dataset and the parameter K. C-Trie holds the index from words to the documents they appear as well as the frequencies.
  • Hierarchical clustering that uses Cosine similarity between documents. The input is the same C-Trie that is used in K-Means clustering. The ranking now is based on the angles between documents

NLP code engines work on dictionaries. A dictionary can be a set of words or phrases in text or compiled binary format. The compiled format can contain indices for quick recognition of a word as well as information related with the word.The common functionality of all kernels includes the configuration to run the algorithm either on the JVM without acceleration, run it using E2data acceleration (TornadoVM), and run it both in JVM and E2data and compare the results, the times and the acceleration rate. The experiments’ plan followed two steps:The first step of experiments were for the current version of the NLP kernels and aimed to showcase GPU speedup with E2data acceleration on NLP kernels execution, within the runs of lexicographical ranking kernel with two different dictionaries (person names, spelling lexicon) and two clustering algorithms (Hierarchical, K-Means) on a corpus of tweets. The next steps of experiments measured the performance behavior and scalability after the integration of the kernels with Flink, on a local cluster setup and on ICCS cluster, for lexicographical ranking kernel and K-Means clustering algorithm.

E2Data platform was stressed by demanding and realistic datasets; we used various open datasets on which we apply the code kernel algorithms in both Java and Tornado versions.

DATASETS utilized across experimental runs

  • 160.000 words from the English vocabulary excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings

  • 360.000 words single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings

  • 530.000 tweet messages

  • 515.000 hotel reviews

  • 150.000 wine reviews

  • 3.000.000 Russian troll tweets

  • 500.000 Location addresses

All runs were iterated many times in order to validate and collect average metrics across all executions. A variety of open and realistic datasets were utilized across the various testings.

Use of Tornado on a local GPU setup. The new code kernels were evaluated on a setup with 13 GPUs GTX970 with memory 5Gb. The performance gains we got running the kernels under TornadoVM with respect to the JVM version are summarized in the table below.


Vocabulary size and type

(nr of words)


Tornado speedup

Levensthein Distance Algorithm

160.000 person names



160.000 person named



360.000 English words



360.000 English words



Hierarchical classification

115000 tweet messages (13 words per message on average, 39k distinct words) about sports (FIFA 2019 cup)



K-Means clustering

115000 tweet messages (13 words per message on average, 39k distinct words) about sports (FIFA 2019 cup)



Use of Integrated Flink version (no tornado acceleration in this phase) on a local cluster setup. The evaluation went onwards with the use of the integrated E2data – Flink version on a cluster setup with 3 machines, running 1 Job Manager and 8 to 10 Task managers (16-32 Gb memory). The parallelized versions exhibited:

  • 5,9x speedup for the Levenshtein algorithm on 10 processors
  • 1,6x to 2,4x for the Kmeans clustering on 8 processors. Better performance was observed for lower number of centroids (100 instead of 500)

Results from integrated Flink version on ICCS & KMAX clusters. The experiment continued with similar runs on ICCS cluster.

neurocom post 1

Levenshtein speed up reached up to:

  • 5,7x on ICCS cluster with 1 task/node on 8 nodes and

  • 12x on ICCS cluster with 8 tasks/node on 8 nodes

  • 9x on KMAX cluster with 1 task/node on 16 nodes

  • 22,8x on KMAX cluster with 4 tasks/node on 16 nodes

Τhe algorithm run for a dictionary size of 160.000 words and input size 10.000 words for all runs.

For K-Means clustering, increasing the number of tasks per node was more effective rather than the increase of number of nodes, reaching a maximum gain of:

  • 5x for 4 nodes with 8 tasks per node, on ICCS cluster
  • 22,8x for 16 nodes with 4 tasks per node, on KMAX cluster

The algorithm was run for 500 centroids (clusters), with 20 iterations on a 12.000 words vocabulary (number of dimensions) in 11500 documents (number of points).

This project has received funding from the European Union's Horizon H2020 research and innovation programme under grant agreement No 780245.

E2Data is part of the Heterogeneity Alliance