Java GPU accelerated Viola Jones Face Detection with TornadoVM

By Gary Frost (iProov)

In this blog post, we will demonstrate how iProov achieved an ~23x performance improvement by using TornadoVM to help accelerate a Java based Viola Jones face/feature detector. iProov's business domain is identity management so the ability to accurately identify an individual in an image or in a stream of images aligns with their goals.

The technique described within this document is clearly face detection ("That looks like a face!") rather than face recognition ("That is Joe Smith's face!"). However, the ability to quickly - and computationally inexpensively - identify areas of an image which contain faces can be considered as a useful first stage in a pipeline involving more aggressive - and computationally intensive - techniques, such as machine learning.

ViolaJones is an established and fairly simple algorithm for face or general feature detection. There are existing Java implementations such as jvoilajones, but these tend to be slower than implementations which layer on top of JNI and OpenCV. We wanted to stick with Java but did not want to sacrifice performance, so we chose to leverage TornadoVM.

iproov blog post 1

TornadoVM is a Java framework developed at The University of Manchester which enables hardware acceleration of Java programs on GPUs and FPGAs. For the appropriate workload, TornadoVM can yield impressive performance improvements compared to the same code executing on a Java Virtual machine on conventional CPU's.

Java acceleration via GPU is not new (OpenCL and CUDA wrappers have existed for over a decade). However, the adoption of these wrappers required the Java developer to fully understand the concepts and details of GPU's multi-layered memory model and to be able to successfully map those concepts back to the Java world. Later frameworks starting with AMD's Aparapi, Rootbeer, Oracle/AMD's collaborative Sumatra and NVidia/IBM's collaborative PowerPC+CUDA offering successfully hid much of the gory GPU detail, whilst still allowing the developer to reap performance rewards, but were generally confined to single kernel dispatches.

TornadoVM extends the capabilities of these later frameworks. It allows graphs of tasks to be constructed and the data interdependencies between them to be described in Java, thus allowing the developer to implement whole algorithms from data parallel tasks. What also sets TornadoVM apart from its competitors in this space is it's adoption of Oracle's Graal (a Java JIT compilation suite implemented in Java) for code generation. Rather than attempting to decode/map Java bytecode to OpenCL or CUDA using a custom transpiler, TornadoVM re-uses technology already in place for existing polyglot execution and targets the intermediate format of the accelerator (OpenCL, CUDA, PTX) in the same way that other Graal runtimes may target the ISA of the virtual machine's host (X64, X86, ARM64).

The TornadoVM team have successfully leveraged the front end compiler optimization phases that come with Graal (for inlining, loop unrolling, dead code elimination) and developed their custom backend (written in Java) for executing on the appropriate acceleration target. Technically, this approach was demonstrated in Sumatra (2013), which also employed a very early form of Graal to target HSA platforms, but this was years before Graal was included as part of the OpenJDK. Whilst Sumatra was able to demonstrate some impressive performance, the project stalled around 2014.

In this post, rather than demonstrating the potential of TornadoVM using 'toy' NBody, Mandelbrot and other embarrassingly parallel workloads, we wanted to show how a more complex application using multiple compute stages could take advantage of TornadoVM's task graph model. We chose to implement the Viola Jones (Haar Cascade) Feature Detection algorithm using a cascade trained to detect faces.

The Viola Jones Feature Detection algorithm

To detect faces (or any other feature) using Viola Jones, we either need to create a trained model (using AdaBoost) or use one of the models available online. Thankfully, the OpenCV project makes a number of open source cascades available for us to use. Each model or 'Haar Cascade' is the result of training for a specific feature such as face, hand, digits, etc.

In our case, we load our 'face detection' cascade from an xml file called: haarcascade_frontalface_alt2.xml

We will explain the details of each stage as we walk through the process. At runtime, we divide the work into multiple stages.

  1. Convert an image from RGB to grayscale/monochrome.

  2. Create a Summed Area Table (Integral Image) from our grayscale image

  3. Iterate over rectangular regions of our Integral Image at various scales and 'apply' Haar Cascade.

The first two steps merely serve to simplify/minimize the computation required in stage 3.

RGB to GrayScale

Converting an image from RGB to grayscale is an “embarrassingly parallel” process. For each RGB pixel value we map to a single Grey-Scale pixel value. Creating a grey pixel value by taking 29% red + 60% green + 11% blue from our RGB pixels seems to be a common approach.

int gray = (int)(.29 * red + .60 * green+ .11 * blue);

This is a trivial task for a GPU or other SIMD/SIMT computer architecture.

GrayScale to Integral Image

In order to minimize the computation in the later cascade stage, we take advantage of the properties of a Summed Area Table or Integral Image. This 'table' is constructed from the grayscale image. An integral image is essentially a 2D prefix scan/sum. In a prefix sum, each element (at index i) is the sum of all values from 0th element to i'th element in the original data. With an integral image, the value at location (x,y) in our table contains the sum of the pixels above and to the left of (x,y).

For a small image (see the left data set below) where black pixels are '1' and white '0', the Integral Image values are shown on the right.

iproov blog post 2

The primary value of an Integral Image is that we can determine the sum of the pixels for any rectangular subregion of the original image using simple addition and subtraction using values at the four corners of the region of interest in the table.

iproov blog post 3

The traversal of the cascade later needs to compare a lot of subregions so this table becomes invaluable. Sadly, computing the value of each cell depending on the values of all cells top and left is more “embarrassingly parallel” than creating an integral image. However, it can be shown that an Integral Image is essentially a 2D prefix scan, and can be computed by applying a prefix sum/scan for each column then for each resulting row. We have a recipe for extracting parallelism. Essentially we perform a prefix scan of all rows in parallel, then a scan of all columns in parallel.

iproov blog post 4

This 'integral image' serves as a lookup table for cascade traversal and allows us to quickly determine the image density (proportion of light to dark) of any scaled rectangular region of interest in the original image.

Processing the Haar Cascade

The Haar Cascade describes operations to be performed on a fixed grid size (say 24 x 24) and is composed of a sequence of decision trees. Each tree is composed of nodes representing 'Haar Features' along with left and right node pointers and some 'threshold' meta data to determine when to traverse left or right. A 'Haar Feature' represents the relative size, position and light/dark ratio between subrectangles within a bounding rectangular area. Essentially, the sum of the pixels which lie within the white rectangles of the Haar Feature are subtracted from the sum of pixels in the grey rectangles.

An example of a Haar Feature relative to its bounding box is shown below.

iproov blog post 5

Each  Haar Feature also includes a 'weighting' value, indicating how much this feature contributes to a successful detection in this tree.

Other examples of Haar Features:

iproov blog post 6

During training (via an algorithm called AdaBoost), a series of curated prescaled faces (in our 24x24 grid) are analyzed and common patterns of light/dark used to construct the eventual cascade. In its simplest form, one can imagine attempting to match a cartoon face such as this:

iproov blog post 7

and concluding that a node comprising these two Haar Features offers a good first approximation.

iproov blog post 8

By arranging features in a tree form and determining at each node whether, after testing against a threshold, to walk left or right through the tree, one can see how the cascade allows you to fail fast and to determine whether a rectangular region looks like a face or not.

iproov blog post 9

Of course, real faces vary a lot more than our idealized 'smiley face' so we can expect a real cascade to form a list of substantial trees. The cascade used for our face detection in this implementation contained 20 trees, 2094 haar features and 4535 rectangles.

To aid performance, the cascade is assembled with the most commonly 'discriminating' tree first. In a sense, the deeper the algorithm traverses the tree, the more likely the area under scrutiny is a face. If it traverses all the way to a leaf node in the tree, it has determined that it is 'looking at' a face. So after the RGB -> GreyScale creation of Integral Image, the cascade is essentially a series of nested loops wrapping an irregular tree traversal.

To illustrate the whole algorithm, we assume that the tree traversal for a scaled region at given x, y locations in the image is encapsulated in a function.

isAFace(x,y,scale,integralImage)

For a pure sequential implementation, we have enough data to start coding a sequential version.

faceCount = 0

   for (float scale..) {

            for (int y..) {

                 for (int x ..) {

           if (isAFace(x,y,scale,integralImage)){

              results[faceCount++] = {x, y, scale}

           }

                }

         }

}

We can see that the nested loop body is called repeatedly within three nested loops (shown in red) for each unique triple of x, y and scale. At one extreme, we could launch a new thread for each x,y and scale triple and collect all faces in parallel. Our only real challenge will be dealing with the 'race' condition (code in green) when we find multiple faces from multiple threads.

Thankfully Java has AtomicInteger.

AtomicInteger faceCount = new AtomicInteger(0);

for (float @Parallel scale..) {

         for (int @Parallel y..) {

               for (int @Parallel x ..) {

           if (isAFace(x,y,scale,integralImage)){

             results[faceCount.inc()] = {x, y, scale}

           }

              }

       }

}

 

The Cascade Data Structure.

For our pure Java implementation, we can easily imagine our cascade as a List<Tree> each Tree containing a Map<Features>, each feature comprising a List<Rectangle>.

Given the nature of Java allocations, these tree objects (and their subnodes) are all on the heap somewhere and, as Java programmers, we generally don't care where these objects are located. Sadly, this is the main challenge with many algorithms we wish to move to a GPU or other data parallel accelerator. We need a data parallel version of the cascade data structure.

Data Parallel Cascade Data Structure.

For GPU acceleration, we essentially need to flatten our heap data into one or more contiguous memory regions so we can pass it to the GPU. To allow TornadoVM to benefit from locality, we need to flatten the object tree form of the Haar Cascade into contiguous memory. This process is similar to serializing Java objects into an ObjectStream or writing to Protocol Buffer using a library in preparation for transmitting over a network.

We walk our data structure and write all data to contiguous memory, mapping inter object pointers, to intra memory relative references. To perform this operation, we built a custom serialization step which flattens the whole tree into two parallel float and int arrays. We call the resulting parallel int and float array a DataParallelCascade. These two arrays can then be passed as data to the GPU along with the Integral Images.

Essentially the generated TornadoVm kernel becomes a state machine which interprets the data from these flattened parallel arrays and uses it to traverse the tree. We now have a form of Viola Jones which can be run on an OpenCL device.

Performance

In the table below we compare the performance of a sequential Java implementation, a multi-threaded Java implementation and two TornadoVM implementations running on CPU OpenCL runtime (from Intel) and a GPU OpenCL runtime (from AMD/Radeon). We tested using four images of increasing sizes 600x408, 844x612, 1578x976 and 2560x1738 pixels. It should be noted that we ran each detection 11 times in total. We discarded the time for the first run in all cases (for Java this skips interpreter phase, for TornadoVM it skips the code gen), we then take average time over the next 10 runs.

Workstation Configuration

  • Ubuntu 20.04.1 LTS
  • Z840 Workstation (2x10 core/20 thread) Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
  • AMD/Radeon (Ellesmere) Radeon RX 580 GPU (36 Compute units) ROCM v 3.9.1

iproov blog post table 1

We can see that the TornadoVM results for larger image (298ms vs 6825ms) shows a 22.9x performance improvement over Sequential Java and (298ms vs 1018ms) a 3.4x performance improvement over Multithreaded Java.

The following table has normalized results. Java single threaded results representing 1x and other column values as performance improvements over this Java sequential reference.

iproov blog post table 2

Conclusions

  • TornadoVM can yield impressive performance results for algorithms for which we can extract parallelism.
  • Extracting parallelism from a given problem has always been the challenge for non-trivial applications. Once the developer has found a way to do so, the next challenge is representing the algorithm in their language of choice.
  • Java is ubiquitous, and is consistently cited as being in the top two or three programming languages on the planet. The TornadoVM team has found a way to allow Java to be used as a way to describe data parallel algorithms, and exploit available compute resources to accelerate performance.
  • If you have an algorithm, which has data parallel stages, but you prefer to keep coding in Java rather than learning new languages, TornadoVM is a very powerful tool

We should note that the test platform had a fairly low power GPU (only 36 compute units) but had 20 CPU cores. This allowed the CPU based OpenCL runtime to compete very well against the GPU based OpenCL runtime. So, Tornado VM may well be beneficial even if you don’t have an accelerator on you system.

Links/References

Viola Jones

  • P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 2001, pp. I-I, doi: 10.1109/CVPR.2001.990517.

https://ieeexplore.ieee.org/document/990517

https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf

  • Vinay Gangadhar, Sharmila Shridhar, Ram Sai Manoj Bamdhamravur, "An Efficient GPGPU Implementation of Viola-Jones Classifier Based Face Detection Algorithm"

http://pages.cs.wisc.edu/~vinay/pubs/FD_GPGPU.pdf

  • Breaking Down Facial Recognition: The Viola-Jones Algorithm, Rohan Gupta

https://towardsdatascience.com/the-intuition-behind-facial-detection-the-viola-jones-algorithm-29d9106b6999

  • Wikipedia.. AdaBoost

https://en.wikipedia.org/wiki/AdaBoost

  • OpenCV "Haar Cascades"

https://github.com/opencv/opencv/tree/master/data/haarcascades

haarcascade_frontalface_alt2.xml

  • Jason Brownlee "Boosting and AdaBoost for machine learning"

https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/

  • jviolajones (Java Implementation of Viola Jones)

https://github.com/tc/jviolajones (fork)

https://code.google.com/archive/p/jviolajones (original)

Integral Image/Prefix Sums

  • Blelloch, G.. “Prefix sums and their applications.” (1990).

http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.6430&rep=rep1&type=pdf

  • Wikipedia.. "Summed Area Table/Integral Image"

https://en.wikipedia.org/wiki/Summed-area_table

  • B. Bilgic, B. K. P. Horn and I. Masaki, "Efficient integral image computation on the GPU," 2010 IEEE Intelligent Vehicles Symposium, San Diego, CA, 2010, pp. 528-533, doi: 10.1109/IVS.2010.5548142.

https://www.nmr.mgh.harvard.edu/~berkin/Bilgic_2010_Integral_Image.pdf

https://ieeexplore.ieee.org/abstract/document/5548142

Java and GPU

  • J. Docampo, S. Ramos, G. L. Taboada, R. R. Expósito, J. Touriño and R. Doallo, "Evaluation of Java for General Purpose GPU Computing," 2013 27th International Conference on Advanced Information Networking and Applications Workshops, Barcelona, 2013, pp. 1398-1404, doi: 10.1109/WAINA.2013.234.

https://ieeexplore.ieee.org/document/6550591

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.696.9804&rep=rep1&type=pdf

  • Aparapi

https://code.google.com/archive/p/aparapi

https://github.com/aparapi/aparapi (github port original AMD Google code project )

https://aparapi.com

https://github.com/Syncleus/aparapi (Syncleus port)

https://www.oreilly.com/library/view/java-and-jvm/9781449345075/oreillyvideos1273351.html

https://www.infoq.com/news/2010/09/aparapi-java-and-gpus

TornadoVM

https://www.tornadovm.org/

https://github.com/beehive-lab/TornadoVM

https://www.infoq.com/news/2020/03/TornadoVM-QCon-London/

https://www.infoq.com/articles/tornadovm-java-gpu-fpga

This project has received funding from the European Union's Horizon H2020 research and innovation programme under grant agreement No 780245.

E2Data is part of the Heterogeneity Alliance