xTLD Pipeline: Detection component

This blogpost describes the detection component of xTLD technology which has been awarded in NVIDIA Inception Program Contest. xTLD is still in development.
2017-01-11, Zdenek Kalal, TLD Vision

Submitted video (examples)

Object detection is the key technology for computer vision – a subset of AI which deals with visual perception. While current research focus on detection of large number of object classes, industrial problems often require to detect only a single object class but with significantly higher robustness and accuracy.

To answer this demand, we spent the past year developing an innovative technology codenamed xTLD. As other modern detection methods, xTLD is using deep neural networks for learning and inference, but instead of spreading the power among many classes, it focuses it into single class only. This decision allows us to increase the detection robustness and speed, but also to use the remaining power for additional visual intelligence. In particular, we simultaneously estimate the 3D object pose, which is a challenging task on its own.

xTLD is general and allows to be adapted for a large number of objects which are worth detecting in practice. Nevertheless, we start with human head and refer to this instance as HeadTLD. In near future, we plan to apply xTLD to other classes (e.g. cars, pedestrians) depending on the real demand.

Current features

real-time detection and 360° pose estimation in HD video
minimal bounding box size 20x20 pixels (head covers only 10x10 pixel area)
pose alignment error 15°
detects and aligns multiple targets simultaneously

This image shows the accuracy of alignment. The face is detected/aligned independent of its rotation end even when it covers just 20x20 pixel area.

Training data

combines computer graphics with real video data
requires approximate 3D model of the object

Technology

Our technology is building on top of relatively low-level libraries in order to ensure maximal flexibility of our system.

neural network implemented from scratch in C++, CUDA
convolution accelerated using cuDNN (observed 2x speedup over cuBLAS)
real-time rendering with OpenGL, GLFW, GL3W
camera and video access using OpenCV

Inference

inference requires CUDA and works best for fast GPU such as Titan X
gracefully degrades for low-end GPUs

Example detections

Target applications

autonomous driving – predict intent to cross the road at larger distance
augmented reality – augmentation of multiple faces at larger distance
initialization – for high-detail 3D capture or facial recognition software
privacy protection – innovative real-time 3D video anonymity which cannot be recovered by neural network
deeper analysis of crowd behavior
monitoring of attention of students