xTLD Pipeline: Detection component
This blogpost describes the detection component of xTLD technology which has been awarded in NVIDIA Inception Program Contest. xTLD is still in development.
2017-01-11, Zdenek Kalal, TLD Vision
Object detection is the key technology for computer vision – a subset of AI which deals with visual perception. While current research focus on detection of large number of object classes, industrial problems often require to detect only a single object class but with significantly higher robustness and accuracy.
To answer this demand, we spent the past year developing an innovative technology codenamed xTLD. As other modern detection methods, xTLD is using deep neural networks for learning and inference, but instead of spreading the power among many classes, it focuses it into single class only. This decision allows us to increase the detection robustness and speed, but also to use the remaining power for additional visual intelligence. In particular, we simultaneously estimate the 3D object pose, which is a challenging task on its own.
xTLD is general and allows to be adapted for a large number of objects which are worth detecting in practice. Nevertheless, we start with human head and refer to this instance as HeadTLD. In near future, we plan to apply xTLD to other classes (e.g. cars, pedestrians) depending on the real demand.
Current features
- real-time detection and 360° pose estimation in HD video
- minimal bounding box size 20x20 pixels (head covers only 10x10 pixel area)
- pose alignment error 15°
- detects and aligns multiple targets simultaneously
This image shows the accuracy of alignment. The face is detected/aligned independent of its rotation end even when it covers just 20x20 pixel area.
Training data
- combines computer graphics with real video data
- requires approximate 3D model of the object
Technology
Our technology is building on top of relatively low-level libraries in order to ensure maximal flexibility of our system.
- neural network implemented from scratch in C++, CUDA
- convolution accelerated using cuDNN (observed 2x speedup over cuBLAS)
- real-time rendering with OpenGL, GLFW, GL3W
- camera and video access using OpenCV
Inference
- inference requires CUDA and works best for fast GPU such as Titan X
- gracefully degrades for low-end GPUs
Example detections
Target applications
- autonomous driving – predict intent to cross the road at larger distance
- augmented reality – augmentation of multiple faces at larger distance
- initialization – for high-detail 3D capture or facial recognition software
- privacy protection – innovative real-time 3D video anonymity which cannot be recovered by neural network
- deeper analysis of crowd behavior
- monitoring of attention of students