Sunday, August 28, 2016

Ranked top 5% percent in Kaggle Distracted Driver Competition

Kaggle State Farm Distracted Driver Detection competition has just ended, and I ranked within top 5% (64th out of 1450 participating teams, winner's got $65,000).My approach is mainly based on Deep Learning (trained 20 very deep models) but still applies Computer Vision strategies to reduce neural network distraction.A brief description about the system is in the image below:
References:[1] Rajen Bhatt, Abhinav Dhall, 'Skin Segmentation Dataset', UCI Machine Learning Repository.
[2] X. Zhu, D. Ramanan. "Face detection, pose estimation and landmark localization in the wild" Computer Vision and Pattern Recognition (CVPR) Providence, Rhode Island, June 2012.
[3] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

  • The competition was very challenging, we did not do some costly annotations nor used test data in any form of learning (even semi-supervised) nor annotation.
    We treat images independently, while some participants learn from test data and take advantage of the fact that test images are originally sampled from recorded videos. So, they do some sort of test videos reconstruction and hence image classification makes use of temporal context.
    Also, some participants annotate their training data and some crowdsources the annotation. Which is either too much work or needs money.
    Our system is more general than such systems, even that they are doing better than ours in leaderboard; they are indirectly over-fitting the competition test data.
  • The average loss metric used in competition leaderboard ranking doesn't directly reflect the system accuracy. I believe all top 100 systems classification accuracies are higher than 99%, but the loss metric reflects how were you confident in your classification, which is harder. Only a single misclassification with high confidence, will give a very bad average loss.
  • Face detection for such problem is hard, the well-known Haar Cascades surely fail. The example in the image above is easy, but normally driver face is seen from side.
  • I've tried lots of strategies for upper body Human Pose Estimation (Calvin) and scene/driver segmentation (this and this) but didn't achieve good results.
  • I used data augmentation. Because the training data size is not large.
  • The camera is not calibrated, and changes orientation. The system should be intelligent enough to handle this.
  • I know many interesting details and results (like the following image visualizing the most important area that affected network decision) are missing here, but I will be happy to answer any of your questions about any details.