by Simon Hadfield
Abstract:
The aim of this thesis, is to develop estimation and encoding techniques for 3D information, which are applicable in a range of vision tasks. Particular emphasis is given to the task of natural action recognition. This "in the wild" recognition, favours algorithms with broad generalisation capabilities, as no constraints are placed on either the actor, or the setting. This leads to huge intra-class variability, including changes in lighting, actor appearance, viewpoint and action style. Algorithms which perform well under these circumstances, are generally well suited for real world deployment, in applications such as surveillance, video indexing, and assisted living. The issues of generalisation, may be mitigated to a significant extent, by utilising 3D information, which provides invariance to most appearance based variation. In addition, 3D information can remove projective distortions and the effect of camera orientation, and provides cues for occlusion. The exploitation of these properties has become feasible in recent years. This is due to both the emergence of affordable 3D sensors, such as the Microsoft Kinect\texttrademark, and the ongoing growth of 3D broadcast footage (including 3D TV channels, and 3D Blu-Ray). To evaluate the impact of this 3D information, and provide a benchmark to aid future development, a large multi-view action dataset is compiled, covering 14 different action classes and comprising over an hour of high definition video. This data is obtained from 3D broadcast footage, which provides a broader range of variations, than may be feasibly produced, during staged capture in the lab. A large number of existing action recognition techniques are then implemented, and extensions formulated, to allow the encoding of 3D structural information. This leads to significantly improved generalisation, over standard spatiotemporal techniques. As an additional information source, the estimation of 3D motion fields is also developed. Motion estimation in 3D is also referred to as "scene flow", to contrast with its image plane counterpart "Optical Flow". However, previous work on scene flow estimation, has been unsuitable for real applications, due to the computational complexity of approaches proposed in the literature. The previous state of the art techniques generally require several hours, to estimate the motion of a single frame, rendering their use with datasets of reasonable size, intractable. This in turn, has lead to the field of scene flow estimation being often viewed as an item of academic interest only. In this thesis, a new monte-carlo based approach to motion estimation is proposed, which is not only several orders of magnitude faster (and amenable to a parallelised GPU implementation), but also provides improved accuracy by avoiding over-smoothing artefacts. The value of this particle based approach is further augmented, by a re-examination of the underlying assumptions in motion estimation theory. A deeper understanding of the behaviour of such systems is developed, for both the optical flow and scene flow estimation scenarios. In particular, existing formulations are demonstrated to either require accurate initialisation, or to favour small motions (explaining the popularity of multi-scale estimation schemes). The most enlightening analysis, however, explores the idea that functions which accurately represent real data, are not by default, suitable for the detection of estimation errors. This leads to the proposal of a more robust, non-linear estimation scheme, based on machine learning. This "Intelligent Transfer Function" is incorporated into existing motion estimation schemes (both single and multi-view), along with support for probabilistic occlusion handling, and multi-hypothesis motion smoothing techniques. Finally, this fast and accurate approach to 3D motion estimation, is exploited within the original task of natural action recognition. A range of schemes are explored, for effectively encoding this rich information, based on variations of the hugely popular HOG and HOF descriptors. By utilising the actors undistorted motion field, action recognition rates are significantly improved over both the standard spatiotemporal approaches, and their 3D structural extensions. This serves to demonstrate that, due to the new more tractable formulation, in conjunction with the growth of 3D data, scene flow estimation may be a valuable tool for computer vision in the future.
Reference:
The estimation and use of 3D information, for natural human action recognition (Simon Hadfield), PhD thesis, University of Surrey, 2013. (Viva Presentation Slides)
Bibtex Entry:
@PhdThesis{Hadfield2013Estimation,
Title = {The estimation and use of {3D} information, for natural human action recognition},
Author = {Simon Hadfield},
School = {University of Surrey},
Year = {2013},
Month = {May},
Abstract = {The aim of this thesis, is to develop estimation and encoding techniques for 3D information, which are applicable in a range of vision tasks. Particular emphasis is given to the task of natural action recognition. This ``in the wild'' recognition, favours algorithms with broad generalisation capabilities, as no constraints are placed on either the actor, or the setting. This leads to huge intra-class variability, including changes in lighting, actor appearance, viewpoint and action style. Algorithms which perform well under these circumstances, are generally well suited for real world deployment, in applications such as surveillance, video indexing, and assisted living. The issues of generalisation, may be mitigated to a significant extent, by utilising 3D information, which provides invariance to most appearance based variation. In addition, 3D information can remove projective distortions and the effect of camera orientation, and provides cues for occlusion. The exploitation of these properties has become feasible in recent years. This is due to both the emergence of affordable 3D sensors, such as the Microsoft Kinect{\texttrademark}, and the ongoing growth of 3D broadcast footage (including 3D TV channels, and 3D Blu-Ray). To evaluate the impact of this 3D information, and provide a benchmark to aid future development, a large multi-view action dataset is compiled, covering 14 different action classes and comprising over an hour of high definition video. This data is obtained from 3D broadcast footage, which provides a broader range of variations, than may be feasibly produced, during staged capture in the lab. A large number of existing action recognition techniques are then implemented, and extensions formulated, to allow the encoding of 3D structural information. This leads to significantly improved generalisation, over standard spatiotemporal techniques. As an additional information source, the estimation of 3D motion fields is also developed. Motion estimation in 3D is also referred to as ``scene flow'', to contrast with its image plane counterpart ``Optical Flow''. However, previous work on scene flow estimation, has been unsuitable for real applications, due to the computational complexity of approaches proposed in the literature. The previous state of the art techniques generally require several hours, to estimate the motion of a single frame, rendering their use with datasets of reasonable size, intractable. This in turn, has lead to the field of scene flow estimation being often viewed as an item of academic interest only. In this thesis, a new monte-carlo based approach to motion estimation is proposed, which is not only several orders of magnitude faster (and amenable to a parallelised GPU implementation), but also provides improved accuracy by avoiding over-smoothing artefacts. The value of this particle based approach is further augmented, by a re-examination of the underlying assumptions in motion estimation theory. A deeper understanding of the behaviour of such systems is developed, for both the optical flow and scene flow estimation scenarios. In particular, existing formulations are demonstrated to either require accurate initialisation, or to favour small motions (explaining the popularity of multi-scale estimation schemes). The most enlightening analysis, however, explores the idea that functions which accurately represent real data, are not by default, suitable for the detection of estimation errors. This leads to the proposal of a more robust, non-linear estimation scheme, based on machine learning. This ``Intelligent Transfer Function'' is incorporated into existing motion estimation schemes (both single and multi-view), along with support for probabilistic occlusion handling, and multi-hypothesis motion smoothing techniques. Finally, this fast and accurate approach to 3D motion estimation, is exploited within the original task of natural action recognition. A range of schemes are explored, for effectively encoding this rich information, based on variations of the hugely popular HOG and HOF descriptors. By utilising the actors undistorted motion field, action recognition rates are significantly improved over both the standard spatiotemporal approaches, and their 3D structural extensions. This serves to demonstrate that, due to the new more tractable formulation, in conjunction with the growth of 3D data, scene flow estimation may be a valuable tool for computer vision in the future.},
Comment = {<a href="slides/thesis_hadfield.zip">Viva Presentation Slides</a>},
File = {Hadfield2013Estimation.pdf:Hadfield2013Estimation.pdf:PDF},
Gsid = {14121923313490978618,11128339752378594507},
Keywords = {Scene Flow, 3D Motion, Motion Estimation, Unregularized, Action Recognition, Interest Point, Saliency, Hollywood, Particle, 3D, Hand Tracking, Sign Language, Tracking, Scene Particles, Occlusion Estimation, Probabilistic Occlusion,Occlusion, Bilateral Filter, Histogram of Scene-flow, HOS, Motion Features, 3D Tracking, Motion Segmentation},
Timestamp = {2013.05.09},
Url = {http://personalpages.surrey.ac.uk/s.hadfield/thesis_hadfield.pdf}
}