Traditional clustering algorithms are widely used for building bag-of-words (BOW) models to aggregate spatio-temporal feature points extracted from a video for human activity recognition problems. Their performances are restricted by the computational complexity which limits the number of feature points being used. In contrast, deep clustering yields good clustering performance without the limit of the number of feature points. Therefore, this work proposes a dual stacked autoencoders features embedded clustering (DSAFEC) and a BOW construction method based on the DSAFEC (B-DSAFEC) to reduce the computational complexity and to remove the selection restriction. The DSAFEC first transforms feature points extracted from a video to a learned feature space and then probabilities of cluster assignment of feature points are predicted to build BOWs for human activity recognition. A soft clustering is used by assigning each feature point to multiple clusters yielding the largest probabilities instead of only one in hard clustering. Experimental results on three benchmark human activity datasets show that the B-DSAFEC yields better performance compared to five reference methods which are developed based on either traditional clustering methods or deep clustering methods.
|Journal||IEEE Transactions on Circuits and Systems for Video Technology|
|Early online date||5 Feb 2021|
|Publication status||E-pub ahead of print - 5 Feb 2021|
- Bag-of-words (BOW)
- deep clustering
- human activity recognition