SAIVT-QUT@TRECVid 2012 : interactive surveillance event detection
Xu, Jingxin, Deman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2012) SAIVT-QUT@TRECVid 2012 : interactive surveillance event detection. In Over, Paul (Ed.) TRECVid 2012 Proceeding, NIST, Gaithersburg, USA.
In this paper, we propose an approach which attempts to solve the problem of surveillance event detection, assuming that we know the definition of the events. To facilitate the discussion, we first define two concepts. The event of interest refers to the event that the user requests the system to detect; and the background activities are any other events in the video corpus. This is an unsolved problem due to many factors as listed below:
1) Occlusions and clustering: The surveillance scenes which are of significant interest at locations such as airports, railway stations, shopping centers are often crowded, where occlusions and clustering of people are frequently encountered. This significantly affects the feature extraction step, and for instance, trajectories generated by object tracking algorithms are usually not robust under such a situation.
2) The requirement for real time detection: The system should process the video fast enough in both of the feature extraction and the detection step to facilitate real time operation.
3) Massive size of the training data set: Suppose there is an event that lasts for 1 minute in a video with a frame rate of 25fps, the number of frames for this events is 60X25 = 1500. If we want to have a training data set with many positive instances of the event, the video is likely to be very large in size (i.e. hundreds of thousands of frames or more). How to handle such a large data set is a problem frequently encountered in this application.
4) Difficulty in separating the event of interest from background activities: The events of interest often co-exist with a set of background activities. Temporal groundtruth typically very ambiguous, as it does not distinguish the event of interest from a wide range of co-existing background activities. However, it is not practical to annotate the locations of the events in large amounts of video data. This problem becomes more serious in the detection of multi-agent interactions, since the location of these events can often not be constrained to within a bounding box.
5) Challenges in determining the temporal boundaries of the events: An event can occur at any arbitrary time with an arbitrary duration. The temporal segmentation of events is difficult and ambiguous, and also affected by other factors such as occlusions.
Impact and interest:
Citation counts are sourced monthly from and citation databases.
These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.
Citations counts from theindexing service can be viewed at the linked Google Scholar™ search.
Full-text downloads displays the total number of times this work’s files (e.g., a PDF) have been downloaded from QUT ePrints as well as the number of downloads in the previous 365 days. The count includes downloads for all files if a work has more than one.
|Item Type:||Conference Paper|
|Subjects:||Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING (080100)|
|Divisions:||Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
|Copyright Owner:||Copyright 2012 (please consult the authors).|
|Deposited On:||07 Mar 2013 05:49|
|Last Modified:||08 Feb 2014 12:16|
Repository Staff Only: item control page