Fast content-based file type identification

Ahmed, Irfan, Lhee, Kyung-Suk, Shin, Hyun-Jung, & Hong, Man-Pyo (2011) Fast content-based file type identification. In Sujeet, Shenoi & Peterson, Bert (Eds.) 7th Annual IFIP WG 11.9 International Conference on Digital Forensics, January 30 - February 2, 2011, Orlando, Florida.

[img] PDF (446kB)
Administrators only | Request a copy from author

View at publisher


Digital forensic examiners often need to identify the type of a file or file fragment based only on the content of the file. Content-based file type identification schemes typically use a byte frequency distribution with statistical machine learning to classify file types. Most algorithms analyze the entire file content to obtain the byte frequency distribution, a technique that is inefficient and time consuming. This paper proposes two techniques for reducing the classification time. The first technique selects a subset of features based on the frequency of occurrence. The second speeds classification by sampling several blocks from the file. Experimental results demonstrate that up to a fifteen-fold reduction in file size analysis time can be achieved with limited impact on accuracy.

Impact and interest:

6 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 41535
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: File type identification, File content classification, Byte frequency
ISBN: 9783642242113
Subjects: Australian and New Zealand Standard Research Classification > INFORMATION AND COMPUTING SCIENCES (080000) > COMPUTER SOFTWARE (080300) > Computer System Security (080303)
Divisions: Past > Institutes > Information Security Institute
Copyright Owner: Copyright 2011 Springer
Copyright Statement:

This is the author-version of the work.

Conference proceedings published, by Springer Verlag, will be available via SpringerLink.

Deposited On: 28 Aug 2011 22:26
Last Modified: 27 Jan 2012 03:24

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page