Consensus σ70 promoter prediction using hadoop

Hogan, James M., Kelly, Wayne A., & Newell, Felicity S. (2013) Consensus σ70 promoter prediction using hadoop. In Proceedings of the 2013 IEEE 9th International Conference on e-Science, IEEE, 22 - 25 October 2013, pp. 35-44.

View at publisher


MapReduce frameworks such as Hadoop are well suited to handling large sets of data which can be processed separately and independently, with canonical applications in information retrieval and sales record analysis. Rapid advances in sequencing technology have ensured an explosion in the availability of genomic data, with a consequent rise in the importance of large scale comparative genomics, often involving operations and data relationships which deviate from the classical Map Reduce structure. This work examines the application of Hadoop to patterns of this nature, using as our focus a wellestablished workflow for identifying promoters - binding sites for regulatory proteins - Across multiple gene regions and organisms, coupled with the unifying step of assembling these results into a consensus sequence. Our approach demonstrates the utility of Hadoop for problems of this nature, showing how the tyranny of the "dominant decomposition" can be at least partially overcome. It also demonstrates how load balance and the granularity of parallelism can be optimized by pre-processing that splits and reorganizes input files, allowing a wide range of related problems to be brought under the same computational umbrella.

Impact and interest:

0 citations in Scopus
Search Google Scholar™

Citation counts are sourced monthly from Scopus and Web of Science® citation databases.

These databases contain citations from different subsets of available publications and different time periods and thus the citation count from each is usually different. Some works are not in either database and no count is displayed. Scopus includes citations from articles published in 1996 onwards, and Web of Science® generally from 1980 onwards.

Citations counts from the Google Scholar™ indexing service can be viewed at the linked Google Scholar™ search.

ID Code: 68694
Item Type: Conference Paper
Refereed: Yes
Additional URLs:
Keywords: Biology computing, Data handling, Genomics, Parellel programming, Proteins, Public domain software
DOI: 10.1109/eScience.2013.42
ISBN: 9780768550831
Divisions: Current > Schools > School of Electrical Engineering & Computer Science
Current > QUT Faculties and Divisions > Science & Engineering Faculty
Copyright Owner: Copyright 2013 by The Institute of Electrical and Electronics Engineers, Inc.
Deposited On: 17 Mar 2014 23:28
Last Modified: 19 Mar 2014 03:15

Export: EndNote | Dublin Core | BibTeX

Repository Staff Only: item control page