Protein-DNA Interactions

The completion of first human genome has ushered in the the next era of proteomics where a map of every protein interaction will be carefully constructed from both experimental and computational data. One fundemental set of interactions comprises transcription factors and their targets. We propose a three step computational protocol to predict all such interactions: 1) identify potential DNA-binding proteins; 2) identify potential binding sites on the protein; 3) identify potential binding sites on the DNA siding a protein struture along the DNA. Here, I focus on the first two steps.
Protein Structure Annotation
Transcription regulation is a fundamental biological process, and expansive efforts have been made to investigate its mechanisms through both biological experiments and computational modeling based on physical-chemical principles. This data is subsequently used to construct regulation networks in order to investigate the underlying gene expression in the cell.
Using structural similarity between a protein with a known function to one with unknown fnction only transfers function to the unknown protein two-thirds of the time. Thus, structural features in combination with physio-chemical property features can better discriminate proteins that bind DNA from those that do not. By using training examples of both classes, a machine learning method can build a model combining such features to accurately predict the function of a specific protein [2,6].
Protein Sequence Annotation
Much of the available protein data exists in the form of sequence rather than structure. Furthermore, many structrual elucidation studies provide initial low resolution structures, not suitable for structure annotation machine learning studies. Since sequence (almost) completely determines structure, it also determines function. To this end, features can be derived from the arrangement of physio-chemical properties of a sequence and used to build a machine learning model to accruately predict whether a protein binds DNA. Such a method can be applied to proteome-wide analysis [8] .
Protein DNA-binding Site Identification
The DNA binding sites on proteins are constrained by the structural and physical properties of DNA. These properties may be used to predict the sites on a protein that bind DNA, which can be derived from the structure-based and sequence-based properties of unbound structures. Note, a similar approach has been taken to predict protein-protein interfaces [3].