Greedy-proximal A* and Hybrid Spectral/Subspace Clustering for Molecular Dynamics Simulations

No Thumbnail Available
Syzonenko, Ivan
Journal Title
Journal ISSN
Volume Title
Middle Tennessee State University
Protein folding plays a crucial role in human biochemistry. Proteins are the building blocks for most of our tissues, help to transfer signals through the bloodstream and other fluids, and help to cure diseases. On the opposite side, pathogens and viruses also consist of proteins, which make our understanding of protein function a top priority to save and prolong human life. Even small changes in folding patterns may lead to serious diseases like Alzheimer’s or Parkinson’s where proteins are folded either too quickly or too slowly. The protein folding problem has been studied in the field of molecular biophysics for many years (Maximova et al., 2016; Scheraga et al., 2007), however many questions are still unanswered. Mainly they are: "what is the final conformation (3-dimensional shape or structure) given the primary sequence of amino acids? ", "how does the conformation change over time?", and "how does a protein’s secondary and tertiary conformation affect its functions?". Molecular dynamics (MD) is one of the tools used to understand how proteins fold into native conformations (Chen et al., 2008). It uses computational techniques to calculate the interactions of molecules. While it captures sequences of conformations that lead over time to the folded state, limitations in simulation timescales remain problematic (Klepeis et al., 2009). One of the limitations is the notion of "energy wells" (Liu et al., 2012) - conformations with low potential energy which reduce the probability to form other conformations and finally fold (reach the global minimum of the potential energy) within a computationally feasible timescale. The complete set of energies for all possible conformations is called the energy landscape (Liu et al., 2012; Wales, 2003). Although many approaches have been suggested to speed-up the simulation process using rapid changes in temperature or pressure, we propose a rational approach derived from path finding algorithms to explore the supposed shortest-path folding pathway. Such an algorithm should not only reduce the computational time needed to obtain the folded conformation without adding artificial energy bias, but also make it possible to generate trajectories which only contain motions needed for the folding transition. We introduce several new protein structure comparison metrics based on the contact map distance to help mitigate the challenges faced by "standard" metrics. We test our approach on proteins which represents the two main types of secondary structure: a) the Trp-Cage Miniprotein Construct TC5b (1L2Y) (Neidigh et al., 2002) which is a short, fast-folding protein that represents alpha-helical secondary structure formed because of a locked triptophan in the middle, b) the immunoglobulin binding domain of streptococcal protein G (1GB1) (Gronenborn et al., 1991), containing an alpha-helix and several beta-sheets and c) the chicken villin subdomain HP-35, N68H protein (1YRF) (Chiu et al., 2005) - one of the fastest folding proteins which forms three alpha-helices. We compare our algorithm to Replica-Exchange Molecular Dynamics (REMD) and Steered Molecular Dynamics (SMD) methods which represent the main algorithms used for folding proteins with MD. Another common application of MD simulations is for future experimental validation and energy landscape exploration for studying metastable conformations and the transi- tions between them (Phillips, 2012; Bowman and Pande, 2010). While the problem of capturing metastable states may often be successfully resolved within the timescale of the simulation, finding those states is often performed with automated techniques such as clustering (Bhowmik and Ramanathan, 2018; Sittel and Stock, 2016). Although there are many clustering algorithms available, not all of them can be successfully applied to high-dimensional data such as MD simulations (Steinbach et al., 2004). In particu- lar, recent work from the clustering literature (Sakuraba et al., 2010) shows that many high-dimensional data sets explore a mixture of independent subspaces and previous clustering studies of MD data have ignored such effects. In this study we explore the application of subspace clustering techniques to MD simulation data and compare their performance with traditional Spectral clustering (SPC) algorithms (Ng et al., 2002) and demonstrate when and why such approaches may be superior to traditional techniques.
Biochemistry, Computer science, Biophysics