Research

The Pharmaceutical Bioinformatics lab is working on several projects dealing with the discovery of new drugs, analysis of the mechanisms of known compounds as well as the biosynthesis of natural compounds. In the following some selected projects are described.

Automated genome annotation of Streptomyces spp.

With the increasing amount of available genome sequences over the last few years, there is a high demand for ge-nome annotation. Genome annotation is the process of attaching biological information to sequences of identified elements on the genome. The structural annotation includes the identification of the gene structure and coding regions, open reading frame prediction, and the localization of regulatory motifs. By functional annotation biochemical and biological functions can be assigned to genes. Automatic annotation tools are able to perform these steps by computer analysis. We are using Galaxy as framework for interactive large scale genome analysis which allows the integration of user-defined tools. In our current project we are working in cooperation with the Pharmaceutical Biology & Biotechnology and Pharmaceutical & Medicinal Chemistry on genome annotation of industrially relevant Streptomyces spp. Those bacteria are able to produce a variety of bioactive compounds, e.g. antibiotics. Performing complete genome alignments will allow for a comparison of identified genes, products, and metabolites. The results will be used for constructing kinetic models to direct metabolic engineering.

Cheminformatics: Development of tools for the design of large chemical libraries and de novo compounds

We have implemented several tools in the workflow management system Galaxy and created a suitable cheminformatics platform that includes: physico-chemical filters, chemical converters, clustering methods, similarity- and substructure-based searches, fragmentation techniques, etc. Using such platform we have developed, for example, very large datasets of molecules and de novo compounds that we further use in our virtual screening projects.

CoRS – an Automated Compound Research System

Scientific findings in international journals exist in form of digital texts in most cases. In general, these texts are unstructured. Often, context-based information has to be extracted manually, because computational applications lack automated processing abilities of semantic data (i.e. data with related and linked meaning). Only experts in their field are able to interpret and organise retrieved data. However, extracting this information and structuring generated knowledge is important to identify potential new drugs.

Thus, a Compound Research System (CoRS) will be developed for automated extraction of information from different articles and to display their interrelation. CoRS is funded by the German National Research Foundation (DFG, Lis45).

Several tools exist for extracting information about proteins and genes in literature – though algorithms able to find chemical structures in free texts are scarce (Hettne et al., 2009). Frequently, information about a compound's interaction partners can be deduced from similar molecules by using structural descriptors. However, the determination of a compound's structure may be challenging for unspecific compound identifiers. Proteins with a potential interaction with a given compound can be identified by searching for co-occurrences in all PubMed abstracts. Those issues have been addressed within the preliminary work Compounds in Literature (CIL, Grüning et al., 2011).

The complementary step in analysing co-occurences in literature was the discovery of compounds that interact with a particular protein. The recently published system called Protein-Literature Investigation for Interacting Compounds (prolific) also supports the identification of similar proteins by sequence comparison (Senger et al., 2012).

A refined search engine will reveal not only the interacting compounds and proteins, but also a classification of the reaction like induction, activation, inhibition, etc.

Thus, the structure of compound-protein relationships will be investigated and modelled with a specialized, controlled vocabulary or an ontology, combined with different techniques of natural language processing. This encompasses text and data mining methods which are carefully selected, adapted and developed – including machine learning with clustering, decision trees, support vector machines, and Bayesian classifiers. Those algorithms are able to reveal and exploit semantics of words and phrases covering compound-protein relationships.

The application of these approaches will be extended from abstracts to full articles. Processing of the large body of data volumes and the currentness of CoRS will involve distributed computing.

About 975000 PubMed abstracts were published in 2011. CoRS aims to efficiently detect the rapidly growing amount of knowledge by gathering, classifying, and visualising information which would remain undiscovered otherwise. Furthermore, CoRS helps to generate hypotheses connecting findings for deductive reasoning.

Most recent research related to CoRS resulted in StreptomeDB, a resource for natural compounds isolated from Streptomyces species (Lucas X, Senger C et al., 2012). This database has been generated through automatic text mining of thousands of articles from PubMed, followed by manual curation.

The next milestone in this project includes the implementation of several new features in CIL and prolific as well as a CoRS portal combining these resources.

CIL webserver prolific StreptomeDB

Compounds In Literature (CIL): screening for compounds and relatives in PubMed. Grüning BA, Senger C, Erxleben A, Flemming S, Günther S. Bioinformatics 27:1341-2. Epub 2011 Mar 16.

Mining and Evaluation of Molecular Relationships in Literature. Senger C, Grüning BA, Erxleben A, Döring K, Patel H, Flemming S, Merfort I, Günther S. Bioinformatics 28:709-14. Epub 2012 Jan 13.

StreptomeDB: a resource for natural compounds isolated from Streptomyces species. Lucas X, Senger C, Erxleben A, Grüning BA, Döring K, Mosh J, Flemming S, Günther S. Nucleic Acids Res. 2012 Nov 28. [Epub ahead of print]

DNA Methylation

Methylation of cytosins within a CpG dinucleotide is a common epigenetic DNA modification and may arrest cells in a pathogenic state in complex disorders, e.g. cancer or rheumatoid arthritis. CpGs occur mainly in clusters, called CpG islands (CPIs), being present in nearly 70% of the human genes’ promotor region. The Illumina HumanMethylation450 Beadchip platform provides a genome-wide coverage of 485,577 CpGs. Analysis of these CpGs reveals a correlation between changes in DNA methylation and gene expression, even though not all sites have the same impact.

The methylation state of one CpG or a whole CPI may influence the expression of the corresponding gene due to binding of Methylation-Binding-Domains (MBD) and other methylation dependent proteins. To identify CpGs influencing gene expression and common methylation patterns we used several approaches, e.g. network analysis and machine learning techniques.

Drug discovery: in silico screening for the identification of novel small molecules interfering with protein-protein interactions

Motivation
Disease-relevant intracellular protein-protein interactions occurring at defined cellular sites have great potential as drug targets. They allow for highly-specific pharmacological interference with defined cellular functions. Drugs targeting such interactions are likely to act with fewer side effects than conventional medication influencing whole cell functions.

Methods
We use virtual screening techniques to identify small molecules with capability to interfere with protein-protein interactions. After automatic docking, several methods exist for the rational selection of candidate molecules. Candidates selected this way are subsequently tested experimentally to determine their in vitro activity. We focus on the epigenetics field.

Prediction of drug-metabolizing enzymes by support vector machines

Machine learning approaches
A variety of methods exist to classify and predict biological properties of chemical compounds, e.g. principal component analysis, partial least squares, artificial neural networks, evolutionary algorithms, and support vector machines (SVMs). SVMs are models for non-linear classification and regression. They find a hyperplane with the maximum margin separating samples of two classes of a training set. If samples are not separable linearly, they will be mapped to a high-dimensional “feature” space to find a hyperplane separating classes linearly in that space . To form a set of meaningful descriptors for classification, different characteristics of chemical compounds like size, shape, surface, ring counts, etc. have to be computed.

Application
Cytochromes P450 (CYPs) account for ~75% of all drug-metabolizing enzymes. The prediction of affinities of enzymes and drugs is important for the prevention of drug-drug interactions, frequently caused by multi-medications in elder or intensive care unit patients. Databases like SuperCyp provide classified, comprehensive datasets on CYPs, metabolized drugs, and enzyme or drug structures. Utilizing this data for the calculation of descriptor sets enables the creation of training- and test-sets and prediction-models for compounds whose metabolisms are not known, yet.

Systems biology modelling of Streptomycetes

Streptomycetes are important bacteria for producing natural drugs such as Daptomycin and Tacrolimus. Unfortunately, they produce these vital antibiotics only in low yields. To increase production of required products in this project a systems biology approach is applied. The existence of galore of genetic biological information and systems metabolic engineering tools can contribute to engineer an overproducer of important secondary metabolites of the genus Streptomyces.

Flux balance analysis (FBA) is one mathematical approach and will be applied for studying biochemical networks, in particular the genome-scale metabolic network reconstructions. These network reconstructions contain all of the known and recognized metabolic reactions in an organism and the genes that encode each enzyme. Finally, this model can be used to calculate flow rates of all reactions (fluxes) for an optimal production of secondary metabolites.

The project involves multiple steps among which genome annotation of i.g. Streptomyces TÜ6071 is already accomplished by the research group Pharmaceutical Bioinformatics (Erxleben et al., 2011). Further Gap-filling is the process which involve inserting reactions from reference Database, reversing direction of reactions and addition of reactions to complete the metabolic reaction network.

The goal of this project is to apply an approved metabolic model that can be applied for rational metabolic engineering approach of an overproducer of important antibiotics.

Progress text
Message text