Platform for the reconstruction of biological integrated systems : an application to bacterial transport systems

Action Incitative Bioinformatique

Gwennaele Fichant

Participants

Gwennaele Fichant
Yves Quentin

Laboratoire d’Informatique Fondamentale de Marseille

Cécile Capponi

Action HELIX INRIA-Rhône Alpes

Michel Page
Danielle Ziébelin

with the contribution of
Alain Viari and Anne Morgat

Institut de Mathématiques de Luminy

Alain Guénoche


Objectives of the project

Many biological functions rely upon several protein partners interacting effectively in permanent or sporadic supramolecular assemblies, called integrated systems. The aim of the project was to develop automatic strategies to identify, assemble and classify such systems in completely sequenced bacterial genomes and to represent the relationships among the partners. The work was first validated on a given integrated system - the ABC transporters - before being generalized.

Results

We have implemented a general automated strategy for the identification and assembly of integrated biological systems. It has been first applied on ABC transporters. The difficulties encountered for the identification, reconstruction and classification of those systems don't depend on the number of partners which is of five, but in the fact that some of them show very few sequence conservation and in the great number of transporters encoded by a genome. Indeed, their genes form one of the most large family of paralogous genes.

Our strategy is developed in a framework based on PERL scripts and database management system (ACeDB). It can be pictured as follow .

Image1

In order to reconstruct the integrated systems, different steps have been identified and have been implemented in separate modules. The two first ones don't interact with the database. The files created after the data extraction are used by the module "Partner identification". All other modules interact with the database and the direction of each arrow indicates if its action is a data input or a data needed for the next steps of the analysis. The partner identification module is central and integrates different bioinformatic methods.

This diagram illustrated the strategy used to identify ABC domains in a new proteome.

identification

It is decomposed into two steps: a learning step (upper part) and a test step (bottom part). The learning step is used to compute the parameters of the methods: the profiles for PSI-BLAST and the motifs and the hidden Markov model (HMM) for Meta-MEME. The profiles are built from alignments provided by ClustalW. The motifs are discovered by MEME and either used directly by MAST for prediction or modeled as HMM with HMM tools of the Meta-MEME suite. The predictions are achieved by applying independently each method on the proteome to annotate. For similarity searches with the BLASTP2 program, each protein of the proteome is used as query against a domain database issued from ABCdb. The results of the methods are a list of ABC partner candidates, which are submitted to a BLASTP2 against the complete proteomes of the bacteria from the learning set. This procedure, called back-blast, is used to remove the false positives that correspond to the queries having best hits with sequences unrelated to ABC proteins. The validated partners are assembled in transporters and entered in the database. The comparison of the performance of the different methods can be found in Quentin et al. 2002. The strategies developed must use rules (method layout) and parameters updated with the incoming data (dataflow control). The analysis mechanism stores the data obtained, then reuses them to reevaluate the methods’ parameters and thereby launch more accurate analysis of new data sets. We thus face a recursive process whose data flow is quite hard to control, notably because it involves jointly managing the type of relationships among data and how we lay out the methods. Also, even if we can easily store the individual proteins involved in an integrated system in a database, representing complex relationships among them requires more sophisticated data-processing and semantic tools. Therefore, to comprehensively identify, assemble, classify, and store integrated systems, a knowledge representation system appears more suitable than a database.

AROM, an object-based knowledge representation system (OBKRS), fits this objective in several ways:

image2

We designed ABCkb to encompass both the description of ABC transporters and the classification of i) proteins involved in these systems according to their domains and ii) systems either in import or export system depending the subfamily they belong to. Therefore ABCkb includes two distinct parts, although linked, that can be treated independently. A first version of this knowledge base has been published (Capponi et al., 2001). The current version of ABCkb contains 52 complete prokaryotic genomes featuring 13556 objects and 21344 tuples and its schema consists of 10 top classes, some specialized into more characterized classes. The top classes are linked through 12 top-associations, which are specialized in turn along with the class specializations.

The internal classification mechanism of AROM is an important feature for handling biological information. Indeed, knowledge grows fast in biology research, and frequent updates necessitate revising the data and knowledge base schema as well. Yet schema evolution presents a major problem in database technology—specifically, ensuring the schema consistency and relocating the objects. Even if schema evolution is not fully managed in AROM, the classification mechanism facilitates object relocation.

AROM offers an original classification scheme that concerns not only objects (also named instances of a class) but also tuples (instances of an association). Both class and association provide necessary conditions for membership by the way of typed variables and roles. Although reliable, the AROM's classification algorithm was not sufficient. Indeed, the classification of objects in a knowledge representation systems should actually be a cycle:

The first AROM's classification algorithm only performed the two first steps of this cycle for it does not automatically carry on with potentially related object and tuple classification tasks. Eventually, considering the whole scheme of a knowledge base, it means that the possible consequences of an object classification are not exploited: a newly classified object o does not lead to the classification of objects which are linked to o through one or more tuples. Recursively applying such automatic propagations of classification results along associations requires to formally study many specific cases, e.g. the nature of dependences among objects. We developed such a propagation algorithm, which was applied to the knowledge stored in ABCkb. The algorithm permits, through the roles considered as paths among objects, the propagation of the classification of a single object to the different objects of the knowledge base that are somehow link to the first (Chabalier et al., 2002).

Querying the knowledge base can be achieved either by using an algebraic modeling language or by writing Java programs. In order to facilitate the building of requests by the users, a Web interface has been developed based on servlets and JSP.

Conclusion

The developed automated strategy permits the analysis of 96 prokaryotic genomes, including 16 archaea. A new version of ABCdb is currently being released (http://www.lcb.cnrs-mrs.fr/~quentin/ ) as we decided to maintain the database under ACeDB for the Web diffusion. However, me might use soon the WebAROM module, recently developed at INRIA, which is a knowledge base server on the web.

This strategy is now generalized to other integrated systems as the Tat secretion pathway and the two component systems. These latter show the same identification and reconstruction problems than the ABC transporters, i. e., weak sequence conservation of given domains and belonging to one of the largest gene families. In addition, the two component systems and ABC transporters are, in many cases, functionally linked (Joseph et al. , 2002). Therefore, the knowledge base ABCkb has evolved to integrate the data concerning both systems, giving rise to ISYMOD (Integrated SYstem MODeling). However, in order to get a fully functional system of knowledge acquisition, the strategies used to identify and reconstruct the integrated systems must be declaratively modeled as task organizations.A task is an encapsulated problem whose input and output are objects or tuples of the knowledge base, and which is associated to several solving methods. The AROM tasks module gives the user a specialization relation over tasks. In ISYMOD, a task may correspond to a biological problem (e.g. comparison of two proteins), thus it will be associated to several bioinformatic solving methods (for example, Blast). A strategy comprises a specified task layout and facilities for the selection of the appropriate method for solving a task depending on the context of execution. Such an environment, closely coupled with the declarative knowledge, permits integrating all knowledge, either factual or methodological, within the same knowledge base.

Publications made with the program support

Proceeding with referees

Communications and posters