* Ordonner des grandes phylogenies*** (***stage M2*)

*Ordonner des grandes phylogenies*

*stage M2*)

*Encadrants :** ***Fabio Pardi****, ****Céline Scornavacca, Vincent Berry**

*Contexte et description du stage :*

Phylogenetics aims at clarifying the evolutionary relationships that exist among different organisms, represented by phylogenetic trees or phylogenies. A phylogeny consists of nodes connected by branches (see the figure for an example). Terminal nodes are called* leaves* and represent organisms for which data has been collected. Internal nodes represent hypothetical ancestors since they cannot be directly observed. In rooted phylogenetic trees, each internal node represents the most recent common ancestor of its descendants and the only node with no ancestors is called the *root* of the tree.

Leaves can have several kinds of information associated to them. For example in phylogeography, each leaf may be labelled with the country where the organism was sampled (like in the figure above, where colors represent geographical origins). Similarly, in medicine, when studying the spreading of epidemics, leaves may be labelled with the individuals that carried the pathogen under study. When the data has been collected at different moments in time (e.g. when studying viruses like Influenza) labels can even be temporal and thus have different degrees of similarity between them (obviously 2012 is more related to 2010 than to 1984).

Automatic tools for ordering the leaves of a phylogenetic tree so that the leaves that share the same (or similar) labels are as ‘contiguous’ as possible in the drawing of the tree are highly desirable for biologists. Indeed, this can help them to discern underlying biological structures or increase the visual clarity of information such as the common phylogenetic origin of a morphological trait, or the spread of an infection within a restricted group of people, etc.

Biologists often have phylogenies with thousands/hundreds of thousands of leaves and cannot handle the drawing of the tree by hand by rotating each pair of subtrees around their root node. The subject of this stage is to propose fast algorithms to choose the best possible organization of the leaves of the tree, in order to obtain the best graphical display of the phylogenetic tree, given a labeling of the leaves. More formally, given a rooted tree *T* with *n* leaves, one can define a* linear ordering consistent with T as* any of the *2^(n-1) *permutations of the leaves of *T* generated by ‘ﬂipping’ internal nodes in *T* [1]. Once an objective function *f* is defined, where *f(π)* measures the degree of clustering of similar labels in a linear ordering *π*, the goal is to find a linear ordering consistent with *T* that maximises .

A *O(n^4)* algorithm that maximizes the sum of the similarities between each pair of consecutive leaves has already been proposed [1, implemented in 2]. The student will start by exploring the literature in search of works similar to this one, and by evaluating the strengths and weaknesses of the objective functions chosen by different authors, including the hardness of optimizing them. He/she may also propose novel objective functions. The ultimate goal is the design and implementation of an algorithm for this task that is faster than the existing ones.

*Mots clés ***:**** **algorithms, phylogenomics, tree drawing, programming

*Collaborations ***:**** **the developed method will be integrated to Dendroscope 3 [3] and Phylotype [4].

*Bibliographie **:*

[1] Fast optimal leaf ordering for hierarchical clustering. Bar-Joseph Z, Gifford DK, Jaakkola TS. Bioinformatics. Vol. 17 Suppl 1 (2001), pages S22-S29.

[2] PermutMatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Caraux G and Pinloche S. Vol. 21 no. 7 (2005), pages 1280–1281.

[3] Dendroscope 3 : An Interactive Tool for Rooted Phylogenetic Trees and Networks. Huson DH et Scornavacca C. Systematic Biology. Vol. 61 no. 6 (2012), pages 1061–1067.

[4] Searching for virus phylotypes. Chevenet F, Jung M , Peeters M, de Oliveira T and Gascuel O. (2013), doi:10.1093/bioinformatics/btt010.