Category Archives: Uncategorized

Internship at University of Montpellier (semester one 2016)

 Scaling up phylogenetic networks to genome-size data  


Supervisors : Céline ScornavaccaVincent Berry

Context: Reconstructing the evolutionary histories of organisms is of great importance for biologists aiming at clarifying the evolutionary relationships of organisms. This is a crucial step towards tackling major issues, such as characterizing the mosaic structure in plant genomes. Computer science has an important role to play here as evolutionary relationships are displayed by trees and graphs, which can be inferred from biological data by various algorithms, once a model has been chosen. Even though selected models are mathematically simple, most combinatorial optimization problems following them are NP-hard. The research process also involves going back and forth between a theoretical study and empirical results on real data. This allows to gain new insights in the proposed models and algorithms and ultimately to bring new knowledge to biologists.

Work to be done: Evolutionary relationships are traditionally displayed in tree-shaped diagrams called phylogenetic trees, where leaves are labelled by currently existing species. However, trees cannot represent evolutionary events such as recombination and hybridization, in which two lineages combine to form a new lineage. Single-rooted direct acyclic graphs (DAGs), also called here “phylogenetic networks”, can be used to display such complex evolutionary scenarios. The Genome Harvest project aims at the production of such networks for several plants of interest (rice, banana, citrus, tomato, …).

The main work of the internship will consist in investigating large matrices that store SNPs for numerous crop cultivars. SNPs are stored according to the order with which they appear along the genome. In such a matrix, each SNP partitions the cultivars, hence defines a set of at least two clusters. A method is sought to identify largest consecutive SNPs sets that correspond to compatible clusters, that is clusters that can be combined into a single tree. The matrix will then be decomposed into several such trees, whose incompatibilities give in turn information on the hybridization events that lead from a few founder plants to the observation of all current cultivars. The corresponding evolutionary scenario will be exposed by building phylogenetic networks from the trees inferred by analyzing SNPs matrices. The rice is the first target plant for which experimental data is available. This data was generated by the 3000 rice genomes project that involved the research community in Montpellier.
The student will have a computer science and mathematical background and should have an interest in bridging algorithmic theory to applied problems.

Keywords: Algorithms, Models, Trees, Directed acyclic graphs, computational complexity, combinatorics, polynomial-time algorithms, dynamic programming, phylogenetics, reticulate evolution, approximation algorithms.


[1] D.H. Huson, R. Rupp and C. Scornavacca. Phylogenetic Networks. Cambridge University Press, 2011.

[2] C. Scornavacca and D.H. Huson. A survey of combinatorial methods for phylogenetic networks. Genome Biology and Evolution 3(2011), pp. 23-35.

[3] Li et al. The 3,000 rice genomes project: new opportunities and challenges for future rice research. GigaScience 3(2014), no. 1, pp. 1-3.


Sujet stage M2, année 2013-2014

 Ordonner des grandes phylogenies (stage M2)


Encadrants : Fabio Pardi, Céline Scornavacca, Vincent Berry

Contexte et description du stage :

Phylogenetics aims at clarifying the evolutionary relationships that exist among different organisms, represented by phylogenetic trees or phylogenies. A phylogeny consists of nodes connected by branches (see the figure for an example). Terminal nodes are called leaves and represent organisms for which data has been collected. Internal nodes represent hypothetical ancestors since they cannot be directly observed. In rooted phylogenetic trees, each internal node represents the most recent common ancestor of its descendants and the only node with no ancestors is called the root of the tree.

Leaves can have several kinds of information associated to them. For example in phylogeography, each leaf may be labelled with the country where the organism was sampled (like in the figure above, where colors represent geographical origins). Similarly, in medicine, when studying the spreading of epidemics, leaves may be labelled with the individuals that carried the pathogen under study. When the data has been collected at different moments in time (e.g. when studying viruses like Influenza) labels can even be temporal and thus have different degrees of similarity between them (obviously 2012 is more related to 2010 than to 1984).

Automatic tools for ordering the leaves of a phylogenetic tree so that the leaves that share the same (or similar) labels are as ‘contiguous’ as possible in the drawing of the tree are highly desirable for biologists. Indeed, this can help them to discern underlying biological structures or increase the visual clarity of information such as the common phylogenetic origin of a morphological trait, or the spread of an infection within a restricted group of people, etc.

Biologists often have phylogenies with thousands/hundreds of thousands of leaves and cannot handle the drawing of the tree by hand by rotating each pair of subtrees around their root node. The subject of this stage is to propose fast algorithms to choose the best possible organization of the leaves of the tree, in order to obtain the best graphical display of the phylogenetic tree, given a labeling of the leaves. More formally, given a rooted tree T  with n leaves, one can define a linear ordering consistent with T as any of the 2^(n-1) permutations of the leaves of T generated by ‘flipping’ internal nodes in T [1]. Once an objective function f is defined, where f(π) measures the degree of clustering of similar labels in a linear ordering π, the goal is to find a linear ordering consistent with T  that  maximises .

A O(n^4) algorithm that maximizes the sum of the similarities between each pair of consecutive leaves has already been proposed [1, implemented in 2]. The student will start by exploring the literature in search of works similar to this one, and by evaluating the strengths and weaknesses of the objective functions chosen by different authors, including the hardness of optimizing them. He/she may also propose novel objective functions. The ultimate goal is the design and implementation of an algorithm for this task that is faster than the existing ones.

Mots clés : algorithms, phylogenomics, tree drawing, programming

Collaborations : the developed method will be integrated to Dendroscope 3 [3] and Phylotype [4].

Bibliographie :

[1] Fast optimal leaf ordering for hierarchical clustering. Bar-Joseph Z, Gifford DK, Jaakkola TS. Bioinformatics. Vol. 17 Suppl 1 (2001), pages S22-S29.

[2] PermutMatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Caraux G and Pinloche S. Vol. 21 no. 7 (2005), pages 1280–1281.

[3] Dendroscope 3 : An Interactive Tool for Rooted Phylogenetic Trees and Networks. Huson DH et Scornavacca C. Systematic Biology. Vol. 61 no. 6 (2012), pages 1061–1067.

[4] Searching for virus phylotypes. Chevenet F, Jung M , Peeters M, de Oliveira T and Gascuel O. (2013), doi:10.1093/bioinformatics/btt010.

Position for a software engineer in Bioinformatics AVAILABLE

The position is not available anymore.

A position for a software engineer in bioinformatics will be available at the Institut des Sciences de l’Evolution de Montpellier (ISEM, UMR-5556,, in collaboration with SupAgro (, in the framework of the Ancestrome project, recently funded by the ANR agency on the Bioinformatics Investissement d’avenir call (


Ancestrome aims at studying genome organization and functioning of extant living organisms to build integrative models to reconstruct evolutionary
intermediates together with the evolutionary processes that have generated them.


The candidate will support researchers involved the Ancestrome project:
• developing software and web services – in conjunction with the team MBB platform – to distribute new methods conceived ​as part of the Ancestrome project.
• using prototyping and method validation (via tests on real or simulated data) to test these new methodologies.

Duration: 2 years
Startin date: 01/10/2013 (flexible)
Salary: depending on seniority and according to the civil service salary structure

Skills: Algorithmic (on graph and text), object-oriented programming (C++ and / or Java), Linux. A background in evolution and genomics would be appreciated.

Applications including a detailed curriculum vitae, a cover letter, and contact information of at least one reference should be sent electronically
to Celine Scornavacca ( and Vincent Ranwez ( before 1 September 2013.

Back to Montpellier