Internship at University of Montpellier (semester one 2016)

 Scaling up phylogenetic networks to genome-size data  

 

Supervisors : Céline ScornavaccaVincent Berry

Context: Reconstructing the evolutionary histories of organisms is of great importance for biologists aiming at clarifying the evolutionary relationships of organisms. This is a crucial step towards tackling major issues, such as characterizing the mosaic structure in plant genomes. Computer science has an important role to play here as evolutionary relationships are displayed by trees and graphs, which can be inferred from biological data by various algorithms, once a model has been chosen. Even though selected models are mathematically simple, most combinatorial optimization problems following them are NP-hard. The research process also involves going back and forth between a theoretical study and empirical results on real data. This allows to gain new insights in the proposed models and algorithms and ultimately to bring new knowledge to biologists.

Work to be done: Evolutionary relationships are traditionally displayed in tree-shaped diagrams called phylogenetic trees, where leaves are labelled by currently existing species. However, trees cannot represent evolutionary events such as recombination and hybridization, in which two lineages combine to form a new lineage. Single-rooted direct acyclic graphs (DAGs), also called here “phylogenetic networks”, can be used to display such complex evolutionary scenarios. The Genome Harvest project aims at the production of such networks for several plants of interest (rice, banana, citrus, tomato, …).

The main work of the internship will consist in investigating large matrices that store SNPs for numerous crop cultivars. SNPs are stored according to the order with which they appear along the genome. In such a matrix, each SNP partitions the cultivars, hence defines a set of at least two clusters. A method is sought to identify largest consecutive SNPs sets that correspond to compatible clusters, that is clusters that can be combined into a single tree. The matrix will then be decomposed into several such trees, whose incompatibilities give in turn information on the hybridization events that lead from a few founder plants to the observation of all current cultivars. The corresponding evolutionary scenario will be exposed by building phylogenetic networks from the trees inferred by analyzing SNPs matrices. The rice is the first target plant for which experimental data is available. This data was generated by the 3000 rice genomes project that involved the research community in Montpellier.
The student will have a computer science and mathematical background and should have an interest in bridging algorithmic theory to applied problems.

Keywords: Algorithms, Models, Trees, Directed acyclic graphs, computational complexity, combinatorics, polynomial-time algorithms, dynamic programming, phylogenetics, reticulate evolution, approximation algorithms.

Bibliography:

[1] D.H. Huson, R. Rupp and C. Scornavacca. Phylogenetic Networks. Cambridge University Press, 2011.

[2] C. Scornavacca and D.H. Huson. A survey of combinatorial methods for phylogenetic networks. Genome Biology and Evolution 3(2011), pp. 23-35.

[3] Li et al. The 3,000 rice genomes project: new opportunities and challenges for future rice research. GigaScience 3(2014), no. 1, pp. 1-3.


Sujet stage M2, année 2013-2014

 Ordonner des grandes phylogenies (stage M2)

 ex

Encadrants : Fabio Pardi, Céline Scornavacca, Vincent Berry

Contexte et description du stage :

Phylogenetics aims at clarifying the evolutionary relationships that exist among different organisms, represented by phylogenetic trees or phylogenies. A phylogeny consists of nodes connected by branches (see the figure for an example). Terminal nodes are called leaves and represent organisms for which data has been collected. Internal nodes represent hypothetical ancestors since they cannot be directly observed. In rooted phylogenetic trees, each internal node represents the most recent common ancestor of its descendants and the only node with no ancestors is called the root of the tree.

Leaves can have several kinds of information associated to them. For example in phylogeography, each leaf may be labelled with the country where the organism was sampled (like in the figure above, where colors represent geographical origins). Similarly, in medicine, when studying the spreading of epidemics, leaves may be labelled with the individuals that carried the pathogen under study. When the data has been collected at different moments in time (e.g. when studying viruses like Influenza) labels can even be temporal and thus have different degrees of similarity between them (obviously 2012 is more related to 2010 than to 1984).

Automatic tools for ordering the leaves of a phylogenetic tree so that the leaves that share the same (or similar) labels are as ‘contiguous’ as possible in the drawing of the tree are highly desirable for biologists. Indeed, this can help them to discern underlying biological structures or increase the visual clarity of information such as the common phylogenetic origin of a morphological trait, or the spread of an infection within a restricted group of people, etc.

Biologists often have phylogenies with thousands/hundreds of thousands of leaves and cannot handle the drawing of the tree by hand by rotating each pair of subtrees around their root node. The subject of this stage is to propose fast algorithms to choose the best possible organization of the leaves of the tree, in order to obtain the best graphical display of the phylogenetic tree, given a labeling of the leaves. More formally, given a rooted tree T  with n leaves, one can define a linear ordering consistent with T as any of the 2^(n-1) permutations of the leaves of T generated by ‘flipping’ internal nodes in T [1]. Once an objective function f is defined, where f(π) measures the degree of clustering of similar labels in a linear ordering π, the goal is to find a linear ordering consistent with T  that  maximises .

A O(n^4) algorithm that maximizes the sum of the similarities between each pair of consecutive leaves has already been proposed [1, implemented in 2]. The student will start by exploring the literature in search of works similar to this one, and by evaluating the strengths and weaknesses of the objective functions chosen by different authors, including the hardness of optimizing them. He/she may also propose novel objective functions. The ultimate goal is the design and implementation of an algorithm for this task that is faster than the existing ones.

Mots clés : algorithms, phylogenomics, tree drawing, programming

Collaborations : the developed method will be integrated to Dendroscope 3 [3] and Phylotype [4].

Bibliographie :

[1] Fast optimal leaf ordering for hierarchical clustering. Bar-Joseph Z, Gifford DK, Jaakkola TS. Bioinformatics. Vol. 17 Suppl 1 (2001), pages S22-S29.

[2] PermutMatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Caraux G and Pinloche S. Vol. 21 no. 7 (2005), pages 1280–1281.

[3] Dendroscope 3 : An Interactive Tool for Rooted Phylogenetic Trees and Networks. Huson DH et Scornavacca C. Systematic Biology. Vol. 61 no. 6 (2012), pages 1061–1067.

[4] Searching for virus phylotypes. Chevenet F, Jung M , Peeters M, de Oliveira T and Gascuel O. (2013), doi:10.1093/bioinformatics/btt010.


Position for a software engineer in Bioinformatics AVAILABLE

The position is not available anymore.

A position for a software engineer in bioinformatics will be available at the Institut des Sciences de l’Evolution de Montpellier (ISEM, UMR-5556, http://www.isem.univ-montp2.fr), in collaboration with SupAgro (http://www.supagro.fr), in the framework of the Ancestrome project, recently funded by the ANR agency on the Bioinformatics Investissement d’avenir call (http://lbbe-dmz.univ-lyon1.fr/ancestrome/spip_ancestrome/).

Background:

Ancestrome aims at studying genome organization and functioning of extant living organisms to build integrative models to reconstruct evolutionary
intermediates together with the evolutionary processes that have generated them.

Missions:

The candidate will support researchers involved the Ancestrome project:
• developing software and web services – in conjunction with the team MBB platform – to distribute new methods conceived ​as part of the Ancestrome project.
• using prototyping and method validation (via tests on real or simulated data) to test these new methodologies.

Duration: 2 years
Startin date: 01/10/2013 (flexible)
Salary: depending on seniority and according to the civil service salary structure

Skills: Algorithmic (on graph and text), object-oriented programming (C++ and / or Java), Linux. A background in evolution and genomics would be appreciated.

Applications including a detailed curriculum vitae, a cover letter, and contact information of at least one reference should be sent electronically
to Celine Scornavacca (celine.scornavacca@univ-montp2.fr) and Vincent Ranwez (vincent.ranwez@supagro.inra.fr) before 1 September 2013.


poste de MdC, à l’ISEM (peut être)!

Un poste de MdC va prochainement être ouvert au concours par l’Université de Montpellier II.

Identification du poste : Développement de méthodes, modèles, algorithmes et logiciels pour l’analyse de données en biologie évolutive ou écologie. MC 67/27

Section CNU : 67/27

Mots clefs : Bioinformatique, Modélisation, Biostatistiques, Bases de données

Profil Enseignement

Systèmes d’information environnementaux et gestion de base de données Il comportera une partie d’enseignements généralistes dans le cadre de la licence Informatique, et a également pour vocation de combler les besoins en enseignements dans des formations inter-disciplinaires pérennes et émergentes, telles que la spécialité Géomatique du master Informatique, la spécialité Bioinformatique, Connaissance, Données du master STIC pour la Santé, et la spécialité Spatialisation et Technologies en Ecologie du master STIC pour l’écologie et l’environnement. Les enseignements concernés tournent autour des concepts et approches ingénierie relatives aux systèmes d’information environnementaux (pris au sens large : écologie, biologie, santé, information géographique, etc..), ce qui inclut les aspects gestion de base de données spatiales, bases de données semi-structurées et nosql , les aspects temporels, ainsi que le champ base de connaissances (ontologies et génie ontologique). Toute autre compétence en enseignement pouvant s’intégrer dan s les formations pré-citées sera également appréciée.

Département d’enseignement ou équipe pédagogique : Département d’Informatique de la Faculté des Sciences

Nom du Directeur département : Michelle Joab et Bruno Durand

Email directeur département : dirinfofds@univ-montp2.fr

Profil Recherche

Développement de méthodes, modèles, algorithmes et logiciels pour l’analyse de données en biologie évolutive ou écologie.

Laboratoires d’accueil potentiels : AMAP, CBAE, CEFE, EME, ISEM, MIVEGEC

Les laboratoires d’accueil potentiels couvrent un vaste champ de recherche en écologie et évolution. Les recherches dans ce domaine font appel à des compétences en bioinformatique, qu’il s’agisse de l’analyse des données de génomique haut débit ou de l’intégration de données spatiales et temporelles (écologiques, démographiques, biogéographiques, génétiques, climatiques). L’évolution simultanée des thématiques de recherches, des systèmes d’acquisition et d’agrégation de données (réseaux collaboratifs), ainsi que des demandes sociétales vont conférer à ces aspects une importance croissante dans les programmes de recherche. Les recherches de la personne recrutée pourront ainsi porter sur des thématiques de bases de données, de modélisation, d’inférence statistique et d’aide à la décision, dans les domaines suivants : • l’acquisition de données spatialisées pour l’observation des changements environnementaux, intégrant possiblement le temps long, la définition et l’application de traitements sur ces observations, et l’intégration des produits dérivés dans des systèmes d’information et d’interprétation ; • l’algorithmique pour l’analyse de données biologiques haut-débit : assemblage, identification de variants, phylogénie, datation moléculaire ; • l’algorithmique pour les statistiques inférentielles (par exemple, méthodes de simulation, parallélisation) portant sur les processus écologiques et évolutifs ; • l’intégration des processus démographiques et génétiques dans les modèles biogéographiques ou écosystémiques. Les recherches seront conduites sur ces thématiques et devront renforcer les compétences en bioinformatique du laboratoire d’accueil et les collaborations avec les laboratoires d’informatique et de modélisation de Montpellier.


Post-Doctoral position in Combinatorics for Bioinformatics AVAILABLE

The position is not available anymore.

Post-Doctoral position in Combinatorics for Bioinformatics (18 months)

A postdoctoral fellowship will be available at the Institut des Sciences de l’Evolution de Montpellier (ISEM, UMR-5556, http://www.isem.univ-montp2.fr) in the framework of the Ancestrome project, recently funded by the ANR agency on the Bioinformatics Investissement d’avenir call (http://lbbe-dmz.univ-lyon1.fr/ancestrome/spip_ancestrome/).

Ancestrome aims at studying genome organization and functioning of extant living organisms to build integrative models to reconstruct evolutionary intermediates together with the evolutionary processes that have generated them.

A critical step in the development of methods to reconstruct the evolution of genomes is the modeling of the processes that relate gene phylogenies to species phylogenies. We have developed methods for integrating events of gene duplication, gene loss and lateral gene transfer in the simultaneous reconstruction of species and gene histories. These methods yield much better estimates of the gene content of ancestral genome, and hence open the door to accurate reconstructions of ancestral species characteristics. This includes cellular and genome organizations, or metabolic capabilities as deduced from genome content, but also virtually any internal or external factors that have left a trace in the genomic record, such as the environment to which the species has adapted, its resources, parasites or demography.

The aim of this position is to define, implement and exploit measures of reconciliation similarities. Reconciliations are defined as mapping of gene tree nodes into the species tree nodes permitting to infer past events in the species’ history such as gene duplications, losses and transfers [1]. Defining similarities between two such mappings will allow to study the diversity of equally parsimonious reconciliations on the same pair species tree-gene tree, to cluster them and to eventually summarize them through a consensus reconciliation. Measures to compare two reconciliations for the same species tree but for different gene trees on the same gene family and even on different gene families should also be proposed. This latter measure will be used to search for gene families, which have undergone concomitant gene losses, duplications and transfers, suggesting a physical or functional link between those genes.

Candidates should have a solid background in Combinatorics. A background in Probability and Statistics would be appreciated.

Applications including a detailed curriculum vitae, a cover letter, and contact information of at least three references should be sent electronically to Celine Scornavacca (celine.scornavacca@univ-montp2.fr) and Vincent Ranwez (vincent.ranwez@supagro.inra.fr) before 1 October 2013.

Starting date: 1 November 2013.


Back to Montpellier