Scaling up phylogenetic networks to genome-size data
Context: Reconstructing the evolutionary histories of organisms is of great importance for biologists aiming at clarifying the evolutionary relationships of organisms. This is a crucial step towards tackling major issues, such as characterizing the mosaic structure in plant genomes. Computer science has an important role to play here as evolutionary relationships are displayed by trees and graphs, which can be inferred from biological data by various algorithms, once a model has been chosen. Even though selected models are mathematically simple, most combinatorial optimization problems following them are NP-hard. The research process also involves going back and forth between a theoretical study and empirical results on real data. This allows to gain new insights in the proposed models and algorithms and ultimately to bring new knowledge to biologists.
Work to be done: Evolutionary relationships are traditionally displayed in tree-shaped diagrams called phylogenetic trees, where leaves are labelled by currently existing species. However, trees cannot represent evolutionary events such as recombination and hybridization, in which two lineages combine to form a new lineage. Single-rooted direct acyclic graphs (DAGs), also called here “phylogenetic networks”, can be used to display such complex evolutionary scenarios. The Genome Harvest project aims at the production of such networks for several plants of interest (rice, banana, citrus, tomato, …).
The main work of the internship will consist in investigating large matrices that store SNPs for numerous crop cultivars. SNPs are stored according to the order with which they appear along the genome. In such a matrix, each SNP partitions the cultivars, hence defines a set of at least two clusters. A method is sought to identify largest consecutive SNPs sets that correspond to compatible clusters, that is clusters that can be combined into a single tree. The matrix will then be decomposed into several such trees, whose incompatibilities give in turn information on the hybridization events that lead from a few founder plants to the observation of all current cultivars. The corresponding evolutionary scenario will be exposed by building phylogenetic networks from the trees inferred by analyzing SNPs matrices. The rice is the first target plant for which experimental data is available. This data was generated by the 3000 rice genomes project that involved the research community in Montpellier.
The student will have a computer science and mathematical background and should have an interest in bridging algorithmic theory to applied problems.
Keywords: Algorithms, Models, Trees, Directed acyclic graphs, computational complexity, combinatorics, polynomial-time algorithms, dynamic programming, phylogenetics, reticulate evolution, approximation algorithms.
 D.H. Huson, R. Rupp and C. Scornavacca. Phylogenetic Networks. Cambridge University Press, 2011.
 C. Scornavacca and D.H. Huson. A survey of combinatorial methods for phylogenetic networks. Genome Biology and Evolution 3(2011), pp. 23-35.
 Li et al. The 3,000 rice genomes project: new opportunities and challenges for future rice research. GigaScience 3(2014), no. 1, pp. 1-3.