**Titre**: “Dynamique évolutive du pan-génome et rôle des variants structuraux dans l’adaptation locale au sein du complexe d’espèces des chênes blancs européens.”
**Title**: “Pan-genome evolutionary dynamics and role of structural variants in local adaptation within the European white oak species complex.”
Structural variation (SV) – i.e. pieces of DNA being deleted, inserted, or rearranged among chromosomes – strongly influences gene content and genome structure within and between closely related species. SV may have strong impacts on phenotypes of ecological and agronomical interest but are poorly captured when genotypes are produced in comparison to one single reference genome. Consequently, the evolutionary dynamics of SV in natural populations (mutation rate, role of selection in shaping genetic variation) and their role in local adaptation remain poorly documented compared to that of single nucleotide polymorphisms (SNPs), especially in forest tree species. Hence, in order to capture variation of SV and its impact on phenotypic variation, comparative and evolutionary genomics areas are now turning to pan-genome based analyses of genetic variation.
White Oaks are species of ecological, economic and cultural importance in Europe, accounting for a third of the metropolitan French forests (i.e. 5M ha). These populations display signs of stress or even decline related to biotic and abiotic stresses and their adaptive potential in the context of climate changes is uncertain. The genome of the pedunculate oak was sequenced recently, providing the blueprints to discover genes underlying adaptive variation in these keystone species. However, given the complicated speciation history of white oaks (with a recent secondary contact between species), it is highly likely that identifying functional adaptive variation will require to use a pan-genome of the species complex.
Using the white oak species complex as a model system, this PhD project will focus in understanding the evolutionary dynamics of genomes in perennial species and contruct a pan-genome to investigate the role of SVs in key adaptive traits such as drought resistance and bud burst.
The PhD candidate will be in charge of the bioinformatics, comparative and population genomics analyses of the data already available at UMR Biogeco for the white oak species (high quality genome assemblies, long and short read sequencing, pool-seq…).
pan-genome, structural variant mutation rate, population genomics, local adaptation, white oak.
**Modalités d’encadrement, de suivi de la formation et d’avancement des recherches du doctorant**
The PhD student will be supervised by Christophe Plomion (HDR UMR BIOGECO INRAE – Université de Bordeaux) and Ludovic Duvaux (main supervisor, UMR BIOGECO INRAE – Université de Bordeaux).
Given the unseen pandemic situation, a special attention will be given to monitoring the PhD student activities. Since the data is already available and most of the work can be performed remotely providing a strong interaction with both thesis supervisors, no delay is anticipated even if the pandemic situation worsens. Nevertheless, on-site work will be encouraged whenever possible to alleviate a possible feeling of isolation. A formal meeting will be organized at least once a week in order to identify critical steps and organize the PhD work as well as ensure the student welfare. At the same time, the student will be encouraged to request informal meetings (using video conference) anytime he/she feels the need. In addition to the weekly meetings, the supervisors will contact the student on a regular basis using emails and video conferences. Tight collaborative interactions are also already arranged with colleagues of the UMR BFP whom have very similar projects and data for the apricot tree (V. Decroocq et QT Bui).
Once a year, the work progress, research directions and organization will be assessed by a PhD committee constituted of experts in the field. The role of the committee is to suggest relevant adjustements of the research directions or the organization of the work if nessecary. The committee will be provided with an annual report before each meeting and the meetings will be organized around a presentation of research results and advancement by the student.
In addition to compulsory PhD training modules, the student will be encouraged and funded to follow online and in-site training, workshops and summer schools.
Genome evolution ; Pan-genome construction and diversity ; Contribution of structural genetic variation in oak tree adaptation ; genotype – phenotype association.
Evolutionary, comparative and population genomics ; Bioinformatics.
Objective O1: Characterization of structural variants in european white oaks populations. This objective includes two tasks:
Task O1.1: Constructing a white oak species complex pan-genome with a particular focus on Quercus robur et Q. petraea. To do so, we will use state of the art methods as used in the human pangenome project.
Task O1.2: Genotyping SVs in white oak populations by analysing long reads sequencing data, comparing de novo genome assemblies and/or mapping short reads on the pan-genome graph.
Objective O2: Assessment of the evolutionary dynamics of SV and its role in local adaptation. This objective includes three tasks:
Task O2.1: Estimating the mutation rate of the different categories of strutural variants (SVs) using high quality data from two parents-offsprings trios or quartets (long read sequencing data, de novo genome assemblies) .
Task O2.2: Estimating the contribution of SVs to fitness and genetic load in oak populations and species.
Task O2.3: Estimating the contribution of SV to adaptive phenotypic variation in populations sampled along a latitudinal transect.
Structural variants (SVs) include duplications, indels, inversions, translocations, transposable elements (TEs), copy number variation (CNV) and gene presence-absence variation (PAV). SVs are the largest component of genetic polymorphism in terms of base number1 thanks to their sizes and their high mutation rates estimated to be 10 to 1,000 times higher than SNPs2,3. In plants, the dispensable genome (i.e. sequences present only in some individuals as opposed to the core-genome found in all individuals of a species) may represent between 25% and 65% of the genome size4. Importantly, structural variation (SV) may have strong phenotypic effects and have been shown to underlie important adaptive or agronomic traits in crops5-7. As evidence that SV accounts for a large portion of genetic variation accumulates, comparative and evolutionary genomics are slowly shifting from reference genome to pan-genome based analyses. Unfortunately, due to its cost and complexity, pan-genomics analyses are still mainly developed for crops or model species4,8,9. The lack of consideration of SV in woody plants is particularly striking since only poplars10,11 and grapevine12 have been considered – with basic methodologies – so far. Thus, relatively little is known about pan-genome diversity in the wild, especially in trees.
Despite the lack of pan-genomes in wild species, more and more evidence accumulates showing the evolutionary significance of SVs in natural populations, as illustrated by the special issue of Molecular Ecology untitled “The role of genomic structural variants in adaptation and diversification” (see also refs 13 & 14). Most SVs are expected to be selectively neutral or deleterious and that demographic history (i.e. neutral evolution) is likely a major driver of their evolution3,14. This expectation derives from the fact that initially low population frequencies favor elimination by drift and that purifying selection appears to be widespread for SVs with phenotypic consequences8,12,13. However most of the litterature available on SV focuses on inversions despite their relative rarity8. Mutiple other classes of SVs exist (PAV, CNV, TE) and may have differente evolutionary dynamics14.
Furthermore, the mutation rates of SVs are expected to be higher than SNPs and be an important source of genetic variation, potentially fuelling evolutianary change and adaptation. However, SV mutation rates have rarely been estimated In trees, such estimations have been provided for only two species, poplar10,11,15 and the conifer Picea abies16.
Overall, new sequencing technology has promoted the discovery that SVs are prevalent in genomes, however the evolutionary dynamics of SVs within and among tree populations remain poorly understood, in particular the relative contribution of mutation, drift, purifying and positive selection.
In conclusion, this PhD project aims at constructing a high quality pan-genome of the oak species complex in order to (i) characterize structural variation in european white oaks populations and to (ii) assess their evolutionary dynamics and their role in local adaptation.
Objective O1: Characterization of structural variation in european white oaks populations.
Most data required to build the pan-genome and to call SVs is already available:
5 gold standard long read genome assemblies will be used for the pan-genome construction:
60-130X ONT for a quartet of Q. robur individuals: two parents and two of their offsprings already assembled at Genoscope.
Pacbio HiFi data for 1 individual of Q. petraea currentlty assembled in collaboration with CNRGV.
Furthermore, two proposals have been submitted in order to obtain PacBio HiFi data for 6 to 36 more genomes.
30X Illumina sequencing for 18 individuals including 3 ancient genomes (12 Q. petraea and 6 Q. robur)
10X Illumina sequencing currently under progress for ~250 Q. petraea individuals along the latitudinal transect.
We will test 3 state of the art methods to detect SVs in genomic data: (i) long read analysis using sniffle17 and/or SVIM18, (ii) comparison of de novo assemblies using SVIM-asm19, (iii) mapping of short reads on the pan-genome graph and SV genotyping using cactus and giraffe20,21. In parallel, TEs and genes will be annotated using REPET22 and as in ref. 12, respectively. Once the pan-genome and a first catalog of SV will be obtained, we will genotype population datasets like pool sequecing data including populations from Q. pubescens and Q. pyrenaica (PoPoolationTE223) and correlate SV with expression data9. Although the pan-genome will mainly be constructucted for Q. robur and Q. petraea, previous studies24,25 have shown that frequent mutations are shared within the species complex.
Objective O2: Evolutionary dynamics of SV and its role in local adaptation
First, the gold standard genomes will be improved using trio binning26-28 and mutation rates will be estimated counting and checking new variants in trios (given the high mutation rates of the SVs, a pair of trios will be enough). Second, drift estimation and fitness distribution will then be estimated using the SV site frequency spectrum in different species and populations8, using pre-existing data. SV differentiation between species and populations will be estimated by computing FST for 42 poolseqs already available for SVs. Finally, we will investigate correlations between SV frequencies and ecological parameters along environmental gradiants (temperature, rainfall, soil composition) or among discrete phenotypic classes (early bud flushing versus late bud flushing).
Three main results are expected:
1.A high quality pan-genome of the European white oak species complex. By itself, a pan-genome is still a very valuable resource for biologists. Indeed, pan-genomes remain poorly documented, especially for wild species such as trees. Thus, the oak pan-genome will be an invaluable resource for the oak community in general and the E4E team in particular, especially researchers investigating genetic-phenotype-environment relationships.
2.An estimation of SV mutation rates per variant category directly derived from trio analyses. This kind of direct estimations is essential in order to understand SV evolutaionary dynamics.
3.New insights into the evolutionary dynamics of SV in the wild and their role in phenotypic variation and adaptation.
**Conditions scientifiques matérielles (conditions de sécurité spécifiques) et financières du projet de recherche**
As stated in the project, most of the data are already available to complete the main objectives. In parallel, two ANR projects have been submitted in order to expand the PacBio HiFi data in order to obtain a more representative pan-genome. As required, small datasets may be gathered during the PhD in order to cover or deepen some specific aspects of the project (e.g. optical mapping for genome assembly, effect of SV on gene expression, additional phenotypes, …). It is important to note that additional oak resources (Illumina-seq, RNA-seq, metabolic phenotypes…) are continuously being generated by other members of the team E4E (Benjamin Brachi, Grégoire Le Provost) and may be included in the project if deemed relevant.
The candidate will benefit from strong IT support from INRAE (for storage and computing capacity) as well as up-to-date data management plan developed within the Institute. The candidate will have access to several High Performance Computing clusters required for the analyses including the MCIA and CBIB centers and possibly other centers if required like the CEA TGCC, the Genotoul and the GenOuest clusters.
He.she will be integrated in the E4E team of the research unit where strong interactions among team members is the rule. He.she will also be encouraged to have strong interactions with our lab informatics team during the project.
The candidate will benefit from financial support to attend national and international conferences as well as summer schools to develop his/her own network. The hosting team is involved in several EU funded projects (e.g. B4EST, FORGENIUS, Cost Action GBIKE…) and networks (e.g. Evoltree) providing additional opportunities to integrate a larger research community working on forest tree adaptation.
Locally, strong interactions already exist with Veronique Decrooq & Quynh-Trang Bui (BFP, INRAE Bordeaux). They work on the genomic bases of adaptation and domestication in apricot trees and share the same interest in SV and tree pan-genome dynamics. Thus, a long term collaboration is ongoing. The candidate will also collaborate with national experts in genome assemblies and annotations (Jean-Marc Aury from Génoscope, William Marande from CNRGV, INRAE URGI staff members), structural variation analyses (Olivier Panaud, from Perpignan University), comparative genomics (Jérôme Salse, from INRAE Clermont-Ferrand) and population genetics (within the BIOGECO and BFP).
**Objectifs de valorisation des travaux de recherche du doctorant : diffusion, publication et confidentialité, droit à la propriété intellectuelle**
The work of the PhD candidate will be published in open access scientific journals. Three papers, in line of the three objectives, are foreseen in a time frame of three years. No delays are anticipated even if the pandemic situation worsens, since the data are available and most of the work can be performed remotely providing a strong interaction with both thesis supervisors.
**Profil et compétences recherchées**
A candidate with skills in evolutionary ecology/population genomics and strong will to deepen his.her bioinformatics skills or alternatively a candidate with skills in bioinformatics/computing science and strong interest in evolutionary processes.
1Escaramis, G., Docampo, E. & Rabionet, R. A decade of structural variants: description, history and methods to detect structural variation. Brief. Funct. Genomics 1–10 (2015). doi:10.1093/bfgp/elv014
2. Katju, V. & Bergthorsson, U. Copy-number changes in evolution: rates, fitness effects and adaptive significance. Front. Genet. 4, (2013).
3. Schrider, D. R. et al. Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 9, e1003242 (2013).
4. Gao, L. et al. The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 51, 1044–1051 (2019).
5. Cook, D. E. et al. Copy number variation of multiple genes at Rhg1 mediates nematode resistance in soybean. Science 338, 1206–9 (2012).
6. Zhu, J. et al. Copy number and haplotype variation at the VRN-A1 and central FR-A2 loci are associated with frost tolerance in hexaploid wheat. Theor. Appl. Genet. 127, 1183–1197 (2014).
7. Taylor, C. M. et al. INDEL variation in the regulatory region of the major flowering time gene LanFTc1 is associated with vernalization response and flowering time in narrow-leafed lupin ( Lupinus angustifolius L.). Plant. Cell Environ. 42, 174–187 (2019).
8. Kou, Y. et al. Evolutionary genomics of structural variation in asian rice (Oryza sativa) domestication. Mol. Biol. Evol. 37, 3507–3524 (2020).
9. Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).
10. Pinosio, S. et al. Characterization of the Poplar Pan-Genome by Genome-Wide Identification of Structural Variation. Mol. Biol. Evol. 33, 2706–2719 (2016).
11. Zhang, B. et al. The poplar pangenome provides insights into the evolutionary history of the genus. Commun. Biol. 2, (2019).
12. Zhou, Y. et al. The population genetics of structural variants in grapevine domestication. Nat. Plants 5, 965–979 (2019).
13. Duvaux, L. et al. Dynamics of Copy Number Variation in Host Races of the Pea Aphid. Mol. Biol. Evol. 32, 63–80 (2015).
14. Wellenreuther, M. & Bernatchez, L. Eco-Evolutionary Genomics of Chromosomal Inversions. Trends Ecol. Evol. 33, 427–440 (2018).
15. Prunier, J. et al. Gene copy number variations involved in balsam poplar (Populus balsamifera L.) adaptive variations. Mol. Ecol. (2018). doi:10.1111/mec.14836
16. Prunier, J., Caron, S. & MacKay, J. CNVs into the wild: Screening the genomes of conifer trees (Picea spp.) reveals fewer gene copy number variations in hybrids and links to adaptation. BMC Genomics 18, 1–12 (2017).
17. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
18. Heller D, D. & Vingron, M. SVIM: structural variant identification using mapped long reads. Bioinformatics 35, 2907-2915 (2019).
19. Heller D, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 36, 5519–5521 (2020).
20. Paten et al. Cactus: Algorithms for genome multiple sequence alignment. Genome Research 21, 1512-1528 (2011).
21. Siren et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, 1461 (2021).
22. Flutre, T., Duprat, E., Feuillet, C. & Quesneville, H. Considering Transposable Element Diversification in De Novo Annotation Approaches. PLoS One 6, e16526 (2011).
23. Kofler, R., Gómez-Sánchez, D. & Schlötterer, C. PoPoolationTE2: Comparative Population Genomics of Transposable Elements Using Pool-Seq. Mol. Biol. Evol. 33, 2759–2764 (2016).
24. Leroy, T. et al. Extensive recent secondary contacts between four European white oak species. New Phytol. 214, 865–878 (2017).
25. Leroy, T. et al. Massive postglacial gene flow between European white oaks uncovered genes underlying species barriers. New Phytol. 33, nph.16039 (2019).
26. Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
27. Yen, E. C. et al. A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning. Gigascience 9, (2020).