Master internship: Development of workflows for pangenome graph functional annotation at large scale

Prokaryotes (i.e. bacteria and archaea) constitute a fascinating field of living organisms, representing remarkable diversity and ubiquity. Their impact on the biosphere is immense, influencing human and animal health, soil and ocean biogeochemistry, and much more. Large-scale exploration of microbial genomes has helped uncover the molecular mechanisms underlying their diversity, and particularly the role of Mobile Genetic Elements (MGE).

In recent years, with the explosion of sequencing projects, several bioinformatics approaches have been developed based on the pangenome concept, offering solutions for efficiently managing and exploiting large quantities of data [1]. Pangenomics examines genetic variability across all available genomes of a given group, usually a species, rather than relying on a single reference genome or making pairwise comparisons. In terms of gene content, a distinction is made between the core genome, i.e. the genes present in all individuals, and the accessory (or variable) genes that are more or less conserved in the genomes, and therefore likely to explain phenotypic particularities. The development of pangenomic methods is thus a response to the challenge of massive data in biology, helping to understand the evolution of microorganisms in relation to epidemiological or environmental data.

For several years now, the LABGeM team has been working on a model to represent genomic data as a pangenome graph at the gene family level, enabling the compression of information from thousands of genomes while preserving the chromosomal organization of genes. The PPanGGOLiN software suite [2] (awarded an Open Science Research Prize by the French Ministry of Research in 2023; >220 citations since 2020) has been developed to reconstruct and analyze pangenome graphs. It includes methods such as the identification of regions of genomic plasticity (panRGP method) [3] and their fine description in conserved modules (panModule method) [4], demonstrating their utility for identifying genomic islands and their MGEs. LABGeM is also developing PanGBank, a database of pangenomes reconstructed from public genomes from Genbank and RefSeq databases using the GTDB classification. It currently gathers pangenomes for >4300 prokaryotic species.

This internship topic is part of the ANR PanGAIMiX project, which aims to leverage large language models to enhance large scale computational microbiology using pangenome graphs. This project has been developed in collaboration with Christophe Ambroise (LaMME laboratory, University of Evry), and Guillaume Gautreau (MaIAGE, INRAE), for the statistics and artificial intelligence expertise.

In the framework of work-package 3 of ANR PanGAIMiX, the internship project aims to build a dataset of annotated pangenome graphs at different functional levels, serving as a basis for model training and validation.

The main objectives of this work will be to:

construct workflows to annotate pangenome graphs using diverse bioinformatics resources (KEGG, PFAM, InterPro…)
apply the workflows to annotate pangenomes extracted from the PanGBank resource
Use functional annotations to design a genomic tokenizer

Candidate profile

Master in Bioinformatics
Good programming skills in Python
Good knowledge in microbial genomics
Knowledge in machine learning

Duration: 6 months starting in January-February 2026, funded by CEA (1400€/month). Possibility of continuing with a PhD.

This internship will take place at Genoscope Evry in the Bioinformatics Analysis Laboratory for Genomics and Metabolism (LABGeM).

To apply, please send your resume and cover letter to the following addresses: acalteau@genoscope.cns.fr , vallenet@genoscope.cns.fr, guillaume.gautreau@inrae.fr

References

[1] Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2016. doi:10.1093/bib/bbw089
[2] Gautreau G, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16: e1007732. doi:10.1371/journal.pcbi.1007732
[3] Bazin A, et al. panRGP: a pangenome-based method to predict genomic islands and explore their diversity. Bioinformatics. 2020;36: i651–i658. doi:10.1093/bioinformatics/btaa792
[4] Bazin A, et al. panModule: detecting conserved modules in the variable regions of a pangenome graph. bioRxiv. 2021. p. 2021.12.06.471380. doi:10.1101/2021.12.06.471380

Master internship: Development of workflows for pangenome graph functional annotation at large scale