Prokaryotes (—bacteria and archaea)—are diverse, ubiquitous organisms with vast impacts on health, soil, and ocean ecosystems. Large-scale genome sequencing and pangenomics have revealed their molecular diversity, especially the role of Mobile Genetic Elements (MGEs). Pangenomics analyzes genetic variability across all genomes of a group, distinguishing between core genes (shared by all) and accessory genes (variable, linked to phenotypic traits). These methods address the challenge of big data in biology [1], advancing our understanding of microbial evolution in epidemiological and environmental contexts.
For years, the LABGeM team has developed a pangenome graph model at the gene family level, compressing data from thousands of genomes while preserving gene order. The PPanGGOLiN software suite [2] reconstructs and analyzes pangenome graphs at the species level. It encompasses methods for the identification of regions of genomic plasticity, including MGEs and Genomic islands, (panRGP method) [3] and their fine description in conserved modules (panModule method) [4]. LABGeM is also developing PanGBank, a database of pangenomes reconstructed from public genomes from Genbank and RefSeq databases using the GTDB classification. It currently gathers pangenomes for over 4300 prokaryotic species.
However, a significant proportion of bacterial species remain underrepresented in genomic databases (e.g. 64% of the ~110,000 species in GTDB [5] are represented by only a single genome). This sparse coverage limits our ability to conduct evolutionary analyses at the species level. To bridge this gap, genus-level comparative analyses offer valuable insights into how closely related species adapt to a wide range of environmental conditions.
In the framework of the ANR PanGAIMiX project (2025-2029), which aims to leverage artificial intelligence to enhance large scale computational microbiology using pangenome graphs, we are recruiting an engineer to join our team and contribute to the development of PPanGGOLiN to allow analyses at the genus level.

Your mission:
You will help enhance the PPanGGOLiN model and methods to enable genus-level pangenome analyses, while preserving species-level detail. The main objectives of this work will be to:
- adapt the PPanGGOLiN data model and gene families to different taxonomic levels, including species and genus.
- improve the NEM (Neighborhood Expectation-Maximization) partitioning method implemented in PPanGGOLiN
- update panRGP and panModule methods to the new data model, with parameter adjustments to account for the increased genomic diversity at the genus level.
Candidate profile
- Master in Bioinformatics or Computer Science
- Expertise in bioinformatics algorithm (graphs) and software development
- Strong programming skills in Python. Experience in C development will also be appreciated.
- Knowledge in microbial genomics and statistics will be appreciated
Duration: 2 years contract starting in April 2026.
Remuneration: According to qualifications and experience (CEA salary scale)
Location: Genoscope, Evry, in the Bioinformatics Analysis Laboratory for Genomics and Metabolism (LABGeM).
To apply, please send your resume and cover letter to the following addresses: acalteau@genoscope.cns.fr , vallenet@genoscope.cns.fr
References
[1] Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2016. doi:10.1093/bib/bbw089
[2] Gautreau G, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16: e1007732. doi:10.1371/journal.pcbi.1007732
[3] Bazin A, et al. panRGP: a pangenome-based method to predict genomic islands and explore their diversity. Bioinformatics. 2020;36: i651–i658. doi:10.1093/bioinformatics/btaa792
[4] Bazin A, et al. panModule: detecting conserved modules in the variable regions of a pangenome graph. bioRxiv. 2021. p. 2021.12.06.471380. doi:10.1101/2021.12.06.471380
[5] Parks DH, et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50: D785–D794. doi:10.1093/nar/gkab776
