Gene duplication—a DNA region, typically containing a gene, duplicated within the same genome—is a major force in evolution. The new gene, a paralog, may accumulate mutations and either go through gene loss or acquire new functionalities. Since bacteria have a short generation time and a high effective population size, it is widely accepted that gene duplications that do not endow a selective advantage are rapidly selected against. Thus, individuals in the population harboring duplications are almost undetectable in experimentally cultured bacteria. This is underlined by the limitations imposed by short reads sequencing, whose read length is significantly shorter than the average gene length, hindering the detection of extant duplicated sequences. We address these problems using long reads from natural microbial populations, that capture full genes as well as their context and thus enable us to identify recurring genes with different contexts. When further supplemented with tetranucleotide frequency analysis and epigenetic modification patterns (attainable through long-read sequencing), we are able to sort the long reads into species and strains, distinguish gene duplication events from lateral gene transfers, and measure the rate of duplication per species and environment.
The immediate outcome of our method is the ability to identify transient gene duplications in microbial long reads sequencing data.
We expect this method to greatly promote our understanding of gene duplications and their fate in natural microbial populations.