Mobile genetic elements including viruses and plasmids are part of microbial communities and play a vital role in disease and in antibiotic resistance. Accurately identifying sequence contigs as phages, plasmids, and bacterial chromosomes in mixed metagenomic assemblies is critical for further unravelling their functions. Moreover, microeukaryotes are present and sometimes abundant in microbial communities and are usually ignored or misclassified as prokaryotes by existing studies.
Here we developed 4CAC, an algorithm to classify contigs as phages, plasmids, prokaryotes, and eukaryotes from mixed metagenomic assemblies. The algorithm first trains an XGBoost algorithm to classify sequence contigs into the four classes based on their k-mer compositions. Since phages, plasmids, and microeukaryotes have usually low prevalence in metagenomics assemblies, 4CAC sets high score thresholds for identifying these minor classes and results in high precision. Afterwards, 4CAC reclassifies short contigs and contigs classified with low confidence by taking advantage of classification of their neighbors in the assembly graph. Evaluation on simulated metagenomes and real human gut microbiome samples showed that 4CAC outperformed existing classifiers in both precision and recall, especially for the minor classes. 4CAC achieved F1 score above 0.9 on simulated metagenomes. In classification of short read assemblies, 4CAC increased the F1 score as much as 35 percentage points compared to extant methods.