Background & Summary

Marine picocyanobacteria in the genera Prochlorococcus and Synechococcus are estimated to account for around 25% of oceanic net primary productivity1. Prochlorococcus is the smallest (0.5–0.7 μm in diameter) and most abundant photosynthetic organism on Earth, with an estimated global population of approximately 1027 cells2. Notably, it possesses the smallest genome of any free-living phototroph, with some strains containing genomes as small as 1.65 Mbp comprising roughly 1,700 genes3,4. This compact genome reflects its adaptation to oligotrophic, stable environments in the open ocean. However, the genetic diversity within Prochlorococcus populations is extensive, significantly contributing to their ecological resilience5. In contrast, Synechococcus has a broader biogeographical distribution, spanning tropical to subpolar regions and extending from coastal waters to the open ocean6. The expansive range of this genus is accompanied by a larger genome, which generally ranges from 2.2 to 2.86 Mbp and contains approximately 2,358 to 3,129 genes7,8. Its enhanced phenotypic plasticity and regulatory versatility enable it to acclimate to diverse and fluctuating environmental conditions9. However, despite numerous studies highlighting the ecological significance and genomic characteristics of Prochlorococcus and Synechococcus10,11,12,13, current genomic datasets still lack comprehensive representation of under-sampled ocean regions, associated bacteria and viruses14,15,16. These limitations constrain our understanding of the functions and adaptations of these species within global marine ecosystems.

Prochlorococcus and Synechococcus are ecologically significant in marine ecosystems through their symbiotic relationships with heterotrophic bacteria, which help to maintain population stability and ecosystem balance17. Additionally, both Prochlorococcus and Synechococcus are influenced by viral predation, which affects their population dynamics and genetic diversity18. Viral lysis releases organic matter and nutrients back into the environment, contributing to nutrient recycling19. Additionally, it facilitates horizontal gene transfer, enhancing the genetic diversity and adaptability of these picocyanobacteria20. The dynamic interactions among Picocyanobacteria, heterotrophic bacteria, and viruses highlight the complexity of marine microbial ecosystems and their significant role in global biogeochemical cycles. The currently available genomic datasets of associated bacteria and viruses are insufficient in terms of both quantity and representativeness21,22, which limits our ability to understand the importance of their interactions within ecosystems.

The limitations of second-generation sequencing technologies have contributed to the insufficient quality of current genomic datasets. The use of conventional sequencing methods often results in fragmented assemblies with lower completeness, leading to underestimates of microbial diversity and genomic variability23,24. Additionally, short-read sequencing often fails to resolve complex genomic regions, resulting in incomplete or biased representations of these genomes and their communities25. To address these limitations, we combined second- and third-generation sequencing to generate genome assemblies of Prochlorococcus, Synechococcus, and associated bacteria and viruses, achieving comprehensive genomic characterisation with fine resolution, rare variant detection, and deeper insights into microbial interactions.

To construct the dataset, we obtained 105 picocyanobacterial enrichment cultures from a total of 27 sampling stations across the Indian Ocean (2022), the South China Sea (2014, 2021), and the western Pacific Ocean (2022) (Fig. 1). All samples were sequenced using second-generation sequencing. Of these, 81 samples were further processed using a hybrid assembly approach that combined second-generation Illumina sequencing and third-generation Qitan nanopore sequencing. Metagenomic binning and refinement identified 55 Prochlorococcus genomes, including 23 high-light clade II (HLII) strains, one HL HLI strain, and 31 low-light clade I (LLI) strains. Additionally, 50 Synechococcus genomes were identified, consisting of 26 clade 5.1B strains, 16 clade 5.1 A strains, and eight clade 5.2 strains, including major subclades such as 5.1B clade I (n = 14), 5.1 A clade II (n = 8), 5.1B clade CRD1 (n = 5), 5.1B clade XVI (n = 3), and 5.1 A clade XV (n = 2). All these genomes exhibited high completeness (>98%) and low contamination (<2%). Of these, 42 genomes were assembled into single contigs (Fig. 2).

Fig. 1
figure 1

Sampling sites and methods of bioinformatic analysis used for this study. (a) Map of South East Asia, with sampling sites in the South China Sea, the Indian Ocean, and the western Pacific Ocean indicated by red dots (2014 sampling sites), green dots (2022 sampling sites), and blue dots (2021 sampling sites). (b) Study workflow for processing metagenome sequences.

Fig. 2
figure 2

Phylogenomic tree of 105 picocyanobacteria MAGs. One hundred and twenty universally conserved single-copy marker genes were used to build maximum-likelihood phylogenomic tree with 1000 bootstrap replicates. Detailed MAG taxonomy assignment and completeness and contamination information are provided in Table S1. In addition, 93 cyanobacterial reference sequences were included in the analysis, as detailed in Supplementary Table S4.

A total of 1,457 medium- and high-quality non-cyanobacteria metagenome-assembled genomes (MAGs) (completeness ≥50% and contamination ≤10%) were recovered from which 308 non-redundant MAGs were identified using dRep with a 95% average nucleotide identity (ANI) threshold for dereplication. These 308 medium/high-quality non-cyanobacteria MAGs covered 18 bacterial phyla, including major groups such as Pseudomonadota (n = 142), Bacteroidota (n = 89), Bacillota (n = 25), Actinomycetota (n = 10), and Myxococcota (n = 6) (Fig. 3). From these metagenomic assembly data, 2,113 non-redundant viral operational taxonomic units (vOTUs) were derived from 7,632 qualified viral contigs using a 95% ANI threshold over 85% of the length of the shorter contig. Among these vOTUs, 176 were identified as complete, 470 as high-quality, and 1,201 as medium-quality. Only 11.36% of the vOTUs were classified based on the RefSeqVirus database and Prokaryotic Viral RefSeq (v201) databases, leaving the majority unclassified. A total of 28 viral families were identified, including Mimiviridae (n = 74), Peduovirinae (n = 32), Duneviridae (n = 20), Winoviridae (n = 15), and Casjensviridae (n = 12), as summarised in Table 1.

Fig. 3
figure 3

Phylogenomic tree of 308 bacterial MAGs. One hundred and twenty universally conserved single-copy marker genes were used to build maximum-likelihood phylogenomic tree with 1000 bootstrap replicates. Detailed MAG taxonomy assignment and completeness and contamination information are provided in Table S2.

Table 1 Family-level annotation of viral operational taxonomic units (vOTUs) in samples.

Given the critical ecological roles of Prochlorococcus and Synechococcus in marine ecosystems, our dataset that includes these picocyanobacteria genomes as well as those of their associated bacteria and viruses fills significant gaps in current genomic resources. All of the primary reads, MAGs and vContigs have been deposited in the National Center for Biotechnology Information (NCBI) BioProject database and the figshare website. This dataset not only provides a comprehensive foundation for understanding the complex interactions among marine picocyanobacteria, their associated bacteria, and viruses, but also serves as a valuable resource for further studies on microbial ecology, evolution, and biogeochemistry under varying environmental and anthropogenic conditions.

Methods

Sample collection and preparation

Samples were collected from 27 different stations at depths of 25–150 m in the South China Sea (2014, 2021), the Indian Ocean (2022), and the western Pacific Ocean (2022) (Fig. 1). The isolation procedure followed the method described by Yan et al.13. In summary, seawater samples were obtained using a Niskin bottle and subjected to gravity filtration through double polycarbonate filters with a pore size of 0.6 μm (Millipore, Billerica, MA, USA), following the protocol described by Chisholm et al.26. The filtrate was then enriched by adding a Pro2 medium nutrient stock solution27 and incubated onboard for an initial enrichment period of 4 to 8 weeks. Flow cytometry was used to count fluorescent cells and primarily confirm the successful enrichment of picocyanobacteria populations. The enrichment cultures were maintained at a temperature of 22 °C under continuous light with an intensity of 10–20 μmol photons m−2 s−1.

DNA isolation, library preparation, and sequencing

Second-Generation Sequencing Method: Genomic DNA was extracted from 5-mL aliquots of the laboratory cultures using a TIANamp Bacteria DNA Kit (Tiangen, Beijing, China) following the protocol of Yan et al.13 followed by centrifugation at 12,000 × g for 30 minutes. The extracted DNA (1 µg) was fragmented using a Covaris ME220 Focused-ultrasonicator (Covaris, Woburn, MA, USA). The DNA library was constructed using the NEBNext® Ultra™ DNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) according to the manufacturer’s instructions. Paired-end sequencing was performed using the Illumina NovaSeq 6000 platform with a read length of 150 bp. Ten nanograms of the prepared library DNA was used for sequencing. All library preparation and sequencing procedures were performed at the Shanghai Majorbio Bio-pharm Technology Co., Ltd. (Shanghai, China).

Third-Generation Sequencing Method: The extracted DNA was processed using the QitanTech DNA Library prep kit QDL-E V1.1 (QitanTech, Chengdu, China), following the manufacturer’s protocol. Library quantification was performed with a Qubit fluorometer using the dsDNA HS Assay Kit, and size distribution was assessed with an Agilent 4200 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Sequencing was carried out on the QNome-3841 nanopore sequencing platform (QitanTech) using the QitanTech DNA Sequencing Kit QDS V1.1.

Metagenomic assembly, gene annotation, and quality control

All metagenomic assemblies were performed on the KBase platform (https://kbase.us), which provided an integrated environment for data processing and analysis28. The Narratives for the assembly processes were as follows: for the standalone second-generation sequencing assembly, refer to Kbase Narrative (https://narrative.kbase.us/narrative/175840); for the third-generation and second-generation hybrid assembly, refer to Kbase Narrative (https://narrative.kbase.us/narrative/186419).

For second-generation sequencing data, raw sequencing reads were initially quality assessed using FastQC (version 0.12.1)29 to identify potential quality issues, including low-quality bases, GC content anomalies, and adapter contamination. Reads were then trimmed and quality-filtered using Trimmomatic (version 0.36)30 with the following parameters: SLIDINGWINDOW:4:15, LEADING:3, TRAILING:3, CROP:140, HEADCROP:5, MINLEN:36. For third-generation sequencing data, quality control was performed using NanoPlot (version 1.32.0) to assess read quality, length distribution, and potential biases. Reads were subsequently filtered to remove low-quality reads and short sequences using NanoFilt (version 2.8.0)31 with a minimum quality score of 7 and a minimum read length of 500 bases.

Clean reads from the second-generation sequencing were assembled using metaSPAdes (k-mer sizes: 33, 55, 77, 99, and 127)(v3.15.3)32, with a minimum contig length set to 2000 bases. Clean reads from both second- and third-generation sequencing technologies were assembled into contigs using HybridSPAdes (k-mer sizes: 21, 33, and 55) (v3.15.3)33, with a minimum contig length set to 2000 bases. Contigs longer than 2 kb were selected for metagenomic binning. Multiple binning tools were used to recover metagenome-assembled genomes (MAGs). Specifically, MetaBAT2 (v1.7)34, MaxBin2 (v2.2.4)35, and CONCOCT (v1.1.0)36 were used for binning, each with the minimum contig length parameter set to 2000 bases. The resulting bins were then consolidated using DAS Tool (v1.1.2)37 to generate a refined set of high-quality bins. The completeness and contamination of MAGs were estimated by running CheckM (v1.0.18)38.

The MAGs meeting the criteria of ≥50% completeness and ≤10% contamination were subsequently clustered using dRep (v3.4.251)39 at the 95% ANI threshold (-sa 0.95 -comp 50 -con 10), resulting in a total of 308 MAGs of associated bacteria. The taxonomy of each MAG was assigned using GTDB-Tk (v2.4.0)40 based on the Genome Taxonomy Database (GTDB v220). In addition, MAGs were functionally annotated using RASTtk (v1.073)41.

Viral contig identification, dereplication, and taxonomic classification

After metagenomic assembly, contigs with lengths ≥10 kb were selected for viral identification. Putative viral contigs were identified using VirSorter2 (v2.2.1)42 and VIBRANT (v1.2.1)43, both with default settings. Contigs with a VirSorter2 score ≥0.5 and all viral contigs detected by VIBRANT were retained. The viral contigs identified by these two pipelines were then consolidated for downstream analysis.

Subsequently, the viral contigs were subjected to quality assessment and trimming using CheckV (v0.7.0)44. Contigs containing host genes but lacking viral genes were excluded according to the standard operating procedures of VirSorter2. After this quality check, the remaining contigs were clustered using CheckV scripts (https://bitbucket.org/berkeleylab/checkv/src/master/) to generate a non-redundant set of species-level vOTUs. Clustering was based on a 95% ANI threshold over 85% of the length of the shorter contig, following established protocols45.

To assign taxonomy to the obtained vOTUs, two pipelines were employed: a protein-sharing network approach using vConTACT2 (v0.9.19)46 and a homology-based search using BLASTp (v2.15.0)47 alignment against viral sequences in RefSeq at the NCBI. Initially, open reading frames were predicted using Prodigal (v2.6.3)48 with the ‘-p meta’ parameter, and the resulting protein sequences were subsequently used for taxonomic analysis. For the protein-sharing network approach, vConTACT2 was run on the KBase platform with default parameters. In parallel, the same set of viral proteins used in vConTACT2 were aligned against the RefSeq viral protein database using BLASTp. A vOTU was assigned to a specific viral family if more than 50% of its proteins matched proteins in that family with a bit-score of ≥50.

Phylogenomic tree reconstruction

One hundred and twenty conserved single-copy genes were extracted from the MAGs using GTDB-Tk (v2.4.0). MUSCLE (v5.1)49 was used to align the marker gene sequences extracted from MAGs, and then trimAl (v1.4.1)50 was used to prune the alignments. Phylogenomic trees were constructed using IQ-TREE (v2.3.4)51 with the model (-st AA -m LG + PMSF + G -B 1000–bnni). The confidence of the maximum-likelihood tree was estimated using 1000 bootstrap replicates. In addition, 93 cyanobacterial reference sequences were included in the analysis, as detailed in Supplementary Table S4.

Data Records

Raw reads generated in this study have been deposited in the NCBI Sequence Read Archive under accession number SRP53972652. The metagenome-assembled genomes (MAGs) of Prochlorococcus, Synechococcus, and associated bacteria have been deposited in the NCBI BioProject database under accession number PRJNA117545453. Detailed accession numbers for these MAGs are provided in Supplementary Table S1. The cyanobacterial genomes, Non-cyanobacterial MAGs and vContigs have been deposited on the figshare website (https://figshare.com/projects/Genomes_of_Prochlorococcus_Synechococcus_bacteria_and_viruses_recovered_from_marine_picocyanobacteria_cultures_based_on_Illumina_and_Qitan_nanopore_sequencing/234704)54,55,56.

Technical Validation

All raw data processing steps, including software and parameters used in this study, were described in the Methods section. The quality of clean reads was assessed using FastQC v0.12.1. The qualities of the MAGs and vContigs were assessed using CheckM v1.0.18 and CheckV v0.7.0, respectively.