Background & Summary

The Cyprinidae, the largest family of freshwater fishes, has a remarkable global distribution, with particularly high diversity in Asia1,2. This family is characterized by exceptional adaptability, allowing rapid radiation and colonization of diverse habitats, particularly in East Asia3,4,5. The ecological success of cyprinids in this region can be attributed to their high reproductive capacity, broad ecological tolerance and efficient dispersal mechanisms6,7. These traits have allowed cyprinids to thrive in a wide range of environmental conditions, from mountain streams to lowland rivers, often becoming dominant components of freshwater ecosystems8,9. Understanding the mechanisms that drive their distribution and adaptability not only provides insights into their evolutionary success, but also informs conservation strategies to maintain biodiversity and ecological balance in rapidly changing environments.

Gold barb (Barbodes semifasciolatus), a member of the family Cyprinidae, exhibits strong tolerance to extremely acidic environments, making it an excellent model for studying the adaptive mechanisms of cyprinids in response to harsh conditions10. In this study, we assembled two haplotype resolved chromosome-level genomes (776 and 779 Mb) for the gold barb using PacBio’s advanced highly accurate long-read sequencing and chromosome conformation capture technology, achieving a contig N50 of 23.07 Mb and 26.46 Mb. The continuity and accuracy of the genome was confirmed by conserved core gene analysis (BUSCO scores) and synteny assessment, establishing it as one of the highest quality cyprinid genomes assembled to date. Overall, this high quality genome will be a great resource for future research into the adaptive evolution of cyprinids.

Methods

Sample collection

We obtained a gold barb (Barbodes semifasciolatus) from the highly acidic waters of the Guangzhou Conghua Nature Reserve for Tanichthys albonubes in China. The fish were anaesthetized with MS-222 as in our previous study11, and samples of muscle, liver, blood, gills, brain and kidney tissues were rapidly collected. These tissues were immediately frozen in liquid nitrogen and stored at −80 °C. Muscle tissue was used for PacBio’s advanced highly accurate long-read sequencing (HiFi), liver tissue for chromosome conformation capture (Hi-C) technology and a pooled transcriptome of all collected tissues was prepared for RNA-seq sequencing.

Genomic long-read sequencing (HiFi)

We followed PacBio’s standard protocol (Pacific Biosciences, California, USA) to generate genomic data using the PacBio Sequel II platform. The sequencing yielded 1,895,794 clean reads with a total of 34.76 Gb of genomic data, achieving an average read length of 18.34 kb. To ensure high-quality data, reads were filtered for adapter sequences and low-quality bases. The filtering criteria were set to remove reads with a length shorter than 0.5 kb and those with a quality score lower than 0.80.

Iso-Seq Library Construction and Sequencing

Iso-Seq libraries were prepared using the PacBio SMRTbell prep kit 3.0, following the manufacturer’s guidelines. This process resulted in the generation of 49,885,551 long reads totaling 117.61 Gb, with an average read length of 2.35 kb. The libraries were loaded onto the PacBio Sequel II platform, and sequencing was carried out using the SMRT cells. Post-sequencing, long reads were filtered using the PacBio Circular Consensus Sequence (CCS) algorithm to ensure only high-quality consensus sequences were retained.

RNA Extraction and cDNA Library Construction

Total RNA was extracted from pooled tissue samples using the TRIzol reagent (TIANGEN, Cat # DP424, China) following the manufacturer’s protocol. The extracted RNA was quantified and assessed for integrity using the Agilent 2100 Bioanalyzer. Only high-quality RNA samples (RIN > 7.0) were used for further processing. Reverse transcription was performed to synthesize cDNA from the RNA samples, following standard protocols with random hexamers and SuperScript III Reverse Transcriptase (Thermo Fisher Scientific, USA). The cDNA libraries were then constructed using the Illumina TruSeq Stranded mRNA Library Prep Kit (Illumina, USA). The constructed cDNA libraries were sequenced using the Illumina NovaSeq 6000 platform (Illumina, USA). Paired-end sequencing was performed with a read length of 150 bp. During sequencing, the Illumina software automatically filtered out reads of low quality, ensuring that only high-quality reads were retained for downstream analysis. After sequencing, raw reads were processed using the fastp12 tool (v0.20.0) to remove adapter sequences, low-quality bases (with Phred score < 20), and short reads. The resulting dataset consisted of 58,426,944 high-quality short reads, corresponding to 8.7 Gb of sequencing data.

Hi-C Library Preparation and Sequencing

Genomic DNA was extracted from muscle and cross-linked with biotinylated nucleotides using formaldehyde. The DNA was then digested with the restriction enzyme DpnII, followed by ligation of the DNA ends to form chimeric fragments. The biotinylated ligation products were captured on streptavidin beads and the libraries were amplified by PCR. Then Hi-C libraries were sequenced on the Illumina NovaSeq 6000 platform using paired-end sequencing (150 bp). Library concentration and fragment size were verified prior to sequencing. Raw sequencing data were processed to remove low quality reads, adapter sequences using fastp12 tool (v0.20.0), followed by alignment to the reference genome using the BWA-MEM v0.7.1213.

Chromosome level genome assembly of gold barb

By integrating HiFi reads and Hi-C reads, we assembled the gold barb genome using Hifiasm v0.16.114 with default parameters. This process produced two haplotype-resolved assemblies (hap1 and hap2) (Table 1). The resulting haplotype-resolved assemblies were 776 Mb and 779 Mb in size, containing 143 and 100 contigs respectively, with N50 of 23.07 Mb and 26.46 Mb (Table 1). The genome size is very comparable to that of its closely related cyprinid species (Puntigrus tetrazona). We then aligned the Hi-C reads to the two haplotype-resolved assemblies using BWA v0.7.1213. Chromosome-level assemblies were then constructed for both haplotypes using Haphic15, resulting in 25 high quality chromosomes for each haplotype. Next, manual curation of potential assembly errors was performed using JuiceBox v2.20.0016, with contigs lacking obvious interaction relationships treated as single scaffolds. The haplotype-resolved assemblies had anchoring rates of 99.6% and 99.7%, respectively. Each chromosome consisted of between 1 and 6 contigs, with four chromosomes being gap-free (Fig. 1A,B). We further employed quarTeT17 to predict the telomeric and centromeric regions of these 25 chromosomes. The centromeric regions were subsequently validated by their intra-chromosomal interaction patterns. For example, in the 18,754,673–19,992,527 bp region on chromosome 13, we found that this region is largely devoid of coding genes, enriched with tandem repeat sequences (TR), and flanked by transposable elements (TE). Moreover, Hi-C data revealed minimal chromatin interactions in this region, suggesting its potential as a centromeric region (Fig. 1C,D). The BUSCO (v5.5.0) score based on actinopterygii_odb10 is 98.1%, comprising 96.2% Complete and single-copy and 1.9% Complete and duplicated. Finally, we performed a synteny analysis of the genomes of zebrafish and gold barb using JCVI v0.9.1318, which revealed a strong syntenic relationship between the two species (Fig. 1E).

Table 1 Statistics of the genome assembly.
Fig. 1: Genome assembly of haplotype 2 for the gold barb (B. semifasciolatus).
figure 1

(A) Hi-C linkage density heat map, where the x and y axes represent genomic positions. Red dots indicate regions of high linkage density, suggesting that these regions are more likely to belong to the same chromosome. Blue boxes indicate potential chromosomes, while green boxes indicate contigs. (B) Distribution of contigs among the 25 chromosomes. Chromosomes with free gaps are shown in orange. (C) Assembly of the 25 chromosomes showing the distribution of gaps, gene density and putative telomeric and centromeric regions. Blue triangular regions indicate putative telomeric regions, orange boxes represent the locations of assembly gaps, and concave regions mark potential centromeric regions. (D) Detailed view of chromosome 13 showing the distribution of genes, repetitive sequences, transposable elements and the Hi-C linkage density heat map. Abbreviations: TE, transposable elements; TR, tandem repeats. (E) Synteny between the chromosomes of the gold barb and the 25 chromosomes of zebrafish. Each line represents a syntenic block, highlighting homologous regions between the two species. The 25 zebrafish chromosomes are shown on one axis, while the corresponding gold barb chromosomes are displayed on the other axis.

Annotation of repetitive sequences

We used both homologous and de novo-based approaches to annotate the repetitive sequences in the gold barb genome. For homologous methods, we used RepeatProteinMask19 and RepeatMasker20 to align transposable elements (TEs) at the protein and DNA levels, respectively. Tandem repeats were then annotated using TRF21 with the following parameters: trf 2 7 7 80 10 50 2000 -d -h. For the species-specific repeat library, we used RepeatModeler (version 1.73) to identify consistent classification sequences, which were then used to run RepeatMasker. Finally, we merged all annotated repeat sequences using bedtools22 (v2.25.0). In total, repeat elements account for 31.22% of the genome (Table 2 and Fig. 2).

Table 2 Statistics of repeat elements genome assembly.
Fig. 2: Circos plots show the distribution of genomic components in the gold barb.
figure 2

Genomic features are shown in 1,000,000 bp windows. From the outermost circle to the innermost circle, the following features are shown: gold barb chromosomes, gene frequency across the genome, density of DNA transposable elements, density of long interspersed nuclear elements (LINEs), density of long terminal repeats (LTRs), density of short interspersed nuclear elements (SINEs), density of other transposable elements and density of unknown genomic elements. The innermost links represent the collinearity relationships between different haplotypes.

Annotation of protein-coding genes

For protein-coding gene prediction, we used a combination of three approaches as in previous studies23,24,25: ab initio prediction, homology-based prediction and transcriptome-based prediction. Transcriptome data were generated using two sequencing methods: Paired-end RNA-seq and full-length transcriptome sequencing. TransDecoder26 v5.5.0 was used to predict proteins from paired-end RNA-seq transcripts, while IsoQuant27 v3.6.2 was used to predict proteins from full-length transcripts. For homology-based prediction, genomic data from related species were downloaded from NCBI, including Carassius gibelio (GCF_023724105.1), Cyprinus carpio (GCF_018340385.1), Carassius auratus (GCF_003368295.1), Labeo rohita (GCF_022985175. 1), Carassius carassius (GCF_963082965.1), Onychostoma macrolepis (GCF_012432095.1), Puntigrus tetrazona (GCF_018831695. 1), Sinocyclocheilus graham (GCF_001515645.1), Sinocyclocheilus rhinocerous (GCF_001515625.1) and Sinocyclocheilus anshuiensis (GCF_001515605.1). These genomes were used for comparative gene structure alignment. Finally, EVidenceModeler28 v1.1.1 was used to integrate the results of the three prediction approaches into the final gene set, resulting in 26,057 protein-coding genes and 2,087 pseudogenes.

Data Records

Hi-C data, full length Full-length transcriptome data and transcriptome data were deposited in the National Center for Biotechnology Information (NCBI) SRA database29 under the accession numbers SRR31825773, SRR31825774 and SRR31825775, respectively. HiFi data were deposited in the NCBI SRA database29 under the accession numbers SRR31825776 and SRR31825777. The assembly genome data of the two haplotypes were deposited at GenBank under accession JBMAGD00000000030 and JBMAGE00000000031. Genome annotations were deposited in the Figshare database32.

Technical Validation

The completeness of the genome assemblies was assessed using BUSCO v5.5.033. The Hap2 results showed a BUSCO completeness of 98.1%, including 96.2% single-copy genes, 1.9% duplicated genes, 0.7% fragmented genes, and 1.2% missing genes. The analysis involved mapping the PacBio and Illumina sequencing reads to Hap2 using Minimap2 v2.2834, BWA v0.7.1213 and SAMtools v1.2.035. The mapping rates for HiFi, Hi-C, and RNA-HiFi data were 100%, 99.91%, and 99.84%, respectively.