Background & Summary

Biological invasions represent a widespread global issue, posing challenges for conserving biodiversity, ecological stability, and environmental resources, as well as causing significant social challenges and economic losses1,2. The proliferation of alien plants, notably driven by the rapid expansion of international trade, tourism, and transportation, constitutes a pivotal element in the broader scope of biological invasion, holding significance in social economics, ecology, and evolutionary studies.

Ageratina adenophora (Spreng.) R.King & H.Rob., commonly known as crofton weed, is a perennial, evergreen, semi-shrub in the Asteraceae family3, indigenous to Central America. It was firstly introduced to Europe as an ornamental plant in the 19th century and later spread to Australia and Asia4,5. Nowadays, it emerged as highly destructive alien invasive weed globally, widely problematic in the United States, Australia, Indian Ocean islands, and Pacific Ocean islands as well as Southern Asia and Eastern Asia. In China, it ranks among the top 10 most destructive invasive species6,7. Crofton weed exhibits various biological characteristics contributing to its success as an invasive species. Firstly, A. adenophora contains many active substances exerting strong allelopathic effects on other plant species and poisoning animals, enabling it to dominate ecological niche. For example, flavonoids, serve as a protective mechanism against biotic and abiotic stressors, including herbivores, pathogens, ultraviolet radiation, and high temperatures. Furthermore, A. adenophora has a high reproduction coefficient, producing ~10,000 small, widely dispersed seeds per plant. Its exceptional adaptation to harsh conditions and efficient soil nutrient absorption, coupled with a robust rhizome system, facilitate rapid invasion, colonization, and expansion8. Therefore, a comprehensive understanding of the genetic basis underlying these biological characteristics is promising for not only efficiently controlling A. adenophora, but also providing valuable genes for crop improvement through genetic engineering.

Although previous studies have explored the biological characteristics of crofton weed, its genetic resource and genomic data are limited, hindering in-depth investigations into its invasive mechanism. Here, we employed a diverse array of sequencing technologies and assembly strategies (Fig. 1) to successfully construct a chromosome-level reference genome of A. adenophora with genome size ~3.82 Gb and a scaffold N50 of 70.8 Mb. In summary, our study provides the valuable genomic resources for further exploring the invasion mechanisms and control strategies of Crofton weed.

Fig. 1
figure 1

Flowchart of genome assembly and quality control for the A. adenophora. Various sequencing technologies, including Illumina, BAC, PacBio, BioNano, and Hi-C, was utilized to achieve accurate genome assembly. Multiple quality control methods were employed for evaluation.

Methods

Plant materials and sequencing

The crofton weed samples in this study were obtained from Tengchong County (N 25°529″ 204″, E 98°45″220″) in Yunnan Province, China. Emphasis was on young leaves, providing high-quality genomic DNA for subsequent genome sequencing and Hi-C analysis. Genomic DNA from A. adenophora was extracted using the modified CTAB method9. DNA concentration and quality were assessed using the NanoDrop 2000 (Thermo Fisher, USA) and Qubit fluorometer (Thermo Fisher, USA). Subsequently, libraries with 270 bp insert fragments were constructed following standard Illumina procedures, and sequenced on the Illumina platform (Illumina, USA) with a PE150 strategy, generating a total of 92.83 Gb (~34×) clean Illumina reads. PacBio DNA sequencing libraries with 30 kb insert size were constructed following PacBio’s recommended protocol (Pacific Biosciences, CA, USA) and sequenced on the PacBio Sequel platform, resulting in 206 Gb (~54×) of raw data with an average length of 13,666 bp. Simultaneously, 28.81 Gb of PacBio HiFi reads (average length 10,505 bp) were generated for genome polishing analysis. The BAC library was constructed and sequenced by Nanjing Hong-Yuan Biotechnology Company Limited (Nanjing, China). Hi-C libraries were constructed following previous protocol10, and sequenced with a 2 × 150 bp read length on the Illumina platform, resulting in a total of 398.39 Gb (~104×) of high-quality Hi-C clean data. The BioNano Genomics Irys system (BioNano Genomics, USA) was employed to generate optical maps and about 797.39 Gb (~209×) of high-quality optical molecular data (length >100 kb, label signal-to-noise ratio of 3.0, average molecule intensity <0.6) were obtained.

The method by Yang et al.11 was utilized for total RNA extraction and cDNA synthesis from A. Adenophora roots, leaves, and flower. mRNA sequencing libraries, constructed on the Illumina NovaSeq platform with 150 bp paired-end sequencing technology, underwent three biological replicate experiments per sample. For full-length transcriptome sequencing, root, leave, and flower samples were prepared following the PacBio Iso-seq experimental workflow and sequenced on the PacBio Sequel platform.

De novo assembly of a phased triploid genome

Fluorescence in situ hybridization (FISH) confirmed 51 chromosomes in A. adenophora (Fig. 2a). Smudgeplot12 analysis indicated an “AAB” genome structure for A. adenophora (Fig. 2b). The genome size estimation was conducted using flow cytometry, with rice serving as the standard reference genome. The measured fluorescence intensity ratio between A. adenophora and rice was approximately 5.1, which was used to estimate the genome size of A. adenophora at 3.96 gigabases (Gb) (Fig. 2c). We delineated the intricacies of the A. adenophora genome by employing multiple sequencing technologies and a meticulous assembly strategy (Fig. 1). Initially, PacBio Sequel long reads were used for contig-level assembly by Canu v1.913 (the parameter set as genomeSize = 3.96 g), and polished by Pilon14 (v1.2229)(parameters:–min depth 10–changes–fix bases) and Arrow v7.01 (Pacific BioScience). The assembled contig size is 3.46 Gb with an N50 of 1.52 Mb. Hybrid genome assembly was performed by utilizing BioNano Solve v3.0.13 (https://bionanogenomics.com/support/software-downloads/) with the parameters “-B 2 -N 2”, and over 93.35% of contig sequences were assembled into super scaffolds, achieving scaffold N50 of 27.33 Mb. Subsequently, we use LACHESIS15 (version 20171221) software to make chromosomal assembly with optimized parameters CLUSTER_MIN_RE_SITES = 100, CLUSTER NONINFORMATIVE RATIO = 1.5, CLUSTER_MAX_LINK_DENSITY = 2, ORDER MIN N RES IN SHREDS = 60, ORDER MIN N RES IN TRUNK = 60. Analysis of the anchored chromosome interaction heatmap revealed a distinct pattern, with one haplotype exhibiting a stronger interaction signal than the other two. Accordingly, we isolated the haplotype with the stronger interaction signals from the three chimeric haplotypes by Juicebox Assembly Tools (JBAT v1.1)16 software, resulting in a non-redundant haplotype-resolved assembly comprising 17 chromosomes, designated as haplotype B. Utilizing haplotype B as the reference, we applied the ALLHIC17 method to distinguish the remaining haplotypes, A1 and A2. Consequently, we obtained a haplotype-resolved, chromosome-level genome assembly for A. adenophora, with a total size of ~3.82 Gb and a scaffold N50 of 70.8 Mb (Fig. 2d and Table 1). Notably, this assembly represents 96.46% of the genome size estimated through flow cytometry measurement (Fig. 2c). A significant portion of scaffolds, totaling 3.55 Gb (92.98%), successfully anchored to 51 pseudo-chromosomes. These pseudochromosomes form 17 homologous groups, each comprising three allelic chromosomes (A1: 1.16 G, A2: 1.17 G, and B: 1.22 G) (Table 2).

Fig. 2
figure 2

Overview of A. adenophora genome. (a) The A. adenophora karyotype. DAPI-stained metaphase chromosomes are shown in the left two parts and demonstrated 51 chromosomes in A. adenophora. Three homologous chromosomes signal was detected by the hybridization sites of 18S (red) and the 5S (green) rDNA probes and shown in the right two parts. Bar 10 μm. (b) Smudgeplots analysis based on density of k-mer for A. adenophora. (c) Genome size estimation by utilizing flow cytometry. The rice with genome size 385.7 Mb was used as the standard sample. The fluorescence intensity fold between A. adenophora and rice was ~5.1, and the genome size of A. adenophora was estimated to be 3.96 Gb. (d) Chromosomal features of A. adenophora genomes: a chromosome ideograms. b Transposable element in each chromosome. c Gene density content in each chromosome. d 5 mC DNA methylation levels obtained from the Pacbio HiFi data. e SNP density. f InDel density. g gene expression levels. (e) BUSCO scores of the three haplotypes of A. adenophora.

Table 1 Genome assembly and annotation statistics for A. adenophora.
Table 2 Statistics of chromosomes length for A. denophora.

Benchmarking Universal Single-Copy Orthologs (BUSCO)18 analysis revealed that ~97.71% of BUSCO genes were completed in our assemblies (Fig. 2e). The annotation of LTRs revealed an LTR Assembly Index19 (LAI) score of 18.53 for the A. adenophora genome (Table 1). The package Merqury20 (v1.3) was used to assess the quality and completeness of the genome using short sequencing reads of A. adenophora and the result indicated a base accuracy of the genome was over 99.97% (QV > 35.43), and k-mer completeness estimated at 97.30% (Table 1). Collectively, these results indicate a high quality of the A. adenophora genome assembly.

Structural variations among allelic chromosomes

Structural variation analysis between haplotypes was conducted using MUMMER v4.021 software. The “nucmer” command facilitated genome alignment, with parameters set as “–maxmatch -c 500 -b 500 -l 100”. Subsequently, the delta-filter command was applied to refine alignment results, using parameters “-i 90 -l 1000 -m” to ensure a minimum 1 Kb matching and at least 90% similarity. The SyRI (v1.6)22 software was then employed to identify structural variations among three haplotypes, with parameters set as “–allow-offset 100 –unic 2000”. To ensure result accuracy, only DUP types with a minimum 50% overlap with SYN were retained. All variations were required to have a minimum length of 30 Kb. Finally, a total of 1,352 structural variations >30 kb in length between haplotypes were detected. Notably, 46 extra-large inversions >1 Mb were detected, comprised ~57.80% of the cumulative structural variation length. The most of largest inversion between haplotype was localized on chromosome 14, spanning ~40 Mb and constituting 57.88% of its length (Fig. 3a). Hi-C and optical mapping method were confirmed the accuracy of the inversion (Fig. 3b and c).

Fig. 3
figure 3

Characterization of genomic structural variations between the A. adenophora haplotypes. (a) Overview of syntenic blocks across the three haplotypes. Gray lines represent the syntenic regions, while orange lines represent the inversions. (b) Identification of large inversions in Chr14. The upper two heatmaps show a chromatin 300-kb interaction matrix, including mapping Hi-C data of A1_14 against the haplotype B genome (A1_map_B), mapping Hi-C data of B_14 against the haplotype A1 genome (B_map_A1). The middle panel shows the syntenic blocks across A1, B and A2 in chromosome 14. The largest inversion regions are shown by orange lines. The lower two heatmaps for A2_14 show a chromatin interaction heatmap with a similar mapping strategy as the upper track. (c) Consistent alignment of the BioNano contigs with the PacBio assembly demonstrates the inversion accuracy in the PacBio assembly. The red arrows indicate inversion breakpoints, while the black lines shown collinearity between the PacBio assembly and BioNano maps.

Transposable elements annotation

To build a comprehensive repeat sequence library for A. adenophora, RepeatModeler23 (v2.02), LTR-FINDER24 (v1.05), MITE-hunter(20100819)25, and PILER-DF26 (v1.0) were used with default parameters. This sequence library was merged with the Repbase27 database, and sequences were classified into different categories using PASTEClassifier.py28. Finally, RepeatMasker v.4.1.129 was used to mask the genome with the finalized repeat library.). Finally, a total of 3.16 Gb (76.44%) repetitive sequences were identified in the A. adenophora genome (Table 3). Among these, 881.04 Mb (76.00%), 893.83 Mb (76.34%), and 982.94 Mb (80.55%) of TEs were in the haplotype A1, A2, and B, respectively. We conducted a comparative analysis among several closely related Asteraceae family species, revealing a notably elevated LTR ratio in A. adenophora (Fig. 4a). To elucidate LTR expansion, we identified 55,232 full-length LTR-RTs (A1:17,721, A2:17,646 and B:19,865) in A. adenophora using LTRretriever30 (v2.9.8), with 61.71% in the Gypsy subfamily and 16.89% in the Copia subfamily.

Table 3 Repeats elements statistics in genome of A. denophora.
Fig. 4
figure 4

LTR retrotransposon accumulation and insertion analysis. (a) Comparison of repetitive sequences contents in Asterids. (b) The insertion time distribution of intact LTRs in different Asteraceae species. Mya indicates million years ago.

Furthermore, we calculated the LTR insertion time. Flanking sequences on both sides of LTRs were aligned using MAFFT31 (v7.205) (parameters:–local pair–max iterate 1000). Subsequently, the Kimura model in EMBOSS32 (v6.6.0) was employed to calculate the distance (K). The formula for LTR insertion time is T = K / (2 × r), with the molecular clock rate (r) set at 7 × 109. The results showed a recent burst expansion [<0.5 million years ago (Mya)] shared among all three haplotypes, consistent with H. annuus and M. micrantha, but later than L. sativa (~1.9 Mya) and C. cardunculus (~2.4 Mya) (Fig. 4b).

Gene annotation

We integrated homology-based, de novo, and transcriptome prediction methods gor gene annotation. GenScan33 (v1.0), Augustus34 (v2.4), GlimmerHMM35 (v3.0.4), GeneID36 (v1.4), and SNAP37 were utilized for ab initio prediction. Homology-based prediction employed GeMoMa38 (v1.4.2) with genome information from Oryza sativa L. ssp. japonica, Arabidopsis thaliana, M. micrantha, and H. annuus. Additionally, the clean RNA-seq reads from different tissues were mapped to the genome using HISAT39 (v2.1.0), and the alignments were then input to Trinity40 (v. 2.2.0) by running genome-guided mode to make de novo transcriptome assembly into unigenes. The unigenes plus with the PacBio full-length cDNA were then aligned to the genome using BLAT41 (v35), and then followed by PASA v2.0.242 for transcriptome prediction. EVidenceModeler v1.1.143 integrated homologous-based, transcriptome, and ab initio predictions, producing a unified gene model updated by PASA. The maximum intron length was set to 20 kilobases (kb) for all software tools mentioned above. Finally, we identified 123,134 protein-coding genes in the A. adenophora genome (Table 1). For gene function annotation, DIAMOND v0.9.2844 was used to align the predicted protein sequences against NCBI non-redundant protein (NR), eggNOG45, Swissprot and TrEMBL46 databases with a cutoff value of 1e-5. HMMER47 (v3.1b2) was used for search Pfam48 database for protein domain annotation, and Gene Ontology49 (GO) term annotation was obtained from InterProScan50 (v4.3). Additionally, Kyoto Encyclopedia of Genes and Genomes51 (KEGG) pathway annotation used the KEGG Automatic Annotation Server52 (KAAS). Approximately 99.03% (121,934) of gene models were functionally annotated in Swissprot, TrEMBL, NR, KEGG, GO, eggNOG, or Pfam databases (Table 4).

Table 4 Function annotation of predicted gene model in A. denophora.

Data Records

All sequencing raw data used in this study53 and the Whole Genome Shotgun (WGS) assembly54 have been submitted to the National Center for Biotechnology Information (NCBI) via BioProject ID PRJNA1096832. The genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR28607150 and SRR28607151. PacBio CLR DNA sequencing data is available under the NCBI Sequence Read Archive accession number SRR28607117 and PacBio HiFi DNA sequencing data are under SRR28607109 and SRR28607110. PacBio Iso-Seq data for all tissues (flowers, root and leaf) are available under the NCBI Sequence Read Archive accession numbers SRR28607137, SRR28607138 and SRR28607140-SRR28607143. The RNA-seq data (flowers, root and leaf) are available under the NCBI Sequence Read Archive accession number SRR28607106-SRR28607108 and SRR28607144-SRR28607149. The short read sequences for Hi-C sequencing have been deposited in the SRA accessions SRR28607128 and SRR28607139. The final genome assembly, structural variations, transposable elements, gene structure and function annotation were deposited in the Figshare55 database.

Technical Validation

We evaluated the continuity of the genome, and the results indicated that the contig N50 value reached 1.22 Mb (Table 1). This is a significant improvement compared to the previously reported triploid banana genome, which had a contig N50 of 1.08 Mb56. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis revealed that ~97.71% of BUSCO genes were completed in our assemblies (Table 1and Fig. 2e), comparable to percentage reported in cultivated hexaploidy of C. morifolium57. Long Terminal Repeats (LTRs), crucial for assessing assembly quality in repetitive sequences and intergenic regions. The annotation of LTRs revealed an LTR Assembly Index (LAI) score of 18.53 for the A. adenophora genome (Table 1), meeting standard expected for a reference genome58. Comparison with whole genome sequencing short reads of A. adenophora indicated a base accuracy of the A. adenophora genome was over 99.97% (QV > 35.43), and k-mer completeness estimated at 97.30% (Table 1).

The chromosome interaction heatmap highlighted the grouping of 51 pseudochromosomes into 17 homoeologous clusters, each cluster comprising three allelic chromosomes (Fig. 5). To further affirm the phased genome’s accuracy, we utilized 971 Bacterial Artificial Chromosome (BAC) sequences (N50 of 114.26 kb), calculating a switch error of 4.20% between haplotypes (Fig. 6). The most of largest inversion between haplotype B and A was localized on chromosome 14, spanning ~40 Mb and constituting 57.88% of its length (Fig. 3a). Discrete chromatin interaction signals around breakpoints were observed through inter-haplotypes mapping of Hi-C data (Fig. 3b), and the inversions were also confirmed by BioNano optical maps (Fig. 3c).

Fig. 5
figure 5

Hi-C interaction heatmap of 17 homoeologous chromosome groups in A. adenophora. The Hi-C data was aligned to the A. adenophora genome. The heatmaps for haplotype A1, A2, and B of each group are shown at a resolution of 300 Kb. The dark red dots indicate a high probability of interaction, while the light-yellow dots indicate a low probability of interaction.

Fig. 6
figure 6

Consistency plot of BAC sequences with each chromosome of the A. adenophora genome. The BAC sequences ID were drawn in Y axis.