Keywords
genome assembly, invasive species, fruit fly, tephritidae, pest
This article is included in the Agriculture, Food and Nutrition gateway.
Here, we present novel high quality genome assemblies for five invasive tephritid species of agricultural concern: Ceratitis capitata, C. quilicii, C. rosa, Zeugodacus cucurbitae and Bactrocera zonata (read depths between 65 and 78x). Three assemblies (C. capitata, C. quilicii and Z. cucurbitae) were scaffolded with chromosome conformation data and annotated using RNAseq reads. For some species this is the first reference genome available (B. zonata, C. quilicii and C. rosa), for others we have published improved annotated genomes (C. capitata and Z. cucurbitae). Together, the new references provide an important resource to advance research on genetic techniques for population control, develop rapid species identification methods, and explore eco-evolutionary studies.
genome assembly, invasive species, fruit fly, tephritidae, pest
A significant number of phytophagous insects within the dipteran family of the Tephritidae (the “true” fruit flies) are considered as serious pests for fruits and vegetables worldwide (White & Elson-Harris 1992). Globalization has led to a surge in intercontinental trade and movement, and has increased the number of incursions of harmful non-native fruit fly species (Bragard et al. 2020). Many countries have put costly and elaborate phytosanitary measures in place to prevent entry and establishment of harmful fruit fly species (Bragard et al. 2020; Papadopoulos et al. 2023a, 2023b). Making resources available that could provide researchers with a better tool for studying fruit fly pests is becoming increasingly important. Agricultural areas with a suitable climate for fruit fly pests are rapidly increasing around the globe (Sultana et al. 2020), changing patterns of distribution of fruit fly pests (Ni et al. 2011). This leads to more fruit fly incursions and first detections of new fruit fly species in several countries in recent years, e.g. B. dorsalis in France, Italy and Belgium; B. zonata in France (EPPO alert list, https://www.eppo.int/ACTIVITIES/plant_quarantine/alert_list).
Here, we present high quality reference genome assemblies for five tephritids (Ceratitis capitata (Wiedemann), C. quilicii (De Meyer, Mwatawala & Virgilio), C. rosa (Karsch), Zeugodacus cucurbitae (Coquillett), Bactrocera zonata) of agricultural importance (Figure 1a). For three (C. quilicii, C. rosa, B. zonata) of the five species, a genome assembly is completely lacking in public databases and could thus provide a major step forward in accumulating knowledge on those species. Genome assemblies are a valuable resource for both fundamental and applied research and can facilitate the development of new and sustainable pest management methods. The highly contiguous and complete genomes presented here will increase the chances of researchers to find specific genes of interest and investigate changes in genomic architecture. The new assemblies will enable researchers to tackle questions regarding climate adaptation, host and range expansion and niche shifts (Papanicolaou et al. 2016).
(a) Photographs of the five fruit fly pest species from a dorsal and lateral view © RMCA (Royal Museum for Central Africa. (b) Hi-C (Dovetail™ Omni-C™) contact map for three tephritid species showing which reads are in close proximity of each other, revealing the linear representation of the scaffolds/chromosomes within the genome. (c) Phylogenetic tree of the three tephritid fruit flies with annotation and five other diptera species. (d) BUSCO completeness results for each of the assembled tephritid genomes.
PacBio CSS reads covered the genome between 65 and 78 times assuming a genome size of 0.5 Gb ( Table 1) for the five fruit fly species shown in Figure 1a. A BUSCO search for genome completeness for all five novel assemblies against the Diptera database delivered a decent genome completeness between 94.6% (B. zonata) and 98.8% (C. capitata) using the duplicate purged PacBio assemblies (Figure 1d). Total assembly lengths ranged from 410 Mb (Z. cucurbitae) to 889 Mb (C. quilicii) with L50 values ranging from three (B. zonata) to 63 (C. quilicii) ( Table 1). BlobToolKit results for identifying contaminants are shown in Figure. S1-S5 (Refer extended data) accessible at https://zenodo.org/records/14186560). Physical pairing between chromatin regions is shown in Figure 1b for C. capitata, C. quilicii and Z. cucurbitae.
The annotated genomes comprise 32,449; 38,590 and 31,422 genes in total for C. capitata, C. quilicii and Z. cucurbitae respectively with a total coding region length (bp) of 39,037,294; 46,768,995 and 41,286,253. The average gene length (bp) is 1,203.04; 1,211.95 and 1,313.93 for C. capitata, C. quilicii and Z. cucurbitae respectively. The most recent C. capitata assembly available on NCBI (GCA_905071925.1, published in November 2020) contains 14,054 genes and thus, this novel assembly improves the degree of annotation of the C. capitata genome significantly. The same can be observed in Z. cucurbitae, where the most recent NCBI reference assembly (GCF_028554725.1) only comprises 17,225 genes. In Ceratitis sp. however, a substantial proportion of BUSCO’s are duplicated, which suggest the presence of redundant sequences resulting from partial misassemblies. Our recommendation is therefore to be cautious when comparing Ceratitis sp. assemblies with other assemblies.
A total of 19,480 gene orthogroups could be found using OrthoFinder and a total of 32,051; 37,950 and 31,009 genes could be attributed to an orthogroup for C. capitata, C. quilicii and Z. cucurbitae respectively. Using these orthogroups as evidence we estimated that the Tephritidae-Drosophilidae split took place around 120 MYA (Figure 1c), which is in line with the estimations of Russo et al. (2013) who constructed a drosophilid time tree with two tephritid species as outgroup (C. capitata and B. oleae) and estimated the split at around 110 MYA.
We believe that our contribution will substantially impact tephritid genome research and provides new opportunities for comparative genomics with a focus on characterizing genes related to invasiveness.
An inbred lab colony of each of the following tephritid species was established in an artificial setting and larvae were collected for subsequent sequencing: Ceratitis capitata, C. quilicii, C. rosa, Zeugodacus cucurbitae and Bactrocera zonata. Inbred specimens of C. quilicii, C. capitata and C. rosa were produced at Citrus Research International in Mbombela and were originally sourced from wild flies collected in Ermelo (-26.516021, 29.996168), Burgershall (-25.112083, 31.087778) and Mbombela (-25.452258, 30.970778), Mpumalanga Province, South Africa respectively in 2020 (C. rosa) and 2021 (C. capitata and C. quilicii). Species identity was confirmed by Marc De Meyer (C. quilicii) and Aruna Manrakhan (C. capitata and C. rosa). Inbred lines for Z. cucurbitae and B. zonata were already present at the facilities of CIRAD, Réunion for more than 150 generations and could thus be used for our purposes. Pupae of all species supplied for sequencing originate from a parent x F1 backcross to increase homozygosity. The sequencing and assembly process can be described by three consecutive steps: generation of PacBio CCS reads and primary assembly with Hifiasm, generation of Hi-C (specifically, Dovetail™ Omni-C™ reads) coupled with secondary assembly using HiRise and lastly, generation of an RNAseq library for ab initio genome annotation. Only the assemblies of C. capitata, C. quilicii and Z. cucurbitae comprised the HiRise scaffolding and annotation steps.
De novo PacBio assembly and filtering
A de novo assembly was constructed using ±38.8 Gb of PacBio CCS reads resulting in a coverage of around 70x of the tephritid genome ( Table 1). The obtained PacBio reads were used as input to Hifiasm v0.15.4-r347 (Cheng et al. 2021) with default parameters. Blast results of the Hifiasm output assembly against the nucleotide BLAST database (https://blast.ncbi.nlm.nih.gov/) were used as input for blobtools v1.1.1 (Laetsch and Blaxter 2017) and scaffolds identified as possible contamination were removed from the assembly. Finally, purge_dups3 v1.2.5 (Guan et al. 2020) was used to purge haplotigs and contig overlaps. The final assembly was checked for its completeness using BUSCO using the diptera_odb10 dataset (Manni et al. 2021).
Chromosome conformation capture and HiRise scaffolding
To construct a Dovetail™ Omni-C™ library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DNAse I, chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, crosslinks were reversed and the DNA purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments. Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce approximately 30x sequence coverage.
The input de novo assembly and Dovetail™ Omni-C™ library reads (MQ > 50) were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al. 2016). Dovetail™ Omni-C™ library sequences were aligned to the draft input assembly using bwa (https://github.com/lh3/bwa). The separations of Dovetail™ Omni-C™ read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold.
Firstly, repeat families in the three tephritid genome assemblies (C. capitata, C. quilicii and Z. cucurbitae) were identified de novo and classified using the software package RepeatModeler2 (Flynn et al. 2020, the original version of RepeatModeler is free and available at https://github.com/Dfam-consortium/RepeatModeler/blob/master/RepeatModeler). The custom repeat library obtained from RepeatModeler2 was used to discover, identify and mask the repeats in the assembly using RepeatMasker (Version 4.1.0, available at https://github.com/rmhubley/RepeatMasker). Secondly, coding sequences from Bactrocera dorsalis, Ceratitis capitata and Drosophila melanogaster available on GenBank were used to train the ab initio model in AUGUSTUS (version 2.5.5) by performing six rounds of optimization. Likewise, the same coding sequences were used to train an independent ab initio gene model using SNAP (Korf 2004). Furthermore, RNAseq reads were mapped onto the genome using the STAR aligner software (Dobin et al. 2013). MAKER (Campbell et al. 2014), SNAP and AUGUSTUS (with intron-exon boundary hints provided from RNAseq) were then used to predict genes in the repeat-masked reference genome. To help guide the prediction process, SwissProt peptide sequences from the UniProt database (https://www.uniprot.org/) were downloaded and used in conjunction with the protein sequences from the aforementioned species to generate peptide evidence in the Maker pipeline (Campbell et al. 2014). Only genes that were predicted by both SNAP and AUGUSTUS were retained in the final gene sets. To help assess the quality of the gene prediction, AED scores were generated for each of the predicted genes as part of the MAKER pipeline. Genes were further characterised for their putative function by performing a BLAST (Ye et al. 2006) search of the peptide sequences against the UniProt database. tRNA were predicted using the software tRNAscan-SE (Lowe & Chan 2016, available at: https://lowelab.ucsc.edu/tRNAscan-SE/).
We inferred orthogroups using OrthoFinder v2.5.5. (Emms & Kelly 2019) for the three fruit fly species with an annotated genome assembly in this study (C. capitata, C. quilicii and Z. cucurbitae). In addition, we downloaded protein sequence data for Drosophila melanogaster Meigen (GCA_000001215.4), Anopheles darlingi Root (GCA_000211455.3), Musca domestica Linnaeus (GCF_030504385.1), Rhagoletis pomonella (Walsh) (GCF_013731165.1) and Bactrocera tryoni (Froggatt) (GCF_016617805.1). Sequences were aligned using Diamond and gene trees were inferred using fasttree. The STAG algorithm combined with the STRIDE rooting methods, implemented in OrthoFinder, was then used to infer a species tree with realistic branch lengths from the full set of gene trees (Emms & Kelly 2017). A time-calibrated tree was constructed by transforming the species tree rendered by Orthofinder into a ultrametric tree and calibrating it based on the split between A. darlingi and the rest of the taxa (240.8 MYA) as inferred from TIMETREE5 (timetree.org).
PD, SV, LE, MDM, MV (RMCA, BE) – Conceptualization, funding acquisition, original draft preparation and data submission.
PA, JT, MK (SU, ZA) – Conceptualization, development of the inbred lines, provision of field samples, review and editing.
AM (CRI, ZA) - Conceptualization, development of the inbred lines, provision of field samples, review and editing.
DC, LC (EMU, MZ), LB (National FF lab, MZ) - Conceptualization, provision of field samples, review and editing.
MM, RM, AK, JT (SUA, TZ), JB (UDOM, TZ) - Conceptualization, review and editing.
HD (CIRAD – La Réunion, FR) - Conceptualization, funding acquisition, development of the inbred lines, provision of field samples, review and editing.
All five genome assemblies have been deposited on the NCBI data repository.
National Centre for Biotechnology Information. BioProject: Five new genome assemblies of Tephritid pest species. Accession number: PRJDB18489; https://www.ncbi.nlm.nih.gov/bioproject/PRJDB18489/.
GenBank assemblies for the five tephritid species can be consulted using following identifiers:
National Centre for Biotechnology Information. GCA_043005645.1: Bactrocera zonata; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_043005645.1/.
National Centre for Biotechnology Information. GCA_043005455.1: Ceratitis capitata; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_043005455.1/.
National Centre for Biotechnology Information. GCA_043005495.1: Ceratitis quilicii; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_043005495.1/.
National Centre for Biotechnology Information. GCA_043005725.1: Ceratitis rosa; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_043005725.1/.
National Centre for Biotechnology Information. GCA_043005565.1: Zeugodacus cucurbitae; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_043005565.1/
Annotation files for C. capitata, C. quilicii and Z. cucurbitae are stored at zenodo. https://zenodo.org/records/13928607, Genome sequence and .gff annotation of three pest fruit flies (Tephritidae).
zenodo. Genome sequence and .gff annotation of three pest fruit flies (Tephritidae), DOI: https://doi.org/10.5281/zenodo.13928607 (Royal Museum for Central Africa 2024).
The project contains the following underlying data:
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Zenodo: A new genome sequence resource for five invasive fruit flies of agricultural concern: Ceratitis capitata, C. quilicii, C. rosa, Zeugodacus cucurbitae and Bactrocera zonata (Diptera, Tephritidae), DOI: https://doi.org/10.5281/zenodo.14186560 (Deschepper 2024).
The project contains the following extended data:
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Insect genomics
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Evolutionary genetics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 06 Dec 24 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)