Introduction

Pathogenic E. coli has been the subject of extensive research and stands among the bacteria most closely associated with fatal infections worldwide1,2. According to a study conducted by the GBD 2019 Antimicrobial Resistance Collaborators, in 2019, 33 pathogenic bacteria were responsible for 7.7 million global deaths, with pathogenic E. coli ranking as the second leading cause, contributing to over 500,000 deaths, particularly prevalent in Sub-Saharan Africa, Asia, and Oceania2. Typically inhabiting the intestinal microbiota of warm-blooded animals1,3,4, E. coli demonstrates a diverse presence across more than 500 species and can be classified into eight main phylogroups: A, B1, B2, C, D, E, F, and G1,5. While the majority of E. coli populations exist as commensal organisms within their hosts, specific subpopulations, notably diarrheagenic E. coli (DEC) and extra-intestinal pathogenic E. coli(ExPEC), exhibit high pathogenicity in both humans and animals, leading to intestinal and extraintestinal infections6.

The selective pressure exerted by a host or environment, primarily constituted by commensal E. coli, significantly influences the genetic makeup of these bacteria, potentially transforming them into pathogens7,8,9,10,11. Thus, emerges as a pivotal mechanism driving the evolution of virulence and antimicrobial resistance in E. coli, particularly through the acquisition of mobile genetic elements such as phages, plasmids, and integrative and conjugative elements, often organized as Pathogenicity Islands (PAIs)12. Pathogenic E. coli frequently harbors virulence genes within PAIs, such as the locus of enterocyte effacement (LEE) identified in E. coli O157:H7. Notably, the high-pathogenicity island (HPI), originally identified in Yersinia strains, is recurrently found in various members of the Enterobacteriaceaefamily13, including pathogenic E. coli strains14,15. This HPI encompasses multiple genetic markers responsible for encoding yersiniabactin-mediated iron uptake, a factor that enhances iron metabolism within the host and subsequently triggers septicemia in mammals and birds15. While the clinical significance of this island is well-documented, studies have primarily reported the presence of HPI in E. coli from diverse hosts16,17 (Karch et al., 1999; Ju et al., 2013). However, information regarding its prevalence in E. coli originating from food sources remains scarce.

Moreover, another genomic island harboring genetic factors for heat resistance has been highlighted in E. coli, this island called transmissible locus of stress tolerance (tLST), was described for the first time in an E. coli strain isolated from a beef processing plant18,19,20. The tLST harbors multistress genes and exhibits four known variants, each spanning a size range of 14–30 kb, detected in either the plasmid or chromosome of E. coli and other members of the Enterobacteriaceaefamily, with the recent identification of two additional20,21. The tLST variants are composed of genes or orfsgrouped to express proteins with functions against heat, envelope stress and oxidative stress18. However, each variant may contain one or more different genes or orfs21,22.

In commensal E. coli, the prevalence of tLST can be as high as 10%, while in pathogenic E. coli, it is 2.7%23,24. Despite this disparity, studies have increasingly focused on identifying investigating tLST in E. coli isolated from animal food sources3,8,24,25,26,27,28. Long-term temporal assessments of resistant strain emergence often rely on genomic tools capable of estimating lineage molecular clocks29. However, such an approach has not yet been disclosed for tLST. In this study, we sequenced the genomes of E. coli strains isolated from pasteurized milk, previously characterized for their heat resistance abilities8. Our comprehensive characterization focuses on estimating the molecular clock for the emergence of heat resistance variants, assessing the global profile of other resistance genes and characterizing the virulence profile with emphasis on the presence of HPI genes.

Methods

Sampling acquiring, whole-genome sequencing, and Denovo assembly

In this study, we utilized a collection of 29 E. colistrains isolated from pasteurized milk in Mato Grosso state, Brazil8. These strains were chosen for their heat resistance phenotype and genotype, previously characterized by PCR. These strains were preserved in brain heart infusion (BHI) broth (Oxoid, Hampshire, England) containing 15% glycerol. Upon reactivation on MacConkey agar (MerckTM, Darmstadt, Germany), the strains were incubated at 37 °C for 18–24 h. Subsequently, a single colony from each agar plate was suspended in BHI broth and incubated overnight at 37 °C. Total DNA extraction was performed using the Qiagen DNeasy Blood and Tissue commercial kit (Hilden, Germany), following the manufacturer’s instructions.

Sequencing libraries were then prepared using the NEBNext1 Library Preparation Kit (New England Biolabs Inc., Ipswich, MA, USA), according to the manufacturer’s protocol. Sequencing was conducted using the Illumina HiSeq 2500 platform (Illumina Inc., San Diego, CA, USA), ensuring a minimum coverage of 100x. Quality assessment of reads was conducted using FastQC v0.11.5 and processed using fastp v0.23.430. De novo assembly was carried out using Unicycler with the standard pipeline (https://github.com/rrwick/Unicycler). Assembly metrics were checked using QUAST v531.

In addition to the twenty-nine E. coli strains sequenced in the present study, we downloaded 764 genomes available E. coli genome assemblies from Brazil (as of October 23, 2023), sequenced between 1964 and 2021, from the Pathogen Detection Web Browser on the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/pathogens/). Our sequences have been deposited in NCBI GenBank through the BioProject: PRJNA1106916. All datasets analyzed in this study are avaiable in the GenBank database (https://www.ncbi.nlm.nih.gov/genbank), and corresponding accession numbers are provided in the supplementary information (Table S1).

In silico determination of genotype profiles

We employed the MLST software developed by Torsten Seemann (Seemann T, MLST Github: https://github.com/tseemann/mlst) to automatically analyze the assembled genomes of E. coli. This analysis was conducted using the traditional MLST server, PubMLST (https://www.cge.cbs.dtu.dk//services/MLST/), which utilizes whole genome sequence (WGS) data for bacterial sequence type (ST) identification32. For typing schemes and ST identification, we selected seven specific housekeeping genes: aroC (chorismate synthase), dnaN (DNA polymerase III beta subunit), hemD (uroporphyrinogen III cosynthase), hisD (histidinol dehydrogenase), purE (phosphoribosylaminoimidazole carboxylase), sucA (alpha-ketoglutarate dehydrogenase), and thrA (aspartokinase + homoserine dehydrogenase). These genes were used as targets in the analysis.

To determine the Core Genome Multi-Locus Sequence Type (cgMLST), we utilized the open-access cgMLSTFinder software provided by the Center for Genomic Epidemiology (https://bitbucket.org/genomicepidemiology/cgmlstfinder/src/master/). This software employs rapid and accurate K-mer alignment of reads33 against EnteroBase to identify cgMLST schemes based on alleles found in the 3,002 Salmonellagenes34. E. coli phylogroup determination and serotyping was performed using EzClermont (https://github.com/nickp60/EzClermont) and ECTyper v1.0.0 (https://github.com/phac-nml/ecoli_serotyping).

We utilized Seemann T.‘s ABRicate software (v1.0.0, https://github.com/tseemann/abricate) for mass screening sequences obtained from the virulence finder database (VFDB) contained 310 genes, AMRFinderPlus (NCBI) and the Biofilms Structural Database - biosim (https://biofilms.biosim.pt) with 51 genes. Our screening criteria included a minimum per base identity of 95% and 60% coverage. The coverage threshold was deliberately set to ensure comprehensive screening, accounting for genes potentially lying on the edge of a contig or spanning multiple contigs due to possible non-perfect assembly.

To identify antimicrobial resistance genes and plasmids, we utilized STARAMR 0.7.1 software (https://github.com/phac-nml/staramr). This software conducts scans of bacterial genome contigs against the ResFinder35, PointFinder36, and PlasmidFinder37 databases. It generates a comprehensive report summarizing the detected antimicrobial resistance genes, plasmids, and predicts drug resistance based on the identified resistance genes. Our scans were conducted with stringent criteria, employing a minimum DNA identity threshold of 95% and a minimum DNA coverage of 60% for all genome alignments against the databases.

To characterize the structure of the tLST variants and HPI, we inserted all reference genomes into the Geneious Prime software. This step allowed us to extract the target sequences for further analysis. The cutoff was subsequently applied considering the presence of at least one ORF or protein. For this purpose, for tLST a cutoff of 11–90% was considered for genomes that partially contained the island, and 99–100% for complete genomes. For HPI, the cutoff was 11–89% for partial genomes, and 90–100% for complete genomes. For tLST, the target variants, including tLSTCP010237, tLSTCP016838, tLSTCP061755, tLSTLDYJ01000141, and tLSTKY416992,were annotated using BLAST, and completeness percentages were obtained. Similarly, for the high pathogenicity island (HPI), we considered two variants based on prior studies14,15 from Yersinia pestis (accession number AL590842) and E. coli (accession number AY233333). These sequences were employed for further structural characterization and analysis.

Pan-genome analysis and phylogenetic inference

Pan-genome analysis and phylogenetic inference were conducted for all the 793 E. coli strains using Panaroo v1.3.4 38. The whole genome alignment was performed with a minimum percentage identity of 95% for BLASTp, which a gene had to be present in at least 99% of the isolates to be considered core. To ensure standardized genome annotation, all genomes were annotated using Prokka v1.14.539. The pan-genome visualization was achieved by employing the postprocessing scripts provided by Roary v3.13.040. For the generation of a maximum likelihood phylogeny from the core genome alignment, IQtree v2.0.341was utilized, employing a GTR + F + I + G4 substitution model with 1000 bootstrap replications to support the tree nodes. Complementary annotation was performed using emapper v2.042with default parameters, relying on eggNOG orthology data43. Diamond protein alignment44was employed for sequence searches. For interactive visualization of phylogenetics, metadata, and gene content, the browser application Phandango45.

Genome-wide Association Study (GWAS)

To explore potential associations between some genetic classifications and the presence of the locus tLST in E. coli isolates, a genome-wide association study (GWAS) was conducted using Scoary2 v0.0.1546. The genetic classifications analyzed included MLST, phylogroups, and serovars. Binary matrices were constructed to represent the presence (1) or absence (0) of each genetic classification across all isolates, as detailed in the Gene_presence_absence.csv file (Table S5). Similarly, the presence or absence of the locus tLST in the isolates was recorded in a binary format, where (1) indicated its presence and (0) its absence, as provided in the trait.csv file (Table S5). The correlation between genetic classifications and the presence of the locus tLST was assessed by odds ratios, Fisher’s exact test p-values (fisher_p), and empirical p-values (empirical_p), with adjustments for multiple testing using the Benjamini-Hochberg correction method to produce adjusted p-values (fisher_q). Sensitivity and specificity metrics were also calculated to evaluate the predictive strength of each genetic classification in relation to the presence of the locus. The workflow involved importing the presence/absence matrices for genetic classifications and the locus tLST into Scoary2, performing association analyses, and interpreting the outputs. Scoary2 provided key metrics, including odds ratios to quantify the strength of associations, Fisher’s exact test p-values to assess statistical significance, and sensitivity and specificity measures to determine predictive performance. The results were exported as a tab-delimited file (Table S5) for further analysis and visualization. Genetic classifications with significant associations (fisher_q < 0.05) and high odds ratios were prioritized as potential markers associated with the locus tLST.

Molecular clock analysis of the tLST

To estimate time-scale phylogenies from tLST, we initiated the process by performing a core SNPs alignment to the reference sequence of the tLST extracted from the strain AW1.7 (LDYJ01000141.1). We specifically selected nineteen E. coli genomes containing the tLST, all belonging to the ST10 lineage. This selection aimed to minimize variation rates between lineages and bolster the robustness of the biological clock, as suggested by Duchêne et al.47. The ST10 was chosen due to its global prevalence among E. coli genomes across environmental, clinical, and food sources48,49. Seventeen of these genomes were obtained from the study conducted by Zhang and Yang21, while two additional ST10 strains, 1206 and 1207, sequenced in our study, were included for temporal analysis. To ensure accurate phylogenetic reconstructions, we identified and excluded recombinant regions within the core genome alignment using Gubbins v2.4.150. Subsequently, the final set of high-quality core-genome SNPs was extracted from the alignment file using SNP-sites v2.5.151. TreeTime v0.11.152 was employed to estimate the time of emergence of the common tLST ancestor. This software handles the estimation of the temporal evolution of the tLST within E. coli.

Results

The prevalence of tLST in Brazilian E. coli genomes

Regarding the eight tLST variants described to date20,21, we focused our investigation on five specific variants (tLSTCP010237, tLSTCP016838, tLSTCP061755, tLSTLDYJ01000141, tLSTKY416992) to describe prevalence of tLST in Brazilian E. coli genomes (Table S2). Notably, three variants (tLST3a, tLST1, and tLST2) reported by Zhang and Yang21 exhibited nucleotide similarity ranging from 99 to 100% with tLSTa and tLST1, as previously described by Wang et al.20. The lack of standardized nomenclature of these described variants18,20,21,22,24,26,28,53,54,55 can often lead to misunderstandings among studies. To mitigate confusion, we employed only five of the described variants, referencing them by their sequence access codes. Furthermore, in our study, genomes that contained partial tLST were classified as those that included at least one ORF or protein (Fig. 1).

Fig. 1
figure 1

Schematic representation of the transmissible stress tolerance locus (tLST) variants extracted from the target sequences for mapping Brazilian E. coli genomes. Each colored arrow with its respective accession number refers to one of the five tLST variants used in this study (CP010237, CP016838, CP061755, LDYJ01000141, and KY416992). Regions of the cds described as null, predicted to encode a protein in the tLST, are described as hypothetical proteins.

Our investigation included 793 Brazilian E. coli genomes collected between 1964 and 2021, including twenty-nine strains sequenced in this study that were isolated from pasteurized milk in Mato Grosso (as detailed in Machado et al.8). Additionally, 764 Brazilian E. coli genomes available in the NCBI Pathogen Detection repository were considered for comprehensive analysis. Our findings revealed the presence of tLST variants in 10% (80/793) of the investigated genomes (Fig. 2). Among these, twenty-four contained the complete tLST sequence, twenty-one exhibited the tLSTCP010237 variant, and three presented the tLSTLDYJ01000141 variant. Moreover, seventy-three genomes showed partial tLST sequences, ranging from 11 to 67% of the sequence. Notably, these variants were identified within both chromosomes and plasmids (Table S2). Genomes from humans, most of which are distributed mainly between clades C and D (Fig. 2), have been identified by E. coli, which belongs to phylogroups D, C, and F (clade C) and B2 (clade D).

Fig. 2
figure 2

Phylogenetic tree based on the core and accessory genome of the 793 Brazilian E. coli strains circulating at the human-animal-food-envinronmental. In the green part of the tree, black refers to complete presence (99-100%) of proteins or ORFs in transmissible Locus of Stress Tolerance (tLST) and/or High-Pathogenic Island (HPI) variants in E. coli genomes, and yellow refers to partial presence (11-67%) of tLST and/or HPI on these genomes.

The inference of molecular-clock phylogenies in the tLST

We performed a time-scaled phylogenetic analysis to reveal the evolutionary trajectory of tLST in E. coli (Fig. 3A). The TreeTime model applied to assess the temporal phylogenetic signal exhibited a convergent regression with an value of 0.27. Our analysis estimated the most recent common ancestor (MRCA) of tLST to have emerged around 1914.4, with a confidence interval (CI) of 25.28 (Fig. 3B). The analysis also revealed the divergence of the common ancestor into two primary populations, classified as clades A and B. Clade A (n = 5) diverged from a common ancestor dating back to 1938. Notably, this clade encompassed ancient genomes GCF_900636295.1 and GCF_014295295.1, isolated in 1953 and 1954, respectively. Conversely, clade B (n = 13) included genomes from more recently isolated strains, including our strains 1206 and 1207. This clade diverged from a common ancestor around 1972 (Fig. 3A).

Fig. 3
figure 3

Molecular clock of temporal estimative about E. coli's tLST emergence. Figure A refers to the phylogenetic tree of E. coli ST10 harboring tLST selected in our study. Figure B refersto the estimate (root date) of the most recent common ancestor (MRCA) in relation to tLST, with a confidenceinterval of +/-25.28.

E. coli Phylogrouping, Serotyping, and distribution of virulence genes

The most frequent phylogroups were A, B1, and B2, distributed across a wide range of serovars (Table S1). A comparative analysis of serovar distribution between the Brazilian genomes and the entire EnteroBase database (https://enterobase.warwick.ac.uk/) revealed exclusive detection of serotypes O120:H32, O131:H46, and O156:H1 within the genomes from Brazil. Among the Top Seven E. coli serotypes most associated with foodborne infections56, all were identified within Brazilian E. coli genomes. Their frequencies were as follows: O45 (15/793), O103 (9/793), O26 (8/793), O145 (4/793), O157 (10/793), O111 (2/793), and O121 (1/793). Predominantly, these serotypes were linked with phylogroup B1. However, our strains isolated from milk sources (serotypes O107, O88, 0128, O176, O15, O88, O61, O18, and O13/O135) did not belong to the most commonly associated serotypes with foodborne infections (Table S1).

In relation to High Pathogenicity Island (HPI) genes, the prevalence was 38.5% (306/793) between E. coli genomes, with 273/306 presenting complete sequences (Table S3). Among these, thirty-seven genomes were sourced from food, and notably, four E. coli strains isolated from milk harbored the complete HPIY.pestis (AL590842) variant within their chromosomes (Fig. 2). Also, the analysis of virulence gene distribution across Brazilian E. coli genomes indicates a relatively low prevalence of Shiga toxin-producing strains. Among the 793 genomes studied, only eleven exhibited the presence of stx genes. However, our investigation revealed the presence of other noteworthy virulence-associated genes, particularly those linked to adherence factors (fdeC, papC, papH, papA), as well as associated with secretion systems and protections (espX1 and iss) (Table S1). The fdeC gene stood out as the most prevalent among the studied genomes, identified in 719 out of 793 genomes. In addition, to better characterize the virulence genome of our milk´s strains, we examined virulence genes using the AMRFinderPlus database containing 310 genes through the E. coli virulence database (VFDB). We identified a frequency ranging from 123 to 165 out of 310 genes, of which the most notable were other genes for protectins (ompT); other adhesins that are part of the avian pathogenic E. coli (APEC) system (fimA, fimH, stg, ecp, vgrG/clpV) and Enterohaemorrhagic E. coli (EHEC) (eaeH); genes related to Adherent-Invasive E. coli (AIEC) invasion/effectors (ibeB, ibeC), and fitness (yjaA) (Table S4).

Characterization of resistance, persistence and mobilome about E. coli genomes

The resistome was assessed for antimicrobial resistance genes (AMR) and genes linked to stress (acid, biocide and metals). More than forty genes have been found related to resistance to sixteen main classes of antibiotics identified among the genomes of E. coli from Brazil (Table S2). Among the most frequent classes between genomes, we highlight the β-lactam resistance genes in 497/793, fluoroquinolones resistance genes in 421/793, sulfonamides in 411/793, aminoglycosides resistance genes in 397/793, tetracycline in 356/793, quinolones in 354/793, and dihydrofolate reductase inhibitor pathway genes in 341/793. Concerning stress-related genes, the most frequent genes among genomes were: ermE for resistance to biocide (492/793), mer for mercury resistance (197/793), pco and ncr for resistance to copper/metal (117/793), ars for resistance to arsenic (53/793), qac for resistance to quaternary ammonium (286/793), sil for resistance to silver/copper (125/793), and ter for resistance to tellurium (68/793). For persistence/biofilm genes, among fifty-one related genes described on the BioSim platform (https://biosim.pt/)57, we found a frequency of seventeen genes in E. coli from Brazil. The genes found were mainly related to the expression of proteins for dispersion, extracellular polymeric substance (EPS), motility and quorum sensing. Regarding mobilome, we found a large number of plasmids belonging to incompatibility groups, with the group col (385/793) (col/col, colpV), and Inc (638/793) (IncB-C, IncF-like, IncH-like, IncI-like, IncN-like, IncQ-like, IncX-like, and IncY-like) the most frequent. We also noticed that the genomes of E. coli from milk were the only ones to show a higher number of stress-related genes (Table S2).

Diversity of sequence types and clonality among genomes

Our investigation unveiled a wide array of Multilocus Sequence Types (MLST) in Brazilian E. coli genomes, identifying over 200 distinct types. Of significance, comparing with the worldwide metadata available in the EnteroBase revealed eleven MLSTs exclusively present in genomes from Brazil (ST13178, ST13403, ST13404, ST13405, ST13406, ST13407, ST13408, ST1440, ST8329, ST8386, ST8628, and ST8901). The internationally recognized STs, such as ST10, ST38, ST90, ST117, ST224, ST354, ST410, ST457, ST648, and ST744, were predominantly identified in genomes sourced from human, animal, and food sources. Among our genomes obtained from milk, we identified eight distinct STs. Notably, six of these STs (ST10, ST480, ST540, ST2600, ST6158, and ST14775) were present in multiple strains, sharing commonalities such as an identical core genome (cgMLST) and similar accessory genes related to both virulence and resistance that may suggest they are clone.

The pan-genome analysis identified 27,511 genes. Among these, 2,664 genes were found to be conserved core genes, present in over 99% of the strains, while 21,801 were accessory genes, present in at least one strain (Fig. 2). We divided the phylogenetic tree into four clades (Fig. 2A, B, C, D) and noticed that as the genomes distanced from the common ancestor (clade A), the diversity of accessory genes increased (clade D).

Associations between MLST, serotype, phylogroup and tLST

This study investigated the association between specific genetic classifications and the presence of E. coli phenotypes, employing Scoary2 for statistical analysis. The dataset resulted in significant findings based on odds ratio, corrected p-values (fisher_q), sensitivity, and specificity. Genetic classifications such as “ST480” and “Serovar_O128:H11” showed infinite odds ratios, indicating perfect associations in the observed dataset. Both classifications exhibited 100% specificity, signifying their exclusive presence in isolates with tLST. Fisher’s exact test yielded extremely low p-values (5.07 × 10⁻¹³ and 5.75 × 10⁻¹², respectively), supporting their statistical significance even after multiple testing correction (adjusted fisher_q). Despite their high specificity, the sensitivity of “ST480” and “Serovar_O128:H11” was relatively low (15.0% and 13.75%, respectively). It suggests these classifications are associated with specific subsets of isolates rather than being universally present in all tLST positive samples. Other classifications, such as “ST14775” and “Serovar_O61:H34,” also exhibited infinite odds ratios and 100% specificity. However, their sensitivity values were markedly lower (6.25% and 3.75%, respectively). While these findings underscore their potential relevance, their low sensitivity limits their utility as comprehensive markers. The phylogroup_A classification demonstrated a high odds ratio (6.96), indicating moderate association. Its sensitivity was significantly higher (61.25%) compared to other classifications, though its specificity was slightly reduced (81.49%). It suggests phylogroup A is more widely distributed across isolates with the tLST but is not an exclusive marker. The adjusted fisher_q values reinforced the robustness of the findings. Classifications with the lowest fisher_q values, such as “ST480” and “Serovar_O128:H11,” remained statistically significant after correction. In contrast, classifications like “Serovar_O61:H34,” which had a higher fisher_q (0.443), require cautious interpretation as they may reflect weaker associations. These results highlight “ST480” and “Serovar_O128:H11” as highly specific markers for the tLST presence. However, their limited sensitivity indicates the involvement of additional genetic factors or interactions in the tLST. The “Phylogroup_A” classification emerges as a broader marker with a balanced sensitivity and specificity, making it potentially applicable to a wider range of isolates.

Discussion

Our study identified the variant tLSTCP010237 as the most prevalent among the full-length variants analyzed. This variant, exhibiting an 80% similarity to the initial tLSTLDYJ01000141 (LHR1), notably differed in genes such as ftsH, responsible for encoding the FtsH protein involved in the quality control of membrane proteins58. Additionally, it featured cls and ybhOgenes, associated with cardiolipin synthase protein production, responding to increased activity in environments with high osmolarity59. Another variant prevalent among genomes containing partial tLST was tLSTKY416992. Differing from the initial tLST variant, this version spans up to 30 kb and showcases genetic alterations from the LHR1 (Fig. 1). Unique genes within this variant encompass a mrkABCD operon, recognized for its role in biofilm formation on biotic surfaces and epithelial cells in E. coli60. Similarly to Boll et al.60, we found this variant associated with E. coli isolated from animal/food sources (milk), as well as human sources, that harborized until seventeen different biofilm genes (Table S1).

Except for the GCA_029589455.1 genome, we exclusively identified partial tLST variants among E. coli strains originating from clinical sources (Table S2). We observed that human genomes, most of which were distributed between clades C and D (Fig. 2), were identified by E. coli belonging to phylogroups D, E and F (clade C) and B2 (clade D), while phylogroups A and B1 were mostly present in the genomes of clades A and B. Studies indicate that virulent extraintestinal E. coli strains belong largely to these phylogroups, while phylogroups A and B1 are more associated with commensal E. coli61,62,63. Furthermore, genomes containing the greatest diversity of virulence genes, including adherence and invasion factors in hosts such as humans, were found in clades C and D, where the tLST was only partially present. On the other hand, clades A and B, where the genomes of commensal E. coli with the presence of the complete tLST predominated, isolated particularly from animals and food (with those from milk), were characterized primarily by resistance genes, with mainly HPI pathogenicity genes. The prevalence of tLST in E. coli belongs to the phylogroups characterized by commensal E. coli, such as phylogroup A, has been identified previously21. Even so, in our analysis of associations between MLST, serovar, and phylogroup, only phylogroup A was moderately associated with genomes with tLST, presenting high sensitivity and odds ratio (Table S5).

Since the selective pressure suffered by bacteria in environments with the presence of heat and other chemical sanitizers such as chlorine is more pronounced in food processing environments64, E. coli from clinical sources that migrate to a processing environment (or vice versa) may acquire certain tLST resistance genes, corroborating their survival. The acquisition of accessory genes under the influence of environmental or host challenges can occur via horizontal transfer through plasmids, transposons; or by the expression of genes that are activated by sigma factors when in cellular demand65. These horizontal gene transfer events, as well as recombination events or even limited genetic data, may explain the characteristics of the Brazilian bifurcation genomes, as was also seen in our previous work with E. coli from Peru66. Despite the significance of potential cross-environmental acquisition of tLST resistance genes, the events underpinning the emergence of tLST remain inadequately studied. Studies using Bayesian analyses have been instrumental in estimating the emergence periods of various bacterial lineages based on their phylogeny and ecological contexts47,67,68,69. In our temporal analysis using the TreeTime tool52 we estimated the closest ancestor to the emergence of tLST in E. coli to be around 1914.4 (with R20.27). Despite being originally designed for viral genomes, this tool has shown applicability to bacterial genomes as well70.

While our estimated R2value for the ancestry clock may be comparatively lower compared to other bacterial populations, previous studies47 have highlighted various critical aspects influencing the robustness of temporal clocks. For instance, lower population substitution rates, as seen in Mycobacterium tuberculosis in their study, often coincide with lower R2values, affecting the strength of the ancestry clock. In addition, other factors such as the association between evolutionary rate and sampling time, genome size and G + C content can interfere with the strength of the clock47,71. Considering these influential factors, especially the relatively low variability observed in the core genome of E. coli, which generally undergoes fewer alterations compared to its larger accessory genome72, it’s conceivable that the substitution rates observed in the genomes selected for our temporal analysis could have influenced the coefficient of variation (R2). Understanding the nuances and factors influencing the temporal behavior of genetic elements like tLST could pave the way for more refined and accurate temporal estimations in bacterial populations.

While pinpointing the exact societal event that catalyzed the emergence of tLST remains elusive, certain industry milestones align temporally with its appearance. For instance, significant episodes in the dairy industry coincide with the emergence of tLST. Notably, the installation of the first small-scale commercial milk pasteurizer in New York in 190773, marked a pivotal shift. Eight years later (1916), pasteurization was already occurring on a large scale in the United States, and in 1924 the Standard Milk Ordinance specifically mandated pasteurization at temperatures not below 61.1 °C for 30 min74,75. Similarly to our study8, a previous study have elucidated the impact of sub-pasteurization temperatures (57–68 °C) on the development of thermoresistance in E. coli 76. Likewise, this resistance in E. coli can overcome sublethal temperatures (up to 80 °C)54. Moreover, the acquisition of tLST might have instigated natural selection among E. coli populations harboring virulence genes, such as stxgenes, which might be incompatible with tLST3,55.

Despite the absence of stx genes in our isolates from pasteurized milk, our study identified Y. pestis High Pathogenicity Island (HPI) genes in 38.5% of the E. coli genomes, including four genomes sourced from pasteurized milk (Table S3). The presence of HPI has already been demonstrated in E. coli which causes septicemia in animals and humans15,17,77, and in Brazil our group has demonstrated the presence of HPI in Salmonellagenomes with a prevalence of over 99%78. However, our findings mark the first identification of HPI in foodborne and human-source E. coli strains harboring tLST. The prevalence of tLST, typically associated with commensal E. coli genomes23, among ready-to-eat foods like pasteurized milk raises critical concerns. Our data suggests the possibility of highly pathogenic E. coli strains in such food products, representing a persistent pathogen in both the food industry and potentially in human hosts, increasing the risk of infection. Surprisingly, within the genomes that contained the complete tLST, our analysis revealed over 123 virulence, adherence, and accessory genes found in pathogenic strains of E. coli ExPEC, APEC, and EHEC12,79.

Among the clinical (animal and human) and environmental genomes, we observed sequence types recognized internationally within the One Health framework80. Notably, these strains carried plasmid incompatibility groups (such as incX4, Incl2, and incHI2) known for harboring AMR genes80,81. In E. coli from milk, persistence genes linked to biofilm formation, as well as those conferring resistance to metals and sanitizers, stood out in particular. Studies suggest that biofilm-producing E. colion abiotic surfaces tends to exhibit resistance to various agents, including biocides82,83,84. The environment in which the bacteria are found will determine the adaptability and resilience of E. coli in distinct environments. Interestingly, the bacterial machinery in our genome repertoire signifies E. coli’s capacity to thrive and circulate in different ecosystems, whether from food to host or from host to food.

Conclusions

Our investigation revealed a 10% prevalence of tLST in E. coli, predominantly observed in milk-derived genomes, with the dominant variant being tLSTCP010237. Remarkably, our temporal analysis traced the emergence of tLST back to around 1914, coinciding with major societal events. This ancestral link highlights potential links between historical events and the emergence of the variant. The widespread presence of tLST variants in different E. coli sources demonstrates the adaptive capacity of bacteria in diverse environments. Moreover, our study unveiled a substantial prevalence (38.5%) of Y. pestis’ HPI genes, with four of these genomes originating from E. coli from milk, suggesting a potential risk in foodborne pathogens. An in-depth analysis of virulence genes, resistance mechanisms, and sequence types depicted a dichotomy: strains associated with food sources showed a higher prevalence of genes conferring resistance to stressors like heat, metals, and biocides, reflecting the selective pressures of food processing environments. Conversely, clinical strains predominantly carried virulence and antimicrobial resistance genes, highlighting the impact of therapeutic interventions. This comprehensive assessment highlights the multifaceted nature of E. coli, which is equipped with genes for resistance and virulence. The versatility of this pathogen is a major challenge that requires immediate and concerted control measures. The coexistence of genetic elements that facilitate survival in different environments underlines the urgent need for rigorous surveillance and strategic interventions to mitigate the potential risks posed by this pathogen in both food processing and clinical settings.