Introduction

Artificial intelligence (AI) is transforming healthcare with an explosion in the number of studied AI applications in medicine1. However, the adoption of AI into clinical practice has been slow and often fragmented, in part due to the significant regulatory, safety, and ethical challenges unique to healthcare AI. These challenges necessitate careful testing and collaborative development to ensure AI technologies are safe, ethical, private, equitable, user-friendly, and achieve their intended impact. Initial national and international regulatory actions acknowledge the need for prompt and timely guardrails around rapidly evolving AI technologies. Notable examples include the European Union AI Act2,3; the US White House executive order on the Safe, Secure, and Trustworthy Development and Use of AI4; and, the healthcare-specific HTI-1 final rule from the US Department of Health and Human Services, the Assistant Secretary for Technology Policy, and Office of the National Coordinator for Health Information Technology (ONC), which mandates transparency for AI algorithms that are part of ONC-certified health information technology5. Likewise, others have proposed frameworks and best practices for governance of healthcare AI at the international, national6, and local level7, with safety, trust8, ethics6,9, and equity10,11,12,13 as core tenets of the responsible use of AI in healthcare.

Major frameworks and guidelines for AI focus on model evaluation rather than implementation, including IBM’s AI Lifecycle Management framework for industry AI pipelines14; SPIRIT-AI and CONSORT-AI guidelines for reporting on clinical trials involving AI15,16,17,18,19,20; and the HTI-1 framework for AI transparency in healthcare5. The AI Lifecycle Management Framework typically focuses on the technical stages of AI model development, deployment, and maintenance but may lack specificity in tailoring these stages to healthcare’s unique safety and ethical requirements. SPIRIT-AI and CONSORT-AI provide guidance on reporting standards for clinical trials involving AI but do not offer a stepwise approach to implementation, monitoring, and scaling AI tools in clinical practice. Finally, while HTI-1 provides regulatory guidance for transparency and ethical AI use, it does not provide specific implementation stages for healthcare organizations to systematically validate and scale AI tools.

Guidance on implementation approach of solutions for pragmatic deployment in this space are arguably nascent; therefore, we advocate for a structured approach that is still rapid enough to impact clinical change in a reasonable timeframe but measured enough to yield defined outcomes and allow for scalability. Others have previously discussed practical concerns around scaling AI in healthcare21 and an approach to AI healthcare product development22. Here, we offer a framework for healthcare organizations to implement AI technologies safely and with impact, beyond scientific research, using in-house developed tools or vendor-based solutions.

Clinical trials framework

US Food and Drug Administration (FDA) clinical research has four phases that include: Phase 1: understanding safety/drug dosage in 20–100 healthy individuals or individuals with the disease; Phase 2: measuring efficacy and identifying side effects in hundreds of individuals with the disease; Phase 3: larger scale efficacy/benefit and monitoring of adverse reactions in 300–3000 individuals, often relative to standard of care; and Phase 4: post-FDA approval post-market monitoring of safety and efficacy in thousands of individuals23. We propose a clinical trials informed framework for AI implementation in healthcare systems at scale, which mirrors the four phases of FDA-regulated clinical trials—safety, efficacy, effectiveness, and post-deployment monitoring—to systematically address critical concerns such as regulatory compliance, patient safety, and model validation. This approach ensures that AI solutions undergo rigorous validation at each stage, creating a foundation for safe and effective clinical integration. By following these stages, healthcare AI can transition more seamlessly from research settings into routine practice, minimizing risks while maximizing patient outcomes. This is a similar approach that the American Medical Informatics Association, the national organization of informatics professionals, researchers, and other members in the US, has taken with case studies of AI24.

For this perspective, though we recognize that the term AI encompasses a wide range of systems, we focus on AI solutions based on machine learning or large language models, though these concepts may apply to other AI systems such as computer-interpretable clinical guidelines25 and argumentation for medical decision-making26. We also recognize that these principles may be most relevant in the US, which has different AI regulations and higher complexity of clinical administrative burden.

Our clinical trials informed framework to AI solution deployment in healthcare is intended for AI solutions that are not necessarily medical devices with four phases (Table 1) for implementation, not regulation. As these AI solutions are neither medical devices nor drugs, this framework includes pragmatic consideration of clinical workflows required in the informatics realm.

Table 1 Clinical Trials Framework for Artificial Intelligence Applications

Phase 1: safety

This initial phase assesses the foundational safety of the AI model or tool. Here, models are deployed in a controlled, non-production setting, where they do not influence clinical decisions. Testing can be done retrospectively or in “silent mode,”27 where predictions are observed without impacting patient care. For example, a large language model might be used in clinical trial screening by evaluating retrospective electronic health record (EHR) notes to determine patient eligibility without risking patient outcomes28; the evaluation may also include validation/bias analyses to measure fairness across different patient demographics29, ensuring the model does not inadvertently disadvantage specific groups.

Phase 2: efficacy

In the second phase, the model’s efficacy is examined prospectively under ideal conditions, often by integrating it into live clinical environments with limited visibility to clinical staff. This phase tests whether the AI can perform accurately and beneficially in real-time workflows. During this phase, models are typically run “in the background,” allowing them to process real-world data without impacting clinical decision-making until performance is thoroughly vetted. Teams begin to organize data pipelines so that hospital data can be input into the model and identify which team members (such as a nurse or physician) will act on data output at which steps in clinical workflows. Teams must determine timepoints to display data output and the means to deliver timely, interpretable output. Examples include using AI to predict admission rates in the emergency department, where results are hidden from end-users to refine accuracy without influencing care30, and an AI-based acute coronary syndrome detection tool31, in which the ingestion of real-world data allowed for optimization of equity and fairness32,33.

Phase 3: effectiveness and/or comparison to existing standard

In this phase, the AI tool is deployed more broadly, and its effectiveness is assessed relative to current standards of care. In contrast to phase 2, which focuses on efficacy, a measure of an outcome under ideal circumstances, phase 3 focuses on effectiveness, which is a measure of benefit in a pragmatic real-world clinical setting34. This phase incorporates health outcome metrics, demonstrating real-world impact on patient care and clinician workflows. Implementation teams evaluate the model’s generalizability by testing it across various patient populations and clinical settings, measuring geographic and domain-specific performance29. A real-world example is ambient documentation, a generative AI platform being piloted by Stanford and Mass General Brigham across multiple clinical specialties that converts patient-clinician conversations into draft notes, which are reviewed and edited by the clinician before signing in the EHR system. The quality and usability of these notes are being compared to notes written by the clinicians themselves, while the outcome measures of clinician experience and burnout are being assessed rigorously. An additional example is AI-generated inbasket draft replies, in which patient message content is sent securely to an EHR vendor’s OpenAI GPT-435 instance and a draft reply is generated, then edited by a clinic staff member (in our implementations, a clinical staff member such as a physician, APP, nurse, or pharmacist)36. Time spent replying to inbasket messages is being assessed to determine whether the AI technology is impacting efficiency. Comparison between clinician and AI-generated draft replies in domains such as professionalism and tone are examples of evaluation that extend beyond traditional process outcomes37.

Phase 4: monitoring-scaled and post-deployment surveillance

After scaled deployment, AI tools require ongoing surveillance to track performance, safety, and equity over time. Continuous monitoring identifies any drift in model performance, while user feedback helps maintain alignment with clinical needs and safety standards. This phase ensures that as AI models evolve or face data shifts, they are recalibrated to remain effective and unbiased. The integration of monitoring systems into routine workflows allows for rapid identification of adverse events or bias, supporting sustained model integrity in clinical practice. Systems to detect model drift38 can inform model updates or de-implementation of ineffective AI solutions. Adopting existing methodology from traditional clinical decision support initiatives, such as override comments as a feedback mechanism for improving clinical decision support39 and the Vanderbilt Clickbusters initiative which iteratively reviews clinical alerts to turn off unneeded alerts and improving or adding more targeted alerts40, can help ensure better clinical uptake and intervention efficacy. In addition, teams should disseminate findings so that other institutions can learn and share best practices.

Deploying AI at scale in healthcare systems faces several challenges, particularly when it comes to aligning AI-generated guidance between specialty practices and primary care. Stanford has evaluated patient prediction models that pose many challenges32. One major issue is a mismatch in recommendations, in which AI models trained in specialty settings may not perform well with primary care workflows or guidelines. Furthermore, lack of coverage and reimbursement for certain tests or treatments recommended by AI may limit usage in real-world practice. Additionally, healthcare populations are often fragmented across multiple practices, with a third of patients in the United States lacking a primary care provider41. This fragmentation complicates patient management and follow-up, as compliance with AI-suggested interventions may fall through the cracks. This monitoring phase requires ongoing model assessment, feedback loops, and potential recalibration, which can be logistically complex.

Discussion

Using a clinical trials framework for healthcare AI provides a pragmatic, structured, stepwise approach to evaluating and scaling novel AI solutions in care delivery. This framework emphasizes patient safety, efficacy and real-world applicability. By mirroring the rigorous processes of traditional clinical trials, this framework offers a robust path to validate AI tools comprehensively, ensuring these technologies benefit diverse populations without introducing unintended risks. Also, this approach addresses the unique challenges of healthcare AI, including regulatory variability, ethical considerations, model drift, and data generalizability, while emphasizing continuous monitoring to sustain model integrity over time. Other healthcare-focused frameworks such as SPIRIT-AI/CONSORT-AI focus more on reporting standards or regulatory guidance (such as HTI-1).

In the US, while external clinical decision support may be considered a medical device42 and potentially be subject to formal FDA review43, healthcare organizations can deploy in-house AI models without FDA certification, allowing for significant flexibility in internal clinical use. The need for rigorous and often prolonged evaluation of external solutions subsequently limits immediate market availability. This regulatory flexibility contrasts with requirements in other jurisdictions, where most clinical AI tools must be certified before use. Addressing such regulatory variation is essential for ensuring the framework’s applicability across global healthcare settings, balancing flexibility for internal use with structured validation for external deployment.

This framework may be less applicable for AI applications in broader healthcare settings, such as public health or community health programs, where direct clinical workflow integration is not always feasible or necessary. In addition, the clinical trials approach to AI-based healthcare technologies may not be applicable for small- to medium-sized healthcare organizations which may implement these tools once they have already reached the ongoing monitoring stage. Analogous to traditional bench or clinical research, these AI clinical trials are more likely to occur at larger academic medical centers, as they require resources, financial investment, and AI-specific technical expertise44. However, while large academic medical centers are likely to lead these efforts, it is crucial that the lessons learned from these initiatives are shared across all healthcare communities, including community healthcare centers and safety net hospitals. By disseminating knowledge and best practices, we can ensure that all populations benefit from safe, effective, and equitable AI solutions.

We recognize that there are distinct challenges to monitoring AI-based technologies in healthcare that may limit some generalizability of findings. For example, with ambient documentation, our institutions have observed differences in configurations, underlying large language models, device support, and EHR integration across different vendors, compounded by rapid platform feature and model changes in a competitive vendor market. On the implementation side, institutions launch ambient documentation technology with different user specialties and different numbers of users. Standardized benchmarks and metrics may help mitigate some of this variability in experience and performance. For example, in Phase 2 of our framework, test case libraries for regular validation (test messages, standardized recordings) could periodically be used by vendors to monitor performance.

When deploying AI in healthcare, it is essential to prioritize outcomes and safety rather than solely focus on process measures and model performance, as we highlight in Phase 3. While metrics such as AI drafted note accuracy or draft reply generation times are important, they do not fully capture the real-world impact of AI on patient care. AI solutions must demonstrate that they improve health outcomes, reduce harm, and contribute to better overall patient experiences45. Emphasizing patient safety from Phases 1-4 ensures that AI solutions are used responsibly, minimizing the risk of unintended consequences like exacerbating health disparities or introducing bias. By shifting the focus toward meaningful outcomes, especially determining the equity impact of AI solutions at different levels of health ranging from individual to population-level health13, healthcare systems can better assess the true value of AI solutions and ensure they are enhancing care in ways that align with the broader goals of equity and quality improvement.

Each of the phases we highlight relies on the availability of high-quality, diverse datasets for testing and validation. However, data quality and representation issues can vary widely, particularly in underrepresented patient groups, which could limit the framework’s effectiveness in promoting equitable AI. More diverse, cross-institutional data will allow us to test the fairness and generalizability of the AI solutions we develop, which should be evaluated in Phase 2 of our framework. While the specifics of how institutions should approach implementation of these technologies can be debated, it is also clear that there is the need for greater regulatory guidance on using these technologies, echoing other calls for a careful approach that recognizes the unique challenges of generative AI46, with the input of aforementioned stakeholders, as well as better systems and regulations to enable more federated cross-institutional pooling of data to improve performance of these tools.

We advocate that there is a pressing need for broad stakeholder engagement, governmental support (e.g., NIH funding) and industry sponsorship to rigorously and systematically study AI technologies, thereby enabling novel AI solutions to be validated and scalable across healthcare systems. Groups like the MIT Task Work on the Work of the Future47, the Coalition for Health AI (CHAI)48, and other more solution-specific interinstitutional collaborations can provide shared lessons. MGB and Stanford are both part of CHAI. MGB is part of the Ambient Clinical Documentation Collaborative, a group of academic medical centers implementing ambient documentation, to share insights and “invent the wheel” together on ambient documentation. Stanford plays a lead role in many of these organizations, as well as promoting local initiatives, such as Responsible AI for Safe and Equitable Health (RAISE Health)49 and Human-Centered AI50.

Finally, as informatics and healthcare system leaders construct and implement AI for pragmatic use in clinical and administrative workflows, teams must consider a solution’s financial viability during early planning stages. While AI offers alluring potential, it may not be appropriate for answering a specific question or solving a specific problem if cost becomes unsustainable. Cost considerations not only include initial technical cost for building the AI solution, but also cost related to uptake, training of staff, trust-building with communities regarding safe and equitable healthcare AI applications, and maintenance of these solutions; cost should subsequently be weighed against return on investment. Cost should be factored in with pragmatic outcomes, patient-oriented outcomes, or other meaningful outcomes to justify testing and scaling the technology. This mindset will prevent unnecessary reiteration of pilots that do not necessarily yield scalable, financially tenable solutions.

Importantly, we recognize that healthcare AI is a rapidly evolving field, and the framework may require adaptation across international regulatory environments and differing clinical settings. By sharing implementation insights and best practices, particularly from early adopters, we aim to support broader, equitable adoption of AI tools across all healthcare environments, from large academic centers to community hospitals. Ultimately, this framework provides a pathway for safe and effective AI in healthcare, aligning technological advancement with the goals of patient-centered outcomes, equity, and long-term societal benefit.

In conclusion, while AI holds promise for transforming healthcare, its deployment must be approached with caution and rigor. Adopting a clinical trials framework ensures that AI solutions are thoroughly tested for safety, efficacy, and effectiveness before widespread implementation. Teams should measure patient outcomes, safety, and equity rather than solely focusing on process improvements or model performance. By sharing lessons learned from early adopters, including academic medical centers, across all healthcare settings, we can ensure that AI solutions are both effective and equitable, benefiting diverse populations and improving the quality of care for all.