A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset

Woerner, Stefano; Jaques, Arthur; Baumgartner, Christian F.

doi:10.1038/s41597-025-04866-4

Download PDF

Data Descriptor
Open access
Published: 19 April 2025

A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset

Stefano Woerner¹,
Arthur Jaques¹ &
Christian F. Baumgartner^1,2

Scientific Data volume 12, Article number: 666 (2025) Cite this article

1570 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.

Multi-domain improves classification in out-of-distribution and data-limited scenarios for medical image analysis

Article Open access 18 October 2024

Machine learning for medical imaging: methodological failures and recommendations for the future

Article Open access 12 April 2022

A lossless compression method for multi-component medical images based on big data mining

Article Open access 11 June 2021

Background & Summary

In recent years, the field of medical image analysis has undergone a transformative shift with the integration of machine learning (ML) techniques fundamentally expanding the landscape of diagnostic and therapeutic strategies. The advancement of this field hinges on the availability of large, diverse, and well-annotated datasets, which are pivotal for training robust and effective ML models. However, the process of collecting raw medical images and bringing them to a format suitable for ML applications is often complex and fraught with challenges. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, a task that becomes increasingly complex when working with multiple datasets from multiple domains that need to be integrated into a cohesive standardised format.

Another challenge is the scarcity of annotated datasets in medical imaging, particularly for rare diseases or specific conditions. This limitation has sparked the exploration of few-shot learning (FSL) methodologies. These try to learn from few examples and are designed to make predictions based on a limited number of training examples. While humans are adept at learning and making accurate predictions from minimal information, achieving comparable levels of performance in machines remains a significant challenge.

Cross-domain few-shot learning (CD-FSL) allows models to leverage the information learned for a task in one domain and apply it to another task in another domain, potentially reducing the need for extensive data in every new task. This ability is of immense value in these contexts, especially for rare conditions. We define a domain as a particular type of subject, i.e. the studied anatomical region in the medical context, combined with a particular imaging modality. Despite its potential, the cross-domain transfer poses significant challenges, given the inherent variability in imaging modalities, disease presentations, and data characteristics across different medical fields and clinics. In addition to the variability in domains, the nature of tasks involved in medical image analysis varies considerably, encompassing a range of classification types such as binary, multi-class, or multi-label, and differing in the number of target labels or classes. Developing algorithms capable of navigating these complexities is essential for effective knowledge transfer. In order to facilitate the development of such algorithms, good preprocessing and standardization across the different domains are even more critical in this context.

Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset designed to facilitate the development and standardized evaluation of ML models and cross-domain FSL algorithms for medical image classification. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, offering opportunities for both single-task and multi-task training. Many of the tasks are diagnostic tasks or are tasks immediately relevant to a diagnosis. Additionally, MedIMeta contains auxiliary tasks (such as gender prediction) that may not have immediate clinical relevance but may nevertheless be of interest. Furthermore, they have relevance for multi-task training or for training FSL algorithms which benefit from having a large amount of tasks.

For improved practicality, each dataset within the MedIMeta dataset is standardized to a size of 224 × 224 pixels which matches image size commonly used in pre-trained models. Furthermore, the dataset comes with pre-made splits to ensure ease of use and standardized benchmarking. We meticulously preprocessed the data and release a user-friendly Python package¹ to directly load images for use in PyTorch.^{Footnote 1}

This makes MedIMeta exceptionally accessible to ML researchers. This ease of access to a diverse set of realistic medical tasks with no need for additional preprocessing can serve as a bridge between medical professionals and the ML community, fostering interdisciplinary collaboration. Moreover, MedIMeta is an ideal platform for investigating cross-domain few-shot learning in medical imaging. The rich array of tasks and domains presents an excellent opportunity to study and develop cross-domain few-shot learning techniques.

In addition to presenting the meta-dataset, this paper also presents a technical validation of MedIMeta, demonstrating its utility through fully supervised and CD-FSL baselines. The validation confirms the dataset’s reliability and robustness, establishing it as a credible benchmark for research in ML for medical image analysis.

Related datasets

Existing meta-datasets can be divided into two categories: those consisting of multiple datasets from a single domain, and those comprising data from multiple domains. An overview is shown in Table 1.

Table 1 Comparison of different datasets.

Full size table

Single-domain meta-datasets

Single domain meta-datasets offer an easy way to benchmark few-shot learning techniques such as meta-learning. One of the first meta-datasets in this category was Omniglot¹, which consists of handwritten characters from a wide range of alphabets. More challenging meta-datasets derived from natural images were obtained by subsampling the widely used ImageNet², or CIFAR datasets³. Examples include the MiniImageNet⁴ and the TieredImageNet⁵ datasets, as well as CIFAR-FS⁶ and FC100⁷.

Multi-domain meta-datasets

While single-domain meta-datasets offer an easy standardized way to evaluate few-shot learning techniques they often lack realism. In reality data are rarely from a single domain in FSL problems. This realization has led to the release of several multi-domain meta-datasets.

The visual decathlon dataset⁸ is one of the first multi-domain datasets. It consists of 10 datasets from different visual tasks including traffic sign and flower recognition. It also includes Omniglot¹ and CIFAR³ datasets discussed earlier.

Later Triantafillou et al. released another collection of 10 datasets coined the “Meta-Dataset”⁹. While it partially overlaps with the visual decathlon dataset, the “Meta-Dataset” was specifically designed to benchmark few-shot learning algorithms on multiple domains. However, it does not contain any medical datasets.

With a similar motivation Zhai et al. released the Visual Task Adaptation Benchmark (VTAB)¹⁰. This meta-dataset consists of 19 tasks, again, partially subsuming the previous datasets. A particular property of the VTAB benchmark is the inclusion of datasets from three different domains covering natural image understanding, structured scene understanding, and specialized tasks. The VTAB dataset is also to our knowledge the first meta-dataset to include medical images.

Dumoulin et al.¹¹ later unified VTAB¹⁰ and Meta-Dataset⁹ into a larger Few-Shot Classification Benchmark. Guo et al.¹² collected multiple previously available datasets from multiple domains for benchmarking CD-FSL methods. Meta-Album¹³ is a collection of 40 different dataset from multiple domains and follows a similar goal to our work. While it contains more datasets than MedIMeta, it does not contain any medical domains except for microscopy.

MedMNISTv2¹⁴ is a collection of 12 medical imaging datasets from 9 different domains. It additionally contains 3D datasets. It is similar in spirit to our work, but we go significantly beyond MedMNIST in the number of tasks and task realism. The images in MedMNISTv2 have a very low resolution of 28 × 28 pixels that obscures fine details that may be clinically relevant. In contrast, we process all images at high resolution and make them available with an image size of 224 × 224 pixels, which allows to detect more clinically relevant features and is the typical resolution used in pretrained neural networks. Additionally, MedMNISTv2 does not contain any multitask datasets. In contrast, MedIMeta contains a wide variety of tasks including binary, multi-class, and multi-label classification as well as ordinal regression.

In some of the overlapping datasets, we found significant problems with MedMNIST’s preprocessing, which we improve upon. Specifically, we found that some of the center-cropped images in MedMNIST had their relevant part, i.e., the part of the image showing the disease, cropped away. We fix this problem by instead zero-padding these datasets. Additionally, some datasets in MedMNISTv2 may have training-test splits that put images of the same subject in multiple splits. We instead generated splits by taking subject information into account.

While Meta-Dataset, VTAB, Meta-Album and MedMNIST contain different domains, only VTAB and MedMNIST contain a variety of different tasks. MedIMeta is the only data collection which contains datasets with multiple tasks, providing users of MedIMeta with the option to train multi-task algorithms. Out of these benchmarks, only MedMNIST contains a significant number of medical tasks. Our proposed MedIMeta contains 54 medical tasks and is easily extensible to tasks from other fields, like natural images. Furthermore, we provide utilities for converting other datasets (e.g. ImageNet) into the MedIMeta format allowing to easily integrate MedIMeta with other data sources.

Methods

We release the MedIMeta dataset, a novel, highly standardized meta-dataset comprised of 19 publicly available datasets containing a total of 54 tasks. In the following, we describe the source datasets and the data generation in detail.

Dataset

All datasets included in the MedIMeta dataset have either been previously published under an open license that allows redistribution under a CC-BY-SA or CC-BY-SA-NC license, or we obtained an explicit permission to do so. In addition to having an open license, we selected these datasets based on three criteria: suitability for defining a minimum of one classification task on the data, image size suitable for rescaling to our target size without producing noticeable artifacts, and a minimum of 100 images. All images in MedIMeta were standardized to an image size of 224 × 224 pixels. We provide pre-defined training, validation and testing splits for all 19 datasets. If data splits were already defined in the source data, we used the pre-existing splits. Otherwise we generated our own. Most datasets include more than one classification task. Typically, there is one main diagnostic task and several auxiliary tasks. Most of these tasks were already present in the source datasets. In some instances, we created additional tasks not present in the source data. Table 2 gives an overview of all datasets, tasks, and their key properties. Figure 1 displays example images for each dataset. In the following, we describe in detail all datasets that MedIMeta is comprised of using the dataset ID, as well as the full dataset name.

aml, AML Cytomorphology: Morphological dataset of leukocytes with expert-labeled single-cell images from peripheral blood smears of patients with acute myeloid leukemia (AML) and patients without signs of hematological malignancy, derived from the Munich AML Morphology Dataset¹⁵. The 18,365 original images were resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled) and converted to RGB format by removing the transparency channel. We adopted the original multi-class classification task with 15 morphological classes from the source dataset.
bus, Breast Ultrasound: Dataset of breast ultrasound images of women between 25 and 75 years old, derived from the Breast Ultrasound Images Dataset¹⁶. The 780 original images were converted to grayscale and their masks to binary format. Images and masks were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation and nearest neighbor interpolation respectively (no images were up-scaled). The multi-class tumor classification task with normal, benign, and malignant examples was adopted without modifications from the source dataset. We additionally defined a binary classification task between malignant tumors and other images.
crc, Colorectal Cancer: Dataset of image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and healthy tissue, derived from the NCT-CRC-HE-100K and CRC-VAL-HE-7K datasets¹⁷. The 107,180 original images from the training and validation sets were not modified, as they already had the right shape and size for MedIMeta. We adopted the multi-class tissue classification task with 9 labels from the source dataset without modifications.
cxr, Chest X-ray Multi-disease: Dataset of frontal-view X-ray chest images, derived from the ChestX-ray14 dataset¹⁸. The 112,120 original images were resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled); the 519 images that were originally in RGBA format were converted to grayscale. We provide a multi-label thorax disease classification task with 14 labels adopted without modifications from the source dataset. We additionally provide a binary classification task of the patient sex derived from the labels present in the original data.
derm, Dermatoscopy: Dataset of dermatoscopic images of common pigmented skin lesions from different populations acquired and stored by different modalities, derived from the HAM10000 dataset¹⁹. The 11,720 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We provide the multi-class disease classification task with 7 labels defined in the challenge hosted by the International Skin Imaging Collaboration (ISIC)²⁰.
dr_regular, Diabetic Retinopathy (Regular Fundus): Dataset of fundus images with diabetic retinopathy grades and image quality annotations, derived from the DeepDRiD dataset²¹. The 2,000 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). Following the annotations present in the original data, we provide 5 tasks: diabetic retinopathy grade (ordinal regression task with 5 labels), sufficient image quality for gradability (binary classification task), strength of artifact (ordinal regression task with 6 labels), image clarity (ordinal regression task with 5 labels), and field definition (ordinal regression task with 5 labels).
dr_uwf, Diabetic Retinopathy (Ultra-widefield Fundus): Dataset of ultra-widefield fundus images with annotations for diabetic retinopathy grading, derived from the DeepDRiD dataset²¹. Only the 250 original images without missing labels were kept. They were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We adopted the DR grading task (ordinal regression) with 5 labels from the source dataset without modifications.
fundus, Fundus Multi-disease: Multi-disease retinal fundus dataset of images captured using three different fundus cameras with 45 conditions annotated through adjudicated consensus of two senior retinal experts as well as an overall disease presence label, derived from the Retinal Fundus Multi-disease Image Dataset²². The 3,200 images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). The original disease presence binary classification task and disease multi-label classification task with 45 labels were directly derived from the annotations provided by the original dataset.
glaucoma, Glaucoma-specific fundus images: Glaucoma-specific Indian ethnicity retinal fundus dataset of images acquired using three devices, where five expert ophthalmologists provided annotations on whether the subject is suspect for glaucoma or not, derived from the Cháksu dataset²³. The 1,345 original images and their masks were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation and nearest neighbor interpolation respectively (no images were up-scaled). We used the glaucoma suspect majority vote annotation to derive a binary classification task.
mammo_calc, Mammography (Calcifications): Dataset of cropped regions of interest (calcifications), derived from the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM)²⁴. The 1,872 images were obtained by extending the regions of interest (bounding boxes) to a square shape with a minimum size of 224 × 224 pixels, and extracting the resulting region crops from the full original images. The region crops were then resized to 224 × 224 pixels using bi-cubic interpolation. From the annotations, 3 tasks were derived: pathology type (binary classification task), calcification type (multi-label classification task with 14 labels), and calcification distribution (multi-label classification task with 5 labels).
mammo_mass, Mammography (Masses): Dataset of cropped regions of interest (masses) from CBIS-DDSM. The 1,696 images were preprocessed as described for Mammography (Calcifications). From the annotations, 3 tasks were derived: pathology type (binary classification task), mass shape (multi-label classification task with 8 labels), and mass margins (multi-label classification task with 5 labels).
oct, OCT: Dataset of validated Optical Coherence Tomography (OCT) images labeled for disease classification, derived from²⁵. The 84,484 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). The original dataset contains a multi-class disease classification task, with three different diseases and a healthy class, which we adopt without modifications. Additionally, we provide a binary task for whether the image warrants urgent referral to a specialist based on the annotations present in the original data.
organs_axial, Axial Organ Slices: Dataset of axial image slices of 11 different organs, extracted from the Liver Tumor Segmentation Benchmark (LiTS) dataset²⁶ and the corresponding organ bounding box annotations from²⁷. We derived a multi-class organ classification task with 11 labels by extracting a cropped image of each individual organ in each of the CT volumes using the bounding box annotations. We obtained a total of 1,645 organ images images. We removed 106 images for which the voxel size information was missing. The axes on one image were permuted to bring it to the same format as the other images. The images and masks were sliced from the original 3D volumes by taking the center of the organ bounding box in the axial plane. The Hounsfield-Units of the images were transformed into grayscale images by applying a window with a width of 400 and a level of 50, which are typical values for abdominal CT imaging. The images and masks were cropped to a square size in the physical space, by centering at the center of the bounding box and expanding the smaller side. The resulting images and masks were resized to 224 × 224 pixels using bi-cubic and nearest neighbor interpolation, respectively. For visualization purposes, we additionally provide images averaged over the 10% central slices with the projected bounding boxes of all organs extracted from the image drawn on top.
organs_coronal, Coronal Organ Slices: Dataset of coronal image slices of 11 different organs, extracted from the LiTS dataset. The images were processed the same as described for the Axial Organ Slices dataset, except that the coronal projections were used.
organs_sagittal, Sagittal Organ Slices: Dataset of sagittal image slices of 11 different organs, extracted from the LiTS dataset. The images were processed the same as described for the Axial Organ Slices dataset, except that the sagittal projections were used.
pbc, Peripheral Blood Cells: Dataset of microscopic peripheral blood cell images of individual normal cells, captured from individuals without infection, with hematologic or oncologic disease and free of any pharmacologic treatment at the moment of blood collection, derived from²⁸. The 17,092 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We adopted the original multi-class blood cell classification task with 8 labels from the source dataset without modifications to the annotations.
pneumonia, Pediatric Pneumonia: Dataset of pediatric chest X-ray images labeled for pneumonia classification, derived from²⁵. The 5,856 original images were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (some images were up-scaled); the 283 images that were originally in RGB format were converted to grayscale. From the original annotations, we derived a binary classification task for pneumonia presence as well as a multi-class task differentiating between normal images, bacterial pneumonia and viral pneumonia.
skinl_derm, Skin Lesion Evaluation (Dermoscopy): A dataset containing dermoscopic color images of skin lesions, along with corresponding labels for seven different evaluation criteria and the diagnosis, derived from²⁹. The images were zero-padded to obtain a square image and then resized to 224 × 224 pixels using bi-cubic interpolation. We adopted an overall diagnostic multi-class task, as well as separate classification tasks for each of the seven diagnostic criteria from the source dataset. Tasks containing infrequent labels have additional grouped versions which bundle the infrequent labels together into more frequent labels. This grouping is provided by the source dataset.
skinl_photo, Skin Lesion Evaluation (Clinical Photography): A dataset containing clinical color photography images of skin lesions, along with corresponding labels for seven different evaluation criteria and the diagnosis, derived from²⁹. This dataset contains the same subjects as Skin Lesion Evaluation (Dermoscopy) and images were preprocessed in the same manner. The tasks are also identical to Skin Lesion Evaluation (Dermoscopy).

Table 2 All MedIMeta tasks.

Full size table

We open-source all code for creating the above datasets from their respective source materials in https://github.com/StefanoWoerner/medimeta-dataset-scripts. Our published source code contains easy-to-use utility functions to extend MedIMeta with additional datasets. As examples, we provide several additional pipelines to create more datasets in the same format from other medical data which is publicly available but is not published under a license which allows redistribution of derivative work.

Python package

We release a Python package called medimeta that enables data loading. This package is installable via pip install medimeta. Users can load data as single datasets or as cross-domain batches. It also supports loading few-shot tasks with support and query sets. Additionally, it is fully compatible with TorchCross³⁰, a library for cross-domain and few-shot learning.

Data Records

We have organized and made available the data files of the MedIMeta dataset on Zenodo in³¹. The data for all datasets within MedIMeta can be accessed via the provided DOI. Each dataset is packaged in a single zip file for convenience and can be downloaded either separately or in one batch with all datasets.

Each zip file contains a structured and organized collection of data, designed to facilitate ease of use and comprehensive understanding of the dataset. The content of these zip files is as follows:

images This folder contains all the image files for the dataset as uint8 TIFFs, named sequentially (e.g., 000000.tiff, 000001.tiff, etc.), ensuring easy access and exploration for human viewers.
splits This directory includes text files (train.txt, val.txt, and test.txt) listing the image paths belonging to each respective split.
original_splits For source datasets with pre-existing split definitions, this directory contains those original splits, allowing users to adhere to the original data partitioning if desired. The format is the same as in the splits directory.
task_labels Each task within the dataset is accompanied by an .npy file named with the respective task’s name (e.g., task_name_1.npy, task_name_2.npy, etc.) contained in this folder. Each .npy file contains a single NumPy array that represents the labels associated with that specific task.
annotations.csv This file provides a comprehensive set of annotations, such as the patient id or imaging plane, for the images in the dataset, offering detailed insights and data points for those interested. It also contains all task labels to make them more readily accessible to human readers.
images.hdf5 This file contains all the images from the folder images formatted as a single dataset with dimensions N × H × W × C, where N represents the number of images, H the height, W the width, and C the number of channels. The HDF5 file is useful for reading data in machine learning applications.
info.yaml This file contains all relevant information about the dataset, including its ID, name and description, the number of images in each split, domain identifier, task definitions, and attribution information.
LICENSE Each dataset is accompanied by its specific license file. We publish all datasets under a creative commons license. We published the majority of the datasets with the CC BY-SA 4.0 license². However, some datasets are published with a non-commercial license³ due to the source material licensing. The specific CC license for each dataset is listed in Table 2.^{Footnote 2}^{Footnote 3}

Technical Validation

In order to validate our proposed dataset, we used it in two distinct learning scenarios. First, we performed simple supervised training on each of the datasets on its primary tasks. Secondly, we investigated the utility of our dataset for CD-FSL. In the following, we describe the experiment setups for the two scenarios in more detail.

Supervised learning experiments

For the supervised experiments we trained ResNet-18 and ResNet-50 models³² on the primary task for each dataset. All networks were initialized with pre-trained weights from ImageNet². Early stopping was performed using the AUROC on the respective validation sets. We performed a simple hyper-parameter search over data augmentation, learning rate and weight decay. We note that the official test split of the Diabetic Retinopathy (Ultra-widefield Fundus) dataset does not contain any samples of the class “PDR”. Moreover, the dataset only contains two patients with this class in total, which prevents creating a custom train-test split with a better class balance. To account for this we trained and evaluated both models without this class.

Cross-domain few-shot learning (CD-FSL) experiments

In this evaluation, we compare 5-shot performance using three different CD-FSL approaches (described below). We evaluated the performance for each dataset in a leave-one-out fashion using one task as target task, and the other tasks from all datasets with no domain overlap or subject overlap as source tasks to transfer knowledge from. The knowledge transfer was achieved by first training a common backbone on the source tasks, and then fine-tuning the network to the target task using a small support set of 5 labeled examples per class. Evaluation was performed on a distinct query set consisting of 10 samples from the same task. We exclude all classes with less than 15 samples from the fine-tuning and evaluation, since at least 15 samples are needed to sample the distinct support and query sets. Figure 2 illustrates the training and evaluation procedure.

Because performance can vary substantially between runs due to the quality of the 5 labeled examples, we reran the experiments 100 times for each target task to obtain a more robust estimation of the performance.

Baseline CD-FSL algorithms

ImageNet pre-training (IN-PT)

The simplest baseline we investigated is simply initializing the common backbone network with ImageNet weights, and then directly fine-tuning to the target task.

Multi-domain multi-task pre-training (mm-PT)

Pre-training using ImageNet lacks specificity to the medical domain. Incrementally pre-training on a series of available datasets may offer a strategy to learn from many related datasets rather than just one³³. However, incremental pre-training may suffer from catastrophic forgetting of earlier tasks³⁴. To address this issue, we propose a multi-domain multi-task pre-training schedule, where for each model update we sample a batch from a random source task. This strategy may facilitate learning representations suitable for a wide range of tasks. The algorithm is summarized in the supplemental materials.

Multi-domain Multi-task MAML (mm-MAML)

Model-agnostic meta-learning³⁵ has been shown to be a promising strategy for CD-FSL³⁶. MAML first “learns to learn” from a set of training tasks before learning the desired test task. However, MAML assumes identical task types and number of classes for each task, which is not realistic in practical settings. Here, we employed our previously proposed Multi-domain Multi-task MAML (mm-MAML) strategy³⁷ where we used an individual classification layer for each class that is simply initialized with zeros. The algorithm is summarized in the supplemental materials.

Results of validation experiments

Table 3 shows the results for the single-domain baselines. It can be seen that the models were able to achieve a high performances in terms of AUROC for most datasets. The comparatively lower performance on some of the tasks reflects the difficulty of these tasks. All tasks with scores below 80 contain classes with very few examples, making these tasks more difficult to learn. Since our fully supervised baselines are generic and simple methods not tailored to a specific task, we do not expect to see state-of-the-art performance on all datasets. As expected, the more complex ResNet 50 model achieved slightly higher performance for most datasets compared to the ResNet 18.

Table 3 AUROC (%) on the test set for the fully supervised baselines.

Full size table

Table 4 shows 5-shot results for the CD-FSL baselines described earlier. Surprisingly, simple fine-tuning from pre-trained ImageNet weights performed as well or better than fine-tuning from the mm-PT and mm-MAML baselines. This can mean that the pre-fine-tuning method we chose is too simple to bring a meaningful benefit. At the same time it is also apparent that simple methods for few-shot learning such as these do not achieve performance close to fully supervised training. We therefore conclude that MedIMeta offers high enough complexity for evaluating future few-shot methods, with the fully supervised results setting an upper bar for future few-shot methods to achieve.

Table 4 AUROC (%) for the CD-FSL baselines averaged across 100 5-shot episodes using a ResNet18 and ResNet50.

Full size table

Usage Notes

All datasets contained in the MedIMeta dataset³¹ can be downloaded from Zenodo. Using the code provided in our data loaders repository⁴, all tasks in MedIMeta can easily be loaded as PyTorch datasets for single-domain, cross-domain and few-shot scenarios. No further pre-processing is required, but it is possible to provide any TorchVision transforms to the dataset class when initializing it. A simple step-by-step example of how to load a single dataset is as follows.

1.
Download the zip file for the dataset you would like to use (e.g. OCT) from the Zenodo record³¹, and extract it to a directory of choice. Let us use ./data/MedIMeta here.
2.
pip install medimeta
3.
The data can now be used as a PyTorch dataset. The following code snippet would instantiate the dataset for the Disease task of the OCT dataset, assuming the data is stored in the data/MedIMeta directory.

from medimeta import MedIMeta

dataset = MedIMeta("data/MedIMeta", "oct", "Disease")

^{Footnote 4}

The repository also contains a directory “examples” which contains several usage examples for single-domain training, cross-domain training, and few-shot fine-tuning.

Practical limitations

We have identified a number of practical limitations for usage of our dataset, which we briefly discuss in this section. Firstly, our meta-dataset contains several datasets which vary substantially in the number of samples. Some of these datasets are rather small, limiting their practical use cases for applications that require a large amount of data. Some of the tasks are separated from their clinical context and therefore may lack clinical realism. For instance, medical professionals holistically evaluate multiple mammography views of the same patient instead of only looking at a small region of interest. When balancing practicality for machine learning with clinical realism, we have consciously prioritized the former while keeping the clinical task as realistic as possible. The scope of our meta-dataset does not completely encompass all modalities and anatomical regions typically seen in medical imaging and is comprised solely of 2D images. In clinical practice 3D images, videos, etc. are commonly used in addition to 2D images. Additionally clinicians often score images on a multidimensional scale, while most of the datasets included in MedIMeta include only classification labels. Another potential limitation is the common size and format of all images in MedIMeta. This might not be optimal for every individual application domain. However, this is a very significant advantage for ease of use and once again a conscious trade-off in favor of practicality for machine learning.

Code availability

The algorithms, methodologies, and procedures used in this paper are fully documented in our accompanying code repositories. All code has been developed in Python and builds on widely used libraries. Specific dependencies and their versions are documented as requirements in the respective repositories, ensuring easy setup and reproducibility.

MedIMeta dataset scripts: The source code for creating datasets from their respective materials contains all scripts needed for the creation of the 19 MedIMeta datasets, utility functions for extending the MedIMeta with additional datasets and all parameters used, such as image size. The code is available at https://github.com/StefanoWoerner/medimeta-dataset-scripts.

MedIMeta data loaders: Easy-to-use code for data loading from MedIMeta is provided both for the single-domain and the cross-domain (few-shot) scenarios. The code is available at https://github.com/StefanoWoerner/medimeta-pytorch. The package can be directly installed using pip install medimeta.

Experiments: The code for our experiments includes scripts for preprocessing, data augmentation, model training, evaluation, and other necessary utilities. It utilizes our data loader code described above. The code is available at https://github.com/StefanoWoerner/medimeta-experiments.

TorchCross library: A general PyTorch few-shot learning and cross-domain learning library which is used in the MedIMeta data loaders code. The code is available at https://github.com/StefanoWoerner/torchcrossand the package can be directly installed using pip install torchcross.

Notes

References

Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015).
Article ADS MathSciNet CAS PubMed Google Scholar
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252 (2015).
Article MathSciNet Google Scholar
Krizhevsky, A. Learning multiple layers of features from tiny images (2009).
Vinyals, O., Blundell, C., Lillicrap, T., kavukcuoglu, K. & Wierstra, D. Matching networks for one shot learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29 (Curran Associates, Inc., 2016).
Ren, M. et al. Meta-learning for semi-supervised few-shot classification. 1803.00676 (2018).
Bertinetto, L., Henriques, J. F., Torr, P. H. & Vedaldi, A. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136 (2018).
Oreshkin, B., Rodríguez López, P. & Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems 31 (2018).
Rebuffi, S.-A., Bilen, H. & Vedaldi, A. Learning multiple visual domains with residual adapters. Advances in neural information processing systems 30 (2017).
Triantafillou, E. et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096 (2019).
Zhai, X. et al. A large-scale study of representation learning with the visual task adaptation benchmark (2019).
Dumoulin, V. et al. Comparing transfer and meta learning approaches on a unified few-shot classification benchmark (2021).
Guo, Y. et al. A broader study of cross-domain few-shot learning (2019).
Ullah, I. et al. Meta-album: Multi-domain meta-dataset for few-shot image classification. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track https://meta-album.github.io/ (2022).
Yang, J. et al. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data 10, 41 (2023).
Article PubMed PubMed Central Google Scholar
Matek, C., Schwarz, S., Spiekermann, K. & Marr, C. Human-level recognition of blast cells in acute myeloid leukaemia with convolutional neural networks 1, 538–544.
Al-Dhabyani, W., Gomaa, M., Khaled, H. & Fahmy, A. Dataset of breast ultrasound images 28, 104863.
Kather, J. N., Halama, N. & Marx, A. 100,000 histological images of human colorectal cancer and healthy tissue. https://zenodo.org/record/1214456 (visited on 03/04/2023).
Wang, X. et al. ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tschandl, P., Rosendahl, C. & Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, 1–9 (2018).
Article Google Scholar
Codella, N. et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368 (2019).
Liu, R. et al. Deepdrid: Diabetic retinopathy-grading and image quality estimation challenge. Patterns 3, 100512 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pachade, S. et al. Retinal fundus multi-disease image dataset (rfmid): A dataset for multi-disease detection research. Data 6, 14 (2021).
Article Google Scholar
Kumar, J. H. et al. Cháksu: A glaucoma specific fundus image database. Scientific data 10, 70 (2023).
Article PubMed PubMed Central Google Scholar
Lee, R. S. et al. A curated mammography data set for use in computer-aided detection and diagnosis research 4, 170177. https://www.nature.com/articles/sdata2017177.
Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning 172, 1122–1131.e9.
Bilic, P. et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis 84, 102680 (2023).
Article PubMed Google Scholar
Xu, X., Zhou, F., Liu, B., Fu, D. & Bai, X. Efficient multiple organ localization in ct image using 3d region proposal network. IEEE transactions on medical imaging 38, 1885–1898 (2019).
Article Google Scholar
Acevedo, A. et al. A dataset of microscopic peripheral blood cell images for development of automatic recognition systems 30, 105474.
Kawahara, J., Daneshvar, S., Argenziano, G. & Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE Journal of Biomedical and Health Informatics 23, 538–546 (2019).
Article Google Scholar
Woerner, S. Torchcross https://github.com/StefanoWoerner/torchcross (2024).
Woerner, S., Jaques, A. & Baumgartner, C. F. A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset (medimeta) (2024).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
Davidson, G. & Mozer, M. C. Sequential mastery of multiple visual tasks: Networks naturally learn to learn and forget to forget. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9282–9293 (2020).
Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 3521–3526 (2017).
Article ADS MathSciNet CAS Google Scholar
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, vol. 70, 1126–1135 (PMLR, 2017).
Guo, Y. et al. A broader study of cross-domain few-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, 124–141 (Springer, 2020).
Woerner, S. & Baumgartner, C. F. Strategies for meta-learning with diverse tasks. In Medical Imaging with Deep Learning (2022).

Download references

Acknowledgements

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC number 2064/1 - Project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Stefano Woerner.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Cluster of Excellence “Machine Learning: New Perspectives for Science”, University of Tübingen, Tübingen, Germany
Stefano Woerner, Arthur Jaques & Christian F. Baumgartner
Faculty of Health Sciences and Medicine, University of Lucerne, Lucerne, Switzerland
Christian F. Baumgartner

Authors

Stefano Woerner
View author publications
Search author on:PubMed Google Scholar
Arthur Jaques
View author publications
Search author on:PubMed Google Scholar
Christian F. Baumgartner
View author publications
Search author on:PubMed Google Scholar

Contributions

S.W. and C.B. conceived the dataset and the experiments, S.W. and A.J. pre-processed and prepared the data, S.W. conducted the experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Stefano Woerner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Training algorithms for validation baselines

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Woerner, S., Jaques, A. & Baumgartner, C.F. A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset. Sci Data 12, 666 (2025). https://doi.org/10.1038/s41597-025-04866-4

Download citation

Received: 17 April 2024
Accepted: 20 March 2025
Published: 19 April 2025
DOI: https://doi.org/10.1038/s41597-025-04866-4

Subjects

Abstract

Similar content being viewed by others

Multi-domain improves classification in out-of-distribution and data-limited scenarios for medical image analysis

Machine learning for medical imaging: methodological failures and recommendations for the future

A lossless compression method for multi-component medical images based on big data mining

Background & Summary

Related datasets

Single-domain meta-datasets

Multi-domain meta-datasets

Methods

Dataset

Python package

Data Records

Technical Validation

Supervised learning experiments

Cross-domain few-shot learning (CD-FSL) experiments

Baseline CD-FSL algorithms

ImageNet pre-training (IN-PT)

Multi-domain multi-task pre-training (mm-PT)

Multi-domain Multi-task MAML (mm-MAML)

Results of validation experiments

Usage Notes

Practical limitations

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Training algorithms for validation baselines

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links