Background & Summary

In recent years, the field of medical image analysis has undergone a transformative shift with the integration of machine learning (ML) techniques fundamentally expanding the landscape of diagnostic and therapeutic strategies. The advancement of this field hinges on the availability of large, diverse, and well-annotated datasets, which are pivotal for training robust and effective ML models. However, the process of collecting raw medical images and bringing them to a format suitable for ML applications is often complex and fraught with challenges. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, a task that becomes increasingly complex when working with multiple datasets from multiple domains that need to be integrated into a cohesive standardised format.

Another challenge is the scarcity of annotated datasets in medical imaging, particularly for rare diseases or specific conditions. This limitation has sparked the exploration of few-shot learning (FSL) methodologies. These try to learn from few examples and are designed to make predictions based on a limited number of training examples. While humans are adept at learning and making accurate predictions from minimal information, achieving comparable levels of performance in machines remains a significant challenge.

Cross-domain few-shot learning (CD-FSL) allows models to leverage the information learned for a task in one domain and apply it to another task in another domain, potentially reducing the need for extensive data in every new task. This ability is of immense value in these contexts, especially for rare conditions. We define a domain as a particular type of subject, i.e. the studied anatomical region in the medical context, combined with a particular imaging modality. Despite its potential, the cross-domain transfer poses significant challenges, given the inherent variability in imaging modalities, disease presentations, and data characteristics across different medical fields and clinics. In addition to the variability in domains, the nature of tasks involved in medical image analysis varies considerably, encompassing a range of classification types such as binary, multi-class, or multi-label, and differing in the number of target labels or classes. Developing algorithms capable of navigating these complexities is essential for effective knowledge transfer. In order to facilitate the development of such algorithms, good preprocessing and standardization across the different domains are even more critical in this context.

Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset designed to facilitate the development and standardized evaluation of ML models and cross-domain FSL algorithms for medical image classification. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, offering opportunities for both single-task and multi-task training. Many of the tasks are diagnostic tasks or are tasks immediately relevant to a diagnosis. Additionally, MedIMeta contains auxiliary tasks (such as gender prediction) that may not have immediate clinical relevance but may nevertheless be of interest. Furthermore, they have relevance for multi-task training or for training FSL algorithms which benefit from having a large amount of tasks.

For improved practicality, each dataset within the MedIMeta dataset is standardized to a size of 224 × 224 pixels which matches image size commonly used in pre-trained models. Furthermore, the dataset comes with pre-made splits to ensure ease of use and standardized benchmarking. We meticulously preprocessed the data and release a user-friendly Python package1 to directly load images for use in PyTorch.Footnote 1

This makes MedIMeta exceptionally accessible to ML researchers. This ease of access to a diverse set of realistic medical tasks with no need for additional preprocessing can serve as a bridge between medical professionals and the ML community, fostering interdisciplinary collaboration. Moreover, MedIMeta is an ideal platform for investigating cross-domain few-shot learning in medical imaging. The rich array of tasks and domains presents an excellent opportunity to study and develop cross-domain few-shot learning techniques.

In addition to presenting the meta-dataset, this paper also presents a technical validation of MedIMeta, demonstrating its utility through fully supervised and CD-FSL baselines. The validation confirms the dataset’s reliability and robustness, establishing it as a credible benchmark for research in ML for medical image analysis.

Related datasets

Existing meta-datasets can be divided into two categories: those consisting of multiple datasets from a single domain, and those comprising data from multiple domains. An overview is shown in Table 1.

Table 1 Comparison of different datasets.

Single-domain meta-datasets

Single domain meta-datasets offer an easy way to benchmark few-shot learning techniques such as meta-learning. One of the first meta-datasets in this category was Omniglot1, which consists of handwritten characters from a wide range of alphabets. More challenging meta-datasets derived from natural images were obtained by subsampling the widely used ImageNet2, or CIFAR datasets3. Examples include the MiniImageNet4 and the TieredImageNet5 datasets, as well as CIFAR-FS6 and FC1007.

Multi-domain meta-datasets

While single-domain meta-datasets offer an easy standardized way to evaluate few-shot learning techniques they often lack realism. In reality data are rarely from a single domain in FSL problems. This realization has led to the release of several multi-domain meta-datasets.

The visual decathlon dataset8 is one of the first multi-domain datasets. It consists of 10 datasets from different visual tasks including traffic sign and flower recognition. It also includes Omniglot1 and CIFAR3 datasets discussed earlier.

Later Triantafillou et al. released another collection of 10 datasets coined the “Meta-Dataset”9. While it partially overlaps with the visual decathlon dataset, the “Meta-Dataset” was specifically designed to benchmark few-shot learning algorithms on multiple domains. However, it does not contain any medical datasets.

With a similar motivation Zhai et al. released the Visual Task Adaptation Benchmark (VTAB)10. This meta-dataset consists of 19 tasks, again, partially subsuming the previous datasets. A particular property of the VTAB benchmark is the inclusion of datasets from three different domains covering natural image understanding, structured scene understanding, and specialized tasks. The VTAB dataset is also to our knowledge the first meta-dataset to include medical images.

Dumoulin et al.11 later unified VTAB10 and Meta-Dataset9 into a larger Few-Shot Classification Benchmark. Guo et al.12 collected multiple previously available datasets from multiple domains for benchmarking CD-FSL methods. Meta-Album13 is a collection of 40 different dataset from multiple domains and follows a similar goal to our work. While it contains more datasets than MedIMeta, it does not contain any medical domains except for microscopy.

MedMNISTv214 is a collection of 12 medical imaging datasets from 9 different domains. It additionally contains 3D datasets. It is similar in spirit to our work, but we go significantly beyond MedMNIST in the number of tasks and task realism. The images in MedMNISTv2 have a very low resolution of 28 × 28 pixels that obscures fine details that may be clinically relevant. In contrast, we process all images at high resolution and make them available with an image size of 224 × 224 pixels, which allows to detect more clinically relevant features and is the typical resolution used in pretrained neural networks. Additionally, MedMNISTv2 does not contain any multitask datasets. In contrast, MedIMeta contains a wide variety of tasks including binary, multi-class, and multi-label classification as well as ordinal regression.

In some of the overlapping datasets, we found significant problems with MedMNIST’s preprocessing, which we improve upon. Specifically, we found that some of the center-cropped images in MedMNIST had their relevant part, i.e., the part of the image showing the disease, cropped away. We fix this problem by instead zero-padding these datasets. Additionally, some datasets in MedMNISTv2 may have training-test splits that put images of the same subject in multiple splits. We instead generated splits by taking subject information into account.

While Meta-Dataset, VTAB, Meta-Album and MedMNIST contain different domains, only VTAB and MedMNIST contain a variety of different tasks. MedIMeta is the only data collection which contains datasets with multiple tasks, providing users of MedIMeta with the option to train multi-task algorithms. Out of these benchmarks, only MedMNIST contains a significant number of medical tasks. Our proposed MedIMeta contains 54 medical tasks and is easily extensible to tasks from other fields, like natural images. Furthermore, we provide utilities for converting other datasets (e.g. ImageNet) into the MedIMeta format allowing to easily integrate MedIMeta with other data sources.

Methods

We release the MedIMeta dataset, a novel, highly standardized meta-dataset comprised of 19 publicly available datasets containing a total of 54 tasks. In the following, we describe the source datasets and the data generation in detail.

Dataset

All datasets included in the MedIMeta dataset have either been previously published under an open license that allows redistribution under a CC-BY-SA or CC-BY-SA-NC license, or we obtained an explicit permission to do so. In addition to having an open license, we selected these datasets based on three criteria: suitability for defining a minimum of one classification task on the data, image size suitable for rescaling to our target size without producing noticeable artifacts, and a minimum of 100 images. All images in MedIMeta were standardized to an image size of 224 × 224 pixels. We provide pre-defined training, validation and testing splits for all 19 datasets. If data splits were already defined in the source data, we used the pre-existing splits. Otherwise we generated our own. Most datasets include more than one classification task. Typically, there is one main diagnostic task and several auxiliary tasks. Most of these tasks were already present in the source datasets. In some instances, we created additional tasks not present in the source data. Table 2 gives an overview of all datasets, tasks, and their key properties. Figure 1 displays example images for each dataset. In the following, we describe in detail all datasets that MedIMeta is comprised of using the dataset ID, as well as the full dataset name.

  • aml, AML Cytomorphology: Morphological dataset of leukocytes with expert-labeled single-cell images from peripheral blood smears of patients with acute myeloid leukemia (AML) and patients without signs of hematological malignancy, derived from the Munich AML Morphology Dataset15. The 18,365 original images were resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled) and converted to RGB format by removing the transparency channel. We adopted the original multi-class classification task with 15 morphological classes from the source dataset.

  • bus, Breast Ultrasound: Dataset of breast ultrasound images of women between 25 and 75 years old, derived from the Breast Ultrasound Images Dataset16. The 780 original images were converted to grayscale and their masks to binary format. Images and masks were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation and nearest neighbor interpolation respectively (no images were up-scaled). The multi-class tumor classification task with normal, benign, and malignant examples was adopted without modifications from the source dataset. We additionally defined a binary classification task between malignant tumors and other images.

  • crc, Colorectal Cancer: Dataset of image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and healthy tissue, derived from the NCT-CRC-HE-100K and CRC-VAL-HE-7K datasets17. The 107,180 original images from the training and validation sets were not modified, as they already had the right shape and size for MedIMeta. We adopted the multi-class tissue classification task with 9 labels from the source dataset without modifications.

  • cxr, Chest X-ray Multi-disease: Dataset of frontal-view X-ray chest images, derived from the ChestX-ray14 dataset18. The 112,120 original images were resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled); the 519 images that were originally in RGBA format were converted to grayscale. We provide a multi-label thorax disease classification task with 14 labels adopted without modifications from the source dataset. We additionally provide a binary classification task of the patient sex derived from the labels present in the original data.

  • derm, Dermatoscopy: Dataset of dermatoscopic images of common pigmented skin lesions from different populations acquired and stored by different modalities, derived from the HAM10000 dataset19. The 11,720 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We provide the multi-class disease classification task with 7 labels defined in the challenge hosted by the International Skin Imaging Collaboration (ISIC)20.

  • dr_regular, Diabetic Retinopathy (Regular Fundus): Dataset of fundus images with diabetic retinopathy grades and image quality annotations, derived from the DeepDRiD dataset21. The 2,000 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). Following the annotations present in the original data, we provide 5 tasks: diabetic retinopathy grade (ordinal regression task with 5 labels), sufficient image quality for gradability (binary classification task), strength of artifact (ordinal regression task with 6 labels), image clarity (ordinal regression task with 5 labels), and field definition (ordinal regression task with 5 labels).

  • dr_uwf, Diabetic Retinopathy (Ultra-widefield Fundus): Dataset of ultra-widefield fundus images with annotations for diabetic retinopathy grading, derived from the DeepDRiD dataset21. Only the 250 original images without missing labels were kept. They were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We adopted the DR grading task (ordinal regression) with 5 labels from the source dataset without modifications.

  • fundus, Fundus Multi-disease: Multi-disease retinal fundus dataset of images captured using three different fundus cameras with 45 conditions annotated through adjudicated consensus of two senior retinal experts as well as an overall disease presence label, derived from the Retinal Fundus Multi-disease Image Dataset22. The 3,200 images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). The original disease presence binary classification task and disease multi-label classification task with 45 labels were directly derived from the annotations provided by the original dataset.

  • glaucoma, Glaucoma-specific fundus images: Glaucoma-specific Indian ethnicity retinal fundus dataset of images acquired using three devices, where five expert ophthalmologists provided annotations on whether the subject is suspect for glaucoma or not, derived from the Cháksu dataset23. The 1,345 original images and their masks were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation and nearest neighbor interpolation respectively (no images were up-scaled). We used the glaucoma suspect majority vote annotation to derive a binary classification task.

  • mammo_calc, Mammography (Calcifications): Dataset of cropped regions of interest (calcifications), derived from the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM)24. The 1,872 images were obtained by extending the regions of interest (bounding boxes) to a square shape with a minimum size of 224 × 224 pixels, and extracting the resulting region crops from the full original images. The region crops were then resized to 224 × 224 pixels using bi-cubic interpolation. From the annotations, 3 tasks were derived: pathology type (binary classification task), calcification type (multi-label classification task with 14 labels), and calcification distribution (multi-label classification task with 5 labels).

  • mammo_mass, Mammography (Masses): Dataset of cropped regions of interest (masses) from CBIS-DDSM. The 1,696 images were preprocessed as described for Mammography (Calcifications). From the annotations, 3 tasks were derived: pathology type (binary classification task), mass shape (multi-label classification task with 8 labels), and mass margins (multi-label classification task with 5 labels).

  • oct, OCT: Dataset of validated Optical Coherence Tomography (OCT) images labeled for disease classification, derived from25. The 84,484 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). The original dataset contains a multi-class disease classification task, with three different diseases and a healthy class, which we adopt without modifications. Additionally, we provide a binary task for whether the image warrants urgent referral to a specialist based on the annotations present in the original data.

  • organs_axial, Axial Organ Slices: Dataset of axial image slices of 11 different organs, extracted from the Liver Tumor Segmentation Benchmark (LiTS) dataset26 and the corresponding organ bounding box annotations from27. We derived a multi-class organ classification task with 11 labels by extracting a cropped image of each individual organ in each of the CT volumes using the bounding box annotations. We obtained a total of 1,645 organ images images. We removed 106 images for which the voxel size information was missing. The axes on one image were permuted to bring it to the same format as the other images. The images and masks were sliced from the original 3D volumes by taking the center of the organ bounding box in the axial plane. The Hounsfield-Units of the images were transformed into grayscale images by applying a window with a width of 400 and a level of 50, which are typical values for abdominal CT imaging. The images and masks were cropped to a square size in the physical space, by centering at the center of the bounding box and expanding the smaller side. The resulting images and masks were resized to 224 × 224 pixels using bi-cubic and nearest neighbor interpolation, respectively. For visualization purposes, we additionally provide images averaged over the 10% central slices with the projected bounding boxes of all organs extracted from the image drawn on top.

  • organs_coronal, Coronal Organ Slices: Dataset of coronal image slices of 11 different organs, extracted from the LiTS dataset. The images were processed the same as described for the Axial Organ Slices dataset, except that the coronal projections were used.

  • organs_sagittal, Sagittal Organ Slices: Dataset of sagittal image slices of 11 different organs, extracted from the LiTS dataset. The images were processed the same as described for the Axial Organ Slices dataset, except that the sagittal projections were used.

  • pbc, Peripheral Blood Cells: Dataset of microscopic peripheral blood cell images of individual normal cells, captured from individuals without infection, with hematologic or oncologic disease and free of any pharmacologic treatment at the moment of blood collection, derived from28. The 17,092 original images were center-cropped to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (no images were up-scaled). We adopted the original multi-class blood cell classification task with 8 labels from the source dataset without modifications to the annotations.

  • pneumonia, Pediatric Pneumonia: Dataset of pediatric chest X-ray images labeled for pneumonia classification, derived from25. The 5,856 original images were zero-padded to a square shape and resized to 224 × 224 pixels using bi-cubic interpolation (some images were up-scaled); the 283 images that were originally in RGB format were converted to grayscale. From the original annotations, we derived a binary classification task for pneumonia presence as well as a multi-class task differentiating between normal images, bacterial pneumonia and viral pneumonia.

  • skinl_derm, Skin Lesion Evaluation (Dermoscopy): A dataset containing dermoscopic color images of skin lesions, along with corresponding labels for seven different evaluation criteria and the diagnosis, derived from29. The images were zero-padded to obtain a square image and then resized to 224 × 224 pixels using bi-cubic interpolation. We adopted an overall diagnostic multi-class task, as well as separate classification tasks for each of the seven diagnostic criteria from the source dataset. Tasks containing infrequent labels have additional grouped versions which bundle the infrequent labels together into more frequent labels. This grouping is provided by the source dataset.

  • skinl_photo, Skin Lesion Evaluation (Clinical Photography): A dataset containing clinical color photography images of skin lesions, along with corresponding labels for seven different evaluation criteria and the diagnosis, derived from29. This dataset contains the same subjects as Skin Lesion Evaluation (Dermoscopy) and images were preprocessed in the same manner. The tasks are also identical to Skin Lesion Evaluation (Dermoscopy).

Table 2 All MedIMeta tasks.
Fig. 1
figure 1

Example images of all MedIMeta datasets.

We open-source all code for creating the above datasets from their respective source materials in https://github.com/StefanoWoerner/medimeta-dataset-scripts. Our published source code contains easy-to-use utility functions to extend MedIMeta with additional datasets. As examples, we provide several additional pipelines to create more datasets in the same format from other medical data which is publicly available but is not published under a license which allows redistribution of derivative work.

Python package

We release a Python package called medimeta that enables data loading. This package is installable via pip install medimeta. Users can load data as single datasets or as cross-domain batches. It also supports loading few-shot tasks with support and query sets. Additionally, it is fully compatible with TorchCross30, a library for cross-domain and few-shot learning.

Data Records

We have organized and made available the data files of the MedIMeta dataset on Zenodo in31. The data for all datasets within MedIMeta can be accessed via the provided DOI. Each dataset is packaged in a single zip file for convenience and can be downloaded either separately or in one batch with all datasets.

Each zip file contains a structured and organized collection of data, designed to facilitate ease of use and comprehensive understanding of the dataset. The content of these zip files is as follows:

  • images This folder contains all the image files for the dataset as uint8 TIFFs, named sequentially (e.g., 000000.tiff, 000001.tiff, etc.), ensuring easy access and exploration for human viewers.

  • splits This directory includes text files (train.txt, val.txt, and test.txt) listing the image paths belonging to each respective split.

  • original_splits For source datasets with pre-existing split definitions, this directory contains those original splits, allowing users to adhere to the original data partitioning if desired. The format is the same as in the splits directory.

  • task_labels Each task within the dataset is accompanied by an .npy file named with the respective task’s name (e.g., task_name_1.npy, task_name_2.npy, etc.) contained in this folder. Each .npy file contains a single NumPy array that represents the labels associated with that specific task.

  • annotations.csv This file provides a comprehensive set of annotations, such as the patient id or imaging plane, for the images in the dataset, offering detailed insights and data points for those interested. It also contains all task labels to make them more readily accessible to human readers.

  • images.hdf5 This file contains all the images from the folder images formatted as a single dataset with dimensions N × H × W × C, where N represents the number of images, H the height, W the width, and C the number of channels. The HDF5 file is useful for reading data in machine learning applications.

  • info.yaml This file contains all relevant information about the dataset, including its ID, name and description, the number of images in each split, domain identifier, task definitions, and attribution information.

  • LICENSE Each dataset is accompanied by its specific license file. We publish all datasets under a creative commons license. We published the majority of the datasets with the CC BY-SA 4.0 license2. However, some datasets are published with a non-commercial license3 due to the source material licensing. The specific CC license for each dataset is listed in Table 2.Footnote 2Footnote 3

Technical Validation

In order to validate our proposed dataset, we used it in two distinct learning scenarios. First, we performed simple supervised training on each of the datasets on its primary tasks. Secondly, we investigated the utility of our dataset for CD-FSL. In the following, we describe the experiment setups for the two scenarios in more detail.

Supervised learning experiments

For the supervised experiments we trained ResNet-18 and ResNet-50 models32 on the primary task for each dataset. All networks were initialized with pre-trained weights from ImageNet2. Early stopping was performed using the AUROC on the respective validation sets. We performed a simple hyper-parameter search over data augmentation, learning rate and weight decay. We note that the official test split of the Diabetic Retinopathy (Ultra-widefield Fundus) dataset does not contain any samples of the class “PDR”. Moreover, the dataset only contains two patients with this class in total, which prevents creating a custom train-test split with a better class balance. To account for this we trained and evaluated both models without this class.

Cross-domain few-shot learning (CD-FSL) experiments

In this evaluation, we compare 5-shot performance using three different CD-FSL approaches (described below). We evaluated the performance for each dataset in a leave-one-out fashion using one task as target task, and the other tasks from all datasets with no domain overlap or subject overlap as source tasks to transfer knowledge from. The knowledge transfer was achieved by first training a common backbone on the source tasks, and then fine-tuning the network to the target task using a small support set of 5 labeled examples per class. Evaluation was performed on a distinct query set consisting of 10 samples from the same task. We exclude all classes with less than 15 samples from the fine-tuning and evaluation, since at least 15 samples are needed to sample the distinct support and query sets. Figure 2 illustrates the training and evaluation procedure.

Fig. 2
figure 2

An overview of the CD-FSL scenario: The few-shot learner is first trained on the meta-dataset of highly diverse training data. It is then adapted to a new task from a new domain using the labeled examples from the support set of a few-shot task. Performance is assessed using a query set from the same task.

Because performance can vary substantially between runs due to the quality of the 5 labeled examples, we reran the experiments 100 times for each target task to obtain a more robust estimation of the performance.

Baseline CD-FSL algorithms

ImageNet pre-training (IN-PT)

The simplest baseline we investigated is simply initializing the common backbone network with ImageNet weights, and then directly fine-tuning to the target task.

Multi-domain multi-task pre-training (mm-PT)

Pre-training using ImageNet lacks specificity to the medical domain. Incrementally pre-training on a series of available datasets may offer a strategy to learn from many related datasets rather than just one33. However, incremental pre-training may suffer from catastrophic forgetting of earlier tasks34. To address this issue, we propose a multi-domain multi-task pre-training schedule, where for each model update we sample a batch from a random source task. This strategy may facilitate learning representations suitable for a wide range of tasks. The algorithm is summarized in the supplemental materials.

Multi-domain Multi-task MAML (mm-MAML)

Model-agnostic meta-learning35 has been shown to be a promising strategy for CD-FSL36. MAML first “learns to learn” from a set of training tasks before learning the desired test task. However, MAML assumes identical task types and number of classes for each task, which is not realistic in practical settings. Here, we employed our previously proposed Multi-domain Multi-task MAML (mm-MAML) strategy37 where we used an individual classification layer for each class that is simply initialized with zeros. The algorithm is summarized in the supplemental materials.

Results of validation experiments

Table 3 shows the results for the single-domain baselines. It can be seen that the models were able to achieve a high performances in terms of AUROC for most datasets. The comparatively lower performance on some of the tasks reflects the difficulty of these tasks. All tasks with scores below 80 contain classes with very few examples, making these tasks more difficult to learn. Since our fully supervised baselines are generic and simple methods not tailored to a specific task, we do not expect to see state-of-the-art performance on all datasets. As expected, the more complex ResNet 50 model achieved slightly higher performance for most datasets compared to the ResNet 18.

Table 3 AUROC (%) on the test set for the fully supervised baselines.

Table 4 shows 5-shot results for the CD-FSL baselines described earlier. Surprisingly, simple fine-tuning from pre-trained ImageNet weights performed as well or better than fine-tuning from the mm-PT and mm-MAML baselines. This can mean that the pre-fine-tuning method we chose is too simple to bring a meaningful benefit. At the same time it is also apparent that simple methods for few-shot learning such as these do not achieve performance close to fully supervised training. We therefore conclude that MedIMeta offers high enough complexity for evaluating future few-shot methods, with the fully supervised results setting an upper bar for future few-shot methods to achieve.

Table 4 AUROC (%) for the CD-FSL baselines averaged across 100 5-shot episodes using a ResNet18 and ResNet50.

Usage Notes

All datasets contained in the MedIMeta dataset31 can be downloaded from Zenodo. Using the code provided in our data loaders repository4, all tasks in MedIMeta can easily be loaded as PyTorch datasets for single-domain, cross-domain and few-shot scenarios. No further pre-processing is required, but it is possible to provide any TorchVision transforms to the dataset class when initializing it. A simple step-by-step example of how to load a single dataset is as follows.

  1. 1.

    Download the zip file for the dataset you would like to use (e.g. OCT) from the Zenodo record31, and extract it to a directory of choice. Let us use ./data/MedIMeta here.

  2. 2.

    pip install medimeta

  3. 3.

    The data can now be used as a PyTorch dataset. The following code snippet would instantiate the dataset for the Disease task of the OCT dataset, assuming the data is stored in the data/MedIMeta directory.

    from medimeta import MedIMeta

    dataset = MedIMeta("data/MedIMeta", "oct", "Disease")

Footnote 4

The repository also contains a directory “examples” which contains several usage examples for single-domain training, cross-domain training, and few-shot fine-tuning.

Practical limitations

We have identified a number of practical limitations for usage of our dataset, which we briefly discuss in this section. Firstly, our meta-dataset contains several datasets which vary substantially in the number of samples. Some of these datasets are rather small, limiting their practical use cases for applications that require a large amount of data. Some of the tasks are separated from their clinical context and therefore may lack clinical realism. For instance, medical professionals holistically evaluate multiple mammography views of the same patient instead of only looking at a small region of interest. When balancing practicality for machine learning with clinical realism, we have consciously prioritized the former while keeping the clinical task as realistic as possible. The scope of our meta-dataset does not completely encompass all modalities and anatomical regions typically seen in medical imaging and is comprised solely of 2D images. In clinical practice 3D images, videos, etc. are commonly used in addition to 2D images. Additionally clinicians often score images on a multidimensional scale, while most of the datasets included in MedIMeta include only classification labels. Another potential limitation is the common size and format of all images in MedIMeta. This might not be optimal for every individual application domain. However, this is a very significant advantage for ease of use and once again a conscious trade-off in favor of practicality for machine learning.