Introduction

Monitoring and tracking of cell motion is a key component for understanding disease mechanisms and evaluating treatments1,2. Cell migration, i.e., cell motion that is accompanied by morphological changes, and mitosis, i.e., cell division into two daughters, occur during development, wound healing, and immune responses. In cancer, for example, cell migration is critical in almost all stages of metastasis. Quantifying the progression of cell cycles enables us to understand the effect of drug treatments. In addition, tissue engineering and stem cell manufacturing require quantification of stem cell proliferation in vitro3,4,5.

Time-lapse optical microscopy techniques have been widely used for studying cell migration. The most common techniques toward this end are fluorescence microscopy, phase contrast (PhC), and differential interference contrast (DIC). These technologies produce image sequences frequently in the order of terabytes with a large number of frames, and dense or sparse cell environments depending on the assay. Many labs use manual cell tracking techniques. Such techniques require a significant amount of time and are characterized by poor reproducibility. Hence, there is a clear need for the development of automated, reproducible cell segmentation and tracking methods.

Automated cell tracking mainly entails finding and segmenting all cell occurrences in a time-lapse sequence and defining temporal relationships. This task is complicated by the following factors: (1) variability of cell shapes, sizes, and intensity distributions, (2) variability of cell contrast produced by different microscopy techniques, (3) temporal and spatial resolution limitations, and varying contrast to noise ratio even within a frame of the sequence. Therefore the development of generalizable computer-based tools for cell tracking is a popular open problem in the research community5,6,7,8. Recent software tools implement segmentation and tracking stages9,10,11 but were mostly developed for specific microscopy techniques such as fluorescence10, or phase contrast microscopy11. Due to the significance and popularity of the cell tracking problem, computer-based cell segmentation and tracking competitions, such as the Cell Tracking Challenge (CTC)6,12, have emerged to provide a common basis for performance comparisons. Furthermore, numerous techniques have been published to tackle cell segmentation, detection and tracking9,13,14,15,16,17,18,19,20,21.

In this work, we propose a methodology for generalizable cell segmentation and tracking that can be applied to multiple types of cells and microscopy imaging techniques, that we denote as CSTQ (Cell Segmentation Tracking and Quantification). A frequent pitfall in segmentation of microscopy sequences is the unpredictable change of levels of contrast and noise between different sequences and even from frame-to-frame within a sequence. To address this key issue, we aim to identify the optimal scale for cell detection and segmentation using multi-scale space-time interest point detectors. Analysis of space, time, and space-time interest points in a multi-scale hierarchical strategy has yielded promising results for multiple computer vision tasks, such as segmentation and object detection22,23,24,25,26,27,28. Here we propose to use these detectors (i) to select an optimal scale for each frame, (ii) to identify interest points on the cells, and (iii) to guide segmentation of cell boundaries. We also introduce a multi-layer neural network to detect the cell regions and complete the cell segmentation stage. Cell detection classes are frequently imbalanced, as there are a lot more noisy background regions than cell regions. This imbalance introduces classification bias. To address this issue, we propose to use the method of Density-Based Spatial Clustering of Applications with Noise (DBSCAN)29,30,31 before training, to identify the main cluster prototypes in the feature space and resample the training samples. DBSCAN is an unsupervised nonparametric technique that has also been previously proposed for 3D cell separation9, and in a U-Net boosting training strategy for brain tumor segmentation in MRI32. In the cell tracking stage, we use a graph-based technique to identify cell migration and cell mitosis events. First, we employ optical flow-based motion compensation and cell matching. Next, we form cell tracklets by linking cell instances across frames, and we identify cell migration and mitosis events to complete the tracking stage.

A key contribution of the proposed technique is that it incorporates the concept of space-time iterest point detection and characterization into automatic scale selection, segmentation, and scale detection. An additional strength is the use of a machine learning technique with class prototype balancing for cell detection. Furthermore, our method offers a well-defined and structured mathematical framework that uses graphs for track generation and cell event detection. We evaluated the segmentation, detection and tracking performance of our method on four fluorescence and two phase contrast time-lapse sequences of the Cell Tracking Challenge. In addition, we compared our technique to top performing techniques from CTC. The performance evaluation results indicate that the proposed methodology is competitive with these techniques, and it generalizes very well to diverse cell types and sizes, and diverse spatial and temporal resolutions.

Methods

In this section we detail the two main components of our framework: cell segmentation and detection, and cell tracking.

Cell segmentation and detection

We display the main stages for cell segmentation and detection in Fig. 1. We describe each of these stages next, with emphasis on multi-scale interest point detection for automated scale selection and segmentation, and neural network-based foreground/background separation for cell detection.

Fig. 1
figure 1

The main stages of cell segmentation and detection. Multi-scale interest point detection is followed by scale selection and superpixel segmentation by watershed. Then the cell regions are detected by a neural network. This is followed by cell boundary delineation and cell cluster separation..

Multi-scale cell interest point detection and segmentation

Spatio-temporal anisotropic diffusion This filter is applied by solving a system of three coupled partial differential equations (PDEs), each resembling the diffusion equation with anisotropic diffusion33,34. This system is solved for each frame at time t and its direct temporal neighbors \(\tau = \{ t - 1, t, t + 1 \}\):

$$\begin{aligned} \frac{\partial L(x,y,\tau ,s)}{\partial s} = g(\vert \nabla L(x,y,\tau ,s) \vert ) \cdot \Delta L(x,y,\tau ,s) + \nabla g(\vert \nabla L(x,y,\tau ,s) \vert ) \cdot \nabla L(x,y,\tau ,s). \end{aligned}$$
(1)

Here, x and y are the spatial indices of the pixels, and s is the scale index. This stage generates multi-scale diffused frame maps. The diffused frame maps are utilized for spatio-temporal feature detection, segmentation, and foreground/background separation.

Spatio-temporal Interest Point (Feature) Detectors The goal is to detect corners, blob structures and other feature types in the spatio-temporal domain22,27. Earlier techniques have used Laplacian-of-Gaussian blob detectors in a single scale to detect and count cell nuclei.35,36. We expect that the use of non-linear diffusion models will enable the detection of non-spherical features that may correspond to non-circular cell shapes and how they evolve in multiple scales. We also estimate motion displacement vector fields of the moving regions37,38,39,40 and add them to the feature set for machine learning-based cell detection.

Spatio-temporal Moments: These are the spatio-temporal second moments22 computed by pairwise products of the first order derivatives of the pixel intensities as an extension of Harris corner detector. This approach computes the structure tensor

$$\begin{aligned} { M(x,y,t) = w(\cdot ,\sigma _s^2,\sigma _t^2) * \left( \begin{array}{ccc} L_x^2(x,y,t) & L_x(x,y,t)L_y(x,y,t) & L_x(x,y,t)L_t(x,y,t) \\ L_y(x,y,t)L_x(x,y,t) & L_y^2(x,y,t) & L_y(x,y,t)L_t(x,y,t) \\ L_t(x,y,t)L_x(x,y,t) & L_t(x,y,t)L_y(x,y,t) & L_t^2(x,y,t) \end{array} \right) , } \end{aligned}$$
(2)

where \(w(\cdot ,\sigma _s^2,\sigma _t^2)\) denotes a smoothing kernel with integrating factors \(\sigma _s^2,\sigma _t^2\) for the spatial and time coordinates, respectively. The points of interest are then identified by the Harris corner detection criterion

$$\begin{aligned} S(x,y,t) = det(M(x,y,t)) - \kappa \cdot trace^3(M(x,y,t)). \end{aligned}$$
(3)

Spatio-temporal Hessian: This approach computes the Hessian matrix H composed by the second order derivatives of the pixel intensities in the space-time domain

$$\begin{aligned} H(x,y,t) = \left( \begin{array}{ccc} L_{xx}(x,y,t) & L_{xy}(x,y,t) & L_{xt}(x,y,t) \\ L_{yx}(x,y,t) & L_{yy}(x,y,t) & L_{yt}(x,y,t) \\ L_{tx}(x,y,t) & L_{ty}(x,y,t) & L_{tt}(x,y,t) \end{array} \right) . \end{aligned}$$
(4)

It measures the strength of the interest points using the determinant magnitude, \(S(x,y,t) = \vert det(H(x,y,t)) \vert\).

Spatial Hessian on Temporal Derivative: This approach computes the spatial Hessian matrix on first-order temporal derivatives \(L_t\) that we denote by \(H_t\). It measures the strength the interest points using the determinant \(S(x,y,t) = \vert det(H_t(x,y,t)) \vert .\)

Motion map computation and cell deformation We estimate the motion of each cell in the prior frame by the Combined Local/Global Optical Flow (CLGOF) technique37,38,39,40. We use the computed motion vector field as a region descriptor for cell detection, and to warp the cells of each prior frame onto the cells of each processed frame.

Automatic scale selection in spatio-temporal domain In this stage, we aim to identify the image scale that preserves meaningful cell visual content while suppressing the noise. For each frame, we develop a stack of multi-scale feature descriptors in the spatio-temporal domain. The feature loci are defined by regional maxima on each scale. We identify the optimal scale as the one that minimizes the number of noisy features that should diminish rapidly in the initial scales, while preserving the features of cells that may sustain longer along the scale space continuum. To this end, we analyze the second order derivatives of features across multiple scales similarly to approaches applied to natural video sequences27. First, we normalize the feature vectors to be able to automatically choose the diffused map with best feature descriptors. Next, we compute the second-order derivative of the spatio-temporal feature vectors to determine a decision criterion for scale selection. Then, the selected scale diffused map along with the regional maxima drive the subsequent cell segmentation and detection stages. Figure 2 illustrates an example of scale generation, keypoint computation, and scale selection.

Fig. 2
figure 2

Example of scale generation and automatic selection. It shows the generation of spatio-temporal non-linear scale space, computation of feature map, and the analysis of interest points (features) over the generated scales for automated scale selection.

Watershed and interest point-based segmentation This stage generates markers to drive the watershed segmentation by fusing the regional interest point feature maxima with the regional minima of the edge map computed on the selected diffused map. First, we invert the stochastic map produced by Parzen density estimation to form regions separated by spatio-temporal discontinuities. Then, we fuse the feature maxima with the density map and produce watershed superpixels.

Machine learning-based cell detection

Here we introduce a model that can classify the watershed superpixels into cells (foreground) and non-cell (background) classes. We employed a fully connected neural network to perform this classification. We trained the network on static and dynamic descriptors formed by the regional averages and variances of diffused map intensities, spatio-temporal descriptors, motion displacement vector magnitudes, spatio-temporal intensity gradient magnitudes, and regional areas. The training vectors are designed to enable the model to learn foreground pixels for cells with irregular shapes, variable feature magnitudes and rates of motion, and varying contrast to noise ratios.

One difficulty in effective training of this machine learning model is the class imbalance between cell and non-cell regions as the class of non-cells typically has many more samples than the cell class. This imbalance frequently leads to classification bias. In this work, we propose to use DBSCAN (Density-Based Spatial Clustering of Applications with Noise)29,30,31 to identify the main cluster prototypes formed in the feature space by the training samples. We measure the cluster sizes (or cardinalities) and perform balancing of the cluster sizes by undersampling the cluster with the most elements. This unsupervised learning stage balances class prototypes to improve neural network training.

The main contributions of DBSCAN are cluster modeling and a clustering algorithm. It is based on the assumption that clusters correspond to areas of high density that are separated by areas of lower density. The data points are classified into core points, border points, and noise. It performs nonparametric density estimation by computing the number of points within the \(\epsilon\)-neighborhood of each query point. DBSCAN effectively computes the transitive closure of an equivalence relation. It can identify clusters of arbitrary shapes and sizes, while being less sensitive to noise and outliers.

To balance the cluster cardinalities in the feature space for effective training of the cell detection network without overfitting, we under-sample the most populous cluster to equalize its size with the size of the second most populous cluster. This operation reduces class imbalance in the training data without changing the structure of the data in the feature space. In DBSCAN we used Mahalanobis distances for improved modeling of arbitrary shapes of clusters.

Next, we train the neural network model on the balanced dataset produced by the previous step. Then, employed a network with layer structure of \(9 \times 12 \times 6 \times 2\) nodes. The fully connected layers are followed by RELU activation. The first layer takes as input the regional feature vectors. The network predicts if the input region descriptor corresponds to a cell, or a non-cell region. We employ the cross entropy loss function for training.

Cell boundary refinement and cell cluster separation

In this stage, we aim to refine the cell boundaries detected from the previous stage of foreground/background separation. We employ a region-based energy minimizing level-set model with a local clustering criterion to delineate the cell region. Next, we identify and separate adherent (or clustered) cells. We compute the solidity to identify the clustered cells, and then we compute the signed distance map of the identified regions. The watershed lines of the distance map determine the boundaries that divide the cells.

Cell tracking

A cell tracking technique should effectively manage key cell events in a sequence, such as cell migration (cell motion that is accompanied by morphological changes), cell mitosis (division into two daughters), cell disappearance (leaving the field of view or collision), cell apoptosis/necrosis (cell death), new cells entering the field of view, and cell reappearance (re-entering the field of view). This is a challenging task mainly because of low temporal resolution and variable types of cell motion. For example, the biological cells may follow a Brownian movement, which makes the motion estimation very hard. We developed a graph optimization technique that can be divided into the following stages: bi-frame cell matching, construction of cell tracks and cell event detection41 as shown in Fig. 3.

Fig. 3
figure 3

Cell tracking method overview. Bi-frame cell matching is applied first by motion compensation and max likelihood matching. Cell tracklets are formed by linking cell correspondences over the complete time sequence. A graph structure is used to store the cell relationships and detect migration and mitosis events.

Bi-frame cell matching

In this stage we find cell correspondences between two consecutive frames that we denote by current frame and prior frame. The problem of finding the association with the highest probability is performed by determining the max-likelihood matching for each cell of the current frame among all the warped cells of the prior frame by the motion field that was found by CLGOF as described before.

Let \(X_t = \{x_1^t, x_2^t, \ldots , x_n^t\}\) and \(\Xi _t = \{ \xi _1^t, \xi _2^t, \ldots , \xi _n^t\}\) be the set of observations and the set of states, respectively, for \(n_t\) cells at time t. We assume a Markov process such that the transition probability depends only on the state of the mother cell or the same cell in previous frame \(P(S_{t+1} | S_{t})\). We compute the likelihood \(\pi _{ij}^{t+1}\) that the cell i in frame \(t+1\) is connected with cell j in frame t based on the spatial proximity between cells of warped indicator function \(\hat{L}(\omega + \hat{\delta },t+1)\) and cells of the current indicator function \(L(\omega , t)\):

$$\begin{aligned} \hat{j} = arg \max _j \pi _{ij}^{t+1} = arg \max _j p(x_i^{t+1} \vert x_{j}^{t},\xi _j^{t}). \end{aligned}$$
(5)

We also use a reject option that signifies cell appearance when the maximum likelihood is still very low. After making a decision we assign the predicted cell class \(\hat{j}\) to the cell indicator map.

Construction of cell tracks

The cell associations are stored in a structure of linked lists. To form the cell tracks, we traverse the linked lists from the last frame to the first. The identified maximum likelihood tracks form a forest of minimal cost chains.

We use the created linked lists to construct the cell tracks. We traverse the set of cell states in reverse chronological order and we store the cell track information in a structure with elements \(Q = \left[ t, \xi _{m}^{t}, t_p, \xi _{n}^{t_p} \right]\), that contain the frame id t, cell label (indicator) id \(\xi _{m}^{t}\), previous frame id \(t_p\), and previous label id \(\xi _{n}^{t_p}\).

This stage produces a cell track structure that holds the tracks \(\phi _i\) and addresses cell appearance and disappearance. However, some tracks may be partially overlapping in cell division occurrences, where different cells have a common ancestor. We address these cases by identifying the overlapping parts of two tracks \(\phi _j\) and \(\phi _k\), and creating three new tracks, one for the mother \(\phi _p = \phi _j \cap \phi _k\), and the 2 daughters \(\phi _q = \phi _j - \left( \phi _j \cap \phi _k \right)\) and \(\phi _r = \phi _k - \left( \phi _j \cap \phi _k \right)\).

Cell event detection

Cell tracking results can be represented using an acyclic oriented graph \(G=(V, E)\) that consists of a vertex set V and an edge set E such that \(E \subset V \times V\). The graph vertices represent the detected cells and its edges represent temporal relations between the cells. The condition for an ordered vertex pair \(E^{t_1,t_2}_{i,j}\left( \xi _i^{t_1}, \xi _j^{t_2} \right)\) to belong to the edge set E is expressed as:

$$\begin{aligned} \left( i=j \wedge t_2 = t_1 + 1 \right) \vee \left( i \ne j \wedge t_1 < t_2 \wedge P(\xi _j^{t_2}) = \xi _i^{t_1} \right) \Rightarrow E^{t_1,t_2}_{i,j}\left( \xi _i^{t_1}, \xi _j^{t_2} \right) \in E. \end{aligned}$$
(6)

The function \(P:\Xi \rightarrow \Xi\), where \(\Xi\) is the universe of cell states \(\xi _i\), returns the mother \(\xi _m\) of an entity \(\xi _i\) that is \(\xi _m = P(\xi _i)\). The first condition in the previous expression represents cell migration, while the second condition represents cell division. We assign graph node labels \(L_G(V)\) to cells in each indicator image that determines a global indicator function \(F_L: \Omega \rightarrow L_G\).

Experiments and results

In our performance evaluation experiments, we used Cell Tracking Challenge (CTC) datasets12. CTC datasets consist of time-lapse microscopy sequences and annotations that can serve as benchmarks for cell segmentation, detection and tracking. Our experiments can be divided into two main parts. In the first experiment, we evaluated the performance of our method on CTC training sequences. In the second experiment, we evaluated the performance of our technique in comparison to results of other published techniques that participated in CTC and produced results on the same sequences. Next, we describe the CTC datasets, the evaluation measures, and the results for each experiment.

Datasets

We utilized 4 datasets of fluorescence (Fluo) microscopy and 2 phase-contrast (PhC) microscopy for training and testing. Each dataset contains two sequences labeled 01 and 02, along with two sets of reference masks named gold standard corpus or gold-truth (GT) and silver standard corpus or silver truth (ST). The CTC organizers provide the participating teams with the GT and ST datasets for the training sequences, but reference datasets for the challenge datasets are only available to CTC organizers for evaluation.

We evaluated our method on fluorescence microscopy sequences of low resolution cytoplasm of rat mesenchymal stem cells Fluo-C2DL-MSC (MSC), low resolution nuclei of cervical cancer cells Fluo-N2DL-HeLa (HeLa), high resolution nuclei of mouse embryonic stem cells Fluo-N2DH-GOWT1 (GOWT1) and high resolution simulated nuclei of human leukemia cells (HL-60) Fluo-N2DH-SIM+ (SIM+) from fluorescence microscopy. The phase contrast microscopy datasets include high resolution cytoplasm of glioblastoma-astrocytoma U373 cells PhC-C2DH-U373 (U373) and low resolution cytoplasm of pancreatic stem cells PhC-C2DL-PSC (PSC). The datasets vary in noise levels, cell densities, numbers of cells leaving and entering the field of view, resolution, and numbers of mitotic events.

The reference label set consists of sparse GT and dense ST reference datasets. The GT reference maps were manually annotated by experts with background in biology at three different institutions, but do not necessarily contain masks for all cells and all frames. ST reference datasets contain computer-generated annotations from the top results submitted by former CTC participants. The CTC organizers used segmentation fusion techniques to produce the ST masks. These datasets contain cell masks for all frames and all cells. Because GT and ST reference datasets offer complementary information, we use both of them for our evaluations.

Performance evaluation measures

To evaluate the cell segmentation detection, and tracking performance of our method and the baseline methods, we utilized the Dice Similarity Coefficient (DSC), Jaccard index, and the CTC implementations of the SEG, DET, and TRA measures. The SEG, DET, and TRA measures are particularly useful as they serve as common bases for comparisons with other techniques participating in the CTC challenge. We denote by \(R_s\) the set of all binary cell regions delineated by our cell segmentation method, while \(R_{Ref}\) is the set of cell pixels from the reference region.

SEG measure – It measures the Jaccard index between test and reference data, for each cell that has a reference mask. Then the method sets to zero the indices of cells for which \(\left| {R_s}\cap R_{Ref} \right| \le 0.5 \cdot \left| R_{Ref} \right|\). It finally computes the average of all the individual indices to yield SEG.

DET measure – DET calculates detection accuracy by comparing the nodes of the acyclic graphs generated by the tested algorithm with nodes of the acyclic graphs of the reference masks. It is defined as

$$\begin{aligned} DET = 1- min(AOGM_D, AOGM_{D0})/AOGM_{D0}. \end{aligned}$$
(7)

Here \(AOGM_D\) is the cost of transforming a set of nodes produced by the tested method into the set of reference nodes. \(AOGM_{D0}\) is the cost of creating the set of reference nodes.

TRA measure – It is based on comparison of acyclic oriented graphs produced by the tested and reference tracking methods. TRA is defined as a normalized Acyclic Oriented Graph Matching (AOGM) measure:

$$\begin{aligned} TRA = 1- min(AOGM, AOGM_0)/AOGM_0, \end{aligned}$$
(8)

where \(AOGM_0\) is the AOGM value required for creating the reference graph from scratch.

Results on CTC training data: train on sequence 1 and test on sequence 2

First, we present the results on CTC training data. In this experiment, we trained the foreground/background classifier on sequence 01 of each dataset and performed segmentation and tracking on sequence 02. Table 1 contains segmentation, detection and tracking measurements against GT and ST reference data. The dataset Fluo-N2DH-SIM+ has the same gold and silver truth labels because it is simulated. In addition, there is no ST reference data for detection and tracking, so we list the DET-GT and TRA-GT results only. From these results we observe general agreement among Jaccard and SEG for most of the datasets. A bit higher agreement is observed between DET and TRA measures. Overall, our technique yields the following average rates: Jaccard-ST: 0.741, SEG-ST: 0.785, SEG-GT: 0.729, DET-GT: 0.895, TRA-GT: 0.881. We note that the SEG-ST and SEG-GT rates are not directly comparable because ST contains dense annotations, but GT contains sparse annotations. We display segmentation label maps on a frame of three sequences in Fig. 4. We observe the good agreement between the cell label maps produced by our method and the reference label maps. In Fig. 5, we show a visualization of tracking output that enables monitoring of the main cell migration and mitosis events.

Table 1 Segmentation, detection and tracking performance measures of our method on the second sequences of the CTC training datasets using the first sequences for training.
Fig. 4
figure 4

Segmentation comparisons: (a) Intensity frame, (b) GT reference segmentation, (c) ST reference segmentation, (d) Segmentation by the proposed method. Fluo-C2DL-MSC-02 (first row): This sequence contains overlapping elongated cells and has low temporal resolution. Our technique produced results that are notably close to the ST and GT references. PhC-C2DL-PSC-02 (second row): PSC has numerous small cells in each frame, with fine shape details, and several cell divisions occur over time. Our technique produced very good detection and segmentation on this dataset. Fluo-N2DH-GOWT1-02 (third row): This sequence is characterized by significant variability in contrast to noise ratios within each frame and over multiple time frames, that make cell detection difficult. The neural network classifier helped to detect cells of low contrast.

Fig. 5
figure 5

Visualization of cell tracking results for Fluo-N2DH-SIM+-02: (a) cell trajectories in 2D + time showing migration and division, (b) cell lineage tree showing cell events.

Results on CTC challenge data

Next, we present results from our experiments on CTC challenge data. Our method’s code and results were submitted to CTC and re-produced by the challenge organizers for all participants. The organizers did not share reference data for these sequences. Hence, they calculated and posted evaluation performance measures of each method on the challenge website. Table 2 contains SEG, DET, and TRA results of our method on each test sequence.

Table 2 SEG, DET, and TRA measures of our method on CTC challenge datasets.

Furthermore, in Tables 3 and 4 we list the SEG and TRA measures of all CTC participant techniques that were evaluated on all of the 6 datasets utilized in our test set. We also include the average performance measures of the 9 techniques that met the inclusion criterion, and our results for reference. Figure 6 shows boxplots of SEG and TRA values produced by the methods listed in Tables 3 and 4. Among the techniques that produced segmentation and tracking results for the same datasets, our technique ranked above the 70th percentile for 3 datasets in SEG and TRA values. Our technique scored mean SEG of 0.738 and mean TRA of 0.895 over all the tested sequences, ranking among the top 4 and top 3 techniques, respectively. It also achieves the top SEG and TRA performances among all compared methods for Fluo-C2DL-MSC.

Table 3 SEG measure comparisons on all six challenge datasets.
Table 4 TRA measure comparisons on all six challenge datasets. Our method is labeled as DESU-US.
Fig. 6
figure 6

Performance evaluation of the six challenge datasets: (a) SEG measure comparisons, (b) TRA measure comparisons. Of note is the balanced performance of the proposed method (labeled as DESU-US) over heterogeneous datasets that supports its generalizability.

Performance evaluation on external datasets

To assess the generalizability of this framework, we evaluated our segmentation and tracking performance on external datasets acquired in the context of tissue engineering research, and specifically cyber-physical organoid formation sequences42,43. In this downstream application, the task is to segment and track cells and micro-robots. We display segmentation and tracking results in Figs. 7 and 8, respectively. These results indicate the good segmentation accuracy and the capacity to track the cells and micro-robots that could help to enable a visual control loop for cyberphysical organoid formation.

Fig. 7
figure 7

Segmentation comparisons on external datasets comprising of cells and a cellbot: (a) Intensity frame, (b) GT reference segmentation, (c) segmentation by the proposed method.

Fig. 8
figure 8

Visualization of cell trajectories in 2D + time showing migration and division for external datasets comprising of (a) cells and a cellbot, and (b) multiple microrobots.

Discussion

By comparing the box plots of Fig. 6 we observe that the proposed technique yields very good performances for all test datasets. These results indicate the generalizability of our technique to diverse datasets in terms of cell size and shape, density of environments, spatial and temporal scales, and contrast mechanisms. We summarize the main challenges presented by the utilized datasets next. The Fluo-C2DL-MSC sequences have overlapping elongated cells, with varying shapes and intensity profiles. These sequences have low spatial and temporal resolution. HeLa sequences present challenges for cell separation in a dense environment. PhC-C2DH-U373 has similar intensity levels inside the cell in cytoplasm and background with thick cell borders complicating cell boundary delineation. Fluo-N2DH-GOWT1 is characterized by significant variability in contrast to noise ratios within each frame and over multiple time frames, that make cell detection difficult. PhC-C2DL-PSC has numerous small cells in each frame, with fine shape details, and several cell divisions occur over time. To illustrate challenging cases, in Fig. 9, we show examples from two sequences, where there is variability in contrast and noise. To quantify the image characteristics, we computed Signal-to-Noise Ratio and Contrast-to-Noise Ratios34. For the Fluo-N2DH-SIM+02 sequence, frames 3 and 34, the SNR values are 1.876 and 0.988, and the CNR values are 6.447 and 3.252, respectively. In this dataset, we can visually observe the significant discrepancy in signal and contrast levels. For the Fluo-N2DH-GOWT1-02 sequence, frames 3 and 89, the SNR values are 24.138 and 8.912, and the CNR values are 25.467 and 7.254, respectively. In this dataset, there is noticable variability among the contrast levels of individual cells. The scale selection module helps to address these challenges by selecting the optimal scale of feature representation for each frame. Our machine learning-based cell detection module learns to separate cells from background efficiently despite contrast variations. Another challenge that may be encountered is high cell density. In Fig. 10, we display cell confluency and segmentation accuracy graphs for two datasets, Fluo-N2DL-HeLa-02 and PhC-C2DL-PSC-02, that include many mitotic events and clustered cells. Of note is that while cell confluency increases, the DSC and Jaccard scores do not decrease, hence indicating a good level of robustness with respect to cell density variations. A key factor for this characteristic is the cell boundary refinement stage. With respect to computational time, increasing cell density implies increased run times, mainly observed in the tracking stage. The segmentation run time depends on multiple factors such as the image matrix size, the number of generated scales, and the boundary refinement iterations. Table 5 lists the average segmentation and tracking run times per frame for the CTC test datasets for Matlab implementation running on a system with Intel Xeon E5-2630 CPU at 2.4 GHz.

Fig. 9
figure 9

Examples of variability in signal, contrast and noise levels within a dataset and between datasets. First row: Fluo-N2DH-SIM+02, (a) frame 3 (SNR: 1.876, CNR: 6.447), and (b) frame 34 (SNR: 0.988, CNR: 3.252). Second row: Fluo-N2DH-GOWT1-02, (a) frame 3 (SNR: 24.138, CNR: 8.912), and (b) frame 89 (SNR: 25.467, CNR: 7.254.).

Fig. 10
figure 10

Graphs of cell confluency, DSC, and Jaccard values versus frame number for two datasets that include many cell mitosis events. We note that segmentation accuracy expressed by DSC and Jaccard is largely independent of increasing cell confluency that implies denser environments.

Table 5 Segmentation and tracking run time analysis.

The high TRA rates show that our methodology can accurately create cell tracks and identify cell migration and cell mitosis. As seen in Fig. 5, our method yields results that enable the monitoring and quantification of cell events in high throughput studies. These capabilities are significant for diagnosis of diseases such as cancer, drug development, and personalized medicine. However, some over-segmentation may lead to false positive detection of cell mitosis events that limit tracking accuracy. Such inaccuracies may be addressed by inclusion of additional features in the cell matching stage implying increased computational cost.

An original contribution of this work is the incorporation of multi-scale spatio-temporal features to perform automated scale selection, cell detection and segmentation. This approach helps to select a scale that automatically adjusts to the visual content of each frame, that addresses changes in contrast and noise levels, and reduces the need for hyperparameter tuning. Another key component of this approach is neural network-based detection of cell regions with class prototype balancing. We first identify the main cluster prototypes formed by the features to balance the class representations in the training sample set. Our results showed that our neural network with DBSCAN-based re-sampling improves cell detection and segmentation compared to our previous approaches that employed parametric and nonparametric statistical tests40,41. Relative to a previous version of our work that employed nonparametric foreground/background separation41, the neural network approach resulted in the following performance improvements over the six CTC datasets: 0.713 to 0.754 in average SEG-GT, 0.886 to 0.895 in average DET-GT, and 0.871 to 0.881 in average TRA-GT.

In conclusion, we introduced a comprehensive and interpretable cell segmentation and tracking framework. Our extensive performance evaluation experiments showed its applicability to analysis of heterogeneous cell microscopy sequences. Future research directions include the development of a joint cell segmentation and classification approach, and graph neural network44,45,46 based optimization techniques for cell tracking and cell event detection.