Ensemble learning and EigenCAM-based feature analysis for improving the performance and explainability of object detection in drone imagery

Joshi, Gargi; Joshi, Amey; Shetty, Mranmay; Walambe, Rahee; Kotecha, Ketan; Scotti, Fabio; Piuri, Vincenzo

doi:10.1007/s42452-025-06879-5

Ensemble learning and EigenCAM-based feature analysis for improving the performance and explainability of object detection in drone imagery

Research
Open access
Published: 20 April 2025

Volume 7, article number 376, (2025)
Cite this article

Download PDF

You have full access to this open access article

Discover Applied Sciences Aims and scope Submit manuscript

Ensemble learning and EigenCAM-based feature analysis for improving the performance and explainability of object detection in drone imagery

Download PDF

Gargi Joshi¹^na1,
Amey Joshi¹^na1,
Mranmay Shetty¹,
Rahee Walambe²,
Ketan Kotecha¹,
Fabio Scotti³ &
…
Vincenzo Piuri³

664 Accesses
Explore all metrics

Abstract

Object Detection (OD) from drone images is a long-standing challenge due to multi-scale objects in the images. Several sophisticated and robust OD models have been proposed in the last few decades. This has significantly improved the performance of the OD task. However, these models remain highly opaque. It is challenging for humans to comprehend their outcomes, raising serious concerns about their real-world usability and adoption in mission-critical, high-risk applications. This work investigates the challenge of OD and explainability in drone imagery, aiming to improve the accuracy, robustness, reliability, and trustworthiness of automated OD systems for intelligent surveillance. Our work explores an overview of existing approaches for OD in drone imagery, highlighting their strengths and limitations. The proposed methodology and custom model architecture leverage an integrated pipeline for explainability and ensembling, enabling users to have an improved and better understanding of the OD outcomes. The application of the proposed methodology is demonstrated on the AU-AIR dataset. Significant improvements in object detection accuracy and interpretability are observed compared to existing state-of-the-art methods. The affirmative voting strategy resulted in a 3% increase in mean average precision, demonstrating the potential of ensemble learning to improve the performance of multi-scale OD. The perturbation-based ablation probing of the model with EigenCAM attributes to necessary features reliance on the proposed model with evaluation of XAI for robust, trustworthy, and improved OD outcomes.

Highlights

1.
An integrated pipeline combining ensemble learning and explainability for object detection in drone imagery
2.
Affirmative score-weighted strategy, achieving a 3% increase in mean average precision (mAP) for multi-scale object detection, enhancing model robustness and accuracy.
3.
Integrated explainability techniques to improve the interpretability and trustworthiness of OD predictions, with EigenCAM-based evaluation ensuring robust and explainable detections.

VisDrone-DET2020: The Vision Meets Drone Object Detection in Image Challenge Results

DroBoost: An Intelligent Score and Model Boosting Method for Drone Detection

Drone-vs-Bird Detection Challenge at ICIAP 2021

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Object Detection (OD) is an essential task in computer vision, which involves identifying and localizing objects in images or videos. The ability to detect objects automatically is crucial in surveillance, agriculture, infrastructure inspection, and search and rescue operations [1]. Over the years, deep learning-based techniques [2], particularly convolution neural networks (CNNs) [3], have shown significant progress in OD tasks. Deep learning-based object detectors are classified into single-shot detectors such as You Only Look Once (YOLO) and SSD (Single-Shot Detector) and object detectors based on region proposals such as RCNN, fast RCNN, faster RCNN, RFCN, and mask RCNN [4]. OD in aerial drone imagery is far more challenging as multiple factors such as altitude, camera angle, object scale, overlap, occlusion, motion blur, lack of labelled data, flat and small view of objects, real-time limited view of computation, and lack of contextual information hinder the overall object detection capabilities [5]. The tradeoff in detection accuracy and real-time performance coexists for detecting small-scale objects [6,7,8]. Single-pixel shifts cause significant interference and miss-detection due to a lack of background and foreground information [9]. These are some of the pertinent challenges for OD in UAVs [9]. Although recent approaches have significantly improved object detection performance and outcomes, the OD models are still considered opaque black boxes. It is not clear to a broader audience why the model predicts what it predicts, raising serious concerns and the need to build models that are more transparent and understandable to humans [10]. Owing to the persistent accuracy-interpretability tradeoff, i.e., the higher the complexity of the model, the lesser the interpretability, deep learning models are often viewed as black boxes. The black box nature makes understanding and comprehending their underlying working behaviour and decision-making process [11] challenging. Explainable AI (XAI) is a research field that aims to understand and interpret the working of machine learning models. It allows the interpretation of the AI-generated insights [12]. This field has gained significant attention as the use of ML models in real-world settings has increased, especially in high-risk domains such as healthcare, autonomous driving, drone-based surveillance, rescue operations [13], etc. Trustworthy and human-interpretable explanations are crucial for informed decision-making to validate the AI-generated insights in human-AI collaborative tasks considering the mission-critical application of drone imagery [14]. Drone images are subject to ethical considerations such as surveillance, security, and compliance with various data protection, privacy and transparency laws worldwide. Explainability is particularly important in critical applications such as defence, healthcare, and autonomous vehicles, where it is crucial to develop trust, transparency, and safety in machine learning models for real-world adoption and deployment [15].

The combination of OD and XAI has several potential benefits. By providing insights into the model's working, researchers and end users can better understand how the model makes decisions and identify potential biases or errors, improve accuracy and reliability, and increase users'trust in the model outcomes [16]. However, while much work has been done in XAI for other domains, further exploration of the XAI specific to OD on drone imagery has been needed. This work aims to fill the identified gap by developing and evaluating XAI techniques tailored to the unique challenges of OD in drone imagery [4]. The experimentation is carried out on the open-domain dataset AU-AIR [40], which is used for surveillance. The object detector performance is improved using ensembling techniques, and the OD outcomes are made explainable with XAI techniques with further evaluation. The significant contributions of the work can be stated as follows:

Developing an integrated pipeline combining ensemble learning and explainability for multi-scale object detection in drone imagery.
Comparative analysis of various OD models for ensembling and voting strategies for multi-scale object detection.
Demonstrate and implement explainability techniques to improve the interpretability and trustworthiness of OD predictions, with EigenCAM-based evaluation ensuring robust and explainable detections.

2 Related work

2.1 Deep learning-based object detection

Recently, drone-based OD has received increased attention since it has many applications, from surveillance [17] to agriculture [18], and there has been an increase in research on OD specific to drone imagery [19,20,21]. In the context of unimodal OD, which focuses on detecting objects using a single modality, such as RGB images, several works have explored the use of deep learning models. The popular Faster R-CNN [22] and YOLO [23] models have been shown to achieve state-of-the-art performance on OD tasks. Deep learning-based ensemble techniques have recently been used for multi-scale OD applications such as drone-based OD, pedestrian detection, autonomous driving, etc. [9,10,11]. Recent studies have introduced and studied deep learning-based object detections in challenging environments and unstructured environments [24, 25]. Object detection accuracy for small objects is low [26]. A method to handle the OD for small objects, especially when the images are low-resolution, is reported in [27]. Particularly with deep learning, an ensemble of CNN for object detection in a constrained environment is proposed in [24, 25]. A vision transformer-convolutional neural network is applied to classify remote sensing images [28, 29] for efficient knowledge distillation of hybrid models. In [30], an efficient and robust knowledge transfer network named ERKT-Net is proposed. This network is designed to provide a lightweight yet accurate Convolutional Neural Network (CNN) classifier for classifying remote sensing images. In [31], an enhanced vision transformer-based object detection for remote sensing images is presented.

2.2 Explainability for OD

The latest machine learning models offer a tradeoff between prioritizing higher explainability with lower accuracy or prioritizing higher accuracy with lower explainability. Previously, machine learning and pattern recognition needed specialized knowledge to create a feature extractor to convert raw data into a feature vector. However, deep learning provides a new way of automatically extracting abstract features [32]. In recent years, OD-explainability has significantly developed [33, 34], but the field is still at an initial stage. Several researchers have proposed explainable OD models that provide insight into how and why objects are detected to address this issue. These models typically incorporate attention and saliency mechanisms to highlight the most informative regions of an image for OD [35]. Several explainability techniques have been proposed for the task [36,37,38] to date. While some exciting applications of explainability are seen in agriculture [39] and driver assistance restoration [34], they lack evaluation and validation, which is crucial for real-world adoption and deployment.

In this context, our work addresses significant object detection (OD) gaps in drone imagery by presenting an innovative end-to-end framework that integrates ensemble learning and explainable OD techniques. A 3% increase in mAP scores for the AU-AIR dataset demonstrates the potential of ensemble learning to improve the OD model's performance as compared to the SOTA performance. Evaluation of XAI outcomes with a perturbation-based feature ablation strategy with EigenCAM validates the object detection outcomes for more reliability and trustworthiness for real-world deployment settings.

3 Materials and methods

3.1 Dataset description

The work is based on drone imagery, focusing on the image datasets obtained from uncrewed aerial vehicles (UAVs). Although various UAV datasets are available, the AU-AIR dataset [40] is chosen for experimentation and demonstration in this work. The AU-AIR dataset was created using a DJI Phantom 4 drone. AU-AIR is the first multimodal UAV dataset for object detection in aerial images with onboard sensor information for autonomous aerial surveillance. It includes various annotations such as object detection, object tracking, and weather conditions. In addition to flight data, the dataset also contains visual data and object annotations. It is designed for traffic monitoring and comprises eight video streams, collectively lasting over two hours.

The majority of the videos were captured in Aarhus, Denmark. The dataset includes aerial videos and information on time, GPS coordinates, UAV altitude, IMU data, and velocity. The videos were taken from different angles, ranging from 45 to 90 degrees and at various heights between 5 and 30 m above ground level. The video frames in the dataset contain bounding boxes that indicate instances of different object categories related to traffic surveillance. The flight data is also included in each frame of the video. The dataset comprises a total of 32,823 annotated video frames, where each frame is labelled with object categories and their corresponding flight information. Eight object categories are labelled: humans, cars, vans, trucks, motorcycles, bicycles, buses, and trailers. The dataset contains a total of 132,034 annotations. The dataset also contains logged sensor information. The sensor information is captured during the video recordings, in addition to visual information and annotations of objects, making it an optimal choice to experiment with multi-scale object detection from drone imagery. Figure 1. below shows the classwise data distribution of the objects in the AU-AIR dataset.

For each extracted frame in the dataset, the following attributes are available:

d, t: current date of a frame, current time stamp of the frame
la, lo, a: UAV latitude, UAV longitude, UAV altitude • φ, θ, ψ: roll angle, pitch angle, yaw angle of the UAV
Vx, Vy, Vz: x-axis speed, y-axis speed, z-axis speed.

Figure 1 shows the class imbalance in the AU-AIR dataset. In addition, there are inaccuracies in image annotation formats. Class mismatches and haphazard image sequencing in pre-trained weights are also identified, demanding meticulous alignment of class indices. Rectifying this misalignment was vital to ensure seamless image-annotation correspondence. We have carried out various augmentation techniques and manual image-annotation matching to handle the class imbalance.

3.2 Model architecture

Figure 2 depicts the proposed model architecture with ensembling and XAI. The left block illustrates distinct pipelines for object detection models. The middle block demonstrates the application of ensemble learning. The prediction model is used to refine the bounding box coordinates. The model predicts the bounding box and classifies the class score in one step, incorporating voting strategies on extracted bounding boxes. The right block elaborates on the XAI implemented in the unimodal pipeline, encompassing gradient maps and obtaining the EigenCAM feature map through upsampling. This work initially considered three OD models, namely YOLOv3-Tiny, MobileNetV2-SSDLite and YOLOV5 s + RetinaNet (few extracted layers), for experimentation. We also experimented with the YOLOv8 model for OD. However, for this particular dataset, YOLOv5 with few layers of RetinaNet yielded the best results. Therefore, YOLOv5 was chosen to choose optimal bounding box predictions in the ensembling step. YOLOv5 and YOLOv8 were employed for ensembling, and an affirmative score weighting strategy yielded superior results.

3.3 Model optimization

The study employed a set of carefully chosen hyper-parameters, including learning rate (lr0 = 0.01), momentum (0.937), weight decay (0.0005), epochs (150), momentum (0.8), bias learning rate (0.1), box loss coefficient (0.05), class loss coefficient (0.5), class loss power (1.0), object loss coefficient (1.0), object loss power (1.0), IoU threshold (0.2), anchor threshold (4.0), focal loss gamma (0.0), HSV augmentation factors (h = 0.015, s = 0.7, v = 0.4), rotation degrees (0.0), translation magnitude (0.1), scaling factor (0.5), shearing factor (0.0), perspective distortion factor (0.0), vertical flipping probability (0.0), horizontal flipping probability (0.5).

For optimisation, stochastic gradient descent (SGD) is employed with a learning rate of 0.01, distributed across parameter groups, including 57 for weight with no decay, 60 for weight with a decay of 0.0005, and 60 for bias.

Additionally, selected augmentation techniques, such as blur, median blur, greyscaling, and Contrast Limited Adaptive Histogram Equalization (CLAHE), with respective application probabilities (p = 0.01) and specified parameter ranges, are incorporated to augment the training data. Ensemble learning with the YOLOv5 model for object detection aims to enhance prediction accuracy by minimizing generalization errors. The ensemble approach mitigates prediction errors by ensuring diversity and independence among the base models, improving performance.

The IoU metric that measures the overlap between the predicted bounding box and ground truth bounding box is employed in the ensemble algorithm to cluster detections based on their bounding box overlaps and class affiliations, yielding a list denoted as R, comprising subsets [DR1, DR2, …, DRm], where each DRi represents a collection of detections. These collections adhere to the following conditions for any pair of detections, d1, and d2, within DRi:

The Bounding Box Overlap Condition necessitates IoU (d1bbox, d2bbox) > 0.5, where d1bbox and d2bbox denote the bounding boxes of d1 and d2, respectively, with IoU quantifying their overlap.
The class matching condition ensures that the class is equivalent to the class, guaranteeing that grouped detections share the same class.

Subsequently, each DRi in list R becomes dedicated to a specific region within the image. The size of DRi is a critical factor in determining whether the algorithm infers the presence of an object in that region. Larger DRi sizes indicate a heightened likelihood of an object's presence within the corresponding area.

3.4 Baseline networks

YOLOv5 was configured for the benchmarking using the default parameters. The AU-AIR dataset is split into 60% training, 10% validation, and 30% testing samples, following the original paper on AU-AIR for a fair comparison. The object detectors are adapted to the total number of classes in the AU-AIR dataset (8 classes) by changing their last layers. The implemented pipeline is depicted below in Fig. 3. Ensemble models are developed to improve the detector performance. The detection is explained by building the underlying rationale by applying post hoc explainable AI techniques with further validation and evaluation.

3.5 Model performance assessment

To evaluate OD performance, mean average precision (mAP) [41] is typically used as a metric, which involves computing the average precision (AP) for each object category at various recall values from 0 to 1. The intersection over the union (IoU) threshold of 0.5 is considered here. mAP is then calculated as the mean of these AP values. The computational resources employed include an NVIDIA RTX™ A4000 GPU with 6,144 CUDA® cores and 16 GB of graphics memory.

The image processing and object detection with the YOLOv5 algorithm with ensembling technique predicts bounding boxes and class probabilities directly from an image with predefined anchor boxes for better localization of a single shot detector that extracts hierarchical features from the input image, improving object detection performance across various object scales, combining features across small medium and large objects The image is resized and normalized before passing through the neural network. The predictions, such as bounding box coordinates (x, y, w, h), object confidence scores, and class probabilities, are calculated using the non-maximum suppression technique that removes redundant bounding boxes to keep the most confident predictions since object detection involves multiple classes, mAP is the mean of AP across all classes:

where: \(mAP=\frac{1}{N}{\sum }_{I=1}^{N}APi\)

N = NumberNumber of object classes.

APi = Average Precision for class i.

Data and class balancing are done with the help of resampling and augmentation, and the ensembling approach adopted also adds to the balance of the data bias across classes.

3.6 Model parameters

To establish a benchmark for OD performance, the YOLOv5 s model is trained on the AU-AIR dataset, and its results are compared to those reported in the original paper using YOLOv3-Tiny and MobileNetv2-SSDLite. Table 1 depicts the classwise mAP scores. It is observed that the custom-built YOLOv5 model surpasses the results for all classes in the dataset, providing an improved detection performance. The classwise results on the models are graphically represented in Fig. 4. It shows an improved performance in the mean average precision compared to the baseline results. The classwise comparison of different models on the AU AIR dataset is depicted in Table 1.

Table 1 Classwise comparative results of the proposed model with the baseline mAP

Full size table

Figure 5 showcases the confusion matrix and F1-confidence curve metrics for the YOLOv5 model. It shows the object-wise F1-confidence curve and corresponding confusion matrix for object class-wise detection.

4 Results and discussion

4.1 Ensemble learning

Object detection performance can be significantly improved through ensembling [42]. Although ensembling increases overhead, it improves accuracy by reducing false positives and false negatives, balancing speed and precision, and enhancing the overall robustness of the detections. A score weightage strategy is implemented. The prediction scores are weighted based on the confidence scores of each classifier. The classifier with more influence has higher confidence scores. The score-weighting strategy combines models with varied reliability and confidence and is inherently scalable to more models. When using ensemble learning [43, 44] for object detection and explainability of the task, the choice of voting strategy is critical in determining the performance of the overall model [44]. The unanimous voting strategy, which combines predictions from all models, can negatively impact performance if some models perform worse than others.

On the other hand, the affirmative and consensus voting strategies accepted correct predictions from some models even if others missed them, thus improving the overall performance. The choice of voting strategy depends on the specific requirements of the object detection task and the level of model uncertainty. The unanimous voting strategy can be helpful when dealing with high-confidence predictions, while the affirmative and consensus voting strategies can be helpful when dealing with model uncertainty. Ensembling makes it harder to debug and understand the predictions as boxes are drawn from multiple models with increased inference time.

For the OD task, three models, namely, YOLOV3-Tiny, MobileNetV2-SSDLite [45, 46] and YOLOV5 s + RetinaNet Layers [23], are evaluated and are observed to achieve mAP of 35%, 20%, and 43%, respectively as shown in Table 1. Ensemble learning of YOLOv5 and YOLOv8 models with two score-weighting strategies, affirmative and unanimous, are reported. The affirmative voting strategy enabled the capture of accurate detections from particular models, thereby enhancing the overall performance of the detector. The mathematical modelling for ensembling and threshold suppression is presented in Algorithm 1 below.

5 Algorithm 1

In the above pseudocode, the output represents the output tensor of YOLO architecture, conf represents the confidence score of a bounding box, and (p1, p2, …, pc) represents the class probabilities for the classes. The per-class ensembling results are shown in Fig. 6 and 7, respectively.

In continuation, the effectiveness of ensemble learning for OD tasks was also evaluated by considering YOLOv5 s and YOLOv8 s models, fine-tuned with equivalent hyper-parameters, and trained on the same data subset. The affirmative voting strategy was adopted, as it avoids excluding correct predictions that may occur in unanimous and consensus voting strategies, which can lead to false negatives and affect the recall curve, thereby impacting the average precision of individual classes. The detections were estimated using IoU metrics. Findings showed that the affirmative voting strategy resulted in a 3% increase in mean average precision, as shown in Table 2, demonstrating the potential of ensemble learning to improve the performance of OD models.

Table 2 mAP obtained for various classes with YOLOv5 + YOLOV8 and affirmative score-weighting strategy

Full size table

Figure 7 shows two images providing the comparative results for object detection with the YOLOv5 model (Fig. 7a) and a combined affirmative strategy using the YOLOv5 and YOLOv8 models (Fig. 7b). The confidence scores of YOLOv5 are low in comparison to YOLOv5 + YOLOv8. It reduces false positives and uncertainties in detection with improved confidence compared to YOLO5 alone. The voting strategy discards detections with low confidence uncertainty and noise and leverages the complementary strengths of YOLOv5 and YOLOv 8, ensuring robust outcomes compared to YOLOv5 alone. While YOLOv8 alone might offer improvements in certain aspects, the empirical evidence presented in the figure suggests that YOLOv5, when combined with the affirmative voting strategy, yields superior results for the specific task and dataset. Therefore, YOLOv5 was chosen to choose optimal bounding box predictions. The results depicted in Fig. 8 highlight the importance of selecting an appropriate voting strategy to enhance the accuracy and robustness of ensemble models for explainability for multi-class OD tasks.

The study demonstrates the potential of ensemble learning and explainable AI in improving the performance and interpretability of object detection models in drone imagery. The results shown in Fig. 8 demonstrate that ensemble learning with the affirmative voting strategy can significantly enhance the performance of OD models, making them more robust and reliable. The study also emphasizes the need for explainable AI in high-risk domains, where trust in AI-generated insights is essential. The figures and tables supported the findings and observations, providing a comprehensive understanding of the effectiveness of the proposed pipeline.

5.1 Explainable OD

Objects from drone imagery are multiscale. Precise localization and detection of discriminative regions by separating the foreground and background information is crucial for object detection in drone imagery to address this challenge. Eigen CAM is selected to infuse explainability to justify the model outcomes and rationale that can be interpreted with human intuition. Eigen CAM does not require retraining or layer modification. It targets salient features that are aligned in the direction of principal components that highlight the part of the image that induces the greatest magnitude of activation, generating visual explanations. Eigen-CAM is robust against classification errors made by fully connected layers in CNNs. It does not rely on the backpropagation of gradients, class relevance score, maximum activation locations, or any other form of weighting features. EigenCAM [48] computes and visualizes the principal components of the learned features/representations from the convolutional layers, producing sharper and more localized heat maps than GradCAM [49], making it a suitable choice.

The EigenCAM method is integrated into the proposed model after observing its rigorous evaluation against other prominent techniques, such as Grad-CAM [49], Grad-CAM + + [50], and CNN-fixations, using well-established datasets for tasks like weakly-supervised localization and object localization in the presence of adversarial noise. EigenCAM is class-agnostic and robust against classification errors from CNN fully connected layers. Moreover, it does not rely on backpropagation, class relevance scores, maximum activation locations, or other feature weighting forms. The critical advantage of EigenCAM is its compatibility with any CNN model, requiring no adjustments or retraining with robustness against adversarial perturbations [48]. The EigenCAM results are depicted below in Fig. 9; the left panel in the image indicates the original image, the middle column represents the eigen cam results, and the right panel indicates the detection results. The activation regions in the heatmap align well with the object detected, indicating the model focuses on the right features for the task. The OD model, class-specific filtering technique and non-max suppression thresholding generate the bounding boxes shown in column 3. These bounding boxes correspond to high activation areas in the EigenCAM heatmap shown in column 2. Column three verifies that the bounding box aligns with detected objects. The bounding boxes are precise, and all relevant objects are detected that are consistent, show precise localization, and highlight the essential features attribution towards the specific detection, making the detector explainable, which is desired for comprehending the OD outcomes to the developers and end users effectively. It can be observed that the heatmaps are generated around the border of images, which is typically due to boundary artefacts and edge detection bias that lies in the dataset. Also, high gradient values are observed in the sky when an object is occluded. One of the explanations for this is contextual reasoning, lack of strong object features, dataset biases, and contrast effects. The model tries to make a decision based on available information, and when the target object is missing, it assigns importance to the visible background or sky.

Figure 9 presents the EigenCAM results, which demonstrate the effectiveness of the proposed approach in infusing explainability into the object detection model. Figure 9 (middle) shows the heat maps generated by EigenCAM, highlighting the regions of the image that contribute most to the model's predictions. Figure 9 (Right) heat maps provide a spatial and visual explanation of the model's decision-making process, allowing developers and end-users to comprehend the rationale behind the object detection outcomes. The results are consistent, precise, and localized, attributing the importance of specific features towards the detection, making the detector explainable and trustworthy for real-world applications in drone imagery.

5.2 Evaluation of XAI results

Evaluation of XAI models is vital in the reliance and trustworthy deployment of such models. For UAV-based OD, explanation generation and assessment or validation of the model outcomes have yet to be widely addressed. XAI outcomes can be evaluated with application-grounded, functional-grounded, and human-grounded evaluation methods, which are goal and application-specific [51]. However, there needs to be more standardization in the methodology and the choice of evaluation metrics [52]. This work follows an application and goal-specific reasoning for the detected objects. The perturbation-based occlusion or ablation strategy accounts for adding or deleting important features to observe their impact and reliance on the overall model predictions, imparting robustness to the model outcomes.

A comprehensive evaluation methodology is implemented for the explanations generated by EigenCAM, a model and class-agnostic interpretation technique for understanding neural networks. The evaluation process involves perturbation/ablation of the input image by deleting the significant top features to observe their resultant impact on the explanation. The perturbation is either pixel-wise or patch-wise through techniques like occlusion masking, blurring, or replacing parts of the image. The chosen perturbation base strategy is less noisy and more reliable, fulfilling the application-specific need for robust and explainable object detection in drone imagery. This perturbation simulates the absence or modification of specific features, allowing us to understand the model's reliance on these features for its predictions. Subsequently, EigenCAM is applied to the perturbed images to generate an explanation heat map, highlighting the regions contributing most towards the model's decision. The resulting visualizations provide valuable insights into the model’s decision-making and reliance process. By comparing the explanations before in Fig. 8 and after perturbation/ablation, i.e., Fig. 9, we can assess the impact of feature perturbation/ablation on the model’s explanation, gaining a deeper understanding of its reasoning and feature importance significantly affecting the model outcomes. It is observed that masking, ablation, or deleting essential features drastically degrades the model detection accuracy, pointing towards some irrelevant regions apart from the masked objects. Figure 10 (a) (b) shows the perturbed image and its EigenCAM results. Showcasing the model’s total reliance on these masked features reduces detection accuracy abruptly, making the OD framework more reliable, robust, transparent, and trustworthy, which is highly desired in the mission-critical application of multimodal/multi-sensor-based OD in drone imagery. Heatmaps generated around the border of images are typically due to boundary artifacts and edge detection bias in the dataset. When an object is occluded, high gradient values in the sky may be due to contextual reasoning, lack of strong object features, dataset biases, and contrast effects. The model is trying to make a decision based on available information, and when the target object is missing, it assigns importance to the visible background or sky.

6 Conclusion

This paper describes developing an OD model end-to-end pipeline for multi-scale OD in drone imagery with an ensembling approach and XAI. AU-AIR, a multimodal dataset, was chosen to demonstrate and evaluate the proposed approach. An object detector with a customized YOLOv5 model with ensemble-based voting strategies was designed. Findings show that the affirmative voting strategy resulted in a 3% increase in mean average precision, demonstrating the potential of ensemble learning to improve the performance of OD models. To establish trust, transparency, and reliability in the detections, explainability was infused in the model with Eigen CAM, which provides the heat map highlighting the essential features affecting the model predictions. The explanations are further evaluated and validated by applying an ablation-based perturbation method, which deletes masks or removes the crucial features of Eigen CAM to observe the model's reliance on these vital features, ensuring the accuracy of the detector, enhancing the robustness, reliability, and explainability and transparency of the overall system. The findings show that integrating ensembling and explainability in the proposed object detection pipeline makes it robust and explainable, an appropriate fit for real-world deployment, further enhancing robustness, trust, transparency, and accountability critical for improved reliability and trustworthiness for object detection in drone imagery. Future improvements can be aimed at reducing the complexity and dimensionality of the proposed architecture by addressing domain-specific challenges, such as employing advanced lightweight architectures for real-time performance, including more models in the ensembling approach, applying and experimenting with diverse XAI techniques for consistent results can further improve OD accuracy and performance, making it more explainable and adaptable for open-world scenarios.

Data availability

Dataset available at: https://bozcani.github.io/auairdataset.

References

Han Y, Liu H, Wang Y, Liu C. A comprehensive review for typical applications based upon unmanned aerial vehicle platform. IEEE J Sel Top Appl Earth Obs Remote Sens. 2022;15:9654–66. https://doi.org/10.1109/JSTARS.2022.3216564.
Article Google Scholar
Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32. https://doi.org/10.1109/TNNLS.2018.2876865.
Article Google Scholar
Du J. Understanding of object detection based on CNN family and YOLO. J Phys Conf Ser. 2018. https://doi.org/10.1088/1742-6596/1004/1/012029.
Article Google Scholar
Jiao L, Zhang FAN, Liu F, Member S. Deep learning for unmanned aerial vehicle-based object detection and tracking: a survey. IEEE Access. 2019;7:128837–68. https://doi.org/10.1109/ACCESS.2019.2939201.
Article Google Scholar
Ramachandran A, Sangaiah AK. A review on object detection in unmanned aerial vehicle surveillance. Int J Cogn Comput Eng. 2021;2:215–28. https://doi.org/10.1016/j.ijcce.2021.11.005.
Article Google Scholar
Liu Y, Sun P, Wergeles N, Shang Yi. A survey and performance evaluation of deep learning methods for small object detection. Exp Syst Appl. 2021. https://doi.org/10.1016/j.eswa.2021.114602.
Article Google Scholar
Zhang Y, Zhang H, Huang Q, Han Y, Zhao M. DsP-YOLO: an anchor-free network with DsPAN for small object detection of multiscale defects. Exp Syst Appl. 2024. https://doi.org/10.1016/j.eswa.2023.122669.
Article Google Scholar
Zhang Y, Ye M, Zhu G, Liu Y, Guo P, Yan J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans Geosci Remote Sens. 2024;62:1–15. https://doi.org/10.1109/TGRS.2024.3363057.
Article Google Scholar
Zhang J, Wan G, Jiang M, Lu G, Tao X, Huang Z. Small object detection in UAV image based on improved YOLOv5. Syst Sci Control Eng. 2023. https://doi.org/10.1080/21642583.2023.2247082.
Article Google Scholar
Nguyen, K., Fookes, C., Sridharan, S., Tian, Y., Liu, F., Liu, X., & Ross, A. The state of aerial surveillance: A survey. arXiv preprint arXiv:2201.03080. 2022
Walambe R, Marathe A, Kotecha K. Multiscale object detection from drone imagery using ensemble transfer learning. Drones. 2021;5(3):66. https://doi.org/10.3390/drones5030066.
Article Google Scholar
G. Vallone and L. Longo, "Explainable Artificial Intelligence: a Systematic Review," arXiv, no. Dl, 2020.
Olszewska JI. Designing transparent and autonomous intelligent vision systems. Icaart. 2019. https://doi.org/10.5220/0007585208500856.
Article Google Scholar
Papagni G, Koeszegi S. Understandable and trustworthy explainable robots : a sensemaking perspective. J Behav Robot. 2021;330:13–30.
Google Scholar
Miller T. Explanation in artificial intelligence : insights from the social sciences. Artif Intell. 2018. https://doi.org/10.1016/j.artint.2018.07.007.
Article Google Scholar
Sokol K, Flach P. One explanation does not fit all: the promise of interactive explanations for machine learning transparency. KI - Kunstl Intelligenz. 2020;34(2):235–50. https://doi.org/10.1007/s13218-020-00637-y.
Article Google Scholar
Wu X, Li W, Hong D, Tao R, Du Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: a survey. IEEE Geosci Remote Sens Mag. 2022;10(1):91–124. https://doi.org/10.1109/MGRS.2021.3115137.
Article Google Scholar
Bouguettaya A, Zarzour H, Kechida A, Taberkit AM. Deep learning techniques to classify crops through UAV imagery: a review. Neural Comput Appl. 2022;34(12):9511–36. https://doi.org/10.1007/s00521-022-07104-9.
Article Google Scholar
Zhou H, Ma A, Niu Y, Ma Z. Small-object detection for UAV-based images using a distance metric method. Drones. 2022;6(10):1–19. https://doi.org/10.3390/drones6100308.
Article Google Scholar
Sun C, Zhan W, She J, Zhang Y. Object detection from the video taken by drone via convolutional neural networks. Math Probl Eng. 2020. https://doi.org/10.1155/2020/4013647.
Article Google Scholar
Yang L, Ma R, Zakhor A. “Drone object detection using RGB/IR fusion. IS T Int Symp Electron Imag Sci Technol. 2022;34(14):1–6. https://doi.org/10.2352/EI.2022.34.14.COIMG-179.
Article Google Scholar
Girshick R, Donahue J, Darrell T, Malik J. “Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2014. https://doi.org/10.1109/CVPR.2014.81.
Article Google Scholar
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection.”
Mukhopadhyay A, Biswas P. Advancements in deep-learning-based object detection in challenging environments. Wirel World Res Trends Mag. 2024;1(10):1–6. https://doi.org/10.13052/2794-7254.001.
Article Google Scholar
Mukhopadhyay A, Br H, Gaikwad PT, et al. I-rod: an ensemble of CNNs for object detection in unconstrained road scenarios. SIViP. 2025;19:3. https://doi.org/10.1007/s11760-024-03590-7.
Article Google Scholar
Zeng S, Yang W, Jiao Y, et al. SCA-YOLO: a new small object detection model for UAV images. Vis Comput. 2024;40:1787–803. https://doi.org/10.1007/s00371-023-02886-y.
Article Google Scholar
Jing R, Zhang W, Liu Y, Li W, Li Y, Liu C. An effective method for small object detection in low-resolution images. Eng Appl Artif Intell. 2024. https://doi.org/10.1016/j.engappai.2023.107206.
Article Google Scholar
Song H, Yuan Y, Ouyang Z, Yang Y, Xiang H. Efficient knowledge distillation for hybrid models: a vision transformer-convolutional neural network to convolutional neural network approach for classifying remote sensing images. IET Cyber-systems Robot. 2024;6:3. https://doi.org/10.1049/csy2.12120.
Article Google Scholar
Song H, Wei C, Yong Z. Efficient knowledge distillation for remote sensing image classification: a CNN-based approach. Int J Web Inf Syst. 2024;20(2):129–58. https://doi.org/10.1108/IJWIS-10-2023-0192.
Article Google Scholar
Song H, Li Y, Li X, Zhang Y, Zhu Y, Zhou Y. “ERKT-Net: Implementing efficient and robust knowledge distillation for remote sensing image classification”, EAI Endorsed Trans. Ind Netw Intell Syst. 2024;11(3):1–19. https://doi.org/10.4108/eetinis.v11i3.4748.
Article Google Scholar
Song H, Xia H, Wang W, Zhou Y, Liu W, Liu Q, Liu J. QAGA-Net: enhanced vision transformer-based object detection for remote sensing images. Int J Intell Comput Cybern. 2024. https://doi.org/10.1108/IJICC-08-2024-0383.
Article Google Scholar
Bengio Y, Courville A, Vincent P. Representation learning a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 1993;8:1–30.
Google Scholar
Mankodiya H, Jadav D, Gupta R, Tanwar S, Hong WC, Sharma R. OD-XAI: explainable AI-based semantic object detection for autonomous vehicles. Appl Sci. 2022;12:11. https://doi.org/10.3390/app12115310.
Article Google Scholar
Marathe A, Jain P, Walambe R, Kotecha K. “RestoreX-AI: a contrastive approach towards guiding image restoration via explainable AI systems. IEEE Comput Soc Conf Comput Vis Pattern Recognit Work. 2022. https://doi.org/10.1109/CVPRW56347.2022.00342.
Article Google Scholar
K. Simonyan, A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps arXiv Prepr. arXiv1312.6034, 2013.
C. Yuanqiang et al., “Guided Attention Network for Object Detection and Counting on Drones,” MM 2020 - Proc. 28th ACM Int. Conf. Multimed. 709–717, 2020, https://doi.org/10.1145/3394171.3413816.
Ribeiro MT, Singh S, Guestrin C. rWhy should i trust you explaining the predictions of any classifier NAACL-HLT 2016–2016 Conf. North Am Chapter Assoc Comput Linguist Hum Lang Technol Proc Demonstr Sess. 2016. https://doi.org/10.1865/v1/n16-3020.
Article Google Scholar
M. Dreyer Matrikel-Nr and S. Lapuschkin, “Comparison of Explainable AI Methods for Object Detection. 2021; 806092
Lu H, Liu L, Li YN, Zhao XM, Wang XQ, Cao ZG. TasselNetV3: explainable plant counting with guided upsampling and background suppression. IEEE Trans Geosci Remote Sens. 2022;60:1–15. https://doi.org/10.1109/TGRS.2021.3058962.
Article Google Scholar
I. Bozcan and E. Kayacan, "AU-AIR: A Multimodal Unmanned Aerial Vehicle Dataset for Low Altitude Traffic Surveillance."
Padilla R, Netto SL, Da Silva EAB. “A survey on performance metrics for object-detection algorithms. Int Conf Syst Signals Image Process. 2020. https://doi.org/10.1109/IWSSIP48289.2020.9145130.
Article Google Scholar
Albaba BM, Ozer S. “Synet: an ensemble network for object detection in UAV images. Proc Int Conf Pattern Recognit. 2020. https://doi.org/10.1109/ICPR48806.2021.9412847.
Article Google Scholar
Chumachenko K, Raitoharju J, Iosifidis A, Gabbouj M. “Ensembling object detectors for image and video data analysis. IEEE Int Conf Acoust Speech Sign Process. 2021. https://doi.org/10.1109/ICASSP39728.2021.9414013.
Article Google Scholar
Casado-García Á, Heras J. Ensemble methods for object detection. Front Artif Intell Appl. 2020;325:2688–95. https://doi.org/10.3233/FAIA200407.
Article Google Scholar
W. Liu et al., “SSD : Single Shot MultiBox Detector.”
He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2016. https://doi.org/10.1109/CVPR.2016.90.
Article Google Scholar
A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017, http://arxiv.org/abs/1704.04861.
Muhammad MB, Yeasin M. Eigen-CAM: class activation map using principal components. Proc Int Jt Conf Neural Netw. 2020. https://doi.org/10.1109/IJCNN48605.2020.9206626.
Article Google Scholar
R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-CAM: Why did you say that?,” arXiv Prepr. arXiv1611.07450, 2016.
Chattopadhyay A, Sarkar A, Howlader P, Balasubramanian VN. "Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conf Appl Comput Vision. 2018. https://doi.org/10.1109/WACV.2018.00097.
Article Google Scholar
Arrieta AB, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion. 2020;58:82–115.
Article Google Scholar
Vilone G, Longo L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf Fusion. 2021;76:89–106. https://doi.org/10.1016/j.inffus.2021.05.009.
Article Google Scholar

Download references

Funding

Open access funding provided by Symbiosis International (Deemed University). This work was supported by the Department of Science and Technology, Government of India, the Italian Ministry of Foreign Affairs, Government of Italy (Sanction ID: INT/ITALY/P-31/2022 (ER)(G)) and La Fondation Dassault Systems (DSF) (Contract ID: 2023/9444).

Author information

Gargi Joshi, Amey Joshi have contributed to this work equally.

Authors and Affiliations

Symbiosis Institute of Technology, Symbiosis International Deemed University, Pune, 412115, India
Gargi Joshi, Amey Joshi, Mranmay Shetty & Ketan Kotecha
Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International Deemed University, Pune, 412115, India
Rahee Walambe
Università Degli Studi Di Milano, 20126, Milan, Italy
Fabio Scotti & Vincenzo Piuri

Authors

Gargi Joshi
View author publications
Search author on:PubMed Google Scholar
Amey Joshi
View author publications
Search author on:PubMed Google Scholar
Mranmay Shetty
View author publications
Search author on:PubMed Google Scholar
Rahee Walambe
View author publications
Search author on:PubMed Google Scholar
Ketan Kotecha
View author publications
Search author on:PubMed Google Scholar
Fabio Scotti
View author publications
Search author on:PubMed Google Scholar
Vincenzo Piuri
View author publications
Search author on:PubMed Google Scholar

Contributions

GJ, AJ and MS: Data pre-processing, methodology design, software design. GJ and RW: writing the first draft of the manuscript. RW: conceptualization and editing, Research management, manuscript revisions; KK, FS and VP: Review and feedback.

Corresponding author

Correspondence to Rahee Walambe.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Joshi, G., Joshi, A., Shetty, M. et al. Ensemble learning and EigenCAM-based feature analysis for improving the performance and explainability of object detection in drone imagery. Discov Appl Sci 7, 376 (2025). https://doi.org/10.1007/s42452-025-06879-5

Download citation

Received: 06 December 2024
Accepted: 04 April 2025
Published: 20 April 2025
DOI: https://doi.org/10.1007/s42452-025-06879-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Ensemble learning and EigenCAM-based feature analysis for improving the performance and explainability of object detection in drone imagery

Abstract

Highlights

Similar content being viewed by others

VisDrone-DET2020: The Vision Meets Drone Object Detection in Image Challenge Results

DroBoost: An Intelligent Score and Model Boosting Method for Drone Detection

Drone-vs-Bird Detection Challenge at ICIAP 2021

Explore related subjects

1 Introduction

2 Related work

2.1 Deep learning-based object detection

2.2 Explainability for OD

3 Materials and methods

3.1 Dataset description

3.2 Model architecture

3.3 Model optimization

3.4 Baseline networks

3.5 Model performance assessment

3.6 Model parameters

4 Results and discussion

4.1 Ensemble learning

5 Algorithm 1

5.1 Explainable OD

5.2 Evaluation of XAI results

6 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords