1 Introduction

Object Detection (OD) is an essential task in computer vision, which involves identifying and localizing objects in images or videos. The ability to detect objects automatically is crucial in surveillance, agriculture, infrastructure inspection, and search and rescue operations [1]. Over the years, deep learning-based techniques [2], particularly convolution neural networks (CNNs) [3], have shown significant progress in OD tasks. Deep learning-based object detectors are classified into single-shot detectors such as You Only Look Once (YOLO) and SSD (Single-Shot Detector) and object detectors based on region proposals such as RCNN, fast RCNN, faster RCNN, RFCN, and mask RCNN [4]. OD in aerial drone imagery is far more challenging as multiple factors such as altitude, camera angle, object scale, overlap, occlusion, motion blur, lack of labelled data, flat and small view of objects, real-time limited view of computation, and lack of contextual information hinder the overall object detection capabilities [5]. The tradeoff in detection accuracy and real-time performance coexists for detecting small-scale objects [6,7,8]. Single-pixel shifts cause significant interference and miss-detection due to a lack of background and foreground information [9]. These are some of the pertinent challenges for OD in UAVs [9]. Although recent approaches have significantly improved object detection performance and outcomes, the OD models are still considered opaque black boxes. It is not clear to a broader audience why the model predicts what it predicts, raising serious concerns and the need to build models that are more transparent and understandable to humans [10]. Owing to the persistent accuracy-interpretability tradeoff, i.e., the higher the complexity of the model, the lesser the interpretability, deep learning models are often viewed as black boxes. The black box nature makes understanding and comprehending their underlying working behaviour and decision-making process [11] challenging. Explainable AI (XAI) is a research field that aims to understand and interpret the working of machine learning models. It allows the interpretation of the AI-generated insights [12]. This field has gained significant attention as the use of ML models in real-world settings has increased, especially in high-risk domains such as healthcare, autonomous driving, drone-based surveillance, rescue operations [13], etc. Trustworthy and human-interpretable explanations are crucial for informed decision-making to validate the AI-generated insights in human-AI collaborative tasks considering the mission-critical application of drone imagery [14]. Drone images are subject to ethical considerations such as surveillance, security, and compliance with various data protection, privacy and transparency laws worldwide. Explainability is particularly important in critical applications such as defence, healthcare, and autonomous vehicles, where it is crucial to develop trust, transparency, and safety in machine learning models for real-world adoption and deployment [15].

The combination of OD and XAI has several potential benefits. By providing insights into the model's working, researchers and end users can better understand how the model makes decisions and identify potential biases or errors, improve accuracy and reliability, and increase users'trust in the model outcomes [16]. However, while much work has been done in XAI for other domains, further exploration of the XAI specific to OD on drone imagery has been needed. This work aims to fill the identified gap by developing and evaluating XAI techniques tailored to the unique challenges of OD in drone imagery [4]. The experimentation is carried out on the open-domain dataset AU-AIR [40], which is used for surveillance. The object detector performance is improved using ensembling techniques, and the OD outcomes are made explainable with XAI techniques with further evaluation. The significant contributions of the work can be stated as follows:

  • Developing an integrated pipeline combining ensemble learning and explainability for multi-scale object detection in drone imagery.

  • Comparative analysis of various OD models for ensembling and voting strategies for multi-scale object detection.

  • Demonstrate and implement explainability techniques to improve the interpretability and trustworthiness of OD predictions, with EigenCAM-based evaluation ensuring robust and explainable detections.

2 Related work

2.1 Deep learning-based object detection

Recently, drone-based OD has received increased attention since it has many applications, from surveillance [17] to agriculture [18], and there has been an increase in research on OD specific to drone imagery [19,20,21]. In the context of unimodal OD, which focuses on detecting objects using a single modality, such as RGB images, several works have explored the use of deep learning models. The popular Faster R-CNN [22] and YOLO [23] models have been shown to achieve state-of-the-art performance on OD tasks. Deep learning-based ensemble techniques have recently been used for multi-scale OD applications such as drone-based OD, pedestrian detection, autonomous driving, etc. [9,10,11]. Recent studies have introduced and studied deep learning-based object detections in challenging environments and unstructured environments [24, 25]. Object detection accuracy for small objects is low [26]. A method to handle the OD for small objects, especially when the images are low-resolution, is reported in [27]. Particularly with deep learning, an ensemble of CNN for object detection in a constrained environment is proposed in [24, 25]. A vision transformer-convolutional neural network is applied to classify remote sensing images [28, 29] for efficient knowledge distillation of hybrid models. In [30], an efficient and robust knowledge transfer network named ERKT-Net is proposed. This network is designed to provide a lightweight yet accurate Convolutional Neural Network (CNN) classifier for classifying remote sensing images. In [31], an enhanced vision transformer-based object detection for remote sensing images is presented.

2.2 Explainability for OD

The latest machine learning models offer a tradeoff between prioritizing higher explainability with lower accuracy or prioritizing higher accuracy with lower explainability. Previously, machine learning and pattern recognition needed specialized knowledge to create a feature extractor to convert raw data into a feature vector. However, deep learning provides a new way of automatically extracting abstract features [32]. In recent years, OD-explainability has significantly developed [33, 34], but the field is still at an initial stage. Several researchers have proposed explainable OD models that provide insight into how and why objects are detected to address this issue. These models typically incorporate attention and saliency mechanisms to highlight the most informative regions of an image for OD [35]. Several explainability techniques have been proposed for the task [36,37,38] to date. While some exciting applications of explainability are seen in agriculture [39] and driver assistance restoration [34], they lack evaluation and validation, which is crucial for real-world adoption and deployment.

In this context, our work addresses significant object detection (OD) gaps in drone imagery by presenting an innovative end-to-end framework that integrates ensemble learning and explainable OD techniques. A 3% increase in mAP scores for the AU-AIR dataset demonstrates the potential of ensemble learning to improve the OD model's performance as compared to the SOTA performance. Evaluation of XAI outcomes with a perturbation-based feature ablation strategy with EigenCAM validates the object detection outcomes for more reliability and trustworthiness for real-world deployment settings.

3 Materials and methods

3.1 Dataset description

The work is based on drone imagery, focusing on the image datasets obtained from uncrewed aerial vehicles (UAVs). Although various UAV datasets are available, the AU-AIR dataset [40] is chosen for experimentation and demonstration in this work. The AU-AIR dataset was created using a DJI Phantom 4 drone. AU-AIR is the first multimodal UAV dataset for object detection in aerial images with onboard sensor information for autonomous aerial surveillance. It includes various annotations such as object detection, object tracking, and weather conditions. In addition to flight data, the dataset also contains visual data and object annotations. It is designed for traffic monitoring and comprises eight video streams, collectively lasting over two hours.

The majority of the videos were captured in Aarhus, Denmark. The dataset includes aerial videos and information on time, GPS coordinates, UAV altitude, IMU data, and velocity. The videos were taken from different angles, ranging from 45 to 90 degrees and at various heights between 5 and 30 m above ground level. The video frames in the dataset contain bounding boxes that indicate instances of different object categories related to traffic surveillance. The flight data is also included in each frame of the video. The dataset comprises a total of 32,823 annotated video frames, where each frame is labelled with object categories and their corresponding flight information. Eight object categories are labelled: humans, cars, vans, trucks, motorcycles, bicycles, buses, and trailers. The dataset contains a total of 132,034 annotations. The dataset also contains logged sensor information. The sensor information is captured during the video recordings, in addition to visual information and annotations of objects, making it an optimal choice to experiment with multi-scale object detection from drone imagery. Figure 1. below shows the classwise data distribution of the objects in the AU-AIR dataset.

Fig. 1
figure 1

Classwise distribution of objects in the AU-AIR dataset

For each extracted frame in the dataset, the following attributes are available:

  • d, t: current date of a frame, current time stamp of the frame

  • la, lo, a: UAV latitude, UAV longitude, UAV altitude • φ, θ, ψ: roll angle, pitch angle, yaw angle of the UAV

  • Vx, Vy, Vz: x-axis speed, y-axis speed, z-axis speed.

Figure 1 shows the class imbalance in the AU-AIR dataset. In addition, there are inaccuracies in image annotation formats. Class mismatches and haphazard image sequencing in pre-trained weights are also identified, demanding meticulous alignment of class indices. Rectifying this misalignment was vital to ensure seamless image-annotation correspondence. We have carried out various augmentation techniques and manual image-annotation matching to handle the class imbalance.

3.2 Model architecture

Figure 2 depicts the proposed model architecture with ensembling and XAI. The left block illustrates distinct pipelines for object detection models. The middle block demonstrates the application of ensemble learning. The prediction model is used to refine the bounding box coordinates. The model predicts the bounding box and classifies the class score in one step, incorporating voting strategies on extracted bounding boxes. The right block elaborates on the XAI implemented in the unimodal pipeline, encompassing gradient maps and obtaining the EigenCAM feature map through upsampling. This work initially considered three OD models, namely YOLOv3-Tiny, MobileNetV2-SSDLite and YOLOV5 s + RetinaNet (few extracted layers), for experimentation. We also experimented with the YOLOv8 model for OD. However, for this particular dataset, YOLOv5 with few layers of RetinaNet yielded the best results. Therefore, YOLOv5 was chosen to choose optimal bounding box predictions in the ensembling step. YOLOv5 and YOLOv8 were employed for ensembling, and an affirmative score weighting strategy yielded superior results.

Fig. 2
figure 2

Proposed model architecture with ensembling and explainability

3.3 Model optimization

The study employed a set of carefully chosen hyper-parameters, including learning rate (lr0 = 0.01), momentum (0.937), weight decay (0.0005), epochs (150), momentum (0.8), bias learning rate (0.1), box loss coefficient (0.05), class loss coefficient (0.5), class loss power (1.0), object loss coefficient (1.0), object loss power (1.0), IoU threshold (0.2), anchor threshold (4.0), focal loss gamma (0.0), HSV augmentation factors (h = 0.015, s = 0.7, v = 0.4), rotation degrees (0.0), translation magnitude (0.1), scaling factor (0.5), shearing factor (0.0), perspective distortion factor (0.0), vertical flipping probability (0.0), horizontal flipping probability (0.5).

For optimisation, stochastic gradient descent (SGD) is employed with a learning rate of 0.01, distributed across parameter groups, including 57 for weight with no decay, 60 for weight with a decay of 0.0005, and 60 for bias.

Additionally, selected augmentation techniques, such as blur, median blur, greyscaling, and Contrast Limited Adaptive Histogram Equalization (CLAHE), with respective application probabilities (p = 0.01) and specified parameter ranges, are incorporated to augment the training data. Ensemble learning with the YOLOv5 model for object detection aims to enhance prediction accuracy by minimizing generalization errors. The ensemble approach mitigates prediction errors by ensuring diversity and independence among the base models, improving performance.

The IoU metric that measures the overlap between the predicted bounding box and ground truth bounding box is employed in the ensemble algorithm to cluster detections based on their bounding box overlaps and class affiliations, yielding a list denoted as R, comprising subsets [DR1, DR2, …, DRm], where each DRi represents a collection of detections. These collections adhere to the following conditions for any pair of detections, d1, and d2, within DRi:

  • The Bounding Box Overlap Condition necessitates IoU (d1bbox, d2bbox) > 0.5, where d1bbox and d2bbox denote the bounding boxes of d1 and d2, respectively, with IoU quantifying their overlap.

  • The class matching condition ensures that the class is equivalent to the class, guaranteeing that grouped detections share the same class.

Subsequently, each DRi in list R becomes dedicated to a specific region within the image. The size of DRi is a critical factor in determining whether the algorithm infers the presence of an object in that region. Larger DRi sizes indicate a heightened likelihood of an object's presence within the corresponding area.

3.4 Baseline networks

YOLOv5 was configured for the benchmarking using the default parameters. The AU-AIR dataset is split into 60% training, 10% validation, and 30% testing samples, following the original paper on AU-AIR for a fair comparison. The object detectors are adapted to the total number of classes in the AU-AIR dataset (8 classes) by changing their last layers. The implemented pipeline is depicted below in Fig. 3. Ensemble models are developed to improve the detector performance. The detection is explained by building the underlying rationale by applying post hoc explainable AI techniques with further validation and evaluation.

Fig. 3
figure 3

Implemented pipeline for explainable OD

3.5 Model performance assessment

To evaluate OD performance, mean average precision (mAP) [41] is typically used as a metric, which involves computing the average precision (AP) for each object category at various recall values from 0 to 1. The intersection over the union (IoU) threshold of 0.5 is considered here. mAP is then calculated as the mean of these AP values. The computational resources employed include an NVIDIA RTX™ A4000 GPU with 6,144 CUDA® cores and 16 GB of graphics memory.

The image processing and object detection with the YOLOv5 algorithm with ensembling technique predicts bounding boxes and class probabilities directly from an image with predefined anchor boxes for better localization of a single shot detector that extracts hierarchical features from the input image, improving object detection performance across various object scales, combining features across small medium and large objects The image is resized and normalized before passing through the neural network. The predictions, such as bounding box coordinates (x, y, w, h), object confidence scores, and class probabilities, are calculated using the non-maximum suppression technique that removes redundant bounding boxes to keep the most confident predictions since object detection involves multiple classes, mAP is the mean of AP across all classes:

where: \(mAP=\frac{1}{N}{\sum }_{I=1}^{N}APi\)

N = NumberNumber of object classes.

APi = Average Precision for class i.

Data and class balancing are done with the help of resampling and augmentation, and the ensembling approach adopted also adds to the balance of the data bias across classes.

3.6 Model parameters

To establish a benchmark for OD performance, the YOLOv5 s model is trained on the AU-AIR dataset, and its results are compared to those reported in the original paper using YOLOv3-Tiny and MobileNetv2-SSDLite. Table 1 depicts the classwise mAP scores. It is observed that the custom-built YOLOv5 model surpasses the results for all classes in the dataset, providing an improved detection performance. The classwise results on the models are graphically represented in Fig. 4. It shows an improved performance in the mean average precision compared to the baseline results. The classwise comparison of different models on the AU AIR dataset is depicted in Table 1.

Table 1 Classwise comparative results of the proposed model with the baseline mAP
Fig. 4
figure 4

Classwise comparison of different models on the AU-AIR dataset

Figure 5 showcases the confusion matrix and F1-confidence curve metrics for the YOLOv5 model. It shows the object-wise F1-confidence curve and corresponding confusion matrix for object class-wise detection.

Fig. 5
figure 5

F1 confidence curve for YOLOv5 model with confusion matrix

4 Results and discussion

4.1 Ensemble learning

Object detection performance can be significantly improved through ensembling [42]. Although ensembling increases overhead, it improves accuracy by reducing false positives and false negatives, balancing speed and precision, and enhancing the overall robustness of the detections. A score weightage strategy is implemented. The prediction scores are weighted based on the confidence scores of each classifier. The classifier with more influence has higher confidence scores. The score-weighting strategy combines models with varied reliability and confidence and is inherently scalable to more models. When using ensemble learning [43, 44] for object detection and explainability of the task, the choice of voting strategy is critical in determining the performance of the overall model [44]. The unanimous voting strategy, which combines predictions from all models, can negatively impact performance if some models perform worse than others.

On the other hand, the affirmative and consensus voting strategies accepted correct predictions from some models even if others missed them, thus improving the overall performance. The choice of voting strategy depends on the specific requirements of the object detection task and the level of model uncertainty. The unanimous voting strategy can be helpful when dealing with high-confidence predictions, while the affirmative and consensus voting strategies can be helpful when dealing with model uncertainty. Ensembling makes it harder to debug and understand the predictions as boxes are drawn from multiple models with increased inference time.

For the OD task, three models, namely, YOLOV3-Tiny, MobileNetV2-SSDLite [45, 46] and YOLOV5 s + RetinaNet Layers [23], are evaluated and are observed to achieve mAP of 35%, 20%, and 43%, respectively as shown in Table 1. Ensemble learning of YOLOv5 and YOLOv8 models with two score-weighting strategies, affirmative and unanimous, are reported. The affirmative voting strategy enabled the capture of accurate detections from particular models, thereby enhancing the overall performance of the detector. The mathematical modelling for ensembling and threshold suppression is presented in Algorithm 1 below.

5 Algorithm 1

Algorithm 1
figure a

Mathematical modeling for threshold suppression and ensembling

In the above pseudocode, the output represents the output tensor of YOLO architecture, conf represents the confidence score of a bounding box, and (p1, p2, …, pc) represents the class probabilities for the classes. The per-class ensembling results are shown in Fig. 6 and 7, respectively.

Fig. 6
figure 6

Graphical representation of classwise ensemble learning results

Fig. 7
figure 7

Classwise ensemble learning results with a) YOLOv5 and b) ensembling of YOLOv5 and YOLOv8 with affirmative score weighting

In continuation, the effectiveness of ensemble learning for OD tasks was also evaluated by considering YOLOv5 s and YOLOv8 s models, fine-tuned with equivalent hyper-parameters, and trained on the same data subset. The affirmative voting strategy was adopted, as it avoids excluding correct predictions that may occur in unanimous and consensus voting strategies, which can lead to false negatives and affect the recall curve, thereby impacting the average precision of individual classes. The detections were estimated using IoU metrics. Findings showed that the affirmative voting strategy resulted in a 3% increase in mean average precision, as shown in Table 2, demonstrating the potential of ensemble learning to improve the performance of OD models.

Table 2 mAP obtained for various classes with YOLOv5 + YOLOV8 and affirmative score-weighting strategy

Figure 7 shows two images providing the comparative results for object detection with the YOLOv5 model (Fig. 7a) and a combined affirmative strategy using the YOLOv5 and YOLOv8 models (Fig. 7b). The confidence scores of YOLOv5 are low in comparison to YOLOv5 + YOLOv8. It reduces false positives and uncertainties in detection with improved confidence compared to YOLO5 alone. The voting strategy discards detections with low confidence uncertainty and noise and leverages the complementary strengths of YOLOv5 and YOLOv 8, ensuring robust outcomes compared to YOLOv5 alone. While YOLOv8 alone might offer improvements in certain aspects, the empirical evidence presented in the figure suggests that YOLOv5, when combined with the affirmative voting strategy, yields superior results for the specific task and dataset. Therefore, YOLOv5 was chosen to choose optimal bounding box predictions. The results depicted in Fig. 8 highlight the importance of selecting an appropriate voting strategy to enhance the accuracy and robustness of ensemble models for explainability for multi-class OD tasks.

Fig. 8
figure 8

Comparative analysis of different class-types (APs). The orange shade represents an increase in APs for each class (Table 2) compared to the original APs with the YOLOv5 model(Table 1)

The study demonstrates the potential of ensemble learning and explainable AI in improving the performance and interpretability of object detection models in drone imagery. The results shown in Fig. 8 demonstrate that ensemble learning with the affirmative voting strategy can significantly enhance the performance of OD models, making them more robust and reliable. The study also emphasizes the need for explainable AI in high-risk domains, where trust in AI-generated insights is essential. The figures and tables supported the findings and observations, providing a comprehensive understanding of the effectiveness of the proposed pipeline.

5.1 Explainable OD

Objects from drone imagery are multiscale. Precise localization and detection of discriminative regions by separating the foreground and background information is crucial for object detection in drone imagery to address this challenge. Eigen CAM is selected to infuse explainability to justify the model outcomes and rationale that can be interpreted with human intuition. Eigen CAM does not require retraining or layer modification. It targets salient features that are aligned in the direction of principal components that highlight the part of the image that induces the greatest magnitude of activation, generating visual explanations. Eigen-CAM is robust against classification errors made by fully connected layers in CNNs. It does not rely on the backpropagation of gradients, class relevance score, maximum activation locations, or any other form of weighting features. EigenCAM [48] computes and visualizes the principal components of the learned features/representations from the convolutional layers, producing sharper and more localized heat maps than GradCAM [49], making it a suitable choice.

The EigenCAM method is integrated into the proposed model after observing its rigorous evaluation against other prominent techniques, such as Grad-CAM [49], Grad-CAM +  + [50], and CNN-fixations, using well-established datasets for tasks like weakly-supervised localization and object localization in the presence of adversarial noise. EigenCAM is class-agnostic and robust against classification errors from CNN fully connected layers. Moreover, it does not rely on backpropagation, class relevance scores, maximum activation locations, or other feature weighting forms. The critical advantage of EigenCAM is its compatibility with any CNN model, requiring no adjustments or retraining with robustness against adversarial perturbations [48]. The EigenCAM results are depicted below in Fig. 9; the left panel in the image indicates the original image, the middle column represents the eigen cam results, and the right panel indicates the detection results. The activation regions in the heatmap align well with the object detected, indicating the model focuses on the right features for the task. The OD model, class-specific filtering technique and non-max suppression thresholding generate the bounding boxes shown in column 3. These bounding boxes correspond to high activation areas in the EigenCAM heatmap shown in column 2. Column three verifies that the bounding box aligns with detected objects. The bounding boxes are precise, and all relevant objects are detected that are consistent, show precise localization, and highlight the essential features attribution towards the specific detection, making the detector explainable, which is desired for comprehending the OD outcomes to the developers and end users effectively. It can be observed that the heatmaps are generated around the border of images, which is typically due to boundary artefacts and edge detection bias that lies in the dataset. Also, high gradient values are observed in the sky when an object is occluded. One of the explanations for this is contextual reasoning, lack of strong object features, dataset biases, and contrast effects. The model tries to make a decision based on available information, and when the target object is missing, it assigns importance to the visible background or sky.

Fig. 9
figure 9

Visualization of Original Image vs its corresponding EigenCAM Heatmap and the Detection output for EigenCAM-based OD

Figure 9 presents the EigenCAM results, which demonstrate the effectiveness of the proposed approach in infusing explainability into the object detection model. Figure 9 (middle) shows the heat maps generated by EigenCAM, highlighting the regions of the image that contribute most to the model's predictions. Figure 9 (Right) heat maps provide a spatial and visual explanation of the model's decision-making process, allowing developers and end-users to comprehend the rationale behind the object detection outcomes. The results are consistent, precise, and localized, attributing the importance of specific features towards the detection, making the detector explainable and trustworthy for real-world applications in drone imagery.

5.2 Evaluation of XAI results

Evaluation of XAI models is vital in the reliance and trustworthy deployment of such models. For UAV-based OD, explanation generation and assessment or validation of the model outcomes have yet to be widely addressed. XAI outcomes can be evaluated with application-grounded, functional-grounded, and human-grounded evaluation methods, which are goal and application-specific [51]. However, there needs to be more standardization in the methodology and the choice of evaluation metrics [52]. This work follows an application and goal-specific reasoning for the detected objects. The perturbation-based occlusion or ablation strategy accounts for adding or deleting important features to observe their impact and reliance on the overall model predictions, imparting robustness to the model outcomes.

A comprehensive evaluation methodology is implemented for the explanations generated by EigenCAM, a model and class-agnostic interpretation technique for understanding neural networks. The evaluation process involves perturbation/ablation of the input image by deleting the significant top features to observe their resultant impact on the explanation. The perturbation is either pixel-wise or patch-wise through techniques like occlusion masking, blurring, or replacing parts of the image. The chosen perturbation base strategy is less noisy and more reliable, fulfilling the application-specific need for robust and explainable object detection in drone imagery. This perturbation simulates the absence or modification of specific features, allowing us to understand the model's reliance on these features for its predictions. Subsequently, EigenCAM is applied to the perturbed images to generate an explanation heat map, highlighting the regions contributing most towards the model's decision. The resulting visualizations provide valuable insights into the model’s decision-making and reliance process. By comparing the explanations before in Fig. 8 and after perturbation/ablation, i.e., Fig. 9, we can assess the impact of feature perturbation/ablation on the model’s explanation, gaining a deeper understanding of its reasoning and feature importance significantly affecting the model outcomes. It is observed that masking, ablation, or deleting essential features drastically degrades the model detection accuracy, pointing towards some irrelevant regions apart from the masked objects. Figure 10 (a) (b) shows the perturbed image and its EigenCAM results. Showcasing the model’s total reliance on these masked features reduces detection accuracy abruptly, making the OD framework more reliable, robust, transparent, and trustworthy, which is highly desired in the mission-critical application of multimodal/multi-sensor-based OD in drone imagery. Heatmaps generated around the border of images are typically due to boundary artifacts and edge detection bias in the dataset. When an object is occluded, high gradient values in the sky may be due to contextual reasoning, lack of strong object features, dataset biases, and contrast effects. The model is trying to make a decision based on available information, and when the target object is missing, it assigns importance to the visible background or sky.

Fig. 10
figure 10

(a): Perturbed/occluded objects and (b) its EigenCAM based detection results

6 Conclusion

This paper describes developing an OD model end-to-end pipeline for multi-scale OD in drone imagery with an ensembling approach and XAI. AU-AIR, a multimodal dataset, was chosen to demonstrate and evaluate the proposed approach. An object detector with a customized YOLOv5 model with ensemble-based voting strategies was designed. Findings show that the affirmative voting strategy resulted in a 3% increase in mean average precision, demonstrating the potential of ensemble learning to improve the performance of OD models. To establish trust, transparency, and reliability in the detections, explainability was infused in the model with Eigen CAM, which provides the heat map highlighting the essential features affecting the model predictions. The explanations are further evaluated and validated by applying an ablation-based perturbation method, which deletes masks or removes the crucial features of Eigen CAM to observe the model's reliance on these vital features, ensuring the accuracy of the detector, enhancing the robustness, reliability, and explainability and transparency of the overall system. The findings show that integrating ensembling and explainability in the proposed object detection pipeline makes it robust and explainable, an appropriate fit for real-world deployment, further enhancing robustness, trust, transparency, and accountability critical for improved reliability and trustworthiness for object detection in drone imagery. Future improvements can be aimed at reducing the complexity and dimensionality of the proposed architecture by addressing domain-specific challenges, such as employing advanced lightweight architectures for real-time performance, including more models in the ensembling approach, applying and experimenting with diverse XAI techniques for consistent results can further improve OD accuracy and performance, making it more explainable and adaptable for open-world scenarios.