Abstract
Direct inverse analysis of faults in machinery systems such as gears using first principle is intrinsically difficult, owing to the multiple time- and length-scales involved in vibration modeling. As such, data-driven approaches have been the mainstream, whereas supervised trainings are deemed effective. Nevertheless, existing techniques often fall short in their ability to generalize from discrete data labels to the continuous spectrum of possible faults, which is further compounded by various uncertainties. This research proposes an interpretability-enhanced deep learning framework that incorporates Bayesian principles, effectively transforming convolutional neural networks (CNNs) into dynamic predictive models and significantly amplifying their generalizability with more accessible insights of the model's reasoning processes. Our approach is distinguished by a novel implementation of Bayesian inference, enabling the navigation of the probabilistic nuances of gear fault severities. By integrating variational inference into the deep learning architecture, we present a methodology that excels in leveraging limited data labels to reveal insights into both observed and unobserved fault conditions. This approach improves the model's capacity for uncertainty estimation and probabilistic generalization. Experimental validation on a lab-scale gear setup demonstrated the framework's superior performance, achieving nearly 100% accuracy in classifying known fault conditions, even in the presence of significant noise, and maintaining 96.15% accuracy when dealing with unseen fault severities. These results underscore the method's capability in discovering implicit relations between known and unseen faults, facilitating extended fault diagnosis, and effectively managing large degrees of measurement uncertainties.
1 Introduction
Rotating machinery systems, integral to various sectors including energy, manufacturing, and propulsion, necessitate rigorous condition monitoring and fault diagnosis to ensure their reliability and safety [1]. Among the diverse signals available for fault detection, vibration analysis is particularly valued for its rich informational content and low instrumentation costs. A quintessential example of such machinery is gear transmission systems, where the complexity of gear vibrations, spanning multiple time and length scales, makes direct inverse analysis of faults from response anomalies impractical. Consequently, the field has predominantly adopted data-driven approaches, comparing measurements from healthy baseline systems against those under operation to identify deviations indicative of faults. Traditional methodologies often incorporate signal processing techniques, such as wavelet transforms, to measure vibration responses and extract features reflecting gear health status [2–5]. These techniques, particularly effective in elucidating variations in underlying features following fault occurrences, have been extensively documented in the literature [6–8]. The integration of signal processing with machine learning has further augmented the potential of these techniques, enhancing feature classification capabilities as computational power has grown [9–12]. However, these early work in the field involved manual feature extraction, which proved ineffective for several reasons. The process is inherently subjective, heavily dependent on the expertise and judgment of the analyst, which can result in inconsistencies and the potential oversight of critical features. Additionally, manual feature extraction often fails to adequately capture the complex and subtle nuances of gear faults, especially when these faults develop gradually or occur under varying operational conditions. Moreover, such manual intervention is not scalable or efficient for continuous monitoring in large-scale industrial applications.
In recent years, machine learning, and particularly deep learning, have established themselves as powerful tools for fault diagnosis. Deep learning neural networks, especially convolutional neural networks (CNNs) [13], leverage powerful computing facilities to enhance feature extraction capabilities directly from raw data. These networks can process vast amounts of data, extracting significant features while maintaining a relatively small number of trainable parameters, which alleviates the computational burden during model training [14–16]. The ability to automatically extract and process these features makes deep learning particularly effective and popular in handling complex diagnostic tasks. Under the umbrella of deep learning, supervised learning has shown considerable effectiveness. It relies on labeled data to train models, allowing for precise model tuning and validation. Examples of its application include work by Kim and Choi, who utilized a CNN for fault diagnosis based on signal segmentation [17], and Chen et al., who developed an adaptive neural network tailored to variable rotating speeds [18]. Shi et al. further extended these capabilities by introducing a Bidirectional Convolutional Long Short-Term Memory (BiConvLSTM) method that integrates vibration and rotational speed signals for enhanced fault classification [19]. Despite its strengths, supervised learning faces challenges, particularly the requirement for extensive labeled datasets, which can be difficult and costly to acquire in the context of gearboxes where the fault conditions form a continuous spectrum. This limitation can restrict the model's ability to generalize across diverse and unseen fault scenarios. On the other hand, unsupervised learning, which operates without labeled data, is often explored as a solution for situations characterized by small dataset sizes. This approach aims to autonomously discover patterns within the data, as demonstrated by Li et al. with an automatic encoder for gear state classification [20] and the use of a stacked sparse auto-encoder [21]. Zhou and Tang extended this approach into a semisupervised realm using a deep convolutional generative adversarial network to cope with limited fault labels [22]. While these methods are adept at making the most of small data volumes, they face significant challenges in ensuring model accuracy and maintaining reliability across varying operational conditions. Moreover, the challenge in our gearbox fault diagnosis scenario is not merely small data size but specifically the scarcity of labeled data, which is crucial for training supervised learning models.
In the case of gearbox, the diverse and often subtle nature of gearbox faults requires extensive labeled datasets to ensure accurate model training and validation. However, obtaining such comprehensive labeling is impractical and costly, particularly given the continuous spectrum of potential fault conditions in gearbox systems. This limitation significantly impacts the performance of data-driven diagnostic approaches, especially deep learning techniques, which are heavily reliant on the quality and expansiveness of training data. The complex dynamics of gearbox operations introduce additional layers of difficulty; the nonlinear vibrations and various operational anomalies compound the challenges, leading to frequent misclassifications of unfamiliar fault types as those within the limited scope of the training set. Such errors are exacerbated by inherent uncertainties in machinery operation, including variability in assembly and environmental noise, which can obscure diagnostic insights. Despite various advancements in data augmentation, such as transfer learning and semisupervised learning techniques [23–25], the issues associated with limited labels and the persistent noise and uncertainty have seen limited progress in mitigation. Efforts like fuzzy neural networks (FNNs) [26,27] have been explored to extend classification capabilities to unknown fault scenarios using fuzzy logic, but these too fall short in integrating the uncertainties of real-world operational conditions into the classification process. The core of these challenges lies in the deterministic nature of most existing deep learning models, which are ill-equipped to handle the nuanced spectrum of uncertainty inherent in complex engineering systems like gearboxes. To truly advance the reliability and applicability of diagnostics in such contexts, a shift toward uncertainty-aware methodologies is necessary.
Therefore, it becomes imperative to integrate probabilistic analysis and account for prediction uncertainties within the diagnostic process. Given the inherent uncertainties and measurement noise in the data used for training, gear fault diagnosis must be approached as a probabilistic issue, akin to making diagnoses in medical practice. This probabilistic perspective introduces a crucial dimension of analysis, enabling the handling of unknown or unseen fault scenarios effectively. The nature of gear vibration is inherently complex, making it unfeasible to rely solely on deterministic models for fault classification. Instead, probabilistic methods allow for the assessment of similarity levels and implicit relationships among fault conditions in a probabilistic sense. Such analysis enables the classification of an unseen fault by comparing it probabilistically with known scenarios, thus allowing for confidence-based decision-making. This is particularly useful when direct deterministic extrapolation cannot accurately classify an unseen fault due to the variability and complexity of the signals involved. Bayesian approaches, with their capacity to provide rich probabilistic interpretations of outcomes, are particularly suited to this task. These methods model the uncertainty directly through the use of posterior distributions, offering a robust framework for incorporating both the known and the unknown variables affecting fault diagnosis [26]. Examples of Bayesian-based methods in machine learning include the Gaussian process, naïve Bayesian, and Gaussian mixture models, which all emphasize probabilistic prediction [28–31]. However, while these models are powerful, they often face scalability limitations [32–34] in high-dimensional spaces or require extensive datasets, which can be a significant constraint in practical applications. For instance, Gaussian processes are known for their computational complexity, which scales cubically with the number of data points [30], posing a challenge for their use in large-scale industrial environments. By adopting these uncertainty-aware methodologies, we can significantly enhance the accuracy and reliability of fault diagnosis systems. Such approaches not only provide a deeper understanding of the data and its inherent uncertainties but also improve the generalization capability of the diagnostic models, ensuring they are robust against the diverse and unpredictable nature of gearbox faults.
Building on the foundation of addressing the challenges posed by uncertainties and unseen fault scenarios in gearbox fault diagnosis, this research proposes a method to bridge the significant gap between theoretical models and their practical application. Our central hypothesis is that adopting a probabilistic analysis through advanced Bayesian methodologies can effectively meet these challenges. Specifically, our approach employs a deep learning framework that leverages modern Bayesian analytics, a strategy that has gained traction across various research domains for its robust feature learning capabilities and its proficiency in probabilistic prediction [35–39]. This innovative framework stands out by incorporating state-of-the-art Bayesian optimization techniques in the training process, enabling precise estimation of the network's unknown parameters and their posterior distributions. Such methods significantly enhance our understanding of data correlations under conditions of uncertainty and are particularly effective at minimizing overfitting with limited datasets, while also simplifying the need for extensive hyper-parameter tuning. However, customizing and implementing this advanced Bayesian framework to address gear fault diagnosis, especially under conditions of limited labels and significant uncertainties, presents unique challenges. For instance, the computational complexity of this framework exceeds that of traditional CNNs, necessitating more sophisticated computational strategies to extract comprehensive information about the network's parameters effectively. Furthermore, the success of the posterior distribution calculations within this framework critically depends on the selection of appropriate priors, which are inherently based on specific hypotheses. Addressing these challenges to extend the capabilities of fault diagnosis in the face of limited data and significant uncertainties forms the core motivation of our research. By harnessing the power of advanced Bayesian analytics, our approach not only confronts uncertainties but also transforms them into actionable insights. By applying these refined Bayesian techniques, we aim to calculate posterior distributions of unknown parameters, providing a probabilistic understanding of gear faults that far exceeds binary or deterministic outputs. Different from traditional Bayesian Convolutional Neural Network (BCNN) applications, our approach specifically tailors the BCNN framework to the intricate task of gear fault diagnosis, focusing on the continuous spectrum of fault severities. By integrating BCNN with customized methods for classification and uncertainty quantification, our model goes beyond standard fault classification to also quantify prediction confidence. This capability is crucial for reliable diagnostics, particularly when dealing with unseen fault severities. Additionally, our approach introduces innovative techniques for analyzing and interpreting probabilistic outputs, offering deeper insights into the nuanced relationships between different fault types. This method allows for recognizing the continuous nature of fault progression rather than confining them to discrete categories, thereby promising to elevate the reliability and applicability of diagnostics in complex engineering systems.
The remainder of this paper is organized as follows: In Sec. 2, the concept and formulation of Bayesian-enriched deep learning framework is presented, in which the probabilistic optimization strategy to facilitate the model training is mathematically detailed. Section 3 outlines the data acquisition and case setup of a laboratory gear testbed employed in this research, as well as the specific Bayesian-influenced deep model details. In Sec. 4, comprehensive case demonstrations are conducted to showcase the unique aspects of the proposed Bayesian-enriched deep learning framework in dealing with uncertainties and unseen faults. Concluding remarks are summarized in Sec. 5.
2 Bayesian-Enriched Deep Learning Framework for Fault Diagnosis
In this section, we present the formulation of the enhanced fault diagnosis framework utilizing Bayesian-enriched deep learning framework. Our particular focus is on the Bayesian approach for neural network training and the variational inference approach for computational alleviation.
2.1 Bayesian Inference-Based Convolutional Neural Network Training Architecture.
Our framework enhances traditional convolutional neural networks by integrating Bayesian principles, significantly improving interpretability and robustness in gear fault diagnosis. Unlike conventional models that rely on fixed weights and biases, our architecture employs probabilistic weights and biases, enabling the network to naturally account for the uncertainties inherent in the diagnostic process. The primary innovation in our architecture lies in its ability to leverage Bayesian inference to transform convolutional layers into probabilistic entities. This approach captures the inherent uncertainties in feature extraction, leading to a more robust and comprehensive representation of the input data. By treating each weight and bias as a random variable drawn from a probability distribution, our model can better handle the variability and complexity of gear fault scenarios.
Incorporating Bayesian inference into the convolutional layers allows our model to maintain uncertainty information throughout the network. This probabilistic treatment extends to the fully connected layers, which aggregate features extracted by the convolutional layers while preserving the uncertainty information. Consequently, the model can provide predictions in the form of probability distributions, offering not only the predicted class but also the associated uncertainty quantification. This probabilistic output is crucial for making informed decisions, particularly in scenarios with limited data and significant measurement noise. The training process of our architecture involves optimizing these probabilistic weights and biases using Bayesian inference techniques [40]. Specifically, we employ variational inference to approximate the posterior distribution of the network parameters. The purpose of model training is to find the optimal posterior distribution [41,42] of these parameters, which inherently incorporates the uncertainty in the model. This approach enables efficient and scalable training, even in high-dimensional spaces, by converting the complex posterior inference problem into a more manageable optimization task.
Our framework's ability to quantify uncertainty offers a significant advantage over traditional deterministic models. By providing a measure of confidence in the predictions, our model enhances the reliability of fault diagnosis, especially in unseen and uncertain scenarios. This probabilistic approach allows for a nuanced interpretation of fault severities, acknowledging the continuous nature of fault progression rather than constraining them to discrete categories. In summary, our Bayesian-enriched architecture transforms the way convolutional neural networks handle gear fault diagnosis by embedding probabilistic reasoning at every stage. This methodology significantly improves the model's ability to generalize from limited data and enhances its performance in predicting and understanding complex fault scenarios. The specific construction of this architecture, including detailed layer configurations and the data conversion process, will be elaborated in the following Sec. 3.
2.2 Variational Inference for Probabilistic Parameter Optimization.
This classical expression, denoted as Eq. (1), involves several probabilistic components that integrate the likelihood, prior, and marginal likelihood of the data, which are essential for Bayesian inference. Since the analytical solution of posterior probability density function (PDF) is intractable, numerical approximations are commonly employed, which are oftentimes computationally intensive. This issue is further deteriorated by the high dimension of because of the large scale of BCNN model in gear diagnosis. While the Markov chain Monte Carlo (MCMC), such as Gibbs sampler, is shown to be able to approximate the true posterior well through expedited sampling, directly sampling a high-dimensional posterior is still costly [43–46]. Even though the posterior PDF can be numerically approximated, it generally is a nonstandard probabilistic distribution, which cannot be well characterized by parametric representation. With such a nonstandard posterior PDF, the maximum a posterior estimate usually is further employed for point estimation/deterministic estimation (i.e., yields single solution corresponding to the largest probability value) [47,48]. Fundamentally, this cannot serve the probabilistic parameter optimization required for BCNN training.
where M is a sufficiently large number to ensure the approximation convergence.
where is the cost function contributed by the cross-entropy loss. Combining the variational inference approach with the stochastic gradient descent (SGD) algorithm, the probabilistic backpropagation optimization to lead the model training becomes readily available. It is worth mentioning that the scheme of reparameterization is adopted to ensure the normal backpropagation since the training involves the stochastic sampling step [54]. The batch training is widely used for neural network training, where the single batch of training samples will be involved for optimization (Eq. (7)) in each iteration.
To seamlessly integrate the variation inference approach into neural network training, Gal and Ghahramani recently developed the method of Monte Carlo dropout that is embedded into the layer that has weights and bias to be optimized [55]. This new design is equivalent to applying variational inference with a specific variational distribution on neural network training. Gal further validated that the variational predictive distribution can be obtained through randomly drawing Bernoulli random variables from dropout during the inference [56]. This method provides a practical solution for BCNN model training, which lays the foundation for the successful implementation of this research.
2.3 Probabilistic Prediction Upon Harnessing Bayesian Convolutional Neural Network.
where is a marginal PDF to represent the posterior predictive distribution. is the posterior PDF of the network weights and biases (Eq. (1)). For each deterministic input , its corresponding output is probabilistic because the weights and biases are randomly sampled from the optimized variation distribution . While the weights and biases of network are subject to the standard statistical distributions, the output of final layer, i.e., probability of classes may not have the analytical form due to the nonlinear activation functions applied. Therefore, Monte Carlo analysis usually is employed to numerically identify such probability distribution given the input . Without loss of generality, the outputs of all layers other than the input layer in BCNN will become probabilistic, which can be extracted in a similar way. Such probabilistic nature of BCNN allows one to account for the uncertainty effect in prediction, showing the unique strength of BCNN.
The probabilistic predictions generated by our model enable the detection of unseen fault scenarios by assessing the associated uncertainty. When the model encounters an input that does not closely match any known fault patterns, the increased uncertainty in the prediction can indicate an unseen fault scenario. This is particularly valuable for distinguishing between very similar fault conditions, such as different severities of the same fault type. Traditional deterministic models often struggle with these nuances, especially when the severity of the fault varies along a continuous spectrum.
While traditional CNNs can categorize samples into predefined classes, they lack the capability to quantify uncertainty in their predictions. This limitation can lead to overconfident classifications, even when encountering unseen fault severities. In contrast, our Bayesian-inference enhanced framework can discern subtle differences between fault severities by modeling the probabilistic nuances of each condition. This approach provides a more detailed understanding of the similarities and discrepancies between similar conditions, making the model's reasoning more transparent and interpretable to humans. The ability to capture these nuances enhances the model's reliability in practical applications, where distinguishing between varying fault severities is crucial. The practical advantages of our approach, including its superior handling of unseen and uncertain scenarios, will be demonstrated in the subsequent case analysis. The workflow of the proposed framework is shown in Fig. 1.
3 Fault Diagnosis Implementation on Lab-Scale Gear Testbed
In this section, we present the details of the lab-scale gear testbed used for collecting training data with various faulty scenarios as well as the experimental data. We further report the setup and the configuration details of the BCNN constructed in this research.
3.1 Experimental Testbed and Data Collected.
Gear vibration responses contain rich and arguably implicit features that can reflect the health condition of the underlying system. In this research, we measure the vibration information directly from a lab-scale gear testbed (Fig. 2) and use the signals to conduct the subsequent gear fault diagnosis. In this testbed, a 32-tooth pinion and an 80-tooth gear are installed on the first stage input shaft, whereas a 48-tooth pinion and a 64-tooth gear are installed on the second stage output shaft. A motor is used to control the gear speed following a trapezoidal profile with speed-increase, constant speed, and speed-decrease. The gear speed is related to the input shaft speed that is measured through a tachometer. An accelerometer is placed to the vicinity to measure the gear vibration signals, which are acquired using dSPACE system with 20 kHz sampling frequency. The time synchronous averaging (TSA) approach particularly is employed to minimize the measurement uncertainty and conduct the signal conversion between the time-even and the angle-even domains, leading to response signals with reduced noncoherent components [2]. This facilitates the succeeding gear fault diagnosis analysis for methodology validation. The ability to handle variable speeds in our experimental setup has significant implications for real-world applications, such as wind turbines, which operate under varying speed conditions. The probabilistic approach of our framework effectively accounts for the changes in operational dynamics, improving fault diagnosis reliability in such complex and variable environments.
In this research, we purposely introduce nine different fault conditions shown in Fig. 3 into the pinion on the input shaft. Chipping tips are created by removing certain amount of material from the pinion. Specifically, pinion material losses of 0.15 mm, 0.24 mm, 0.38 mm, 0.48 mm, and 0.68 mm, respectively are introduced, leading to five different damaged teeth with increasing severity levels. For each fault condition, 104 signals are collected, yielding totally 936 () samples. The details of the dataset are given in Table 1, where each fault condition is assigned with a fault type identification (ID) number. Unless otherwise specified, these fault-type IDs are used throughout this research to indicate the associated fault conditions. The appearances of pinions under different fault conditions are plotted in Fig. 3. In each sample, 3600 angle-even data points are recorded in the course of 4 gear revolutions, resulting in 3600 time-domain features. These features are represented by the time-series signals, which are related to different fault conditions as illustrated in Fig. 4. The time-series signals can be stored in the form of either numeric array data or image data. In this research, we use the image format to establish the deep learning model since it can be potentially extended to other computer vision-based fault diagnosis. All the image data are shared, and can be found in the public link.

Vibration signal samples associated with different fault conditions: (a) healthy, (b) missing tooth, (c)crack, (d) spalling, (e) chipping tip 1 (least severe), (f) chipping tip 2, (g) chipping tip 3, (h) chipping tip 4, and (i) chipping tip 5 (most severe)
Experimental data summary
Type | Fault condition | Data size |
---|---|---|
1 | Healthy | 104 |
2 | Missing tooth | 104 |
3 | Crack | 104 |
4 | Spalling | 104 |
5 | Chipping_tip_5 (least severe) | 104 |
6 | Chipping_tip_4 | 104 |
7 | Chipping_tip_3 | 104 |
8 | Chipping_tip_2 | 104 |
9 | Chipping_tip_1 (most severe) | 104 |
Type | Fault condition | Data size |
---|---|---|
1 | Healthy | 104 |
2 | Missing tooth | 104 |
3 | Crack | 104 |
4 | Spalling | 104 |
5 | Chipping_tip_5 (least severe) | 104 |
6 | Chipping_tip_4 | 104 |
7 | Chipping_tip_3 | 104 |
8 | Chipping_tip_2 | 104 |
9 | Chipping_tip_1 (most severe) | 104 |
3.2 Bayesian Convolutional Neural Network Model Construction.
Following the formulation outlined in Sec. 2, we develop the BCNN model using the Python TensorFlow framework [53]. As mentioned, BCNN fundamentally is a CNN built upon Bayesian learning. The architecture of BCNN thus is analogous to that of CNN. While there are certain guidelines for the design of deep learning model as indicated in literature [54], the implementation is usually case specific, aiming at ensuring adequate training while avoiding both underfitting and overfitting. In neural network training, monitoring training and validation accuracy with respect to epoch is a well-known way to assess the model learning adequacy. Following the aforementioned guidelines and learning performance assessment, we develop the architecture of BCNN model as shown in Fig. 5. Specifically, this architecture consists of six layers which are explained in detail in Table 2. The hyperparameters listed in Table 2, including the input shape and layer configurations, were carefully selected to optimize the performance of the BCNN model. The input shape of 256 × 256 × 1 was chosen to match the resolution of the signal plot images, ensuring that the model captures essential details from the vibration signals. The convolutional layers, with filter sizes and activation functions (ReLU) specified, were designed to efficiently extract features while keeping the model computationally manageable. These configurations were determined through empirical testing to provide a balance between model complexity and diagnostic accuracy. The size of output layer is selected toward the case investigated.
BCNN layer configuration
Layer | Output shape | Parameter number |
---|---|---|
Input | 0 | |
Convolutional (filter: ) (ReLU) | 608 | |
Convolutional (filter: ) (ReLU) | 36,928 | |
Convolutional (filter: ) (ReLU) | 147,584 | |
Flatten | 131,072 | 0 |
Dense (Softmax) | 2,097,160 |
Layer | Output shape | Parameter number |
---|---|---|
Input | 0 | |
Convolutional (filter: ) (ReLU) | 608 | |
Convolutional (filter: ) (ReLU) | 36,928 | |
Convolutional (filter: ) (ReLU) | 147,584 | |
Flatten | 131,072 | 0 |
Dense (Softmax) | 2,097,160 |
Note: Padding is applied, and stride is set as 2 in both directions of 2D feature map. Total number of trainable parameters is 2,282,280.
In this research, our goal is to tackle the challenges of uncertainties and unknown faults in gear fault diagnosis. While the BCNN technique holds the promising aspects of probabilistic prediction and combining empirical experience with deep learning inference, its true performance can only be tested by systematic case investigations. To this end, we analyze three cases, i.e., normal classification where the entire, original dataset is used, (2) noisy classification where the original dataset is contaminated with various levels of noise, and (3) extended classification where the training data only involves a subset of fault labels available while the testing data contains signals corresponding to unseen fault scenarios.
As a demonstration, we use nine nodes of output layer for the first two cases where all 9 labels of data will be utilized for model establishment. It is worth noting that the weights and biases to be optimized in the BCNN model are represented probabilistically, i.e., using mean and variance. Recall that represents the mean and variance of each network unknown in Sec. 2. The total number of trainable parameters hence is much greater than that of its corresponding CNN model.
4 Case Investigations
In this section, we present the details of the implementation and validation of the proposed BCNN to tackle the challenges of uncertainties and unseen faults. As mentioned, three separate cases are analyzed. We start from the usual classification with nine classes of gear conditions using the original dataset collected experimentally, followed by cases of dataset with further noise contamination and training data employing only a subset of fault labels. The performances of the last two cases will be analyzed with respect to the first case, in order to highlight the distinctive features of the proposed BCNN methodology.
4.1 Normal Classification of Existing/Known Faults Using the Entire Original Dataset.
In this first case, we follow the usual way of machine learning by simply splitting the entire original dataset (outlined in Table 1) into 80% training and 20% testing data. To allow the label balance, the split is implemented upon the subdataset of each fault condition, where 81 out of 104 are used as training data and the rest are used as testing data, respectively. This is known as the stratified data split [55]. Therefore, we in total have 747 () training and 189 () testing data samples. The BCNN model with architecture outlined in Table 2 is deployed for training. 5% of the training data is held out for validation during training. The experiments were conducted using the Anaconda distribution with the Spyder IDE. The model was implemented using TensorFlow and Keras libraries, which are well-suited for deep learning tasks. The hardware platform consisted of an Alienware desktop equipped with a high-performance CPU. Other operating hyperparameters and metrics used for training are listed in Table 3. The hyperparameters in Table 3 were selected to enhance the training process of the BCNN model. The tuning process involved adjusting key hyperparameters such as learning rate, batch size, and number of epochs. These hyperparameters were predefined within specific ranges based on empirical knowledge from similar studies, ensuring that variations in outcomes were minimal. We employed grid search to systematically explore different combinations and identify the optimal settings that led to effective model convergence and minimized overfitting. A learning rate of 0.0002 was chosen based on preliminary experiments that showed it provided stable and efficient convergence. The batch size and number of epochs were set to 9 and 100, respectively, to ensure that the model was trained effectively without overfitting. The use of SGD as the optimization algorithm was selected for its proven effectiveness in large-scale learning tasks, with adjustments made to balance learning efficiency and stability. ACC, which is the most intuitive classification accuracy metric, is used for evaluating the training and validation accuracy. It is simply a ratio of correctly predicted observations to the total observations, which can be retrieved from the confusion matrix [56]. The training and validation accuracy curves with respect to epoch are plotted in Fig. 6. Training accuracy curve exhibits a smooth and continuous increase, whereas the validation accuracy curve exhibits overall increasing tendency with slight oscillations. Both accuracies will approach nearly 100% as training proceeds, showing the adequate model learning without underfitting and overfitting.
Operating hyperparameters and metrics for classification analysis
Optimization algorithm | Learning rate | Epoch size | Batch size | Loss metric | Accuracy metric |
---|---|---|---|---|---|
Stochastic gradient decent (SGD) | 0.0002 | 100 | 9 | Categorical cross-entropy and KL divergence | ACC |
Optimization algorithm | Learning rate | Epoch size | Batch size | Loss metric | Accuracy metric |
---|---|---|---|---|---|
Stochastic gradient decent (SGD) | 0.0002 | 100 | 9 | Categorical cross-entropy and KL divergence | ACC |
Once the model is established, it can be directly used for the classification of testing data. As noted, BCNN model inherently behaves as a stochastic neural network because of the probabilistic weights and biases incorporated. Therefore, it can produce probabilistic prediction under each deterministic testing sample. Particularly, we adopt the Monte Carlo analysis to run the emulations 1000 times to gather the distribution information of each testing sample. As illustration, Fig. 7 gives the prediction results of two testing samples with actual fault type 1 (i.e., healthy) and 2 (i.e., missing tooth), respectively. Apparently, the results are no longer the point estimations, and they instead become probabilistic. Here, we use the kernel density to represent such probabilistic information since it can produce more smooth and continuous distribution curve than the histogram [57]. The probabilistic distributions are obtained for all nine fault conditions/types. Through comparing and interpreting those distributions, reliable decision-making can be realized.

Illustration of probabilistic predictions over particular testing samples using kernel density: (a) sample corresponding to true fault type 1 and (b) sample corresponding to true fault type 2
Figure 7(a) points to the prediction as fault type 1 being most probable since its probability mean is larger than that of other fault types. The widths of distributions are relatively small showing the high confidence level of the prediction. Similarly, Fig. 7(b) points to fault type 2 being identified. They are both correct predictions as compared with the actual faults. One may notice that there is “True” or “False” indicated in each prediction plot aforementioned. The y-axis shows the frequency of predicted mean probabilities within specific bins. Undoubtedly, the decision-making of these two testing samples (Fig. 7) is quite confident. Notably, incorrect classifications often have mean probabilities around 0.1, indicating that these fault types are similar or challenging to distinguish. This highlights the value of probabilistic predictions in managing uncertainty and enhancing the reliability of fault diagnosis. However, in some scenarios, the predicted distribution mean values of all fault labels are very close, and the distribution widths may also be large. This poses difficulty in making reliable decision by the BCNN model itself. Hence, in general, one should also allow BCNN model to indicate “Uncertain” in order to avoid false prediction. The metric to define unconfident decision-making essentially is indeed subjective and may depend on empirical knowledge. In this case, we set a threshold of prediction probability mean as 0.2. If the probability mean values of all labels are less than the threshold, the decision-making/classification is considered unconfident even the fault can be simply identified as the fault label with highest probability mean. Once the unconfident classification takes place, all prediction plots will be marked as “False.” To assist reliable decision-making, empirical experience and knowledge can be further incorporated, similar to human health diagnosis. This is the significant and unique strength of the BCNN model, which provides the various options to increase the flexibility of decision-making [58–61].
Recall 21 testing samples for each fault label are employed. We can calculate the mean and standard deviation of the prediction probability distributions over each single testing sample (Fig. 7), and then gather such statistical information of all samples corresponding to different fault conditions together to establish the probabilistic prediction result over entire testing space, as shown in Fig. 8. Consistent with the observations obtained from Fig. 7, the probability mean of testing samples for true fault type is much larger than that for other fault types, indicating accurate classification in a probabilistic sense. Additionally, the probability standard deviations are small overall. The probability standard deviation of testing samples for true fault type is slightly greater than that for other fault types. However, the standard deviation percentage is still small because of its associated higher probability mean. The results collectively illustrate accurate decision-making with high confidence level. Apparently, the probabilistic nature of BCNN model aligns with the general decision-making logic for a stochastic process in real world.



Classification probability outputs of testing samples corresponding to different fault types in Case 1: (a) true fault type 1, (b) true fault type 2, (c) true fault type 3, (d) true fault type 4, (e) true fault type 5, (f) true fault type 6, (g) true fault type 7, (h) true fault type 8, and (i) true fault type 9



Classification probability outputs of testing samples corresponding to different fault types in Case 1: (a) true fault type 1, (b) true fault type 2, (c) true fault type 3, (d) true fault type 4, (e) true fault type 5, (f) true fault type 6, (g) true fault type 7, (h) true fault type 8, and (i) true fault type 9
4.2 Classification of Existing Faults Under Contaminated Measurement.
The gear fault data employed in this research were measured through strictly controlling the motor speed, followed by signal processing using the TSA technique [2]. The uncertainties and noise thus are minimized in the dataset. Nevertheless, in practical situations, the measurement inevitably is contaminated with various noises and even assemblage uncertainties. Moreover, many real-time fault diagnosis systems are autonomous and oftentimes implemented upon the direct measurements without any data denoising. In order to examine the feasibility of this method for fault diagnosis under uncertainties, we purposely contaminate the current measurement by introducing different levels of white noise, i.e., 3% and 10% standard deviations of vibration signal amplitudes, respectively. The discrepancy of signals with higher level of noise will become more obvious as shown in Fig. 9. The results obtained will be compared with those presented in Sec. 4.1. We anticipate that the proposed BCNN can handle noises effectively by providing probabilistic decision-making.

Illustration of vibration signal samples with different noise levels: (a) nominal sample without additional noise, (b) sample with additional noise 3%, and (c) sample with additional noise 10%
We first carry out the analysis under 3% noise level following the same procedure outlined in Sec. 4.1 and obtain the prediction results of two particular samples (Fig. 10) and the aggregated prediction results over entire testing space (Fig. 11). Figure 9 indicates the accurate fault classification when additional 3% noise is introduced. As compared with Fig. 6, the slight difference is the small reduction of probability mean from 0.25 to 0.24 (Fig. 6(a) versus Fig. 9(a)) accompanied by the small increase of the distribution width for the actual fault condition. While this indicates that the confidence level of decision-making for these particular samples decreases slightly, which is expected, the BCNN model is capable of conducting the fault diagnosis task successfully. The aggregated results (Fig. 11) generically reflect the probabilistic nature of decision-making for entire testing samples with different associated true faults. They are essentially identical to the results shown in Fig. 8, illustrating the good performance of BCNN in handling the contaminated data.

Illustration of probabilistic predictions over particular testing samples in Case 2 (with additional noise 3%): (a) sample corresponding to true fault type 1 and (b) sample corresponding to true fault type 2



Classification probability outputs of testing samples corresponding to different fault types (with additional noise 3%): (a) true fault type 1, (b) true fault type 2, (c) true fault type 3, (d) true fault type 4, (e) true fault type 5, (f) true fault type 6, (g) true fault type 7, (h) true fault type 8, and (i) true fault type 9



Classification probability outputs of testing samples corresponding to different fault types (with additional noise 3%): (a) true fault type 1, (b) true fault type 2, (c) true fault type 3, (d) true fault type 4, (e) true fault type 5, (f) true fault type 6, (g) true fault type 7, (h) true fault type 8, and (i) true fault type 9
We further increase the noise level to 10% to examine how robust the model is to deal with the large degree of measurement uncertainties. The results obtained under different noise levels are compared in Fig. 12. For concise illustration, only the probability mean and standard deviation of testing samples for actual/true fault condition are included. Clearly, the predictions under different noise levels exhibit very small discrepancies. It appears that the worst case (10% noise level) will increase the bands of both probability mean and standard deviation for certain fault conditions, i.e., fault type 1, 8 and 9. This is reasonable since the noise naturally will interfere the decision-making. The consequence is, even though the decision-making is accurate, its confidence level may reduce. While the probability distributions predicted indeed illustrate the unique advantage of BCNN model, the crisp testing/classification accuracy is worth examining since it is a direct indicator to assess the model performance. Without loss of generality, the testing samples will be classified as the fault labels with highest probability mean. By comparing the actual faults of the testing samples, the classification accuracy can be easily computed. Table 4 gives the number of accurately classified samples. Recall we have total 189 testing samples. The classification accuracies of cases with different noise levels are all 100%, showing the excellent performance of BCNN model. Besides, the decision-making is categorized either confident or unconfident based upon the probability mean threshold mentioned in Sec. 5.1. When the probability mean values of all fault labels are less than the specified threshold, we claim that this decision-making is unconfident. Therefore, the accurately classified samples consist of the samples that are unconfidently or confidently identified. In other words, the confidently and unconfidently classified samples are two subsets of accurately classified samples. The same probability mean threshold, i.e., 0.2, is adopted here and the numbers of samples with confident and unconfident decision-making, respectively are appended in Table 4. Apparently, the accurate decision-making of all testing samples is highly confident. This consistently validates the proposed methodology.

Classification probability outputs of testing samples with different actual fault types with respect to different additional noise levels: (a) probability mean and (b) probability standard deviation (solid line denotes the low or upper bound, and dash line denotes the mean value of distribution)
Testing result with respect to different noise levels
No additional noise | Additional noise 3% | Additional noise 10% | ||||||
---|---|---|---|---|---|---|---|---|
Accurate | Confident | Unconfident | Accurate | Confident | Unconfident | Accurate | Confident | Unconfident |
189 | 189 | 0 | 189 | 0 | 0 | 189 | 0 | 0 |
No additional noise | Additional noise 3% | Additional noise 10% | ||||||
---|---|---|---|---|---|---|---|---|
Accurate | Confident | Unconfident | Accurate | Confident | Unconfident | Accurate | Confident | Unconfident |
189 | 189 | 0 | 189 | 0 | 0 | 189 | 0 | 0 |
4.3 Extended Classification With Unseen Faults.
One of the goals of this research is to explore a practical way of conducting extended classification for gear diagnosis. That is, we develop, leveraging the BCNN model, the diagnosis of gear system with testing data that the model has not seen during the training phase. This scenario is quite common in practical implementation, as gears feature (infinitely) many fault conditions, whereas the training dataset is always limited especially in terms of fault labels. Our hypothesis is that the probabilistic nature of the BCNN framework can provide insights of unseen faults with respect to existing fault labels employed in training, and produce useful guidance for decision-making. Recall that our dataset includes five different severity levels of the chipping fault. In practice, chipping is a continuously varying fault condition. Therefore, in our subsequent case study, we purposely hold out one of the chipping tip severity levels in BCNN training. In other words, only four chipping tip severity levels are treated as known fault labels to train the BCNN model, and the remaining one chipping tip data are used to evaluate the BCNN capacity in terms of coping with unseen fault conditions. We analyze two different scenarios, i.e., holding out a minor chipping tip case, and holding out a severe chipping tip case.
4.3.1 Scenario 1: Classification of a Minor Chipping Tip Fault as Unseen Fault Condition.
In this scenario, we choose the chipping tip 4, i.e., fault type 6 (Table 1) as the unseen fault. Correspondingly, we use 104 samples of chipping tip 4 as testing data and the rest of samples corresponding to other fault conditions as the training data. Since this classification analysis is not standard, the metric for assessing the classification accuracy needs to be redefined. That is, the classification is accurate when the unseen fault is identified as being close to the neighboring fault type. For this particular scenario, as long as the testing sample is classified into chipping tip 5 (fault type 5) or chipping tip 3 (type 7), it is counted as an accurately classified sample. The BCNN model architecture shown in Table 2 is employed. The major difference is that the size of output layer reduces to 8 because one fault label, i.e., chipping tip 4 is hidden from training. The same operating parameters and metrics (except the classification accuracy), as tabulated in Table 3, are also used. Once model training is completed, the BCNN model is employed for the extended classification analysis.
Similarly, we first look into the probabilistic prediction result of one particular sample as shown in Fig. 13. As compared with the result of normal classification analysis reported in Sec. 4.1, the interference of the faults other than the actual fault become more significant. Specifically, even though the classification result is type 5, which is correct according to above metric, fault type 1 (i.e., healthy condition) appears to play a role. To further elucidate the impact of different fault labels on the classification results, the prediction distributions of all 104 testing samples are generated and shown in Fig. 14. Fault type 1 indeed is a major interference according to the associated distribution of probability mean values (Fig. 14(a)). Such observation can be explained by the fact that the minor chipping tip generally induces small vibration change as compared with healthy condition. The features of vibration measurements under these two conditions thus will be similar, which are difficult to discriminate. However, the distribution of probability standard deviations for fault type 1 is wider than that of fault type 5 or 7 (correct fault to be identified), implying the larger degree of uncertainties exist in the predictions. Intuitively, fault type 5 will be the classified fault condition for most of the testing samples.
We then analyze the probabilistic information shown in Fig. 14 to obtain the testing/classification accuracy and confidence level of predictions, which are shown in Table 5. The same threshold, i.e., 0.2, is applied to indicate the confidence level of decision-making. Consistently, fault type 1 is found to be the major source of misclassification, to which 14 samples are misclassified. The testing accuracy thus is computed as 86.5% (90/104). While the wrong misclassification occurs on all 14 samples, it is observed that 13 out of 14 samples are classified unconfidently. In other words, the BCNN model may also consider those 13 samples to be either fault type 5 or 7, which match the ground truth. This provides the possibility toward the diagnosis accuracy enhancement when the empirical judgment is further applied on those samples that are classified unconfidently. In total, 54 (13 + 43) samples may have the pivot features between the features of fault types 1 and 5, which makes them difficult to be classified correctly with high confidence level. The fundamental reason is that the features in some testing samples may exhibit strong nonlinear relation with respect to the fault severity. For this reason, we can conclude that the current testing accuracy of this particular scenario is satisfactory.
Testing result (scenario 1 in Case 3)
Minor chipping tip | |||||
---|---|---|---|---|---|
Inaccurate | Confident | Unconfident | Accurate | Confident | Unconfident |
Type 1 | Type 5 | ||||
14 | 1 | 13 | 90 | 49 | 41 |
Minor chipping tip | |||||
---|---|---|---|---|---|
Inaccurate | Confident | Unconfident | Accurate | Confident | Unconfident |
Type 1 | Type 5 | ||||
14 | 1 | 13 | 90 | 49 | 41 |
In order to highlight the enhanced performance of BCNN model for extended fault diagnosis, we also create a CNN model with the same architecture configuration for performance comparison. It is worth noting that the number of trainable parameters is significantly reduced (i.e., 1,141,256) in the CNN model because the weights and biases become deterministic. The same training and test data split is used for classification analysis. The result shows that 60 out of 104 samples will be correctly classified as fault type 5, and the rest will be misclassified as fault type 1. The testing accuracy hence is 57.69% (60/104), which is significantly worse than the 86.5% (90/104) achieved by the BCNN model. Additionally, the CNN model only supports deterministic decision-making without considering the uncertainty effect, leading to irreversible false prediction. Overall, the BCNN model greatly outperforms its counterpart, illustrating its feasibility in performing extended fault diagnosis.
4.3.2 Scenario 2: Classification of a Severe Chipping Tip Fault as Unseen Fault Condition.
To further validate the BCNN, we formulate another scenario, and revisit the BCNN modeling and analysis for performance reexamination. In the second scenario, we hold out the data samples belonging to chipping_tip_2 (fault type 8), which is a severe chipping tip severity. Once the model is established following the same procedures indicated above, we can carry out the testing process using the hold-out samples. In this scenario, the classification is considered correct if a testing sample is classified as being close to chipping_tip_3 (fault type 7) or chipping_tip_1 (fault type 9). The prediction results over one particular sample and entire testing space are shown in Figs. 15 and 16, respectively. The classification pattern of type 9 (chipping_tip_1) in the plot is significant as it shows the model correctly identifies type 8 as most similar to type 9, validating the model's capability to discern nuanced differences between fault severities. As can be seen clearly, fault type 2, i.e., missing tooth becomes the major interference. This makes very good sense since the severe chipping tip resembles the missing tooth in terms of the fault severity. Through analyzing the prediction distribution information, we can obtain the testing result as tabulated in Table 6. Only 4 testing samples are misclassified to fault type 2 and the rest all are correctly classified as fault type 8, yielding the high testing accuracy, i.e., 96.15% (100/104). Among those correctly classified samples, 85 samples are classified confidently, showing that type 8 indeed is the most probable fault to represent the unseen fault. On the other hand, all misclassified samples are subject to unconfident decision-making, which may also have the probability to belong to the fault type 8. Similarly, a CNN model with the same architecture is established to perform classification analysis. 79 samples are correctly classified (33 and 46 samples are classified as fault type 7 and 9, respectively). Testing accuracy thus is 75.96% (79/104), which still is much lower than that of BCNN model. Again, the results in this scenario demonstrate the effectiveness of the proposed BCNN for extended fault diagnosis. To determine the threshold for prediction confidence in future cases, one can observe the probability distribution across classes. In our analysis, the 0.2 threshold corresponds to a point of minimum occurrence in the distribution, marking the transition from confident to uncertain predictions. This natural threshold can be identified visually or refined using statistical methods like kernel density estimation or analyzing distribution overlaps to enhance confidence assessment. In comparison to other fault diagnosis methods, such as Gaussian Process, Naive Bayesian, CNN, and Fuzzy Neural Networks, our BCNN approach uniquely combines effective feature extraction with robust uncertainty quantification. While methods like Gaussian Process and Naïve Bayesian offer probabilistic outputs, they struggle with scalability and capturing complex feature relationships. CNNs excel in feature extraction but lack uncertainty quantification, and FNNs require extensive manual tuning. BCNN addresses these limitations by integrating CNN's strengths with Bayesian inference, offering a more comprehensive and flexible solution for fault diagnosis.
Testing result (scenario 2 in Case 3)
Severe chipping tip | |||||
---|---|---|---|---|---|
Inaccurate | Confident | Unconfident | Accurate | Confident | Unconfident |
Type 2 | Type 8 | ||||
4 | 0 | 4 | 100 | 85 | 15 |
Severe chipping tip | |||||
---|---|---|---|---|---|
Inaccurate | Confident | Unconfident | Accurate | Confident | Unconfident |
Type 2 | Type 8 | ||||
4 | 0 | 4 | 100 | 85 | 15 |
While the proposed framework demonstrates significant potential, it also presents certain limitations that should be addressed in future research. The computational complexity associated with Bayesian inference, especially when scaling to larger systems, poses a challenge for real-time applications in industrial settings. Additionally, the accuracy of the model's predictions is sensitive to the selection of appropriate priors, which could lead to suboptimal outcomes if not carefully chosen. Furthermore, although the framework has been successfully validated on a lab-scale gearbox system, its generalizability to other types of machinery and more complex fault scenarios requires further exploration. Future research should focus on enhancing computational efficiency, developing adaptive methods for prior selection, and expanding the framework's applicability to a broader range of industrial diagnostics.
5 Conclusion
In this study, we introduce a novel probabilistic fault diagnosis framework that leverages the strengths of Bayesian-inference in convolutional neural networks. Unlike traditional deep learning methods, our approach enhances fault diagnosis by incorporating probabilistic decision-making, accounting for prediction uncertainties and thereby bolstering diagnostic robustness. This framework inherits the feature extraction capabilities of CNNs while adopting a Bayesian methodology to probabilistically understand hidden data correlations. Through the use of variational inference, expedited by integrating Monte Carlo dropout techniques into the network layers, we streamline the Bayesian-based learning process. Our model has been applied to the fault diagnosis of a lab-scale gearbox system, showcasing its ability to accurately diagnose gear faults even in the presence of significant noise. Notably, our approach excels in extended diagnostics, identifying unknown or unseen faults by their proximity to known fault categories, thus offering crucial insights for practical decision-making. This research marks a significant step forward in applying probabilistic decision-making to gear fault diagnosis, addressing the critical challenges of uncertainties and unseen faults commonly encountered in real-world scenarios. While the developed framework shows significant promise for application across various machinery diagnosis endeavors, the framework's scalability and applicability could be further enhanced by refining computational efficiency and exploring broader industrial scenarios.
Acknowledgment
The corresponding author also appreciates the start-up fund support from The Hong Kong Polytechnic University.
Funding Data
NSF (Grant Nos. CMMI – 2138522 and IIS – 1741174; Funder ID: 10.13039/100000001).
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- BCNN =
Bayesian convolutional neural networks
- BiConvLSTM =
bidirectional convolutional long short-term memory
- CNNs =
convolutional neural networks
- FNNs =
fuzzy neural networks
- KL =
Kullback-Leibler
- MCMC =
Markov Chain Monte Carlo
- PDF =
probability density function
- ReLU =
rectified linear unit
- SGD =
stochastic gradient descent
- TSA =
time synchronous averaging