An Ensemble Feature Selection and Nested Cross-Validation Approach Using miRNA Profiles for the Automated Detection of Usher Syndrome
Laboratory Genetics and Genomics
-
Primary Categories:
- Laboratory Genetics
-
Secondary Categories:
- Laboratory Genetics
Introduction:
Introduction: Usher syndrome is a rare genetic disorder characterized by hearing loss, vision impairment, and balance issues. Traditional testing for Usher syndrome involves a combination of clinical assessments, genetic testing, and, in some cases, imaging studies. These methods are effective but can be time-intensive, costly, and inaccessible for many, often requiring multiple visits to specialized clinics and labs. Given these limitations, there is a critical need for more efficient, non-invasive, and automated diagnostic approaches.
Recent research suggests that miRNAs play an important role in gene expression regulation and may serve as potential biomarkers for Usher syndrome, with distinct miRNA profiles observed in affected individuals compared to healthy controls. These findings suggest that miRNA expression data could serve as a promising alternative to traditional diagnostic methods.
Methods:
Methods: We developed a machine learning pipeline for automated detection and classification of Usher syndrome based on miRNA expression data. Our approach integrates ensemble feature selection with nested cross-validation. We used four feature selection methods—Recursive Feature Elimination, Random Forest Importance, k-best, and LASSO—to identify the top 10 miRNAs distinguishing Usher samples from controls. Gene pathway analysis was performed to evaluate the functional potential of the selected miRNAs. To train and evaluate the model, we applied five classifiers: Logistic Regression, Random Forest, Support Vector Machine, Extreme Gradient Boosting, and AdaBoost. We enhanced model robustness with a nested cross-validation approach, combining Leave-P-out Cross-Validation (LPOCV) and Stratified K-Fold Cross-Validation (SKFCV). LPOCV split the data into training and validation sets, and SKFCV was applied to train and test each model on the training sets. The top-performing model, along with the selected features, was validated across LPOCV validation sets to confirm its generalizability.
Results:
Results: Model performance was evaluated using 60 miRNA microarray profiles (29 controls and 31 Usher samples). Our results showed that during training, both LR and SVM models demonstrated strong performance, achieving mean accuracy of 96%, mean sensitivity of 100%, mean specificity of 92%, mean F1 score of 98%, and mean area under the curve (AUC) of 98%. However, SVM exhibited occasional sensitivity variability, with values sometimes dropping to 85%, leading to the selection of LR as the final model for further validation. In the validation phase, LR demonstrated robust performance with mean accuracy of 96.9%, mean sensitivity of 99%, mean specificity of 95%, mean F1 score of 94.89%, and mean AUC of 93%. The LR model accurately classified Usher samples from controls 96.9% of the time, correctly identifying a true Usher profile in 99% of instances.
Moreover, we identified ten key miRNAs: hsa-miR-148a-3p, hsa-miR-183-5p, hsa-miR-146a-5p, hsa-miR-28-5p, hsa-miR-30c-5p, hsa-miR-551b-3p, hsa-miR-642a-5p, hsa-miR-181a-5p, hsa-miR-28-3p, and hsa-miR-182-5p. Pathway analysis revealed that hsa-miR-183-5p and hsa-miR-182-5p are integral to retinal development and photoreceptor cell survival, targeting genes within apoptotic and neuroinflammatory pathways—processes known to exacerbate retinal degeneration seen in Usher syndrome. Similarly, hsa-miR-146a-5p and hsa-miR-148a-3p have established roles in immune regulation and inflammation, indicating their potential involvement in neuroinflammation and tissue remodeling in the auditory and retinal cells affected by Usher syndrome. Other miRNAs such as hsa-miR-30c-5p and hsa-miR-551b-3p participate in cellular processes including synaptic plasticity and cellular stress response, which may be disrupted in sensory cells.
Conclusion:
Conclusion: These results suggest that miRNA expression profiling, coupled with advanced machine learning techniques, offers a promising alternative to traditional diagnostic methods.
Introduction: Usher syndrome is a rare genetic disorder characterized by hearing loss, vision impairment, and balance issues. Traditional testing for Usher syndrome involves a combination of clinical assessments, genetic testing, and, in some cases, imaging studies. These methods are effective but can be time-intensive, costly, and inaccessible for many, often requiring multiple visits to specialized clinics and labs. Given these limitations, there is a critical need for more efficient, non-invasive, and automated diagnostic approaches.
Recent research suggests that miRNAs play an important role in gene expression regulation and may serve as potential biomarkers for Usher syndrome, with distinct miRNA profiles observed in affected individuals compared to healthy controls. These findings suggest that miRNA expression data could serve as a promising alternative to traditional diagnostic methods.
Methods:
Methods: We developed a machine learning pipeline for automated detection and classification of Usher syndrome based on miRNA expression data. Our approach integrates ensemble feature selection with nested cross-validation. We used four feature selection methods—Recursive Feature Elimination, Random Forest Importance, k-best, and LASSO—to identify the top 10 miRNAs distinguishing Usher samples from controls. Gene pathway analysis was performed to evaluate the functional potential of the selected miRNAs. To train and evaluate the model, we applied five classifiers: Logistic Regression, Random Forest, Support Vector Machine, Extreme Gradient Boosting, and AdaBoost. We enhanced model robustness with a nested cross-validation approach, combining Leave-P-out Cross-Validation (LPOCV) and Stratified K-Fold Cross-Validation (SKFCV). LPOCV split the data into training and validation sets, and SKFCV was applied to train and test each model on the training sets. The top-performing model, along with the selected features, was validated across LPOCV validation sets to confirm its generalizability.
Results:
Results: Model performance was evaluated using 60 miRNA microarray profiles (29 controls and 31 Usher samples). Our results showed that during training, both LR and SVM models demonstrated strong performance, achieving mean accuracy of 96%, mean sensitivity of 100%, mean specificity of 92%, mean F1 score of 98%, and mean area under the curve (AUC) of 98%. However, SVM exhibited occasional sensitivity variability, with values sometimes dropping to 85%, leading to the selection of LR as the final model for further validation. In the validation phase, LR demonstrated robust performance with mean accuracy of 96.9%, mean sensitivity of 99%, mean specificity of 95%, mean F1 score of 94.89%, and mean AUC of 93%. The LR model accurately classified Usher samples from controls 96.9% of the time, correctly identifying a true Usher profile in 99% of instances.
Moreover, we identified ten key miRNAs: hsa-miR-148a-3p, hsa-miR-183-5p, hsa-miR-146a-5p, hsa-miR-28-5p, hsa-miR-30c-5p, hsa-miR-551b-3p, hsa-miR-642a-5p, hsa-miR-181a-5p, hsa-miR-28-3p, and hsa-miR-182-5p. Pathway analysis revealed that hsa-miR-183-5p and hsa-miR-182-5p are integral to retinal development and photoreceptor cell survival, targeting genes within apoptotic and neuroinflammatory pathways—processes known to exacerbate retinal degeneration seen in Usher syndrome. Similarly, hsa-miR-146a-5p and hsa-miR-148a-3p have established roles in immune regulation and inflammation, indicating their potential involvement in neuroinflammation and tissue remodeling in the auditory and retinal cells affected by Usher syndrome. Other miRNAs such as hsa-miR-30c-5p and hsa-miR-551b-3p participate in cellular processes including synaptic plasticity and cellular stress response, which may be disrupted in sensory cells.
Conclusion:
Conclusion: These results suggest that miRNA expression profiling, coupled with advanced machine learning techniques, offers a promising alternative to traditional diagnostic methods.