Skip to main content

Conference Program

Subpage Hero

Loading

Learning phenotypes from 4M+ patients with paired genotypes and phenotypes

Laboratory Genetics and Genomics
  • Primary Categories:
    • Clinical Genetics
  • Secondary Categories:
    • Clinical Genetics
Introduction:


Current methods for defining clinical criteria for variant classification struggle to scale to the vast number of genes and variants encountered in clinical genetic testing. Furthermore, expert-defined criteria may not always accurately reflect real-world phenotypic presentations. A very large real world (4M+ patients) dataset of paired phenotype and genotype information and a machine learning approach that includes large language models (LLMs), offer a flexible approach to learning phenotypes across clinical areas.

Methods:


A pretrained LLM (GatorTron), was fine tuned via curriculum learning to a corpus of clinical free text extracted from test requisition forms. Patient labels were derived from results of genetic testing. This model produced a text score that was combined with other demographic and phenotypic data, such as ICD-10 codes, age at testing, and sex, to train a second model (XGBoost) to generate a patient score. The patient score represented the probability that the patient’s clinical presentation was consistent with a particular genetic condition. Topic modeling of high-scoring patient text yielded phenotypes associated with each condition. Partial dependence plots and SHAP scores were inspected to identify relationships between features and the patient score. Age-specific ascertainment rates for individual phenotypes were assessed for several conditions.

Results:


Across a wide range of clinical areas, including 1,876 unique genes associated with 1,334 molecular conditions, the patient score model demonstrated high performance (AUCtest  ≥ 0.8) in 728 conditions. Partial dependence plots and SHAP scores were generated across all 1,334 conditions. Several conditions yielded notable phenotypic findings. Among mismatch repair genes, MSH6 showed a stronger association with endometrial cancer than with colorectal cancer, compared to MSH2, MLH1, or PMS2. For the cystic fibrosis (CFTR) model (AUCtest = 0.92), topic modeling of high-scoring phenotypes yielded distinct phenotypes with age-specific incidence. For example, bronchiectasis/recurrent infection was observed at a median age of 14, and absent vas deferens was observed in probands tested at reproductive age (median age of 32).

Conclusion:


Our study demonstrates that leveraging large-scale, real-world datasets combined with advanced machine learning techniques—including fine-tuned LLMs —can effectively learn and predict phenotypic presentations associated with a wide array of genetic conditions. This approach offers a scalable solution to identifying phenotypic patterns in clinical genetics. By quantifying the contributions of individual phenotypic features and demographics, our model can enhance the accuracy of genetic diagnoses and support personalized medicine initiatives.

Agenda

Sponsors