Use of artificial intelligence tools and electronic health record data for pancreatic cancer risk prediction

Jan. 08, 2025

Pancreatic cancer (PC) is the third-leading cause of cancer deaths in the United States, with an extremely low five-year survival rate of 13%. According to Mayo Clinic gastroenterologist and pancreatologist Shounak Majumder, M.D., this poor prognosis is largely attributed to the fact that most patients are diagnosed at a stage when the malignancy is either locally advanced or has distant metastasis. Dr. Majumder directs the High-Risk Pancreas Clinic, which conducts PC screening in individuals with certain familial and genetic risk factors at Mayo Clinic in Rochester, Minnesota.

Detecting PC at an early, asymptomatic stage can positively impact survival rates, but currently there is no population-based screening strategy for this disease. Dr. Majumder and colleagues are actively engaged in research exploring new ways to address these challenges. Noting that accurate assessment of risk factor status requires manual review of the electronic health record (EHR) by experts, Dr. Majumder and colleagues studied the use of natural language processing (NLP) for automated extraction of PC risk factors from unstructured clinical notes in the EHR. The results of this study were published in Pancreatology in 2024.

In a systematic review published in the American Journal of Gastroenterology in 2024, Dr. Majumder and colleagues extracted and reviewed data from 30 studies to discern ML methods for predicting PC risk and identifying novel risk factors from EHR data.

In this Q&A, Dr. Majumder discusses this vein of research and what the recent findings say about a potential role for AI in the creation of tools designed to accurately identify pancreatic cancer risk and novel risk factors using EHR data.

Why is this an important area of research?

Based on expert consensus, PC screening is currently considered in individuals with extensive family history of the disease and germline variants in PC susceptibility genes. Identifying these high-risk individuals using data within the EHR and connecting them to appropriate screening programs requires time and expertise that is not widely available. While this approach of risk-based screening has helped shift diagnosis to an earlier stage and prolong survival, 80% to 85% of PC cases are sporadic, occurring in individuals without known familial or genetic risk. These issues pose a significant barrier to the paradigm of risk-based PC screening. Therefore, there is a critical need to automate the identification of individuals with familial and genetic risk of PC and to identify novel risk factors for sporadic PC.

AI- and ML-based applications are poised to transform health data summarization and visualization capabilities. This presents an opportunity to leverage advances in AI and ML capabilities to develop EHR-based applications that accurately identify both known and novel risk factors for PC.

What is the significance of the findings presented in your most recent publications, and how might they guide clinical practice?

In the two Mayo Clinic research publications highlighted here, we developed NLP algorithms that identify familial and genetic risk of PC from unstructured clinical notes within the EHR. We also performed a systematic review to gain insight into the current state of EHR-based AI-ML approaches for estimating patient-level risk of PC.

In our NLP algorithm study, we concluded that rule-based NLP algorithms applied to unstructured clinical notes within the EHR are highly sensitive for automated identification of PC risk factors. These findings are a first step toward automated detection of the high-risk patient population that would benefit from risk-based PC screening. This is especially relevant in the context of PC, which is a lethal cancer for which there is no population-level screening.

In our systematic review, we found that several groups have aimed to develop ML models using EHR data to predict PC risk with variable success. Most studies relied on a curated set of known predictors to develop their models instead of utilizing unbiased approaches using the full spectrum of EHR data, such as combining structured data with unstructured data from clinical notes. Moreover, missing data were underreported, and explainable-AI techniques underutilized. To address these issues, and based on our interpretation of published studies, we have summarized a list of best practices and recommendations for consideration in future studies focusing on EHR-based AI-ML model development for PC.

"In the two Mayo Clinic research publications highlighted here, we developed natural language processing algorithms that identify familial and genetic risk of PC from unstructured clinical notes within the EHR. We also performed a systematic review to gain insight into the current state of EHR-based AI-ML approaches for estimating patient-level risk of PC."

— Shounak Majumder, M.D.

Can you elaborate on how additional research might further advance the field?

The performance of rule-based NLP algorithms for identification of familial and genetic risk of PC can be further enhanced by incorporating emerging tools such as large language models with subsequent validation in a real-world primary care cohort. In ongoing studies, we are exploring pathways to clinical implementation of this digital risk phenotyping tool as we seek to understand the impact on both patient- and healthcare professional-level outcomes.

Additional research will also need to focus on developing EHR-based AI-ML models for identification of novel risk factors for sporadic PC within diverse real-world population cohorts leveraging longitudinal data. While the focus is on the development of the most accurate AI-ML models for estimating PC risk, it will be equally important to minimize the risk of inaccurate biased estimates that lack explainability.

For more information

Sarwal D, et al. Identification of pancreatic cancer risk factors from clinical notes using natural language processing. Pancreatology. 2024;24:572.

Mishra AK, et al. Machine learning models for pancreatic cancer risk prediction using electronic health record data — A systematic review and assessment. The American Journal of Gastroenterology, 2024;119:1466.

Refer a patient to Mayo Clinic.