Academic

Tokenization Tradeoffs in Structured EHR Foundation Models

Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, Lillian Sung · March 18, 2026 · 1 min read · 26 views

#cs.LG #cs.CL

arXiv:2603.15644v1 Announce Type: new Abstract: Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.

Executive Summary

The article explores the impact of tokenization design choices on the performance and efficiency of foundation models for structured electronic health records (EHRs). The authors pretrained a transformer on pediatric EHR data, varying tokenization along event encoding, time encoding, and workflow annotation, and evaluated its performance across 74 clinical prediction tasks. The results show that joint event encoding and positional time encoding outperform their alternatives, requiring fewer pretraining floating-point operations. The study establishes tokenization as a crucial factor in improving the performance and efficiency of EHR foundation models.

Key Points

▸ Tokenization design choices significantly impact the performance and efficiency of EHR foundation models
▸ Joint event encoding and positional time encoding outperform their alternatives in most clinical prediction tasks
▸ The advantage of joint encoding is attributed to local binding efficiency, where code-attribute pairs are combined into single tokens

Merits

Improved Performance

The study demonstrates that optimized tokenization can lead to improved performance in clinical prediction tasks, which can have significant implications for healthcare outcomes

Increased Efficiency

The use of joint event encoding and positional time encoding requires fewer pretraining floating-point operations, making the model more computationally efficient

Demerits

Institution-Specific Temporal and Workflow Effects

The study finds that temporal and workflow effects remain institution-specific, which may limit the generalizability of the results across different healthcare settings

Vocabulary Mismatch

The study notes that there is a substantial vocabulary mismatch between the pediatric EHR data used for training and the adult intensive care unit cohort used for external evaluation, which may affect the model's performance

Expert Commentary

The article provides a significant contribution to the field of EHR foundation models, highlighting the importance of tokenization design choices in improving model performance and efficiency. The study's findings have important implications for clinical practice, particularly in regards to the development of more effective and efficient EHR systems. However, the study also raises important questions about the explainability and transparency of these models, which must be addressed in future research. Overall, the article demonstrates the need for further research into the development of EHR foundation models that can generalize across different healthcare settings and provide accurate and reliable predictions.

Recommendations

✓ Future studies should investigate the impact of tokenization design choices on EHR foundation models in different healthcare settings
✓ The development of more effective and efficient EHR systems should prioritize data standardization and optimized tokenization

Sources

arXiv - cs.LG

Tokenization Tradeoffs in Structured EHR Foundation Models

AI Commentary

Executive Summary

Key Points

Merits

Improved Performance

Increased Efficiency

Demerits

Institution-Specific Temporal and Workflow Effects

Vocabulary Mismatch

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs