Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages
arXiv:2603.22642v1 Announce Type: new Abstract: Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation bet
arXiv:2603.22642v1 Announce Type: new Abstract: Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82). These converging results suggest frontier LLMs preserve medical meaning across resource levels, with implications for language access in healthcare.
Executive Summary
This study presents a rigorous multi-method validation of the medical translation capabilities of leading large language models across high, medium, and low-resource languages. Utilizing four prominent LLMs—GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, and Kimi K2—the researchers evaluated translation performance using a five-layer framework across 22 medical documents and eight languages. The findings indicate consistent high semantic preservation (LaBSE > 0.92) regardless of resource classification, suggesting that frontier LLMs may provide viable alternatives to traditional, costly professional translation services. The absence of significant differential performance between resource levels and lack of circularity bias further supports the generalizability of these results. The convergence of independent validation metrics across multiple models strengthens the reliability of the conclusions.
Key Points
- ▸ All models achieved high semantic preservation (LaBSE > 0.92)
- ▸ No significant difference between high- and low-resource language translation performance
- ▸ Inter-model concordance was high (LaBSE: 0.946)
Merits
Robust Validation Framework
The use of a five-layer validation framework across diverse language categories enhances the credibility of the findings.
Cross-Model Consistency
High inter-model concordance across independently trained models strengthens the validity of the translation efficacy claims.
Demerits
Limited Real-World Application Data
The study is based on simulated medical documents rather than actual clinical translation scenarios, which may limit applicability.
Absence of Human Review
No human evaluators were involved in the validation process, potentially missing nuances in medical terminology or cultural context.
Expert Commentary
This work represents a significant advance in evaluating the practicality of LLMs for medical translation. The convergence of metrics across multiple models and across resource levels is particularly compelling. While the study avoids direct clinical validation, the statistical convergence of semantic preservation across diverse linguistic contexts suggests that these models are sufficiently reliable for preliminary or preliminary-stage translation in non-critical contexts. Importantly, the absence of bias toward resource level implies that LLMs may bridge gaps in medical access without exacerbating existing disparities. However, the study’s limitations—particularly the absence of human review and reliance on synthetic documents—must be addressed before clinical adoption. Future research should incorporate clinical validation panels and cross-cultural sensitivity assessments to elevate the evidence base. In the interim, policymakers should cautiously explore LLM-assisted translation as an interim solution pending more rigorous clinical trials.
Recommendations
- ✓ Healthcare organizations should pilot LLM-based translation systems in low-resource language settings with oversight committees to monitor accuracy and usability.
- ✓ Regulatory bodies should initiate dialogues with LLM developers to establish standards for medical translation accuracy, particularly for low-resource languages.
Sources
Original: arXiv - cs.CL