General Summary
Exploring Large Multimodal Models (LMMs) in Medical Visual Question Answering (Med-VQA) reveals hidden challenges in their reliability. Despite achieving high accuracy on existing benchmarks, these models falter when tested under more robust evaluation conditions, often performing worse than random guessing on medical diagnosis questions. This revelation stresses the need for more rigorous testing datasets and evaluation methods to ensure their efficacy in critical medical applications.
Limitations of Current Benchmarks
While LMMs exhibit impressive performance on standard benchmarks, this study highlights a significant gap between these results and real-world reliability. The standard benchmarks might need to adequately challenge the nuanced understanding required for medical diagnostics, leading to overestimation of model efficacy.
Introduction of the ProbMed Dataset
The researchers introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to address this critical evaluation shortfall. This new dataset is designed to assess LMM performance in medical imaging rigorously. It employs probing evaluation by pairing original questions with their negated counterparts and hallucinated attributes, thus testing the model’s diagnostic accuracy and reasoning capabilities.
Challenges in Procedural Diagnosis
In addition to probing evaluation, the study emphasizes the importance of procedural diagnosis. This requires the model to reason across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. These multi-faceted questions ensure a comprehensive assessment of the model’s diagnostic reasoning.
Performance of Top Models
The evaluation reveals that leading LMMs like GPT-4o, GPT-4V, and Gemini Pro perform surprisingly poorly, often worse than random guessing, on specialized diagnostic questions. This stark underperformance underscores the limitations of these models in handling fine-grained medical inquiries, which are crucial for accurate diagnoses.
Struggles with General Questions
Moreover, models such as LLaVA-Med show difficulties even with more general medical questions. This finding suggests that despite their advanced architectures, these models need more domain-specific expertise required for reliable application in medical contexts.
Transferability of Expertise
Interestingly, results from models like CheXagent demonstrate some expertise transfer across different modalities of the same organ. This indicates that specialized domain knowledge remains essential for improving model performance but highlights the potential for cross-domain learning.
Call for More Robust Evaluations
The study concludes with a strong call for more robust evaluation mechanisms to ensure the reliability of LMMs in the critical field of medical diagnosis. Current LMMs, while advanced, have yet to reach the level of reliability necessary for practical application in medical diagnostics, necessitating further research and development.
Resource
Read more in Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA