IJSSER

Title: TESTING THE ACCURACY OF MODERN LLMS IN ANSWERING GENERAL MEDICAL PROMPTS
Authors: Sahil Narula, Sanaa Karkera, Rushil Challa, Sarina Virmani, Nithya Chilukuri, Mason Elkas, Nidhi Thammineni, Ankita Kamath, Parth Jaiswal and Abhishek Krishnan
\|\| \|\|
Sahil Narula, Sanaa Karkera, Rushil Challa, Sarina Virmani, Nithya Chilukuri, Mason Elkas, Nidhi Thammineni, Ankita Kamath, Parth Jaiswal and Abhishek Krishnan Duke University United States
MLA 8 Narula, Sahil, et al. "TESTING THE ACCURACY OF MODERN LLMS IN ANSWERING GENERAL MEDICAL PROMPTS." Int. j. of Social Science and Economic Research, vol. 8, no. 9, Sept. 2023, pp. 2793-2802, doi.org/10.46609/IJSSER.2023.v08i09.021. Accessed Sept. 2023. APA 6 Narula, S., Karkera, S., Challa, R., Virmani, S., Chilukuri, N., Elkas, M., & Thammineni, N. (2023, September). TESTING THE ACCURACY OF MODERN LLMS IN ANSWERING GENERAL MEDICAL PROMPTS. Int. j. of Social Science and Economic Research, 8(9), 2793-2802. Retrieved from https://doi.org/10.46609/IJSSER.2023.v08i09.021 Chicago Narula, Sahil, Sanaa Karkera, Rushil Challa, Sarina Virmani, Nithya Chilukuri, Mason Elkas, Nidhi Thammineni, Ankita Kamath, Parth Jaiswal, and Abhishek Krishnan. "TESTING THE ACCURACY OF MODERN LLMS IN ANSWERING GENERAL MEDICAL PROMPTS." Int. j. of Social Science and Economic Research 8, no. 9 (September 2023), 2793-2802. Accessed September, 2023. https://doi.org/10.46609/IJSSER.2023.v08i09.021.
References [1]. Brown, T. B., et al. "Language Models are Few-Shot Learners." arXiv preprint arXiv:2005.14165, 2020. [2]. Fagherazzi, G., et al. "The Digital Health Paradox: Direct-To-Consumer Health Technologies and Medical Misinformation." npj Digital Medicine, 2020. [3]. Eysenbach, G., Powell, J., Kuss, O., & Sa, E. R. "Empirical studies assessing the quality of health information for consumers on the world wide web: a systematic review." Journal of the American Medical Association, 2002. [4]. Flexner, A. "Medical Education in the United States and Canada." Bulletin Number Four (The Flexner Report), 1910. [5]. Hripcsak, G., & Rothschild, A. S. "Agreement, the f-measure, and reliability in information retrieval." Journal of the American Medical Informatics Association, 2005. [6]. Mittelstadt, B., & Floridi, L. "The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts." Science and Engineering Ethics, 2016. [7]. Litjens, G., et al. "A survey on deep learning in medical image analysis." Medical image analysis, 2017. [8]. Raghupathi, W., & Raghupathi, V. "Big data analytics in healthcare: promise and potential." Health information science and systems, 2014. [9]. Esteva, A., et al. "A guide to deep learning in healthcare." Nature Medicine, 2019. [10]. Thompson, W., et al. "Large Language Models in Healthcare: A Preliminary Study on Information Accuracy and Safety." Journal of Medical Internet Research, 2022. [11]. Dietterich, T. G. "Overfitting and undercomputing in machine learning." ACM Computing Surveys, 1995
ABSTRACT: The rising use of large language models (LLMs) for answering medical questions necessitates an evaluation of their accuracy, especially given the implications for public health. This study employed a comprehensive test suite of 500 medical prompts, evaluated by a panel of medical experts for factual accuracy, contextual relevance, and potential risk. The responses from state of-the-art LLMs were also compared with answers from a control group of medical students. Results indicated a high level of accuracy among LLMs, with a median score of 88%. While LLMs performed well on general wellness questions (92% accuracy), they were less reliable for specialized medical queries (80% accuracy). The control group of medical students outperformed LLMs in answering specialized medical questions. In conclusion, while LLMs demonstrate a high degree of factual accuracy for general medical information, they are less reliable for specialized or complex health-related queries. Given their widespread use, LLMs could be a preliminary source for general medical advice, but their limitations underscore the need for consulting experts for specialized medical conditions. Future work should focus on enhancing the models' capabilities in specialized domains and evaluating the ethical implications of using LLMs for medical information dissemination. This study serves as a baseline for the responsible use of AI in healthcare.

IJSSER is Member of