Evaluating the Accuracy of Avey: A Clinical Vignettes Study

To evaluate the accuracy of Avey, we designed a comprehensive scientific methodology that capitalizes on the standard clinical vignette approach. Delivering on this methodology, we compiled and peer-reviewed 400 vignettes with 7 external medical doctors using a super-majority voting scheme. To the best of our knowledge, this yielded the largest benchmark vignette suite in the domain. Moreover, we defined and utilized 7 standard accuracy metrics, one of which measures for the first time in the field the ranking qualities of self-diagnostic systems and doctors in generating differential diagnoses.

We leveraged our benchmark vignette suite and accuracy metrics to study the performance of Avey and 5 other major self-diagnostic systems, namely, Ada [1], K Health [2], Buoy [3], Babylon [4], and WebMD [5]. Results show that Avey significantly outperforms the 5 systems. In addition, we compared Avey’s performance against highly seasoned primary care physicians with an average experience of 16.6 years. Results show that Avey compares favourably to the physicians and even outperforms them using some accuracy metrics, including the ability of ranking diseases correctly within their differential lists and generating the main diagnoses at the top of the lists.

We leveraged our benchmark vignette suite and accuracy metrics to study the performance of Avey and 5 other major self-diagnostic systems, namely, Ada [1], K Health [2], Buoy [3], Babylon [4], and WebMD [5]. Results show that Avey significantly outperforms the 5 systems. In addition, we compared Avey’s performance against highly seasoned primary care physicians with an average experience of 16.6 years. Results show that Avey compares favourably to the physicians and even outperforms them using some accuracy metrics, including the ability of ranking diseases correctly within their differential lists and generating the main diagnoses at the top of the lists.

Copyright © Rimads 2022 All Rights Reserved