Bibliography
Douglas G. Altman and Martin Bland. Diagnostic tests 2: predictive values. British Medical Journal, 309(6947):102–102, July 1994. doi:10.1136/bmj.309.6947.102.
Ildar Z. Batyrshin, Nailya Kubysheva, Valery Solovyev, and Luis A. Villa-Vargas. Visualization of similarity measures for binary data and 2x2 tables. Computación y Sistemas, 20(3):345–353, September 2016. doi:10.13053/cys-20-3-2457.
Christopher D. Brown and Herbert T. Davis. Receiver operating characteristics curves and related decision measures: a tutorial. Chemometrics and Intelligent Laboratory Systems, 80(1):24–38, January 2006. doi:10.1016/j.chemolab.2005.05.004.
Gurol Canbek, Seref Sagiroglu, Tugba Taskaya Temizel, and Nazife Baykal. Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In International Conference on Computer Science and Engineering (UBMK), 821–826. Antalya, Turkey, October 2017. Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/UBMK.2017.8093539.
Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, April 1960. doi:10.1177/001316446002000104.
Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to Derivative-Free Optimization. Volume 8. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, USA, January 2009. doi:10.1137/1.9780898718768.
Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. doi:10.1016/j.patrec.2005.10.010.
Ronen Fluss, David Faraggi, and Benjamin Reiser. Estimation of the Youden index and its associated cutoff point. Biometrical Journal, 47(4):458–472, August 2005. doi:10.1002/bimj.200410135.
Ian A. Gardner and Matthias Greiner. Receiver‐operating characteristic curves and likelihood ratios: improvements over traditional methods for the evaluation and application of veterinary clinical pathology tests. Veterinary Clinical Pathology, 35(1):8–17, March 2006. doi:10.1111/j.1939-165x.2006.tb00082.x.
Afina S. Glas, Jeroen Lijmer, Martin H. Prins, Gouke Bonsel, and Patrick M. M. Bossuyt. The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology, 56(11):1129–1135, November 2003. doi:10.1016/S0895-4356(03)00177-X.
John C. Gower and Pierre Legendre. Metric and euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1):5–48, March 1986. doi:10.1007/bf01896809.
Anaïs Halin, Sébastien Piérard, Anthony Cioppa, and Marc Van Droogenbroeck. A hitchhiker's guide to understanding performances of two-class classifiers. arXiv, 2024. arXiv:2412.04377, doi:10.48550/arXiv.2412.04377.
Thomas F. Heston. Standardizing predictive values in diagnostic imaging research. Journal of Magnetic Resonance Imaging, 33(2):505–505, January 2011. doi:10.1002/jmri.22466.
Uzay Kaymak, Arie Ben-David, and Rob Potharst. The AUK: a simple alternative to the AUC. Engineering Applications of Artificial Intelligence, 25(5):1082–1089, August 2012.
Sébastien Piérard, Adrien Deliège, Anaïs Halin, and Marc Van Droogenbroeck. A methodology to evaluate strategies predicting rankings on unseen domains. In IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Workshop on Big Surveillance Data Analysis and Processing (BIG-Surv), 1–6. Nantes, France, June-July 2025. doi:.
Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, and Marc Van Droogenbroeck. The Tile: a 2D map of ranking scores for two-class classification. arXiv, 2024. arXiv:2412.04309, doi:10.48550/arXiv.2412.04309.
Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, and Marc Van Droogenbroeck. Foundations of the theory of performance-based ranking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, Tennessee, USA, June 2025.
Sébastien Piérard and Marc Van Droogenbroeck. Summarizing the performances of a background subtraction algorithm measured on several videos. In IEEE International Conference on Image Processing (ICIP), 3234–3238. Abu Dhabi, United Arab Emirates, October 2020. doi:10.1109/ICIP40778.2020.9190865.
David M. W. Powers. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv, 2020. arXiv:2010.16061, doi:10.48550/arXiv.2010.16061.
Matthijs J. Warrens. The effect of combining categories on Bennett, Alpert and Goldstein's $S$. Statistical Methodology, 9(3):341–352, May 2012. doi:10.1016/j.stamet.2011.09.001.
Daniel S. Wilks. Statistical methods in the atmospheric sciences. Elsevier, fourth edition, 2020. doi:10.1016/C2017-0-03921-6.
Matthias Wimmer, Bernd Radig, and Michael Beetz. A person and context specific approach for skin color classification. In IEEE International Conference on Pattern Recognition (ICPR), 39–42. Hong Kong, China, 2006. Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/icpr.2006.151.
William John Youden. Index for rating diagnostic tests. Cancer, 3(1):32–35, January 1950. doi:10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3.