Bibliography

[1]

Douglas G. Altman and Martin Bland. Diagnostic tests 2: predictive values. British Medical Journal, 309(6947):102–102, July 1994. doi:10.1136/bmj.309.6947.102.

[2]

Ildar Z. Batyrshin, Nailya Kubysheva, Valery Solovyev, and Luis A. Villa-Vargas. Visualization of similarity measures for binary data and 2x2 tables. Computación y Sistemas, 20(3):345–353, September 2016. doi:10.13053/cys-20-3-2457.

[3]

Christopher D. Brown and Herbert T. Davis. Receiver operating characteristics curves and related decision measures: a tutorial. Chemometrics and Intelligent Laboratory Systems, 80(1):24–38, January 2006. doi:10.1016/j.chemolab.2005.05.004.

[4]

Gurol Canbek, Seref Sagiroglu, Tugba Taskaya Temizel, and Nazife Baykal. Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In International Conference on Computer Science and Engineering (UBMK), 821–826. Antalya, Turkey, October 2017. Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/UBMK.2017.8093539.

[5]

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, April 1960. doi:10.1177/001316446002000104.

[6]

Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to Derivative-Free Optimization. Volume 8. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, USA, January 2009. doi:10.1137/1.9780898718768.

[7]

Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. doi:10.1016/j.patrec.2005.10.010.

[8]

Ronen Fluss, David Faraggi, and Benjamin Reiser. Estimation of the Youden index and its associated cutoff point. Biometrical Journal, 47(4):458–472, August 2005. doi:10.1002/bimj.200410135.

[9]

Ian A. Gardner and Matthias Greiner. Receiver‐operating characteristic curves and likelihood ratios: improvements over traditional methods for the evaluation and application of veterinary clinical pathology tests. Veterinary Clinical Pathology, 35(1):8–17, March 2006. doi:10.1111/j.1939-165x.2006.tb00082.x.

[10]

Afina S. Glas, Jeroen Lijmer, Martin H. Prins, Gouke Bonsel, and Patrick M. M. Bossuyt. The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology, 56(11):1129–1135, November 2003. doi:10.1016/S0895-4356(03)00177-X.

[11]

John C. Gower and Pierre Legendre. Metric and euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1):5–48, March 1986. doi:10.1007/bf01896809.

[12]

Anaïs Halin, Sébastien Piérard, Anthony Cioppa, and Marc Van Droogenbroeck. A hitchhiker's guide to understanding performances of two-class classifiers. arXiv, 2024. arXiv:2412.04377, doi:10.48550/arXiv.2412.04377.

[13]

Thomas F. Heston. Standardizing predictive values in diagnostic imaging research. Journal of Magnetic Resonance Imaging, 33(2):505–505, January 2011. doi:10.1002/jmri.22466.

[14]

Uzay Kaymak, Arie Ben-David, and Rob Potharst. The AUK: a simple alternative to the AUC. Engineering Applications of Artificial Intelligence, 25(5):1082–1089, August 2012.

[15]

Sébastien Piérard, Adrien Deliège, Anaïs Halin, and Marc Van Droogenbroeck. A methodology to evaluate strategies predicting rankings on unseen domains. In IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Workshop on Big Surveillance Data Analysis and Processing (BIG-Surv), 1–6. Nantes, France, June-July 2025. doi:.

[16]

Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, and Marc Van Droogenbroeck. The Tile: a 2D map of ranking scores for two-class classification. arXiv, 2024. arXiv:2412.04309, doi:10.48550/arXiv.2412.04309.

[17]

Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, and Marc Van Droogenbroeck. Foundations of the theory of performance-based ranking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, Tennessee, USA, June 2025.

[18]

Sébastien Piérard and Marc Van Droogenbroeck. Summarizing the performances of a background subtraction algorithm measured on several videos. In IEEE International Conference on Image Processing (ICIP), 3234–3238. Abu Dhabi, United Arab Emirates, October 2020. doi:10.1109/ICIP40778.2020.9190865.

[19]

David M. W. Powers. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv, 2020. arXiv:2010.16061, doi:10.48550/arXiv.2010.16061.

[20]

Matthijs J. Warrens. The effect of combining categories on Bennett, Alpert and Goldstein's $S$. Statistical Methodology, 9(3):341–352, May 2012. doi:10.1016/j.stamet.2011.09.001.

[21]

Daniel S. Wilks. Statistical methods in the atmospheric sciences. Elsevier, fourth edition, 2020. doi:10.1016/C2017-0-03921-6.

[22]

Matthias Wimmer, Bernd Radig, and Michael Beetz. A person and context specific approach for skin color classification. In IEEE International Conference on Pattern Recognition (ICPR), 39–42. Hong Kong, China, 2006. Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/icpr.2006.151.

[23]

William John Youden. Index for rating diagnostic tests. Cancer, 3(1):32–35, January 1950. doi:10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3.