90% F1 Score in Relation Triple Extraction: Is it Real? [PDF]
Accepted in GenBench workshop @ EMNLP ...
Pratik Saini +3 more
openalex +3 more sources
The F1 score has been widely used to measure the performance of machine learning models. However, it is variant to the ratio of the positive class in the training data, $\pi $ .
Hyeon Gyu Kim, Yoohyun Park
doaj +2 more sources
This study compares various F1-score variants—micro, macro, and weighted—to assess their performance in evaluating text-based emotion classification. Lexicon distillation is employed using the multilabel emotion-annotated datasets XED and GoEmotions. The
Maria Cristina Hinojosa Lee +2 more
doaj +3 more sources
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. [PDF]
AbstractBackgroundTo evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet.
Chicco D, Jurman G.
europepmc +6 more sources
High-F1-score Recognition of Input Gestures while Holding Smartphone by Reducing False Detection Due to Walking Noise [PDF]
Ryo Katsuma, Tomoaki Amiya
openalex +2 more sources
Estimating the Uncertainty of Average F1 Scores [PDF]
In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future performance on unseen data.
Zhang, Dell, Wang, J., Zhao, X.
openaire +1 more source
X-ray is not inferior to CT in terms of F1 score in the diagnosis of foreign body aspiration: a recall, precision and F1 score performance analysis based on bronchoscopically proven cases. [PDF]
Sarac F, Yazici M.
europepmc +3 more sources
A Bayesian Hierarchical Model for Comparing Average F1 Scores [PDF]
In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future performance on unseen data.
Zhang, Dell +3 more
openaire +1 more source
Time to retire F1-binary score for action unit detection
Detecting action units is an important task in face analysis, especially in facial expression recognition. This is due, in part, to the idea that expressions can be decomposed into multiple action units. To evaluate systems that detect action units, F1-binary score is often used as the evaluation metric.
Saurabh Hinduja +3 more
openaire +2 more sources
Confidence interval for micro-averaged F1 and macro-averaged F1 scores [PDF]
AbstractA binary classification problem is common in medical field, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision (positive predictive value) and recall (sensitivity). As a single summary
Kanae Takahashi +3 more
openaire +2 more sources

