Cost-effective labeling of data

Learning of classification models often relies on data that are labeled/annotated by a human expert. In general, more expertise and time the labeling process requires more costly it is to label the data. In addition, there may be constraints on how many data instances one expert can feasibly label. The goal of this project is to find ways of reducing the number of labels and at the same time preserve or improve the quality of the models based on such labels. Our work on data labeling covers three different directions :

Active learning/sampling.
Learning with auxiliary soft-label information.
Learning from multiple experts.

Active learning/sampling

We are developing a special active learning (sampling) framework that aims to find a labeled set of instances (patient cases) or models from originally unlabeled set such that they can help us to evaluate, as accurately as possible, various statistics (such as sensitivity, specificity) of predictive rules and models. Our goal is to use the framework to support offline evaluation of various clinical alerting rules by estimating their performance statistics on past data and their subsequent refinements.

Related Publications:

H. Valizadegan, S. Amizadeh, M. Hauskrecht.
Sampling Strategies to Evaluate the Performance of Unknown Predictors.
SIAM Data Mining Conference, Anaheim, CA, April 2012.

Learning with auxiliary soft-label information

We have developed a new machine learning framework in which the binary class label information that is typically used to learn binary classification models is enriched with soft-label information reflecting a more refined expert's view on the class a labeled instance belongs to. Soft label information can be represented either in terms of (1) a probabilistic (or numeric) score, e.g., the chance of the patient having the disease is 0.7, or, (2) a qualitative category, such as, weak or strong agreement with the patient having the disease. The cost of obtaining this additional information is typically small compared to original binary label assessment. We have demonstrated the benefit of the new learning framework for reducing the number of examples one needs to label when these are selected randomly from the originally unlabeled set of examples. Currently, we study the combination of our framework with active learning sample selection strategies.

Related Publications:

Q. Nguyen, H. Valizadegan, and M. Hauskrecht.
Learning classification models with soft-label information.
Journal of American Medical Informatics Association, 21:3, pp. 501-508, 2014.
Q. Nguyen, H. Valizadegan, and M. Hauskrecht.
Learning classification with auxiliary probabilistic information,
IEEE International Conference on Data Mining, Vancouver, Canada, December 2011.
Q. Nguyen, H. Valizadegan, A. Seybert, and M. Hauskrecht.
Sample-efficient learning with auxiliary class-label information.
Annual American Medical Informatics Association (AMIA) Symposium , October 2011.

Learning from multiple experts

The labels used for building the models may come from multiple experts and it is possible that the experts may have different subjective opinions on how some of the patient cases should be labeled. We have studied this scenario by designing a new multi-expert learning framework that assumes the information on who labeled the case is available. Our framework explicitly models different sources of disagreements and lets us naturally combine labels from different human experts to obtain: (1) a consensus classification model representing the model the group of experts converge to, as well as, (2) individual expert models.

Related Publications:

H. Valizadegan, Q. Nguyen, and M. Hauskrecht.
Learning Classification Models from Multiple Experts.
Journal of Biomedical Informatics, 46:6, pp. 1125-1135, 2013.
H. Valizadegan, Q. Nguyen, and M. Hauskrecht.
Learning Medical Diagnosis Models from Multiple Experts.
Annual American Medical Informatics Association Symposium , Chicago, IL, November 2012.

Funding:

NIH. 1R01LM010019. Using medical records repositories to improve the alert system design. PI: Hauskrecht, September 2009- September 2013.

The web page is updated by milos.