Previous: Interface  

 

Learning Problem Formulations

 

malibu is designed to handle a number of machine learning problems. Consider the following three questions when selecting a problem formulation:

  1. What type of information do I have?
  2. What do I want to learn?
  3. What assumptions do I need to make?

 

Classification

 

Let's consider a simple classification problem where we are given a set of labeled training data (D). The training data consists of two elements: a fixed-length feature vector (x) and a class label (y). The feature vector is a set of numerical characteristics (or attributes) encoding what is known about the data. The class label denotes what needs to be learned (or predicted) from the data.

 

 

Given the above training data, the goal is to find some decision function (g) that maps the feature vector to the class label.

 

 

A common approach is to use a real-valued prediction function and use the sign function to obtain a discreet output (i.e. if this value is greater than 0 output 1, otherwise output -1).

 

 

The classifier searchs the function space for a function that minimizes the classification error, i.e. the probability that g(x) ¹ y. Here, c denotes our classifier (classification function).

 

 

So far, we have described how the classifier selects a function using the true error or generalization error. However, this is not an observable quantity because the distribution D is unknown. Instead, we are given a sample, S, which is (hopefully) drawn i.i.d. from D and our classifier selects the best function using the empirical error (loss function). Note, I() is a function that returns 1 if the condition is true and 0 otherwise.

 

 

 

Langford, J. Practical prediction theory for classification. In Proceedings of International Conference on Machine Learning. Washington, DC USA. 2003.

 

 

Importance-weighted Classification

 

Now, building on the above defintion, lets consider a classification problem where the cost of misclassifying certain examples is greater the others. The training data in this case consists of three elements: a fixed-length feature vector (x), a class label (y) and the importance of the example (w).

 

 

Given the above training data, the goal is to find some decision function (g) that maps the feature vector to the class label (while considering each examples importance).

 

 

The classifier must find a function that minimizes the expected cost (importance weighted error). Note, the classifier is not allowed to use the importance weight at prediction time.

 

 

 

Zadrozny, Bianca, John Langford, and Naoki Abe. "Cost-Sensitive Learning by Cost-Proportionate Example Weighting." Paper presented at the Third IEEE International Conference on Data Mining, Melbourne, Florida 2003.

 

 

Cost-sensitive Classification

 

This type of learning is a restricted verison of the previous, importance weighted learning. Here, we label the importance of an example by considering its class label. In other words, examples labeled with a certain class will be weighted more/less as compared to the other class.

 

Cost-sensitive classification is generally done in two cases. First, this technique is importance when the cost of misclassifying one class is larger than the other (missing a cancer prediction). Second, this technique is also used when the datasets are skewed toward some class; in this case, the algorithm tries to balance the output.

 

 

Zadrozny, Bianca, John Langford, and Naoki Abe. "Cost-Sensitive Learning by Cost-Proportionate Example Weighting." Paper presented at the Third IEEE International Conference on Data Mining, Melbourne, Florida 2003.

 

 

Probability Estimates

 

Estimating the probability of the classification output are important when predictions are not used in isolation. That is, when a Doctor needs advice or the prediction is used in a larger system. Let's consider a simple classification problem (considered before) where we are given a set of labeled training data (D). Here, the class labels have changed from {-1,1} to {0,1}.

 

 

Now, we will learn a predictor that output not only a classification but also a probability estimate, i.e. the probability the example is positive.

 

 

The classifier must find the probabilistic function that minimizes the mean squared error.

 

 

 

Langford, John, and Bianca Zadrozny. "Estimating Class Membership Probabilities Using Classifier Learners" Paper presented at the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados 2005.

 

 

Multipe-instance Learning

 

In the multiple-instance learning setting reintroduced by (Dietterich et al., 1997), the learning algorithm is given weaker access to labeled information. Instead of having individually labeled instances, the learning algorithm is given a labeled collection of instances where the label is positive if at least one instance in the collection is positive otherwise the label is negative.

 

 

We want to learn a predictor that outputs a prediction similar to classification where the feature vector is mapped to some class label.

 

 

These predictions can be combined in order to classify a bag of instances.

 

 

There exist a number of loss functions to tackle the multiple instance learning problem. In other words, the optimal function that best maps the input bags to the output label. One way to look at the problem is to cast it as a classification problem where any algorithm parameters are tuned on the bag-level.

 

 

 

Dietterich, T. G., R. H. Lathrop and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence. 89:31-71, 1997.

 

Blum, A. and A. Kalai. A note on learning from multiple-instance examples. Machine Learning. 30:23-29, 1998.

 

 

Previous: Interface