Business Intelligence & general IT stuff

Saturday, November 8, 2008

Some Data mining terminology

Confusion matrix

From Wikipedia, the free encyclopedia

In the field of artificial intelligence, a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

When a data set is unbalanced (when the number of samples in different classes vary greatly) the error rate of a classifier is not representative of the true performance of the classifier. This can easily be understood by an example: If there are for example 990 samples from class A and only 10 samples from class B, the classifier can easily be biased towards class A. If the classifier classifies all the samples as class A, the accuracy will be 99%. This is not a good indication of the classifier's true performance. The classifier has a 100% recognition rate for class A but a 0% recognition rate for class B.

In the example confusion matrix below, of the 8 actual cats, the system predicted that three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats. We can see from the matrix that the system in question has trouble distinguishing between cats and dogs, but can make the distinction between rabbits and other types of animals pretty well.

Example confusion matrix
CatDogRabbit
Cat530
Dog231
Rabbit0211



Table of Confusion

In Predictive Analytics, a Table of Confusion, also known as a confusion matrix, is a table with two rows and two columns that reports the number of True Negatives, False Positives, False Negatives, and True Positives.

 actual value
 pntotal
prediction
outcome
p'True
Positive
False
Positive
P'
n'False
Negative
True
Negative
N'
totalPN

Table 1: Table of Confusion.

For example, consider a model which predicts for 10,000 Insurance Claims whether each case is Fraudulent. This model correctly predicts 9,700 non-fraudulent cases, and 100 fraudulent cases. The model also incorrectly predicts 150 cases which are not fraudulent to be fraudulent, and 50 cases which are fraudulent to be non-fraudulent. The resulting Table of Confusion is shown below.

 actual value
 pntotal
prediction
outcome
p'100150P'
n'509700N'
totalPN

Table 2: Example Table of Confusion.


Mean absolute error


In statistics, the mean absolute error is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error (MAE) is given by

\mbox{MAE} = \frac{1}{n}\sum_{i=1}^n \left| f_i-y_i\right| =\frac{1}{n}\sum_{i=1}^n \left| e_i \right|.

As the name suggests, the mean absolute error is an average of the absolute errors ei = fi − yi, where fi is the prediction and yi the true value. Note that alternative formulations may include relative frequencies as weight factors.

The mean absolute error is a common measure of forecast error in time series analysis, where the terms "mean absolute deviation" is sometimes used in confusion with the more standard definition of mean absolute deviation. The same confusion exists more generally.


Mean absolute error

From Wikipedia, the free encyclopedia

In statistics, the mean absolute error is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error (MAE) is given by

\mbox{MAE} = \frac{1}{n}\sum_{i=1}^n \left| f_i-y_i\right| =\frac{1}{n}\sum_{i=1}^n \left| e_i \right|.

As the name suggests, the mean absolute error is an average of the absolute errors ei = fi − yi, where fi is the prediction and yi the true value. Note that alternative formulations may include relative frequencies as weight factors.

The mean absolute error is a common measure of forecast error in time series analysis, where the terms "mean absolute deviation" is sometimes used in confusion with the more standard definition of mean absolute deviation. The same confusion exists more generally.