October 17, 2014

\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\logit}{\text{logit}}\)

SLG Logistics

Let's thank Brian Gaines, Josh Day, and Will Burton for their great talks on Regularization & Penalization in Regression!

Reminder: Slides, code, videos available on Justin Post's website.

Today: First of our three-part classification miniseries.

On the horizon:

  • Oct. 24: Suchit Mehrotra – Bayesian Classification Methods
  • Oct. 31: TBD – TBD

Outline

  1. Why Classification?
  2. Notation
  3. The Two Cultures
  4. Data Models
  5. Algorithmic Models
  6. Summary

Why Classification?

We're Born to Classify!

Examples of Supervised Classification

  • Automatic Speech Recognition
    • Siri
  • Natural Language Processing
    • Google Translate
    • Part-of-speech tagging and syntactic parsing
  • Image Classification
    • Facial recognition
    • Traffic sign identification in self-driving cars
    • Brain tumor diagnosis
  • Consumer Analysis
    • Demographic analysis
    • "Customers who bought this item also bought…"

Notation

Notation

  • \({y}_i\): output or response variable (categorical)
    • \(i = 1,...,n\)
  • \(x_{ij}\): input, predictor, or feature variable
    • \(j = 1,...,p\)

Matrix notation

\(\pmb{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \) and \(\quad \pmb{X} = \begin{pmatrix} \pmb{x_1}^T \\ \pmb{x_2}^T \\ \vdots \\ \pmb{x_n}^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \cdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}\)

Classes

Let classes be indexed by \(k=1,\ldots,K\).

If \(K=2\), then \(Y\) is binary taking values \(\{0, 1\}\).

Otherwise, if \(K>2\), then \(Y\) takes values \(\{1,2,\ldots,K\}\).

The Two Cultures

Statistical Modeling: The Two Cultures

Breiman (2001)

Data Modeling

Algorithmic Modeling

Data Models

OLS Regression

Simple idea: Model \(P(Y = 1\,|\,\pmb{x}) = \pmb{x}^T\pmb{\beta} + \beta_0\)

library(ggplot2)
ggplot(cancer, aes(x = thickness, y = malignant)) + 
    stat_smooth(method = "lm", se=FALSE) + geom_point(size = 3, alpha = 0.1)

plot of chunk cancer-ols

OLS Regression

Pros:

  • Your mother still loves you.

Cons:

  • Predicts probabilities \(<0\) and \(>1\)
  • If \(K > 2\), the response \(\pmb{Y}\) is difficult to define appropriately, if at all.

Surely we can do better.

Logistic Regression

Solution:

Do not model the probabilities linearly! Instead, use a function that is bounded between 0 and 1 like the logistic function:

\(P(Y = 1\,|\,\pmb{x}) = \frac{\exp(\pmb{x}^T\pmb{\beta} + \beta_0)}{1 + \exp(\pmb{x}^T\pmb{\beta} + \beta_0)} \implies \log\left(\frac{P(Y = 1\,|\,\pmb{x})}{1 - P(Y = 1\,|\,\pmb{x})}\right) = \pmb{x}^T\pmb{\beta} + \beta_0\)

  • \(\frac{\exp(\pmb{x}^T\pmb{\beta} + \beta_0)}{1 + \exp(\pmb{x}^T\pmb{\beta} + \beta_0)}\) is the logistic function applied to our data \(\pmb{x}\); hence, logistic regression.
  • \(\log\left(\frac{p}{1-p}\right)\) is called the log-odds or logit of \(p\).
  • The logit is linear in \(\pmb{X}\), making this a linear classifier.

Logistic Regression

ggplot(cancer, aes(x = thickness, y = malignant)) + 
    stat_smooth(method = "glm", family = "binomial", se = FALSE) + 
    geom_point(size = 3, alpha = 0.1)

plot of chunk cancer-logistic

Logistic Regression

Pros:

  • Familiar framework: regression!
  • Allows for statistical inference on the parameters.

Cons:

  • Extensions to \(K>2\) exist but are not widely used.

Linear Discriminant Analysis (LDA)

In LR we directly model \(P(Y = k\,|\,\pmb{X} = \pmb{x})\) via the logistic function.

In LDA we model \(P(\pmb{X} = \pmb{x}\,|\,Y = k)\) and Bayes' Theorem tells us that

\[ P(Y = k\,|\,\pmb{X} = \pmb{x}) = \frac{P(\pmb{X} = \pmb{x}\,|\,Y = k)P(Y = k)}{\sum_{l=1}^K P(\pmb{X} = \pmb{x}\,|\,Y = l)P(Y = l)} \] where \(\sum_{k=1}^K P(Y=k) = 1\).

Linear Discriminant Analysis (LDA)

LDA places two primary assumptions on \(P(\pmb{X} = \pmb{x}\,|\,Y = k)\):

  1. Conditional on \(Y = k\), \(\pmb{X}\) follows a \(p\)-dimensional Gaussian distribution with mean \(\pmb{\mu_k}\) and covariance matrix \(\pmb{\Sigma_k}\).
  2. All classes share a common covariance matrix. i.e., \(\pmb{\Sigma_k} = \pmb{\Sigma}\) for \(k=1,\ldots,K\).

Then the class density of \(\pmb{X}\) given \(Y = k\) is \[ P(\pmb{X} = \pmb{x}\,|\,Y = k) = (2\pi)^{-p/2}|\pmb{\Sigma}|^{-1/2}\exp\left\{-\frac{1}{2}(\pmb{x} - \pmb{\mu_k})^T\pmb{\Sigma}^{-1}(\pmb{x} - \pmb{\mu_k})\right\}\]

Linear Discriminant Analysis (LDA)

Consider the log-ratio between classes \(k\) and \(l\), \[ \log\frac{P(Y=k\,|\,\pmb{X} = \pmb{x})}{P(Y = l\,|\,\pmb{X} = \pmb{x})} = \log\frac{P(\pmb{X} = \pmb{x}\,|\,Y = k)}{P(\pmb{X} = \pmb{x}\,|\,Y = l)} + \log\frac{P(Y=k)}{P(Y=l)} \\ = \pmb{x}^T\pmb{\Sigma}^{-1}(\pmb{\mu_k} - \pmb{\mu_l}) - \frac{1}{2}(\pmb{\mu_k} - \pmb{\mu_l})^T\pmb{\Sigma}^{-1}(\pmb{\mu_k} - \pmb{\mu_l}) + \log\frac{P(Y=k)}{P(Y=l)}. \]

(Notice the normalizing constant \((2\pi)^{-p/2}|\pmb{\Sigma}|^{-1/2}\) and quadratic term \(-\frac{1}{2}\pmb{x}^T\pmb{\Sigma}^{-1}\pmb{x}\) cancel due to the common covariance assumption.)

If this ratio is positive, \(\pmb{X}\) is more likely to have arisen from class \(k\) than \(l\).

Linear Discriminant Analysis (LDA)

Over all classes \(k=1,\ldots,K,\) the data \(\pmb{x}\) is most likely to have arisen from the class \(k\) maximizing \[ \delta_k(\pmb{x}) = \pmb{x}^T\pmb{\Sigma}^{-1}\pmb{\mu_k} - \frac{1}{2}\pmb{\mu_k}^T\pmb{\Sigma}^{-1}\pmb{\mu_k} + \log P(Y=k). \]

\(\delta_k(\pmb{x})\) is linear in \(\pmb{x}\), hence it is a linear classifier (like LR).

Linear Discriminant Analysis (LDA)

  • Dashed Lines – optimal Bayes decision boundary
  • Solid Lines – decision boundary estimated by LDA

(image from An Introduction to Statistical Learning, pg. 143)

Linear Discriminant Analysis (LDA)

Pros:

  • Easily handles arbitrary number of classes \(K\).
  • Provides added stability of parameter estimates via priors on class size

Cons:

  • Gaussian assumption may not be appropriate.

Green Eggs and Spam

spambase <- read.csv("data/spambase.csv")
head(spambase)[1:3, c(58, 1:10)]
##   X1   X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6
## 1  1 0.21  0.28    0.50    0  0.14 0.28 0.21 0.07 0.00 0.94
## 2  1 0.06  0.00    0.71    0  1.23 0.19 0.19 0.12 0.64 0.25
## 3  1 0.00  0.00    0.00    0  0.63 0.00 0.31 0.63 0.31 0.63
tail(spambase)[-(1:3), c(58, 1:10)]
##      X1   X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6
## 4598  0 0.30     0    0.30    0  0.00    0    0    0    0    0
## 4599  0 0.96     0    0.00    0  0.32    0    0    0    0    0
## 4600  0 0.00     0    0.65    0  0.00    0    0    0    0    0

Green Eggs and Spam

# Some variables...
# word_freq_make
# word_freq_address
# word_freq_all
# word_freq_3d
# word_freq_our
# word_freq_over
# word_freq_remove
# word_freq_internet

# Make training and testing sets
test <- runif(nrow(spambase)) < 0.2
spam.test  <- spambase[test, ]
spam.train <- spambase[!test, ]

Green Eggs and Spam

Logistic Regression

fit <- glm(X1 ~ ., data = spam.train, family = "binomial")
pred <- predict(fit, newdata = spam.test, type = "response")
(tab <- table(pred = as.numeric(pred > 0.5), true = spam.test[, 58]))
##     true
## pred   0   1
##    0 538  37
##    1  29 365
sum(diag(tab))/sum(tab)  # classification rate
## [1] 0.9319

Green Eggs and Spam

Linear Discriminant Analysis

library(MASS)
fit <- lda(X1 ~ ., data = spam.train)
pred <- predict(fit, spam.test)$class
(tab <- table(pred = pred, true = spam.test[, 58]))
##     true
## pred   0   1
##    0 539  80
##    1  28 322
sum(diag(tab))/sum(tab)  # classification rate
## [1] 0.8885

Which do I use?

Logistic Regression

  • Response \(Y\) is binary (\(K=2\))
  • Used when the Gaussian assumption of LDA is not valid

Linear Discriminant Analysis

  • Response \(Y\) is not binary (\(K > 2\))
  • More powerful than LR if the Gaussian assumption is valid
  • Can incorporate prior information on relative size of classes through \(P(Y = k)\)

Algorithmic Models

\(k\)-Nearest Neighbors

Here, \(k\) references number of points nearby, NOT classes.

The class membership of any unclassified object is determined by a "majority vote" from the classes of its \(k\) nearest neighbors. (Ties are typically broken at random.)

Of course, the distance metric applied to any two data points \(\pmb{x_1}\) and \(\pmb{x_2}\) defines which neighbors are nearest.

  • Euclidean \(\quad D(\pmb{x_1},\pmb{x_2}) = \left\{\sum_{j=1}^p (x_{1j} - x_{2j})^2\right\}^{1/2}\)
  • Manhattan \(\quad D(\pmb{x_1},\pmb{x_2}) = \sum_{j=1}^p |x_{1j} - x_{2j}|\)
  • Chebychev \(\quad D(\pmb{x_1},\pmb{x_2}) = \max_{j=1}^p |x_{1j} - x_{2j}|\)

\(k\)-Nearest Neighbors