Red Fish Blue Fish

October 17, 2014

\(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\logit}{\text{logit}}\)

SLG Logistics

Let's thank Brian Gaines, Josh Day, and Will Burton for their great talks on Regularization & Penalization in Regression!

Reminder: Slides, code, videos available on Justin Post's website.

Today: First of our three-part classification miniseries.

On the horizon:

Oct. 24: Suchit Mehrotra – Bayesian Classification Methods
Oct. 31: TBD – TBD

Outline

Why Classification?
Notation
The Two Cultures
Data Models
Algorithmic Models
Summary

Why Classification?

We're Born to Classify!

Examples of Supervised Classification

Automatic Speech Recognition
- Siri
Natural Language Processing
- Google Translate
- Part-of-speech tagging and syntactic parsing
Image Classification
- Facial recognition
- Traffic sign identification in self-driving cars
- Brain tumor diagnosis
Consumer Analysis
- Demographic analysis
- "Customers who bought this item also bought…"

Notation

\({y}_i\): output or response variable (categorical)
- \(i = 1,...,n\)
\(x_{ij}\): input, predictor, or feature variable
- \(j = 1,...,p\)

Matrix notation

\(\pmb{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \) and \(\quad \pmb{X} = \begin{pmatrix} \pmb{x_1}^T \\ \pmb{x_2}^T \\ \vdots \\ \pmb{x_n}^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \cdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}\)

Classes

Let classes be indexed by \(k=1,\ldots,K\).

If \(K=2\), then \(Y\) is binary taking values \(\{0, 1\}\).

Otherwise, if \(K>2\), then \(Y\) takes values \(\{1,2,\ldots,K\}\).

The Two Cultures

Statistical Modeling: The Two Cultures

Breiman (2001)

Data Modeling

Algorithmic Modeling

Data Models

OLS Regression

Simple idea: Model \(P(Y = 1\,|\,\pmb{x}) = \pmb{x}^T\pmb{\beta} + \beta_0\)

library(ggplot2)
ggplot(cancer, aes(x = thickness, y = malignant)) + 
    stat_smooth(method = "lm", se=FALSE) + geom_point(size = 3, alpha = 0.1)

plot of chunk cancer-ols

OLS Regression

Pros:

Your mother still loves you.

Cons:

Predicts probabilities \(<0\) and \(>1\)
If \(K > 2\), the response \(\pmb{Y}\) is difficult to define appropriately, if at all.

Surely we can do better.

Logistic Regression

Solution:

Do not model the probabilities linearly! Instead, use a function that is bounded between 0 and 1 like the logistic function:

\(P(Y = 1\,|\,\pmb{x}) = \frac{\exp(\pmb{x}^T\pmb{\beta} + \beta_0)}{1 + \exp(\pmb{x}^T\pmb{\beta} + \beta_0)} \implies \log\left(\frac{P(Y = 1\,|\,\pmb{x})}{1 - P(Y = 1\,|\,\pmb{x})}\right) = \pmb{x}^T\pmb{\beta} + \beta_0\)

\(\frac{\exp(\pmb{x}^T\pmb{\beta} + \beta_0)}{1 + \exp(\pmb{x}^T\pmb{\beta} + \beta_0)}\) is the logistic function applied to our data \(\pmb{x}\); hence, logistic regression.
\(\log\left(\frac{p}{1-p}\right)\) is called the log-odds or logit of \(p\).
The logit is linear in \(\pmb{X}\), making this a linear classifier.

Logistic Regression

ggplot(cancer, aes(x = thickness, y = malignant)) + 
    stat_smooth(method = "glm", family = "binomial", se = FALSE) + 
    geom_point(size = 3, alpha = 0.1)

plot of chunk cancer-logistic

Logistic Regression

Pros:

Familiar framework: regression!
Allows for statistical inference on the parameters.

Cons:

Extensions to \(K>2\) exist but are not widely used.

Linear Discriminant Analysis (LDA)

In LR we directly model \(P(Y = k\,|\,\pmb{X} = \pmb{x})\) via the logistic function.

In LDA we model \(P(\pmb{X} = \pmb{x}\,|\,Y = k)\) and Bayes' Theorem tells us that

\[ P(Y = k\,|\,\pmb{X} = \pmb{x}) = \frac{P(\pmb{X} = \pmb{x}\,|\,Y = k)P(Y = k)}{\sum_{l=1}^K P(\pmb{X} = \pmb{x}\,|\,Y = l)P(Y = l)} \] where \(\sum_{k=1}^K P(Y=k) = 1\).

Linear Discriminant Analysis (LDA)

LDA places two primary assumptions on \(P(\pmb{X} = \pmb{x}\,|\,Y = k)\):

Conditional on \(Y = k\), \(\pmb{X}\) follows a \(p\)-dimensional Gaussian distribution with mean \(\pmb{\mu_k}\) and covariance matrix \(\pmb{\Sigma_k}\).
All classes share a common covariance matrix. i.e., \(\pmb{\Sigma_k} = \pmb{\Sigma}\) for \(k=1,\ldots,K\).

Then the class density of \(\pmb{X}\) given \(Y = k\) is \[ P(\pmb{X} = \pmb{x}\,|\,Y = k) = (2\pi)^{-p/2}|\pmb{\Sigma}|^{-1/2}\exp\left\{-\frac{1}{2}(\pmb{x} - \pmb{\mu_k})^T\pmb{\Sigma}^{-1}(\pmb{x} - \pmb{\mu_k})\right\}\]

Linear Discriminant Analysis (LDA)

Consider the log-ratio between classes \(k\) and \(l\), \[ \log\frac{P(Y=k\,|\,\pmb{X} = \pmb{x})}{P(Y = l\,|\,\pmb{X} = \pmb{x})} = \log\frac{P(\pmb{X} = \pmb{x}\,|\,Y = k)}{P(\pmb{X} = \pmb{x}\,|\,Y = l)} + \log\frac{P(Y=k)}{P(Y=l)} \\ = \pmb{x}^T\pmb{\Sigma}^{-1}(\pmb{\mu_k} - \pmb{\mu_l}) - \frac{1}{2}(\pmb{\mu_k} - \pmb{\mu_l})^T\pmb{\Sigma}^{-1}(\pmb{\mu_k} - \pmb{\mu_l}) + \log\frac{P(Y=k)}{P(Y=l)}. \]

(Notice the normalizing constant \((2\pi)^{-p/2}|\pmb{\Sigma}|^{-1/2}\) and quadratic term \(-\frac{1}{2}\pmb{x}^T\pmb{\Sigma}^{-1}\pmb{x}\) cancel due to the common covariance assumption.)

If this ratio is positive, \(\pmb{X}\) is more likely to have arisen from class \(k\) than \(l\).

Linear Discriminant Analysis (LDA)

Over all classes \(k=1,\ldots,K,\) the data \(\pmb{x}\) is most likely to have arisen from the class \(k\) maximizing \[ \delta_k(\pmb{x}) = \pmb{x}^T\pmb{\Sigma}^{-1}\pmb{\mu_k} - \frac{1}{2}\pmb{\mu_k}^T\pmb{\Sigma}^{-1}\pmb{\mu_k} + \log P(Y=k). \]

\(\delta_k(\pmb{x})\) is linear in \(\pmb{x}\), hence it is a linear classifier (like LR).

Linear Discriminant Analysis (LDA)

Dashed Lines – optimal Bayes decision boundary
Solid Lines – decision boundary estimated by LDA

(image from An Introduction to Statistical Learning, pg. 143)

Linear Discriminant Analysis (LDA)

Pros:

Easily handles arbitrary number of classes \(K\).
Provides added stability of parameter estimates via priors on class size

Cons:

Gaussian assumption may not be appropriate.

Green Eggs and Spam

spambase <- read.csv("data/spambase.csv")
head(spambase)[1:3, c(58, 1:10)]

##   X1   X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6
## 1  1 0.21  0.28    0.50    0  0.14 0.28 0.21 0.07 0.00 0.94
## 2  1 0.06  0.00    0.71    0  1.23 0.19 0.19 0.12 0.64 0.25
## 3  1 0.00  0.00    0.00    0  0.63 0.00 0.31 0.63 0.31 0.63

tail(spambase)[-(1:3), c(58, 1:10)]

##      X1   X0 X0.64 X0.64.1 X0.1 X0.32 X0.2 X0.3 X0.4 X0.5 X0.6
## 4598  0 0.30     0    0.30    0  0.00    0    0    0    0    0
## 4599  0 0.96     0    0.00    0  0.32    0    0    0    0    0
## 4600  0 0.00     0    0.65    0  0.00    0    0    0    0    0

Green Eggs and Spam

# Some variables...
# word_freq_make
# word_freq_address
# word_freq_all
# word_freq_3d
# word_freq_our
# word_freq_over
# word_freq_remove
# word_freq_internet

# Make training and testing sets
test <- runif(nrow(spambase)) < 0.2
spam.test  <- spambase[test, ]
spam.train <- spambase[!test, ]

Green Eggs and Spam

Logistic Regression

fit <- glm(X1 ~ ., data = spam.train, family = "binomial")
pred <- predict(fit, newdata = spam.test, type = "response")
(tab <- table(pred = as.numeric(pred > 0.5), true = spam.test[, 58]))

##     true
## pred   0   1
##    0 538  37
##    1  29 365

sum(diag(tab))/sum(tab)  # classification rate

## [1] 0.9319

Green Eggs and Spam

Linear Discriminant Analysis

library(MASS)
fit <- lda(X1 ~ ., data = spam.train)
pred <- predict(fit, spam.test)$class
(tab <- table(pred = pred, true = spam.test[, 58]))

##     true
## pred   0   1
##    0 539  80
##    1  28 322

sum(diag(tab))/sum(tab)  # classification rate

## [1] 0.8885

Which do I use?

Logistic Regression

Response \(Y\) is binary (\(K=2\))
Used when the Gaussian assumption of LDA is not valid

Linear Discriminant Analysis

Response \(Y\) is not binary (\(K > 2\))
More powerful than LR if the Gaussian assumption is valid
Can incorporate prior information on relative size of classes through \(P(Y = k)\)

Algorithmic Models

\(k\)-Nearest Neighbors

Here, \(k\) references number of points nearby, NOT classes.

The class membership of any unclassified object is determined by a "majority vote" from the classes of its \(k\) nearest neighbors. (Ties are typically broken at random.)

Of course, the distance metric applied to any two data points \(\pmb{x_1}\) and \(\pmb{x_2}\) defines which neighbors are nearest.

Euclidean \(\quad D(\pmb{x_1},\pmb{x_2}) = \left\{\sum_{j=1}^p (x_{1j} - x_{2j})^2\right\}^{1/2}\)
Manhattan \(\quad D(\pmb{x_1},\pmb{x_2}) = \sum_{j=1}^p |x_{1j} - x_{2j}|\)
Chebychev \(\quad D(\pmb{x_1},\pmb{x_2}) = \max_{j=1}^p |x_{1j} - x_{2j}|\)

\(k\)-Nearest Neighbors

plot of chunk 1nn

\(k\)-Nearest Neighbors

plot of chunk 5nn

\(k\)-Nearest Neighbors

plot of chunk 15nn

\(k\)-Nearest Neighbors

Green Eggs and Spam

library(class)
pred <- knn(train = spam.train[, -58], test = spam.test[, -58], 
            cl = spam.train[, 58])
(tab <- table(pred = pred, true = spam.test[, 58]))

##     true
## pred   0   1
##    0 485  86
##    1  82 316

sum(diag(tab))/sum(tab)  # classification rate

## [1] 0.8266

\(k\)-Nearest Neighbors

Pros

Intuitive
Non-linear classifier
Robust against noisy data with larger \(k\)
Most appropriate for large \(n\), small \(p\) problems

Cons

Large storage requirements
Curse of Dimensionality – distances may be adversely affected by irrelevant features
Sensitive to choice of distance metric
Best way to select \(k\) is through expensive cross-validation.

Perceptron Algorithm

Assume training data are linearly separable for \(K=2\) classes.

That is, it is possible to achieve 0 misclassification error with a linear decision boundary (or a hyperplane in higher dimensions).

Suppose \(\pmb{Y} \in \{-1, 1\}\) and define the linear separator by \(\pmb{x}^T\pmb{\beta} + \beta_0\).

A response \(y_i = 1\) is misclassified if \(\pmb{x_i}^T\pmb{\beta} + \beta_0 < 0\).

Conversely, \(y_i = -1\) is misclassified if \(\pmb{x_i}^T\pmb{\beta} + \beta_0 \geq 0\).

Perceptron Algorithm

Rosenblatt's Idea: Minimize \(D(\pmb{\beta}, \beta_0) = - \sum_{i\in\mathcal{M}} y_i(\pmb{x_i}^T\pmb{\beta} + \beta_0)\) where \(\mathcal{M}\) is the set of misclassified points.

This is guaranteed to converge to a solution in a finite number of steps by the Perceptron Convergence Theorem (Rosenblatt, 1962).

Perceptron Algorithm

Pros:

Perceptron is a cool word
Amenable to online-learning (data are processed element-by-element)

Cons:

Convergence may be slow
Solution is not unique
If not linearly separable, algorithm will not converge
Linear separability is rare!

Perceptron Algorithm

Rosenblatt with the Mark I Perceptron (1960)

(image from Neurocomputing)

Perceptron Algorithm

"The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

Later perceptrons will be able to recognize people and call out their names and instantly translate speech in one language to speech and writing in another language, it was predicted."

– The New York Times, 1958 (Olazaran, 1996)

Perceptron Algorithm

In their book Perceptron, Minsky and Papert (1969) show that the perceptron fails to classify points produced by a simple XOR function.

This book besmirched perceptron research through the 1970s and 1980s.

Rumelhart et al. (1986) showed multi-layered perceptrons overcome this limitation and more.

Like any good comeback story, perceptrons were rebranded as…

Artificial Neural Networks

Each unit is a single perceptron.

Really rich topic, better suited for future talks.

Especially relevant today as it forms the basis for Deep Learning.

Decision Trees

Trees are grown by recursive binary splitting:

Find the variable (quantitative or qualitative) which "best" splits the data into two nodes
For each new group, repeat step 1.

Terminology

Root	Very first node of the tree
Branch	Path down which data are split leading to a new node
Leaf	Terminal node (no further classification)
Pruning	Act of removing least beneficial branches and leaves after tree is grown to reduce complexity, prevent overfitting

Decision Trees

Let \(\hat{p}_{mk}\) denote the proportion of observations of class \(k\) at node \(m\) and let \(k(m) = \arg\max_k \hat{p}_{mk}\).

The "best" split is one that minimizes node impurity determined by one of several possible metrics:

Misclassification Rate \(\quad 1 - \hat{p}_{mk(m)}\)
Gini index \(\quad \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk})\)
Cross-entropy or deviance \(\quad -\sum_{k=1}^K \hat{p}_{mk}\log\hat{p}_{mk}\)

Decision Trees

Node Impurity for \(K=2\)

(image from The Elements of Statistical Learning, pg. 309)

Decision Trees

fungi <- read.csv("data/agaricus-lepiota.csv", header = FALSE)

# 1. class: e=edible,p=poisonous
# 2. cap-shape: bell=b,conical=c,convex=x,flat=f,
#               knobbed=k,sunken=s
# 3. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
# 4. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
#               pink=p,purple=u,red=e,white=w,yellow=y
# 4. bruises?: bruises=t,no=f
# 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
#          musty=m,none=n,pungent=p,spicy=s
# 6. gill-attachment: attached=a,descending=d,free=f,notched=n
# 7. gill-spacing: close=c,crowded=w,distant=d
# 8. gill-size: broad=b,narrow=n
# ...

Decision Trees

library(tree)
mushroom <- tree(class ~ ., data = fungi, split = "deviance")
summary(mushroom)

## 
## Classification tree:
## tree(formula = class ~ ., data = fungi, split = "deviance")
## Variables actually used in tree construction:
## [1] "odor"                   "spore.print.color"     
## [3] "stalk.color.below.ring" "stalk.root"            
## Number of terminal nodes:  5 
## Residual mean deviance:  0.0143 = 116 / 8120 
## Misclassification error rate: 0.000985 = 8 / 8124

Decision Trees

mushroom

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 8124 10000 e ( 0.518 0.482 )  
##    2) odor: a,l,n 4328  1000 e ( 0.972 0.028 )  
##      4) spore.print.color: b,h,k,n,o,u,w,y 4256   500 e ( 0.989 0.011 )  
##        8) stalk.color.below.ring: e,g,o,p,w 4152   100 e ( 0.998 0.002 ) *
##        9) stalk.color.below.ring: n,y 104   100 e ( 0.615 0.385 )  
##         18) stalk.root: b 64     0 e ( 1.000 0.000 ) *
##         19) stalk.root: ?,c 40     0 p ( 0.000 1.000 ) *
##      5) spore.print.color: r 72     0 p ( 0.000 1.000 ) *
##    3) odor: c,f,m,p,s,y 3796     0 p ( 0.000 1.000 ) *

Decision Trees

plot(mushroom); text(mushroom)

plot of chunk fungi-tree-plot

Decision Trees

Pros:

Intuitive
Super interpretable
Handles both quantitative and qualitative variables simultaneously, therefore requiring minimal data pre-processing

Cons:

Does not account for dependence between variables within each class
High variance – trees can vary a great deal between data sets

Decision Trees

Ensemble Methods – Bagging and Random Forests

Intended to lower variance by considering a collection of many trees rather than a single tree.
Interpretability is sacrificed in favor of improved prediction performance.

Bagging (Bootstrap Aggregating)

Take \(B\) bootstrap samples from the data
For each sample, grow a decision tree with no pruning
Classify new observations by "majority vote" across \(B\) decision trees

Decision Trees

Ensemble Methods – Bagging and Random Forests

Random Forests

Attempts to alleviate correlation between trees present in the bagging case.

Take \(B\) bootstrap samples from the data
For each sample, take random subset of predictors of size \(m \approx \sqrt{p}\) and grow a decision tree with no pruning only on the \(m\) predictors
Classify new observations by "majority vote" across \(B\) decision trees

Support Vector Machines

Very popular in the 1990s, and still today.

Discussed next month beginning with Jami Jackson's talk on November 7.

Summary

There is no shortage of supervised classification methods
Important choice between data modeling and algorithmic modeling:
- The former allows for inference to be made on the parameters
- The latter often yields better predictions
Sometimes the best option is to run multiple classifiers (if computation is not a major issue)

The great masters do not take any Model quite so seriously as the rest of us. They know that it is, after all, only a model, possibly replaceable.

– C.S. Lewis, in The Discarded Image

Thank You!

References

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199-231.

Hastie, T., Friedman, J., & Tibshirani, R. (2009). The Elements of Statistical Learning. New York: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. New York: Springer.

Rosenblatt, F. (1962): Principles of neurodynamics; perceptrons and the theory of brain mechanisms. Washington: Spartan Books.

References

Arvin Calspan Advanced Technology Center; Hecht-Nielsen, R. (1990). Neurocomputing. Massachusetts: Addison-Wesley

Olazaran, M. (1996). A sociological study of the official history of the perceptrons controversy. Social Studies of Science, 26(3), 611-659.

Minsky, S. & Papert, M. (1969). Perceptrons: An introduction to computational geometry. Expanded Edition.

Rumelhart, D., McClelland, J., and the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol.1 and 2, Massachusetts: MIT Press.