Naive Bayes vs Logistic Regression

I recently came across Tom Mitchell's comparison of Naive Bayes and Logistic Regression. Under some specific assumptions, Naive Bayes can be interpreted in terms of Logistic Regression. Furthermore, as the number of training examples reach infinity, under the aforementioned constraint, NB and LR classifiers are identical. In this blog I'll explain the bias of Naive Bayes, including the reason why using Naive Bayes instead of unbiased learning of Bayes classifiers, along with the relationship between Naive Bayes and Logistic Regression.

1. Why Naive Bayes?

Consider the unbiased learning of Bayes classifiers. Given a supervised learning problem, we want to estimate a function f that maps input X to output Y, f : X -> Y, or in other words, P(Y|X), where X is the vector of attributes and Y can take on k classes.
Applying Bayes rule, we have the following:
Since X is a vector of n attributes. Let's assume each attribute takes on either 0 or 1. To represent P(X|Y) we then need
parameters, since for each attribute, there are 2 states. However since the sum of all P(Xi|Y) is 1, we only need 2^n - 1 parameters. Also, we know that Y takes on k classes.
As we can see from above, the number of parameters scales exponentially as the number of attributes increases. If n = 30 and k = 2 then the number of parameters needed > 1 billion! This is way too expensive, so therefore unreasonable to use, especially when the number of attributes is large.

However, under a specific constraint, namely when all Xi's are independent to each other and to all subsets of other attributes given Y, the number of parameters needed reduces significantly. Let's see how it is.
Under the above constraint, we can rewrite P(X|Y) as the following
since

we also have if X1 and X2 are independent given Y, P(X1|X2,Y) = P(X1|Y)
We are still under the assumption that each attribute can take on 2 values and Y and take on k values. The number of parameters needed to represent P(Xi|Y=y_k) is 1 since P(Xi=0|Y=y_k) = 1 - P(Xi=1|Y=y_k). We have n attributes and k classes, therefore the total number of parameters is kn. In the case that each X can take on j values, the number of parameters needed will be k(j-1)n, while for unbiased learning Bayes classifiers, the number of parameters needed is k(j^n-1). This is a significant reduce of number of parameters, from an exponential function of n to a linear function of n. Let's take a look again at the above example, for n = 30 and k = 2, the number of parameters needed in Naive Bayes is kn = 60 parameters.

So, we can see that under some constraint (bias) that every attribute is independent to each other given Y, the number of parameters needed reduce significantly and much less expensive, especially when the number of attributes is large.

2. Logistic Regression

I'll introduce logistic regression with the same problem to solve as above: Given a supervised learning problem, we want to estimate a function that maps input X to output Yf : X -> Y, or in other words, P(Y|X), where X is the vector of attributes and Y can take on classes.
Let Y be a boolean value, we have the following 
Let the threshold be 0.5, i.e. if P(Y=1|X) > P(Y=0|X), we'll assign 1 to and vice versa.
Therefore we will assign 1 to Y if
This leads to
Therefore, we will assign 1 to y if 
and 0 otherwise. This is exactly the decision of perceptron algorithm!

3. Naive Bayes vs Logistic Regression

We have the following assumptions for Naive Bayes
  • Y is a boolean (Bernoulli distribution).
  • X = (X1, X2, ... Xn) is a vector of booleans
  • Xi's are conditionally independent given Y
Let 

We also know that 
Applying Bayes rule we have
We can see that under the constraints above, Naive Bayes can be interpreted in terms of Logistic Regression, where

Details for when X is a vector of continuous random variables with the assumption that P(Xi|Y=yk) is a Gaussian distribution can be found in Mitchell's Machine Learning book.

I find it super fascinating to see the connection between Naive Bayes and Logistic Regression, something that I've never thought of before. In the next blog I'll show a small coding example that uses both Naive Bayes and Logistic Regression to solve.

Reference: Mitchell's Machine Learning book.

Comments