Methodology Of The Naïve Bayes Algorithm.

In this chapter we are going to provide more insight into the Naïve Bayes algorithm. The aim is to show how the method works. We will also take a look at how our model will be developed, the various data sets that will be used in the process and how they were chosen. Then we are going to look at feature selection and how it will be applied.


Bayes' rule:

P (E | H) x P (H)
P (H | E) = _________________
P (E)

The fundamental concept of Bayes' rule is that the result of a hypothesis or an event (H) can be calculated based on the presence of some observed evidences (E). From Bayes' rule, we have:
1. A prior probability of H or P(H): This is the probability of an event before observing the evidence.
2. A posterior probability of H or P(H | E): This is the probability of an event after observing the evidence.
For example to estimate the probability of a mail being classified as belonging to the Human Resources (HR) class, we usually use some evidences such as the frequency of use of words like “Employment”.

Using the equation above, let ‘HR’ be the event of a mail belonging to HR and ‘Employment’ be the evidence of the word Employment in the mail, then we have

P (Employment | HR) x P (HR)
P (HR | Employment) = _____________________
P (Employment)

P (HR | Employment) is the probability that the word Employment occurs in a mail to HR. Of course, “Employment” could occur in many other mail classes such as Joint Venture or Procurement and Contracting, but we only consider “Employment” in the context of class “HR”. This probability can be obtained from historical mail collections.
P (HR) is the prior probability of the HR class. This probability can be estimated from records, for example, the number of HR mails received throughout a year.
P (Employment) is the probability of the word “Employment” occurring. Again, this can be estimated from the records, but the evidence is not usually well recorded compared to the main event. Therefore, sometimes the full evidence, i.e., P (Employment), is hard to obtain.

As you can see from the example above, we can predict an outcome of some events by observing some data collection. Generally, it is “better” to have more evidence to support the prediction of an event. Typically, the more evidences we can gather, the better the classification accuracy can be obtained. However, the evidence must relate to the event (must make sense). For example, if you add an evidence of “Purchase Order” to the above example, the model might yield worse performance. This is since “HR” class is not related to the evidence of “Purchase Order”, i.e., if Purchase Order appears in a mail, it doesn't mean that the mail is meant for HR.

Assume we have more evidence for developing our Naïve Bayes classifier, we may perhaps run into a dilemma of dependencies, that is to say, some evidence may depend on one or more of other evidences. For instance, the presence of the word...

