2406 words - 10 pages

Information gain analysis

ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions .Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the ...view middle of the document...

However, it is quite likely that the partitions will be impure (e.g., where a partition may contain a collection of tuples from different classes rather than from a single class). How much more information would we still need (after the partitioning) in order to arrive at an exact classification. This amount is measured by

The term |Dj |/|D| acts as the weight of the jth partition. InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A. The smaller the expected information (still) required, the greater the purity of the partitions.Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the newrequirement (i.e., obtained after partitioning on A). That is,

In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in the information requirement caused by knowing the value of A.The attribute A with the highest information gain, (Gain(A)), is chosen as the splitting attribute at node N. This is equivalent to saying that we want to partition on the attribute A that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal (i.e., minimum InfoA(D)).

Example

Induction of a decision tree using information gain. Table 6. presents a training set,D, of class-labeled tuples randomly selected from the AllElectronics customer database.(The data are adapted from [Qui86]. In this example, each attribute is discrete-valued.Continuous-valued attributes have been generalized.) The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there are two distinctclasses (that is, m = 2). Let class C correspond to yes and class C2 correspond to no.There are nine tuples of class yes and five tuples of class no. A (root) node N is created for the tuples in D. To find the splitting criterion for these tuples, we must compute theinformation gain of each attribute.We first use Equation to compute the expected information needed to classify a tuple in D:

Next, we need to compute the expected information requirement for each attribute.Let’s start with the attribute age.We need to look at the distribution of yes and no tuples for each category of age. For the age category youth, there are two yes tuples and three no tuples. For the category middle aged, there are four yes tuples and zero no tuples. For the category senior, there are three yes tuples and two no tuples. Using Equation (6.2),

the expected information needed to classify a tuple in D if the tuples are partitioned according to age is

Hence, the gain in information from such a partitioning would be

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.5 bits, andGain(credit...

Get inspired and start your paper now!