We propose a generic bot detection system for an endpoint host. It classifies destinations contacted by the host as benign or malicious by looking at the traffic generated by the host. The system is based on the assumption that user activity on an endpoint host takes place at random times and hence the traffic generated due to user activity, which we call user-induced traffic, shows random behavior. Bot C&C traffic, on the other hand, is programmed at the time of its coding or configuration and is expected to show regularity in behavior. This difference is behavior is captured using three features extracted from traffic – timegap between flows to a destination, numbers of packets in flows to a destination, numbers of bytes in flows to a destination. A flow is a set of packets that share the same Flow ID (Source IP, Source Port, Destination IP, Destination Port, Protocol).The entropy of the features are used to model the behaviour of both bot and user-induced traffic. We do an initial characterization of both classes of traffic and derive a set of fuzzy rules to describe their behaviour. Fuzziness is introduced in order to describe the difference in traffic behaviour in terms of natural language. The following sections describe the system in detail.
Traffic Characterization
From a review of literature, we found that there are only a few works [34-38] which analyse bot behavior. We understand from these works that the only invariant in bot behavior is its communication with the C&C server. Hence the bot C&C is the weak link of the bot through which we can detect its presence. From the bot analysis works, we were able to conclude that bots communicate periodically with their masters for getting commands, reporting status, posting stolen information and so on. Hence we do an observation and characterization of bot as well as user-induced traffic in order to infer how they can be differentiated.
The characterization of user-induced traffic is done based on data collected from an untainted Windows XP host over 42 days. Data is collected from 10 distinct users. Microsoft Network Monitor 3.4 is used for data capture. A web user session is found to be of duration 15 minutes or less [24]. Hence we choose our timeslot for characterization of traffic to be slightly larger, of 30 minutes duration. A sliding window is maintained over the timeslots with the time window sliding 10 minutes at a time. We also define a term Flow Set which is the set of flows to the same destination in any time period. In the context of traffic characterization, we have chosen the time period to be a...

