Log is a file that records the events which happens while an operating system or software runs . It may include any activity such as information about a simple keystroke, the complete record of communication between two machines, system errors, inter-process communication, update events, server activities, client sessions, browsing history, etc. Logs provide a good insight into various states of a system at any instant and their analytical and statistical study can manage systems and mine useful knowledge about a user on various aspects. Log data is voluminous, growing at a very fast rate, with varying structure across various applications, usages, servers, etc. It possesses the key ...view middle of the document...
Three approaches that had been worked upon are term level analysis, query level analysis and session level analysis. But it is need of time, to take a step ahead and introduce some intelligence to the analyzer and take advantage of complex models to view the data with various perspectives. Big Data analysis and data mining are the answer to such need. The upcoming sections will discuss how Big Data technology and data mining algorithms facilitate handling the voluminous and rapidly growing log data.
Big Data is an emerging area of interest for researchers, it being a powerful source for predicting unforeseen results and an able supporter of decision. Log data is voluminous, growing at a very fast rate, with varying structure across various applications, usages, servers, etc. Thus it possesses the key characteristics of the Big Data.
Heterogeneity of logs can be handled by NoSQL databases such as Hadoop, Dynamo, BigTables etc., because they provide a schema-less and scalable model . They are primarily of four types: key-value pair oriented, column oriented, document oriented and graph databases.
1.1.5. Challenges in Log Analysis
a. Loss of semantic while writing log messages make it difficult to retrieve correct meaning.
b. Log quality and excessive logging can put adverse effect on analysis.
c. Sampling based logging can miss out on crucial and rare events.
d. Power of analysis is limited by the information in logs .
e. A mechanism to distinguish adversary from a legitimate user performing correct operation.
1.2. Introduction to Big Data
In 2010, Apache Hadoop defined it as, “Datasets which could not be captured, managed, and processed by general computers within an acceptable scope” . This interpretation of Big Data varies to three levels: Big Data, Very Big Data and Massive Data. It is a conceptual idea, where other than size, rate of growth of incoming data, types of data, completeness of data and many other factors play an equally important role. It is a trending domain where lot of undiscovered potential lies and is expected to be explored in upcoming years. Figure-3 shows some important results presented in the webinar “ Big Data Opportunities in Vertical Industries” hosted by Gartner in July 2012 .
Data can vary from terabytes to petabytes or even more. So the requirements for the analysis depend highly on the available computing infrastructure and scalable algorithms. That defines many parameters for evaluating a data set as Big Data. Figure-4 shows the key parameters for such evaluation, which include volume, variety, velocity, veracity, viscosity, value.
1.2.1. Handling Big Data
a. Data Generation
The input data which may include day-to-day data generated from enterprises, sensor nodes or IoTs, Bio-medical analysis, and other fields etc.
b. Data Acquisition
The generated data is not worth if not acquired well in time, particularly in case of time sensitive data. Thus, collection and transmission of data to the data centres...