Analysis Of Large Log Files

10573 words - 42 pages

Analysis of large log files
Kasper Laursen s093078
Kongens Lyngby 2012 IMM-B.Sc.-2012-37

Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 4525 3351, Fax +45 4588 2673 reception@imm.dtu.dk www.imm.dtu.dk IMM-B.Sc.-2012-37

Summary
This thesis covers pattern recognition of large log files using clustering analysis in form of mini-batch K-means clustering and data fitting, to find abnormal traffic in network flows provided by DeIC, formerly The Danish Research Network.
The implementation is a modified clustering algorithm using the Mahalanobis distance. In the analysis, more than 109 network flows from a single day was split into different clusters, and outliers were detected. The calculations of the clustering analysis took less than 13 hours, which means that outliers can be detected the following day. The implementation and analysis could be further improved by selecting a different set of fields from the log files, a parallel imple- mentation of the mini-batch K-means clustering algorithm and a more thorough analysis of the detected outliers.

ii

Preface
This bachelor thesis was prepared at the department of Informatics and Math- ematical Modelling at the Technical University of Denmark in fulfillment of the requirements for acquiring a B.Sc.Eng. degree in Software Technology.
Lyngby, 14 December 2012
Kasper Laursen

iv

Acknowledgements
I would like to thank my supervisor Robin Sharp for weekly meetings and sup- port through the whole project.
Tanks to The Danish Research Network for providing network log files for this analysis.
I would like to give a special thanks to Rasmus Jul Hansen for proofreading this project, thanks to Simon Laursen for discussion and finalizing the report and thanks to Søren Løvborg for proofreading, help and discussion through the whole project phase.

vi

Contents
1 Introduction 1
2 Preliminaries 3 2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Clustering analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Network flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Problem analysis 15
4 Handling large datasets 19 4.1 Large log files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Log file variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Scaling of variables . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Packed format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Implementation 29 5.1 Mini-batch K-means . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Data fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Analysis of log files 35 6.1...

Find Another Essay On Analysis of large log files

User Profile Acquisition Approaches Essay

1457 words - 6 pages descriptive statistics to extract knowledge from Web log has been introduced by Srivastava, Deshpende & Phang (2000), by analyzing the session files and perform statistics of user interaction such as frequency, mean, and median on variables i.e. page views, viewing time and length of a navigational path. Additionally, Web logs file analysis using statistical approach proposed by Stermsek et al.(2007) allow for a broader perception of user behavior

testx Essay

10872 words - 43 pages /Applications. If you prefer, use the usualy drag and drop to create an icon in the dock. c© 2010 syntevo GmbH, www.syntevo.com 5 Chapter 3 Major Features 3.1 Change Sets (Pro Only) A Change Set is a group of files with an assigned log message and might be known as "prepared commit" from other version control systems. Optionally, files assigned to a change set are not shown in the project structure (see Section 4.4.3). Starting with SmartCVS 7

Investigators: Digital Evidence

1410 words - 6 pages should be documented. Any software used should be in compliance with the law and NIST to ensure its validity when being used to conduct forensic analysis. Using improper software to conduct analysis could seriously jeopardize the integrity of the investigation. Standard things to search for when conducting forensic analysis of computer systems are system logs, cookies, deleted files, emails and email headers, files with strange extensions or

Data Acquisition

1869 words - 7 pages forensics analysis tools can read other vendors’ formatted acquisitions. Raw Format: There was only one practical way of copying data for the purpose of evidence preservation and examination. Examiners performed a bit-by-bit copy from one disk to another disk the same size or larger. As a practical way to preserve digital evidence, vendors (and some OS utilities, such as the Linux/UNIX dd command) made it possible to write bit-stream data to files

Overview of Cloud Computing

2304 words - 9 pages computing is very reliable to this technology. It is because cloud computing always provided good and satisfied services to the users which the user only need to log into the software by using any of the electronic goods such as laptop or mobile phone. However, users might worry about the security of the document in the cloud computing. When the files or document been destroyed, cloud computing enables destroyed or lost files being recovered in the

Testing Evolutionary Brain Size Change in Bats

2593 words - 10 pages of maintaining a large brain. However, that conclusion was based on a theoretical analysis of brain size in living bats and made no mention of the need to test it with reference to the fossil record. In fact, the major problem in this paper is that brain size in fossil bats is a totally neglected area. How could it be possible to test for an evolutionary trend without looking at the ancestors? The paper by Safi et al. goes on to claim

Case Study Analysis ABC Inc.

1113 words - 4 pages recruitment topics. Mr. Robins assured Ms. Carrolls that he would have everything coordinated before orientation in July. Mr. Robins waited until two weeks prior to orientation to get started on finalizing paperwork, upon doing so, several issues came to light. A large portion of paperwork was incomplete, physicals and drug screens were missed; the training room was double booked, and missing or incomplete manuals were found. Mr. Robins needs to re

Radon Report

1154 words - 5 pages are reported to the client. All completed run forms are retained by the Senior Scientist, Radiological Testing. The run forms for the current year are kept in the office of the Senior Scientist, Radiological Testing. Those for previous years are stored in the document storage area. Spectra and intermediate data reduction files are archived on CD ROM disks and kept in the office of the Senior Scientist, Radiological Testing.stored on the

File system overview - FAT, NTFS, EXT3

3508 words - 14 pages system file is one used by the file system to store its metadata and to implement the file system. System files are placed on the volume by the Format utility. Table below shows metadata stored in MTF.System File File Name MFT Record Purpose of the FileMaster file table $Mft 0 Contains one base file record for each file and folder on an NTFS volume. If the allocation information for a file or folder is too large to fit within a single record, other

Case Study Analysis

998 words - 4 pages Case Study AnalysisIntroductionThis is a case study analysis of Carl Robins, a new campus recruiter for ABC, Inc. Carl has enlisted and hired 15 new employees who are to be trained and work for Monica Carrolls, who is the Operations Supervisor. Carl has failed however to make the necessary arrangements for the new trainee orientation that is scheduled to take place on June 15. He also has neglected to follow-up with each of the new trainees

Sports Medicine

1285 words - 5 pages computer files for storage and later analysis using locally developed software. Subjects completed and returned these forms for a six month period (1 January 1990 - 30 June 1990). Forms were received from all 25 subjects for each of the six months of this pilot study. The subjects accumulated 3,209 exercise sessions totaling 2,631 hours, averaging nearly 5 sessions per week of about 50 minutes per session. Table 1 summarizes the

Similar Essays

Prospects Of Large Scale Rice Suitability Analysis In Papua New Guinea

2239 words - 9 pages with the passing time. With the progress of development of human society, the new generation Papua New Guineans are showing ostensible preference for grain crop ‘rice’ as the staple food. Here is the relevance of finding suitable rice growing areas in Papua New Guinea in order to discover its inherent potential to transcend into a rice exporting country from a rice importing country. Crop-land suitability analysis is a prerequisite to be

Othello This Is Character Analysis About Othello The Charater. It's Got Plenty Of Quotes To Use And Very A Large Resource Of Detail About The Characteristic's Of Othello.

2104 words - 8 pages . Nobody could share his same passion for Desdemona and therefore he was forced to a choice, Desdemona's life or everyman whom would also love her like he. His view as a general forced himself to choose the path leading to the least casualties. Realised all his mistakes in one instant and recognised how foolishly jealous he had been and how his jealousy lead to a large amount of life loss. Another instant where Othello's intelligence is blinded by

Reconstruction Of Image With Hebcot Compression Technique

2420 words - 10 pages knowledge and is chargeable for work access thereto instance or copy. The log harmonizer forms the central element that permits the user access to the log files. 4.2 JAR Generation The JAR file contains a collection of access management rules specifying whether or not and the way the cloud servers and presumably different information interested party (users, companies) square measure licensed to access the content itself. Looking on the

Analyzing Windows Memory Essay

1339 words - 5 pages . Terminated objects may even be found in memory days after they were killed. The memory also will have the state of active network connections (Burdach). “Windows memory analysis techniques depend on the examiner’s ability to translate the virtual addresses used by programs and operating system components into the true locations of data in a memory image,” (Schuster). Due to Windows caching large amounts of file data in memory we need to ensure we