Knowledge Discovery in Databases: An Overview
In the past, the term Data Mining was, and still is, used to designate the activity of pulling useful information from databases. Now, this term is recognized to apply but to one activity in a very large process to extract knowledge from opaque databases. The overall process is known as Knowledge Discovery in Databases, (KDD). This process is comprised of many subprocesses which when linked together provide a firm foundation for knowledge acquisition from large databases. Many tools, techniques, and disciplines come together under the umbrella of KDD.
Today, the topic of data mining has much interest in government, business, and research circles. With the growth of computer use within these areas has also come a greater desire to let the computers do the work that used to be done by humans. The problem, nowadays, is that the data that needs to be analyzed has become too large and cumbersome for one person or even teams of people to envision tackling without help from computers. These computers are no longer mere crunchers of numbers but now they find the patterns that the humans used to find. From this growth has arisen a vast body of knowledge concerned with this process of data analysis. As with much other information, the Internet is employed to make available the ever-growing body of information on this topic. Many general sources of information [a,b,c] are now online. These are updated and expanded upon almost a constant basis. The use of the Internet to disseminate and collect information is itself a consideration in this field. The amount of information is expanding at such a rate that old methods of information disposal, such as paper journals and books has yielded to the online world, where the dissemination is almost instantaneous. Many companies maintain a presence on the Internet in domains dedicated to data analysis. GTE, IBM, and Microsoft, to name but a few, all maintain sites. The U.S. Government has sponsored conferences [i,j], while some companies maintain data sets for perusal on the Internet [e]. Many universities do research in this field and they maintain extensive online resources [l ].
First, we must clarify the terms that are used in this field. Data mining applies to only a part of a much larger process. Data mining is the process in which an algorithm is used to extract patterns from data. This is only one part of a much larger process. First, data must be identified and collected, then analyzed using various tools and the results of the analysis must then be judged. Data mining corresponds to the middle tasks in this overall process. This overall process is called "Knowledge discovery in databases, [which] is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"[7 p.6]. An analogy to true mining would serve to further explicate the situation. When a company wishes to mine a...