Data Mining Essay

1758 words - 7 pages

1 Data Pre-processing
1.1 k-mers extraction

Assume Ka = (a1,a2...ak) is a k-mer of continuous sequence of length k, and a = 1,…, S, where S is the cumulative number of k-mers in that series. In the case of a sequence of length L, we have L – k + 1 total number of k-mers that can be given out making use of k length window drifting procedure.

1.2 Generation Of Position Frequency Matrices

For the positive dataset, 500 sequences were used to calculate k-mer frequencies from three successive windows. The three windows are: (1) window A, from -75 to -26 bp before the polyA site, (2) window B, from -25 to -1 bp before the polyA site, and (3) window C, from 1 to 25 bp after the polyA site. The highly informative k-mer frequencies (HIK) feature vector consisted of cumulated frequencies of all monomer, dimmer, and trimer frequencies for the three regions. This results in 3 regions x 4 monomer frequencies, 3 x 16 dimer frequencies, and 3 x 64 trimer frequencies. Hence, a total of 252 features are obtained. The negative dataset was computed from frequencies in similarly spaced windows, but from the beginning of 500 other independent sequences (windows: A, -300 to -251 bp; B, -251 to -226 bp; and C, -225 to -201 bp

1.3 Background Probability Feature

The label space is written as Y = fp; ng indicating that a sequence with a polyA site is detected (positive
class label p) or not detected (negative class label n). A classiffier, i.e., a mapping from instance space to label space, is found by means of learning from a set of examples. An example is of the form z = (x; y) with x 2 X and y 2 Y. The symbol Z will be used as a compact notation for X _Y. Training data are
a sequence of examples:

S = (x1; y1); : : : ; (xn; yn) = z1; : : : ; zn ; (1)

where each example is generated by the same unknown probability distribution
P over Z (2)
c′(x) represents the number of times sub-sequence y found in a set of k-mers in a node.
If we denote π (a, a′) to take care of the conditional likelihood p(a′|a) of the first order MC, in which a, a′ ∈ {A, C, G, T}.

1.4 Relative Mismatch Score Feature

A relative mismatch score is required for k-mers assignment to the nodes in the learning. The score of a k-mer Kj = (b1b2 …bk) in respect with the PSSM based model assigned to node Vl, can be calculated as,
In this case, k represents the length of k-mer, and f(bi, i) represents the probability of nucleotide bi in position i. Thus the score of a k-mer Kj to the MC based model Mmc of node Vl is calculated by:

2. Generation of Training and Testing Datasets

2.1 pattern pairs for building the learner models

Once a winning node for a k-mer K is found, to construct a prediction set for an unlabeled instance xn+1, each possible label y 2 Y is tried as a label for the instance [15, 14, 42]. In each try we form the
example zn+1 = (xn+1; y) and add it to the training data S. Then, each example in the extended sequence:

(x1; y1);...

Find Another Essay On Data Mining

Data Mining, Data Warehousing and Data Dredging.

1118 words - 4 pages Over the last ten years as information technology and more precisely data storage has become less expensive the amounts of data that businesses are holding onto has been increasing. Furthermore as the amounts increase there is a need to sort through this data to discover useful and relevant information. Some of these techniques are referred to as data mining, data warehousing and data dredging. There is a caveat here because as more businesses

OLAP, Data Mining, Warehousing, Data Marts.

736 words - 3 pages The technology that exists with Data Mining, Warehousing, Data Marts, and OLAP is comparatively a new term but the technology is not. Data Mining is the process of digging or gathering information from various databases. This includes data from point of sales transactions, credit card purchases, online forms which are just a few of the many things that some of the large companies dig to find out more about their clients. The information is used

Data Mining Benefits and Drawbacks

2784 words - 11 pages Data Mining Benefits and Drawbacks Introduction In a world where computers are becoming as essential to daily life as the cars we drive or the telephones we use to communicate, it is difficult to find a person who doesn’t have some particular use for computers. Computers have become the information stores of the world. If you take a moment to think about all the kinds of information a person can and does hold on their computer it is

Data Mining and the US Government

4283 words - 17 pages Data Mining and the US Government Introduction On the morning of September 11, 2001, millions of Americans, and many more around the world, woke up to heart-wrenching news of a horrific magnitude. Two planes had collided into the twin towers of the World Trade Center in New York, a third rammed into the Pentagon in Washington, D.C., and yet a fourth crash-landed in Philadelphia. All victims of this carefully planned act of terrorism

data mining: the future of marketing

1636 words - 7 pages Data Mining - The Future of MarketingHave you ever noticed the finger print left after you touch a glass? Did you ever look back to see the foot prints left when strolling down a sandy beach? Here is an even better question. Have you ever stopped to realize the trail you are leaving with use of today's technology? Surprisingly, everything done online leaves a trail of data. This trail and its' recently discovered importance for businesses gave

Data Mining and the Social Web

1717 words - 7 pages Data Mining is a powerful tool that is designed to gather large sets of data at incredible speed and analyze them. Most companies use this tool to better understand their customer’s habits as well as their interests. Advertisers love this tool because it allows unprecedented amount of access to information. Most people are unaware that their data is being mined, bundled, and sold by a company to third party advertisers in order to make targeted

Data Mining Techniques for Customer Relationship Management

2184 words - 9 pages Abstract Advancements in technology have contributed to the new business culture, where the Customer Relationship Management (CRM) is in the centre of a business concern. CRM is a widely implemented strategy for managing and fostering long term, profitable relationships with specific customers (Ling and Yen, 2001). The automated data mining tools made it possible to move beyond the analyses of the past events and data mining tools can be

Educational Data Mining Model to Attain Sustainability

3080 words - 12 pages partners. Each of the University stake holders has to take part in this new business model. Each of them need to analyses the data available with them and act accordingly. Our research focus on modelling this “big data” and applying data mining techniques to gain knowledge. II. DATA MINING A. Data mining Data mining is a process of using different algorithms to find useful patterns or models from data. It is a process of selecting, exploring and


598 words - 2 pages companies.At that time the mining in Peru increased a lot (data from the exhibit 1).But besides the good impact of the economy of Peru, the mining brought some problems, in fact people living in mining areas associated mining with altering social relations within the community.Also there was a great competition between the mining companies for water and land.Quantitative Analysis (Most interesting trends)Mining sector of Peruvian economy had a steady

Mountaintop Removal Mining in West Virginia

1470 words - 6 pages running through the upper fraction of a mountain, ridge, or hill. The coal must be extracted by removing all the overburden [topsoil] and by creating a level plateau or supporting certain post-mining land uses. In the beginng of try to answer the ethical question of was mountaintop removal mining right or wrong for West Virginia, I decided to look at the environmental hazards first, exploring all the possible data and results that was available to

Database Systems: Big Data Evolution and Efficiency

2224 words - 9 pages facing. Big Data is a term used today to talk about the vastly growing amounts of data, (mainly unstructured, but can also include structured and semi structured data), out there to be mined [1]. Data mining attempts to derive meaningful information from data. As the amount of data in different varieties keeps increasing, it becomes harder to process useful information at an acceptable return time rate. Current software tools and hardware are

Similar Essays

Data Mining Essay

4642 words - 19 pages Data Mining With the increased and widespread use of technologies, interest in data mining has increased rapidly. Companies are now utilized data mining techniques to exam their database looking for trends, relationships, and outcomes to enhance their overall operations and discover new patterns that may allow them to better serve their customers. Data mining provides numerous benefits to businesses, government, society as well as

Data Mining Essay

1299 words - 5 pages has values for more recent customers (para.3). More responses will be received from customers with high ranked recency. Companies will contact their most recent customers, and then the next customers following the first. This will improve the response rate of consumers who purchased the company product(s). Discovery in products sold to customers There are association when it comes to products sold to customers in data mining. The data mining

Data Mining Essay

845 words - 3 pages consumer behaviour. Another method called Association Rule Method can be used to find out products frequently bought together, information which can be used for marketing complementary good. Classification can be used to predict segment specific consumer behaviour. Cluster analysis can be used for target marketing. And Data Mining can be used to map consumer behaviour.Big data is the future. It is the path to competitive advantage, productivity

Computer Science: Data Mining Essay

1690 words - 7 pages Data mining is an analytic process of exploring huge amount of data, extract useful information, finding consistent patterns and trends between variables, and build predictive computer models from the relationship discovered using a combination of classical statistics, machine learning and artificial intelligence. The findings are then applied to new subsets of data to test its validity. It performs two essential tasks, descripting and