1758 words - 7 pages

1 Data Pre-processing

1.1 k-mers extraction

Assume Ka = (a1,a2...ak) is a k-mer of continuous sequence of length k, and a = 1,…, S, where S is the cumulative number of k-mers in that series. In the case of a sequence of length L, we have L – k + 1 total number of k-mers that can be given out making use of k length window drifting procedure.

1.2 Generation Of Position Frequency Matrices

For the positive dataset, 500 sequences were used to calculate k-mer frequencies from three successive windows. The three windows are: (1) window A, from -75 to -26 bp before the polyA site, (2) window B, from -25 to -1 bp before the polyA site, and (3) window C, from 1 to 25 bp after the polyA site. The highly informative k-mer frequencies (HIK) feature vector consisted of cumulated frequencies of all monomer, dimmer, and trimer frequencies for the three regions. This results in 3 regions x 4 monomer frequencies, 3 x 16 dimer frequencies, and 3 x 64 trimer frequencies. Hence, a total of 252 features are obtained. The negative dataset was computed from frequencies in similarly spaced windows, but from the beginning of 500 other independent sequences (windows: A, -300 to -251 bp; B, -251 to -226 bp; and C, -225 to -201 bp

1.3 Background Probability Feature

The label space is written as Y = fp; ng indicating that a sequence with a polyA site is detected (positive

class label p) or not detected (negative class label n). A classiffier, i.e., a mapping from instance space to label space, is found by means of learning from a set of examples. An example is of the form z = (x; y) with x 2 X and y 2 Y. The symbol Z will be used as a compact notation for X _Y. Training data are

a sequence of examples:

S = (x1; y1); : : : ; (xn; yn) = z1; : : : ; zn ; (1)

where each example is generated by the same unknown probability distribution

P over Z (2)

c′(x) represents the number of times sub-sequence y found in a set of k-mers in a node.

If we denote π (a, a′) to take care of the conditional likelihood p(a′|a) of the first order MC, in which a, a′ ∈ {A, C, G, T}.

1.4 Relative Mismatch Score Feature

A relative mismatch score is required for k-mers assignment to the nodes in the learning. The score of a k-mer Kj = (b1b2 …bk) in respect with the PSSM based model assigned to node Vl, can be calculated as,

(3)

In this case, k represents the length of k-mer, and f(bi, i) represents the probability of nucleotide bi in position i. Thus the score of a k-mer Kj to the MC based model Mmc of node Vl is calculated by:

(4)

2. Generation of Training and Testing Datasets

2.1 pattern pairs for building the learner models

Once a winning node for a k-mer K is found, to construct a prediction set for an unlabeled instance xn+1, each possible label y 2 Y is tried as a label for the instance [15, 14, 42]. In each try we form the

example zn+1 = (xn+1; y) and add it to the training data S. Then, each example in the extended sequence:

(x1; y1);...

Get inspired and start your paper now!