1 | \documentclass{llncs}
|
---|
2 | \usepackage{multirow}
|
---|
3 |
|
---|
4 | \title{Databases and Data Mining --- Assignment 4}
|
---|
5 |
|
---|
6 | \author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}
|
---|
7 |
|
---|
8 | \institute{Leiden Institute of Advanced Computer Science (LIACS),
|
---|
9 | Niels~Bohrweg~1,
|
---|
10 | 2333~CA~Leiden,
|
---|
11 | The~Netherlands
|
---|
12 | }
|
---|
13 |
|
---|
14 | \begin{document}
|
---|
15 | \maketitle
|
---|
16 |
|
---|
17 | \section*{Introduction}
|
---|
18 | Finding genes in DNA can be extremely difficult, due to the large data-sets, not
|
---|
19 | clear distinction between useful and useless information and other factors.
|
---|
20 |
|
---|
21 | Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
|
---|
22 | 1970s and nicely explained by L.A.~Rabiner in his field of speech recognition
|
---|
23 | \cite{hmm-speech}, could also be used in the field of gene recognition as shown
|
---|
24 | by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
|
---|
25 | You are able to make predictions about the model (using statistics) and you are
|
---|
26 | able to view the output of the model. You are how-ever not able to
|
---|
27 | understand/see the inner workings of the model.
|
---|
28 |
|
---|
29 | As the \emph{E. coli} genome was not fully known by that time, this report will
|
---|
30 | focus in making a \emph{HMM} based on the complete genome sequence of the
|
---|
31 | \emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
|
---|
32 | al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.
|
---|
33 |
|
---|
34 | The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
|
---|
35 | (\emph{AE005174v2-2.fas}) both taken from the research website. They are
|
---|
36 | presented in the \emph{FASTA}
|
---|
37 | format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}. This format is
|
---|
38 | text-based format representing nucleotide sequences (in our case) in which base
|
---|
39 | pairs are represented using single-letter codes. The \emph{FASTA} format can
|
---|
40 | easy be parsed, although a bunch a specific rules has to be
|
---|
41 | followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}
|
---|
42 |
|
---|
43 | The exact locations of the genes are to be found in the \emph{EDL933} table
|
---|
44 | taken again from the research website.
|
---|
45 |
|
---|
46 | \subsection*{Statistics}
|
---|
47 |
|
---|
48 | When sequencing genes, codons (i.e. nucleotide triplets) are extremely
|
---|
49 | important as specific combinations are specific markers within the
|
---|
50 | \emph{E.~coli} genome. The start location influences the codons you will find.
|
---|
51 | Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
|
---|
52 | codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
|
---|
53 | genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
|
---|
54 | also valid. Those 6 combinations are the so called \emph{reading frames}.
|
---|
55 |
|
---|
56 | % result.txt properly does not cover this
|
---|
57 | % Before processing the contig files the overlap of 527 nucleotides between the
|
---|
58 | % contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
|
---|
59 | % also a gap of 4000 nucleotides introduced between contig 1 and contig 2.
|
---|
60 |
|
---|
61 | Statistics of individual nucleotides is to be found for contig 1 in
|
---|
62 | table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
|
---|
63 | The unknown values has been removed from the results (1726 vs 4229). This shows
|
---|
64 | that all nucleotides are relatively well spread throughout the genome.
|
---|
65 |
|
---|
66 | \begin{table}[ht]
|
---|
67 | \caption{Frequency of individual nucleotides in contig 1}
|
---|
68 | \centering
|
---|
69 | \begin{tabular}{c | r | r | r | r}
|
---|
70 | Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
|
---|
71 | \hline
|
---|
72 | a & 420765 & 24.7\% & 365055 & 24.2\% \\
|
---|
73 | c & 413594 & 24.3\% & 375977 & 24.9\% \\
|
---|
74 | t & 421486 & 24.7\% & 363547 & 24.1\% \\
|
---|
75 | g & 447937 & 26.3\% & 406047 & 26.9\% \\
|
---|
76 | \hline \hline
|
---|
77 | total & 1703782 & 100\% & 1510626 & 100\% \\
|
---|
78 | \end{tabular}
|
---|
79 | \label{tab:in-contig1}
|
---|
80 | \end{table}
|
---|
81 |
|
---|
82 | \begin{table}[ht]
|
---|
83 | \caption{Frequency of individual nucleotides in contig 2}
|
---|
84 | \centering
|
---|
85 | \begin{tabular}{c | r | r | r | r}
|
---|
86 | Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
|
---|
87 | \hline
|
---|
88 | a & 932625 & 24.9\% & 818308 & 24.3\% \\
|
---|
89 | c & 963480 & 25.7\% & 881689 & 26.2\% \\
|
---|
90 | t & 928752 & 24.8\% & 816242 & 24.3\% \\
|
---|
91 | g & 925343 & 24.7\% & 848201 & 25.2\% \\
|
---|
92 | \hline \hline
|
---|
93 | total & 3750200 & 100\% & 3364440 & 100\% \\
|
---|
94 | \end{tabular}
|
---|
95 | \label{tab:in-contig2}
|
---|
96 | \end{table}
|
---|
97 |
|
---|
98 | The codons (sum of reading frames) for contig 1 shown
|
---|
99 | in~\ref{tab:contig1-codon}, contig 2 is listed in
|
---|
100 | table~\ref{tab:contig2-codon}. This tables shows a significance between codons
|
---|
101 | and there percentages to be part of a gene, although the results has to be
|
---|
102 | interpreted with care as the reading frames are not clear.
|
---|
103 |
|
---|
104 | \begin{table}[ht]
|
---|
105 | \caption{Frequency of codons in contig 1}
|
---|
106 | \centering
|
---|
107 | \begin{tabular}{c | r | r | r || c | r | r | r}
|
---|
108 | Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
|
---|
109 | \hline
|
---|
110 | taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
|
---|
111 | tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
|
---|
112 | ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
|
---|
113 | ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
|
---|
114 | tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
|
---|
115 | aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
|
---|
116 | att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
|
---|
117 | tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
|
---|
118 | cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
|
---|
119 | aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
|
---|
120 | tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
|
---|
121 | aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
|
---|
122 | gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
|
---|
123 | cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
|
---|
124 | ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
|
---|
125 | tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
|
---|
126 | agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
|
---|
127 | aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
|
---|
128 | ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
|
---|
129 | agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
|
---|
130 | gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
|
---|
131 | aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
|
---|
132 | tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
|
---|
133 | cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
|
---|
134 | ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
|
---|
135 | act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
|
---|
136 | gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
|
---|
137 | atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
|
---|
138 | tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
|
---|
139 | tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
|
---|
140 | caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
|
---|
141 | ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
|
---|
142 | \end{tabular}
|
---|
143 | \label{tab:contig1-codon}
|
---|
144 | \end{table}
|
---|
145 |
|
---|
146 | \begin{table}[ht]
|
---|
147 | \caption{Frequency of condons in contig 2}
|
---|
148 | \centering
|
---|
149 | \begin{tabular}{c | r | r | r || c | r | r | r}
|
---|
150 | Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
|
---|
151 | \hline
|
---|
152 | taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
|
---|
153 | tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
|
---|
154 | ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
|
---|
155 | tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
|
---|
156 | aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
|
---|
157 | att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
|
---|
158 | aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
|
---|
159 | cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
|
---|
160 | ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
|
---|
161 | tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
|
---|
162 | aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
|
---|
163 | tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
|
---|
164 | cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
|
---|
165 | cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
|
---|
166 | aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
|
---|
167 | ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
|
---|
168 | ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
|
---|
169 | agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
|
---|
170 | tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
|
---|
171 | aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
|
---|
172 | act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
|
---|
173 | gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
|
---|
174 | tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
|
---|
175 | agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
|
---|
176 | gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
|
---|
177 | tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
|
---|
178 | tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
|
---|
179 | caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
|
---|
180 | ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
|
---|
181 | atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
|
---|
182 | aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
|
---|
183 | gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
|
---|
184 | \end{tabular}
|
---|
185 | \label{tab:contig2-codon}
|
---|
186 | \end{table}
|
---|
187 |
|
---|
188 | \section*{Hidden Markov Model}
|
---|
189 | The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
|
---|
190 | As students have no root access, installing the Python source code proved impossible.
|
---|
191 | Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
|
---|
192 | Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
|
---|
193 | Although we used a simplified version, we were unable to recreate the expected results.
|
---|
194 | Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.
|
---|
195 |
|
---|
196 | \subsection*{Model Topology}
|
---|
197 | As opposed to the presented model in \cite{ecoli} we use only four states.
|
---|
198 | These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
|
---|
199 | Instead of using nucleotides for observation, we use complete codons for observation. \\
|
---|
200 | The model is parametrized by the results from the statistics from our dataset.
|
---|
201 | Additional Baum-Welch training is not performed as it was found to be not improving the results.
|
---|
202 |
|
---|
203 | \subsection*{Executing the HMM}
|
---|
204 | A number of MATLAB scripts form the framework of our model.
|
---|
205 | Some MATLAB data files are also needed.
|
---|
206 | These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
|
---|
207 | The MATLAB HMM toolkit is included in the source directory.
|
---|
208 | To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
|
---|
209 | Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
|
---|
210 | The results are shown, after completion, in the MATLAB command window. \\
|
---|
211 | Files in the source directory:
|
---|
212 | \begin{itemize}
|
---|
213 | \item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
|
---|
214 | \item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
|
---|
215 | \item \texttt{cton.m} MATLAB function which translates codons to nucleotides
|
---|
216 | \item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
|
---|
217 | \item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
|
---|
218 | \item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
|
---|
219 | \item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
|
---|
220 | \item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
|
---|
221 | \item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
|
---|
222 | \item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
|
---|
223 | \end{itemize}
|
---|
224 |
|
---|
225 | \section*{Results}
|
---|
226 | A typical output of our HMM is as follows:
|
---|
227 |
|
---|
228 | \begin{verbatim}
|
---|
229 | % nucleotides in gene recognized
|
---|
230 | 94.3267
|
---|
231 |
|
---|
232 | % nucleotides outside gene recognized
|
---|
233 | 91.6520
|
---|
234 |
|
---|
235 | % genes in table found
|
---|
236 | 57.9845
|
---|
237 |
|
---|
238 | % of genes found not in table
|
---|
239 | 49.7807
|
---|
240 |
|
---|
241 | % genes in table found with error margin <= 10 codons
|
---|
242 | 65.3692
|
---|
243 |
|
---|
244 | % of genes found (with error margin) not in table
|
---|
245 | 43.3849
|
---|
246 |
|
---|
247 | % genes in table found with error margin <= 20 codons
|
---|
248 | 71.6875
|
---|
249 |
|
---|
250 | % of genes found (with error margin) not in table
|
---|
251 | 37.9127
|
---|
252 | \end{verbatim}
|
---|
253 | \noindent
|
---|
254 | As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
|
---|
255 | If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
|
---|
256 | Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.
|
---|
257 |
|
---|
258 | \bibliographystyle{plain}
|
---|
259 | \bibliography{report}
|
---|
260 |
|
---|
261 | \end{document}
|
---|