source: liacs/dbdm/dbdm_4/report/report.tex@ 210

Last change on this file since 210 was 67, checked in by Rick van der Zwet, 15 years ago

DBDM - Assignment 4 - report

File size: 13.0 KB
Line 
1\documentclass{llncs}
2\usepackage{multirow}
3
4\title{Databases and Data Mining --- Assignment 4}
5
6\author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}
7
8\institute{Leiden Institute of Advanced Computer Science (LIACS),
9Niels~Bohrweg~1,
102333~CA~Leiden,
11The~Netherlands
12}
13
14\begin{document}
15\maketitle
16
17\section*{Introduction}
18Finding genes in DNA can be extremely difficult, due to the large data-sets, not
19clear distinction between useful and useless information and other factors.
20
21Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
221970s and nicely explained by L.A.~Rabiner in his field of speech recognition
23\cite{hmm-speech}, could also be used in the field of gene recognition as shown
24by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
25You are able to make predictions about the model (using statistics) and you are
26able to view the output of the model. You are how-ever not able to
27understand/see the inner workings of the model.
28
29As the \emph{E. coli} genome was not fully known by that time, this report will
30focus in making a \emph{HMM} based on the complete genome sequence of the
31\emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
32al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.
33
34The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
35(\emph{AE005174v2-2.fas}) both taken from the research website. They are
36presented in the \emph{FASTA}
37format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}. This format is
38text-based format representing nucleotide sequences (in our case) in which base
39pairs are represented using single-letter codes. The \emph{FASTA} format can
40easy be parsed, although a bunch a specific rules has to be
41followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}
42
43The exact locations of the genes are to be found in the \emph{EDL933} table
44taken again from the research website.
45
46\subsection*{Statistics}
47
48When sequencing genes, codons (i.e. nucleotide triplets) are extremely
49important as specific combinations are specific markers within the
50\emph{E.~coli} genome. The start location influences the codons you will find.
51Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
52codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
53genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
54also valid. Those 6 combinations are the so called \emph{reading frames}.
55
56% result.txt properly does not cover this
57% Before processing the contig files the overlap of 527 nucleotides between the
58% contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
59% also a gap of 4000 nucleotides introduced between contig 1 and contig 2.
60
61Statistics of individual nucleotides is to be found for contig 1 in
62table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
63The unknown values has been removed from the results (1726 vs 4229). This shows
64that all nucleotides are relatively well spread throughout the genome.
65
66\begin{table}[ht]
67\caption{Frequency of individual nucleotides in contig 1}
68\centering
69\begin{tabular}{c | r | r | r | r}
70Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
71\hline
72a & 420765 & 24.7\% & 365055 & 24.2\% \\
73c & 413594 & 24.3\% & 375977 & 24.9\% \\
74t & 421486 & 24.7\% & 363547 & 24.1\% \\
75g & 447937 & 26.3\% & 406047 & 26.9\% \\
76\hline \hline
77total & 1703782 & 100\% & 1510626 & 100\% \\
78\end{tabular}
79\label{tab:in-contig1}
80\end{table}
81
82\begin{table}[ht]
83\caption{Frequency of individual nucleotides in contig 2}
84\centering
85\begin{tabular}{c | r | r | r | r}
86Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
87\hline
88a & 932625 & 24.9\% & 818308 & 24.3\% \\
89c & 963480 & 25.7\% & 881689 & 26.2\% \\
90t & 928752 & 24.8\% & 816242 & 24.3\% \\
91g & 925343 & 24.7\% & 848201 & 25.2\% \\
92\hline \hline
93total & 3750200 & 100\% & 3364440 & 100\% \\
94\end{tabular}
95\label{tab:in-contig2}
96\end{table}
97
98The codons (sum of reading frames) for contig 1 shown
99in~\ref{tab:contig1-codon}, contig 2 is listed in
100table~\ref{tab:contig2-codon}. This tables shows a significance between codons
101and there percentages to be part of a gene, although the results has to be
102interpreted with care as the reading frames are not clear.
103
104\begin{table}[ht]
105\caption{Frequency of codons in contig 1}
106\centering
107\begin{tabular}{c | r | r | r || c | r | r | r}
108Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
109\hline
110taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
111tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
112ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
113ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
114tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
115aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
116att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
117tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
118cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
119aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
120tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
121aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
122gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
123cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
124ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
125tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
126agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
127aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
128ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
129agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
130gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
131aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
132tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
133cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
134ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
135act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
136gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
137atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
138tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
139tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
140caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
141ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
142\end{tabular}
143\label{tab:contig1-codon}
144\end{table}
145
146\begin{table}[ht]
147\caption{Frequency of condons in contig 2}
148\centering
149\begin{tabular}{c | r | r | r || c | r | r | r}
150Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
151\hline
152taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
153tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
154ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
155tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
156aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
157att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
158aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
159cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
160ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
161tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
162aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
163tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
164cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
165cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
166aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
167ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
168ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
169agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
170tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
171aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
172act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
173gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
174tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
175agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
176gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
177tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
178tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
179caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
180ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
181atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
182aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
183gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
184\end{tabular}
185\label{tab:contig2-codon}
186\end{table}
187
188\section*{Hidden Markov Model}
189The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
190As students have no root access, installing the Python source code proved impossible.
191Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
192Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
193Although we used a simplified version, we were unable to recreate the expected results.
194Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.
195
196\subsection*{Model Topology}
197As opposed to the presented model in \cite{ecoli} we use only four states.
198These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
199Instead of using nucleotides for observation, we use complete codons for observation. \\
200The model is parametrized by the results from the statistics from our dataset.
201Additional Baum-Welch training is not performed as it was found to be not improving the results.
202
203\subsection*{Executing the HMM}
204A number of MATLAB scripts form the framework of our model.
205Some MATLAB data files are also needed.
206These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
207The MATLAB HMM toolkit is included in the source directory.
208To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
209Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
210The results are shown, after completion, in the MATLAB command window. \\
211Files in the source directory:
212\begin{itemize}
213\item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
214\item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
215\item \texttt{cton.m} MATLAB function which translates codons to nucleotides
216\item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
217\item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
218\item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
219\item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
220\item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
221\item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
222\item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
223\end{itemize}
224
225\section*{Results}
226A typical output of our HMM is as follows:
227
228\begin{verbatim}
229% nucleotides in gene recognized
230 94.3267
231
232% nucleotides outside gene recognized
233 91.6520
234
235% genes in table found
236 57.9845
237
238% of genes found not in table
239 49.7807
240
241% genes in table found with error margin <= 10 codons
242 65.3692
243
244% of genes found (with error margin) not in table
245 43.3849
246
247% genes in table found with error margin <= 20 codons
248 71.6875
249
250% of genes found (with error margin) not in table
251 37.9127
252\end{verbatim}
253\noindent
254As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
255If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
256Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.
257
258\bibliographystyle{plain}
259\bibliography{report}
260
261\end{document}
Note: See TracBrowser for help on using the repository browser.