Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

report.tex@ 210

Last change on this file since 210 was 67, checked in by Rick van der Zwet, 15 years ago
DBDM - Assignment 4 - report
File size: 13.0 KB

Line
1	\documentclass{llncs}
2	\usepackage{multirow}
3
4	\title{Databases and Data Mining --- Assignment 4}
5
6	\author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}
7
8	\institute{Leiden Institute of Advanced Computer Science (LIACS),
9	Niels~Bohrweg~1,
10	2333~CA~Leiden,
11	The~Netherlands
12	}
13
14	\begin{document}
15	\maketitle
16
17	\section*{Introduction}
18	Finding genes in DNA can be extremely difficult, due to the large data-sets, not
19	clear distinction between useful and useless information and other factors.
20
21	Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
22	1970s and nicely explained by L.A.~Rabiner in his field of speech recognition
23	\cite{hmm-speech}, could also be used in the field of gene recognition as shown
24	by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
25	You are able to make predictions about the model (using statistics) and you are
26	able to view the output of the model. You are how-ever not able to
27	understand/see the inner workings of the model.
28
29	As the \emph{E. coli} genome was not fully known by that time, this report will
30	focus in making a \emph{HMM} based on the complete genome sequence of the
31	\emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
32	al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.
33
34	The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
35	(\emph{AE005174v2-2.fas}) both taken from the research website. They are
36	presented in the \emph{FASTA}
37	format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}. This format is
38	text-based format representing nucleotide sequences (in our case) in which base
39	pairs are represented using single-letter codes. The \emph{FASTA} format can
40	easy be parsed, although a bunch a specific rules has to be
41	followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}
42
43	The exact locations of the genes are to be found in the \emph{EDL933} table
44	taken again from the research website.
45
46	\subsection*{Statistics}
47
48	When sequencing genes, codons (i.e. nucleotide triplets) are extremely
49	important as specific combinations are specific markers within the
50	\emph{E.~coli} genome. The start location influences the codons you will find.
51	Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
52	codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
53	genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
54	also valid. Those 6 combinations are the so called \emph{reading frames}.
55
56	% result.txt properly does not cover this
57	% Before processing the contig files the overlap of 527 nucleotides between the
58	% contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
59	% also a gap of 4000 nucleotides introduced between contig 1 and contig 2.
60
61	Statistics of individual nucleotides is to be found for contig 1 in
62	table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
63	The unknown values has been removed from the results (1726 vs 4229). This shows
64	that all nucleotides are relatively well spread throughout the genome.
65
66	\begin{table}[ht]
67	\caption{Frequency of individual nucleotides in contig 1}
68	\centering
69	\begin{tabular}{c \| r \| r \| r \| r}
70	Nucleotide & \multicolumn{2}{\|c\|}{All} & \multicolumn{2}{\|c}{In Gene} \\
71	\hline
72	a & 420765 & 24.7\% & 365055 & 24.2\% \\
73	c & 413594 & 24.3\% & 375977 & 24.9\% \\
74	t & 421486 & 24.7\% & 363547 & 24.1\% \\
75	g & 447937 & 26.3\% & 406047 & 26.9\% \\
76	\hline \hline
77	total & 1703782 & 100\% & 1510626 & 100\% \\
78	\end{tabular}
79	\label{tab:in-contig1}
80	\end{table}
81
82	\begin{table}[ht]
83	\caption{Frequency of individual nucleotides in contig 2}
84	\centering
85	\begin{tabular}{c \| r \| r \| r \| r}
86	Nucleotide & \multicolumn{2}{\|c\|}{All} & \multicolumn{2}{\|c}{In Gene} \\
87	\hline
88	a & 932625 & 24.9\% & 818308 & 24.3\% \\
89	c & 963480 & 25.7\% & 881689 & 26.2\% \\
90	t & 928752 & 24.8\% & 816242 & 24.3\% \\
91	g & 925343 & 24.7\% & 848201 & 25.2\% \\
92	\hline \hline
93	total & 3750200 & 100\% & 3364440 & 100\% \\
94	\end{tabular}
95	\label{tab:in-contig2}
96	\end{table}
97
98	The codons (sum of reading frames) for contig 1 shown
99	in~\ref{tab:contig1-codon}, contig 2 is listed in
100	table~\ref{tab:contig2-codon}. This tables shows a significance between codons
101	and there percentages to be part of a gene, although the results has to be
102	interpreted with care as the reading frames are not clear.
103
104	\begin{table}[ht]
105	\caption{Frequency of codons in contig 1}
106	\centering
107	\begin{tabular}{c \| r \| r \| r \|\| c \| r \| r \| r}
108	Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
109	\hline
110	taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
111	tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
112	ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
113	ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
114	tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
115	aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
116	att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
117	tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
118	cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
119	aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
120	tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
121	aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
122	gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
123	cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
124	ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
125	tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
126	agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
127	aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
128	ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
129	agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
130	gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
131	aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
132	tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
133	cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
134	ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
135	act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
136	gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
137	atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
138	tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
139	tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
140	caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
141	ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
142	\end{tabular}
143	\label{tab:contig1-codon}
144	\end{table}
145
146	\begin{table}[ht]
147	\caption{Frequency of condons in contig 2}
148	\centering
149	\begin{tabular}{c \| r \| r \| r \|\| c \| r \| r \| r}
150	Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
151	\hline
152	taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
153	tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
154	ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
155	tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
156	aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
157	att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
158	aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
159	cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
160	ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
161	tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
162	aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
163	tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
164	cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
165	cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
166	aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
167	ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
168	ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
169	agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
170	tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
171	aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
172	act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
173	gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
174	tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
175	agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
176	gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
177	tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
178	tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
179	caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
180	ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
181	atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
182	aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
183	gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
184	\end{tabular}
185	\label{tab:contig2-codon}
186	\end{table}
187
188	\section*{Hidden Markov Model}
189	The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
190	As students have no root access, installing the Python source code proved impossible.
191	Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
192	Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
193	Although we used a simplified version, we were unable to recreate the expected results.
194	Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.
195
196	\subsection*{Model Topology}
197	As opposed to the presented model in \cite{ecoli} we use only four states.
198	These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
199	Instead of using nucleotides for observation, we use complete codons for observation. \\
200	The model is parametrized by the results from the statistics from our dataset.
201	Additional Baum-Welch training is not performed as it was found to be not improving the results.
202
203	\subsection*{Executing the HMM}
204	A number of MATLAB scripts form the framework of our model.
205	Some MATLAB data files are also needed.
206	These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
207	The MATLAB HMM toolkit is included in the source directory.
208	To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
209	Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
210	The results are shown, after completion, in the MATLAB command window. \\
211	Files in the source directory:
212	\begin{itemize}
213	\item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
214	\item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
215	\item \texttt{cton.m} MATLAB function which translates codons to nucleotides
216	\item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
217	\item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
218	\item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
219	\item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
220	\item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
221	\item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
222	\item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
223	\end{itemize}
224
225	\section*{Results}
226	A typical output of our HMM is as follows:
227
228	\begin{verbatim}
229	% nucleotides in gene recognized
230	94.3267
231
232	% nucleotides outside gene recognized
233	91.6520
234
235	% genes in table found
236	57.9845
237
238	% of genes found not in table
239	49.7807
240
241	% genes in table found with error margin <= 10 codons
242	65.3692
243
244	% of genes found (with error margin) not in table
245	43.3849
246
247	% genes in table found with error margin <= 20 codons
248	71.6875
249
250	% of genes found (with error margin) not in table
251	37.9127
252	\end{verbatim}
253	\noindent
254	As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
255	If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
256	Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.
257
258	\bibliographystyle{plain}
259	\bibliography{report}
260
261	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/dbdm/dbdm_4/report/report.tex@ 210

Download in other formats: