Context Navigation

report.tex

Last change on this file was 67, checked in by Rick van der Zwet, 15 years ago
DBDM - Assignment 4 - report
File size: 13.0 KB

Rev	Line
[67]	1	\documentclass{llncs}
	2	\usepackage{multirow}
	3
	4	\title{Databases and Data Mining --- Assignment 4}
	5
	6	\author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}
	7
	8	\institute{Leiden Institute of Advanced Computer Science (LIACS),
	9	Niels~Bohrweg~1,
	10	2333~CA~Leiden,
	11	The~Netherlands
	12	}
	13
	14	\begin{document}
	15	\maketitle
	16
	17	\section*{Introduction}
	18	Finding genes in DNA can be extremely difficult, due to the large data-sets, not
	19	clear distinction between useful and useless information and other factors.
	20
	21	Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
	22	1970s and nicely explained by L.A.~Rabiner in his field of speech recognition
	23	\cite{hmm-speech}, could also be used in the field of gene recognition as shown
	24	by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
	25	You are able to make predictions about the model (using statistics) and you are
	26	able to view the output of the model. You are how-ever not able to
	27	understand/see the inner workings of the model.
	28
	29	As the \emph{E. coli} genome was not fully known by that time, this report will
	30	focus in making a \emph{HMM} based on the complete genome sequence of the
	31	\emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
	32	al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.
	33
	34	The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
	35	(\emph{AE005174v2-2.fas}) both taken from the research website. They are
	36	presented in the \emph{FASTA}
	37	format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}. This format is
	38	text-based format representing nucleotide sequences (in our case) in which base
	39	pairs are represented using single-letter codes. The \emph{FASTA} format can
	40	easy be parsed, although a bunch a specific rules has to be
	41	followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}
	42
	43	The exact locations of the genes are to be found in the \emph{EDL933} table
	44	taken again from the research website.
	45
	46	\subsection*{Statistics}
	47
	48	When sequencing genes, codons (i.e. nucleotide triplets) are extremely
	49	important as specific combinations are specific markers within the
	50	\emph{E.~coli} genome. The start location influences the codons you will find.
	51	Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
	52	codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
	53	genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
	54	also valid. Those 6 combinations are the so called \emph{reading frames}.
	55
	56	% result.txt properly does not cover this
	57	% Before processing the contig files the overlap of 527 nucleotides between the
	58	% contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
	59	% also a gap of 4000 nucleotides introduced between contig 1 and contig 2.
	60
	61	Statistics of individual nucleotides is to be found for contig 1 in
	62	table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
	63	The unknown values has been removed from the results (1726 vs 4229). This shows
	64	that all nucleotides are relatively well spread throughout the genome.
	65
	66	\begin{table}[ht]
	67	\caption{Frequency of individual nucleotides in contig 1}
	68	\centering
	69	\begin{tabular}{c \| r \| r \| r \| r}
	70	Nucleotide & \multicolumn{2}{\|c\|}{All} & \multicolumn{2}{\|c}{In Gene} \\
	71	\hline
	72	a & 420765 & 24.7\% & 365055 & 24.2\% \\
	73	c & 413594 & 24.3\% & 375977 & 24.9\% \\
	74	t & 421486 & 24.7\% & 363547 & 24.1\% \\
	75	g & 447937 & 26.3\% & 406047 & 26.9\% \\
	76	\hline \hline
	77	total & 1703782 & 100\% & 1510626 & 100\% \\
	78	\end{tabular}
	79	\label{tab:in-contig1}
	80	\end{table}
	81
	82	\begin{table}[ht]
	83	\caption{Frequency of individual nucleotides in contig 2}
	84	\centering
	85	\begin{tabular}{c \| r \| r \| r \| r}
	86	Nucleotide & \multicolumn{2}{\|c\|}{All} & \multicolumn{2}{\|c}{In Gene} \\
	87	\hline
	88	a & 932625 & 24.9\% & 818308 & 24.3\% \\
	89	c & 963480 & 25.7\% & 881689 & 26.2\% \\
	90	t & 928752 & 24.8\% & 816242 & 24.3\% \\
	91	g & 925343 & 24.7\% & 848201 & 25.2\% \\
	92	\hline \hline
	93	total & 3750200 & 100\% & 3364440 & 100\% \\
	94	\end{tabular}
	95	\label{tab:in-contig2}
	96	\end{table}
	97
	98	The codons (sum of reading frames) for contig 1 shown
	99	in~\ref{tab:contig1-codon}, contig 2 is listed in
	100	table~\ref{tab:contig2-codon}. This tables shows a significance between codons
	101	and there percentages to be part of a gene, although the results has to be
	102	interpreted with care as the reading frames are not clear.
	103
	104	\begin{table}[ht]
	105	\caption{Frequency of codons in contig 1}
	106	\centering
	107	\begin{tabular}{c \| r \| r \| r \|\| c \| r \| r \| r}
	108	Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
	109	\hline
	110	taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
	111	tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
	112	ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
	113	ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
	114	tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
	115	aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
	116	att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
	117	tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
	118	cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
	119	aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
	120	tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
	121	aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
	122	gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
	123	cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
	124	ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
	125	tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
	126	agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
	127	aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
	128	ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
	129	agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
	130	gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
	131	aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
	132	tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
	133	cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
	134	ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
	135	act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
	136	gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
	137	atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
	138	tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
	139	tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
	140	caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
	141	ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
	142	\end{tabular}
	143	\label{tab:contig1-codon}
	144	\end{table}
	145
	146	\begin{table}[ht]
	147	\caption{Frequency of condons in contig 2}
	148	\centering
	149	\begin{tabular}{c \| r \| r \| r \|\| c \| r \| r \| r}
	150	Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
	151	\hline
	152	taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
	153	tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
	154	ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
	155	tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
	156	aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
	157	att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
	158	aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
	159	cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
	160	ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
	161	tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
	162	aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
	163	tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
	164	cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
	165	cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
	166	aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
	167	ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
	168	ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
	169	agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
	170	tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
	171	aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
	172	act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
	173	gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
	174	tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
	175	agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
	176	gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
	177	tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
	178	tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
	179	caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
	180	ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
	181	atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
	182	aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
	183	gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
	184	\end{tabular}
	185	\label{tab:contig2-codon}
	186	\end{table}
	187
	188	\section*{Hidden Markov Model}
	189	The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
	190	As students have no root access, installing the Python source code proved impossible.
	191	Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
	192	Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
	193	Although we used a simplified version, we were unable to recreate the expected results.
	194	Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.
	195
	196	\subsection*{Model Topology}
	197	As opposed to the presented model in \cite{ecoli} we use only four states.
	198	These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
	199	Instead of using nucleotides for observation, we use complete codons for observation. \\
	200	The model is parametrized by the results from the statistics from our dataset.
	201	Additional Baum-Welch training is not performed as it was found to be not improving the results.
	202
	203	\subsection*{Executing the HMM}
	204	A number of MATLAB scripts form the framework of our model.
	205	Some MATLAB data files are also needed.
	206	These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
	207	The MATLAB HMM toolkit is included in the source directory.
	208	To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
	209	Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
	210	The results are shown, after completion, in the MATLAB command window. \\
	211	Files in the source directory:
	212	\begin{itemize}
	213	\item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
	214	\item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
	215	\item \texttt{cton.m} MATLAB function which translates codons to nucleotides
	216	\item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
	217	\item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
	218	\item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
	219	\item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
	220	\item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
	221	\item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
	222	\item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
	223	\end{itemize}
	224
	225	\section*{Results}
	226	A typical output of our HMM is as follows:
	227
	228	\begin{verbatim}
	229	% nucleotides in gene recognized
	230	94.3267
	231
	232	% nucleotides outside gene recognized
	233	91.6520
	234
	235	% genes in table found
	236	57.9845
	237
	238	% of genes found not in table
	239	49.7807
	240
	241	% genes in table found with error margin <= 10 codons
	242	65.3692
	243
	244	% of genes found (with error margin) not in table
	245	43.3849
	246
	247	% genes in table found with error margin <= 20 codons
	248	71.6875
	249
	250	% of genes found (with error margin) not in table
	251	37.9127
	252	\end{verbatim}
	253	\noindent
	254	As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
	255	If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
	256	Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.
	257
	258	\bibliographystyle{plain}
	259	\bibliography{report}
	260
	261	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/dbdm/dbdm_4/report/report.tex

Download in other formats: