[67] | 1 | \documentclass{llncs}
|
---|
| 2 | \usepackage{multirow}
|
---|
| 3 |
|
---|
| 4 | \title{Databases and Data Mining --- Assignment 4}
|
---|
| 5 |
|
---|
| 6 | \author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}
|
---|
| 7 |
|
---|
| 8 | \institute{Leiden Institute of Advanced Computer Science (LIACS),
|
---|
| 9 | Niels~Bohrweg~1,
|
---|
| 10 | 2333~CA~Leiden,
|
---|
| 11 | The~Netherlands
|
---|
| 12 | }
|
---|
| 13 |
|
---|
| 14 | \begin{document}
|
---|
| 15 | \maketitle
|
---|
| 16 |
|
---|
| 17 | \section*{Introduction}
|
---|
| 18 | Finding genes in DNA can be extremely difficult, due to the large data-sets, not
|
---|
| 19 | clear distinction between useful and useless information and other factors.
|
---|
| 20 |
|
---|
| 21 | Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
|
---|
| 22 | 1970s and nicely explained by L.A.~Rabiner in his field of speech recognition
|
---|
| 23 | \cite{hmm-speech}, could also be used in the field of gene recognition as shown
|
---|
| 24 | by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
|
---|
| 25 | You are able to make predictions about the model (using statistics) and you are
|
---|
| 26 | able to view the output of the model. You are how-ever not able to
|
---|
| 27 | understand/see the inner workings of the model.
|
---|
| 28 |
|
---|
| 29 | As the \emph{E. coli} genome was not fully known by that time, this report will
|
---|
| 30 | focus in making a \emph{HMM} based on the complete genome sequence of the
|
---|
| 31 | \emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
|
---|
| 32 | al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.
|
---|
| 33 |
|
---|
| 34 | The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
|
---|
| 35 | (\emph{AE005174v2-2.fas}) both taken from the research website. They are
|
---|
| 36 | presented in the \emph{FASTA}
|
---|
| 37 | format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}. This format is
|
---|
| 38 | text-based format representing nucleotide sequences (in our case) in which base
|
---|
| 39 | pairs are represented using single-letter codes. The \emph{FASTA} format can
|
---|
| 40 | easy be parsed, although a bunch a specific rules has to be
|
---|
| 41 | followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}
|
---|
| 42 |
|
---|
| 43 | The exact locations of the genes are to be found in the \emph{EDL933} table
|
---|
| 44 | taken again from the research website.
|
---|
| 45 |
|
---|
| 46 | \subsection*{Statistics}
|
---|
| 47 |
|
---|
| 48 | When sequencing genes, codons (i.e. nucleotide triplets) are extremely
|
---|
| 49 | important as specific combinations are specific markers within the
|
---|
| 50 | \emph{E.~coli} genome. The start location influences the codons you will find.
|
---|
| 51 | Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
|
---|
| 52 | codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
|
---|
| 53 | genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
|
---|
| 54 | also valid. Those 6 combinations are the so called \emph{reading frames}.
|
---|
| 55 |
|
---|
| 56 | % result.txt properly does not cover this
|
---|
| 57 | % Before processing the contig files the overlap of 527 nucleotides between the
|
---|
| 58 | % contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
|
---|
| 59 | % also a gap of 4000 nucleotides introduced between contig 1 and contig 2.
|
---|
| 60 |
|
---|
| 61 | Statistics of individual nucleotides is to be found for contig 1 in
|
---|
| 62 | table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
|
---|
| 63 | The unknown values has been removed from the results (1726 vs 4229). This shows
|
---|
| 64 | that all nucleotides are relatively well spread throughout the genome.
|
---|
| 65 |
|
---|
| 66 | \begin{table}[ht]
|
---|
| 67 | \caption{Frequency of individual nucleotides in contig 1}
|
---|
| 68 | \centering
|
---|
| 69 | \begin{tabular}{c | r | r | r | r}
|
---|
| 70 | Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
|
---|
| 71 | \hline
|
---|
| 72 | a & 420765 & 24.7\% & 365055 & 24.2\% \\
|
---|
| 73 | c & 413594 & 24.3\% & 375977 & 24.9\% \\
|
---|
| 74 | t & 421486 & 24.7\% & 363547 & 24.1\% \\
|
---|
| 75 | g & 447937 & 26.3\% & 406047 & 26.9\% \\
|
---|
| 76 | \hline \hline
|
---|
| 77 | total & 1703782 & 100\% & 1510626 & 100\% \\
|
---|
| 78 | \end{tabular}
|
---|
| 79 | \label{tab:in-contig1}
|
---|
| 80 | \end{table}
|
---|
| 81 |
|
---|
| 82 | \begin{table}[ht]
|
---|
| 83 | \caption{Frequency of individual nucleotides in contig 2}
|
---|
| 84 | \centering
|
---|
| 85 | \begin{tabular}{c | r | r | r | r}
|
---|
| 86 | Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
|
---|
| 87 | \hline
|
---|
| 88 | a & 932625 & 24.9\% & 818308 & 24.3\% \\
|
---|
| 89 | c & 963480 & 25.7\% & 881689 & 26.2\% \\
|
---|
| 90 | t & 928752 & 24.8\% & 816242 & 24.3\% \\
|
---|
| 91 | g & 925343 & 24.7\% & 848201 & 25.2\% \\
|
---|
| 92 | \hline \hline
|
---|
| 93 | total & 3750200 & 100\% & 3364440 & 100\% \\
|
---|
| 94 | \end{tabular}
|
---|
| 95 | \label{tab:in-contig2}
|
---|
| 96 | \end{table}
|
---|
| 97 |
|
---|
| 98 | The codons (sum of reading frames) for contig 1 shown
|
---|
| 99 | in~\ref{tab:contig1-codon}, contig 2 is listed in
|
---|
| 100 | table~\ref{tab:contig2-codon}. This tables shows a significance between codons
|
---|
| 101 | and there percentages to be part of a gene, although the results has to be
|
---|
| 102 | interpreted with care as the reading frames are not clear.
|
---|
| 103 |
|
---|
| 104 | \begin{table}[ht]
|
---|
| 105 | \caption{Frequency of codons in contig 1}
|
---|
| 106 | \centering
|
---|
| 107 | \begin{tabular}{c | r | r | r || c | r | r | r}
|
---|
| 108 | Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
|
---|
| 109 | \hline
|
---|
| 110 | taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
|
---|
| 111 | tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
|
---|
| 112 | ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
|
---|
| 113 | ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
|
---|
| 114 | tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
|
---|
| 115 | aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
|
---|
| 116 | att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
|
---|
| 117 | tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
|
---|
| 118 | cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
|
---|
| 119 | aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
|
---|
| 120 | tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
|
---|
| 121 | aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
|
---|
| 122 | gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
|
---|
| 123 | cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
|
---|
| 124 | ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
|
---|
| 125 | tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
|
---|
| 126 | agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
|
---|
| 127 | aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
|
---|
| 128 | ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
|
---|
| 129 | agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
|
---|
| 130 | gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
|
---|
| 131 | aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
|
---|
| 132 | tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
|
---|
| 133 | cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
|
---|
| 134 | ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
|
---|
| 135 | act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
|
---|
| 136 | gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
|
---|
| 137 | atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
|
---|
| 138 | tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
|
---|
| 139 | tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
|
---|
| 140 | caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
|
---|
| 141 | ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
|
---|
| 142 | \end{tabular}
|
---|
| 143 | \label{tab:contig1-codon}
|
---|
| 144 | \end{table}
|
---|
| 145 |
|
---|
| 146 | \begin{table}[ht]
|
---|
| 147 | \caption{Frequency of condons in contig 2}
|
---|
| 148 | \centering
|
---|
| 149 | \begin{tabular}{c | r | r | r || c | r | r | r}
|
---|
| 150 | Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
|
---|
| 151 | \hline
|
---|
| 152 | taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
|
---|
| 153 | tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
|
---|
| 154 | ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
|
---|
| 155 | tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
|
---|
| 156 | aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
|
---|
| 157 | att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
|
---|
| 158 | aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
|
---|
| 159 | cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
|
---|
| 160 | ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
|
---|
| 161 | tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
|
---|
| 162 | aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
|
---|
| 163 | tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
|
---|
| 164 | cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
|
---|
| 165 | cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
|
---|
| 166 | aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
|
---|
| 167 | ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
|
---|
| 168 | ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
|
---|
| 169 | agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
|
---|
| 170 | tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
|
---|
| 171 | aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
|
---|
| 172 | act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
|
---|
| 173 | gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
|
---|
| 174 | tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
|
---|
| 175 | agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
|
---|
| 176 | gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
|
---|
| 177 | tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
|
---|
| 178 | tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
|
---|
| 179 | caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
|
---|
| 180 | ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
|
---|
| 181 | atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
|
---|
| 182 | aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
|
---|
| 183 | gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
|
---|
| 184 | \end{tabular}
|
---|
| 185 | \label{tab:contig2-codon}
|
---|
| 186 | \end{table}
|
---|
| 187 |
|
---|
| 188 | \section*{Hidden Markov Model}
|
---|
| 189 | The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
|
---|
| 190 | As students have no root access, installing the Python source code proved impossible.
|
---|
| 191 | Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
|
---|
| 192 | Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
|
---|
| 193 | Although we used a simplified version, we were unable to recreate the expected results.
|
---|
| 194 | Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.
|
---|
| 195 |
|
---|
| 196 | \subsection*{Model Topology}
|
---|
| 197 | As opposed to the presented model in \cite{ecoli} we use only four states.
|
---|
| 198 | These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
|
---|
| 199 | Instead of using nucleotides for observation, we use complete codons for observation. \\
|
---|
| 200 | The model is parametrized by the results from the statistics from our dataset.
|
---|
| 201 | Additional Baum-Welch training is not performed as it was found to be not improving the results.
|
---|
| 202 |
|
---|
| 203 | \subsection*{Executing the HMM}
|
---|
| 204 | A number of MATLAB scripts form the framework of our model.
|
---|
| 205 | Some MATLAB data files are also needed.
|
---|
| 206 | These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
|
---|
| 207 | The MATLAB HMM toolkit is included in the source directory.
|
---|
| 208 | To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
|
---|
| 209 | Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
|
---|
| 210 | The results are shown, after completion, in the MATLAB command window. \\
|
---|
| 211 | Files in the source directory:
|
---|
| 212 | \begin{itemize}
|
---|
| 213 | \item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
|
---|
| 214 | \item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
|
---|
| 215 | \item \texttt{cton.m} MATLAB function which translates codons to nucleotides
|
---|
| 216 | \item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
|
---|
| 217 | \item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
|
---|
| 218 | \item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
|
---|
| 219 | \item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
|
---|
| 220 | \item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
|
---|
| 221 | \item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
|
---|
| 222 | \item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
|
---|
| 223 | \end{itemize}
|
---|
| 224 |
|
---|
| 225 | \section*{Results}
|
---|
| 226 | A typical output of our HMM is as follows:
|
---|
| 227 |
|
---|
| 228 | \begin{verbatim}
|
---|
| 229 | % nucleotides in gene recognized
|
---|
| 230 | 94.3267
|
---|
| 231 |
|
---|
| 232 | % nucleotides outside gene recognized
|
---|
| 233 | 91.6520
|
---|
| 234 |
|
---|
| 235 | % genes in table found
|
---|
| 236 | 57.9845
|
---|
| 237 |
|
---|
| 238 | % of genes found not in table
|
---|
| 239 | 49.7807
|
---|
| 240 |
|
---|
| 241 | % genes in table found with error margin <= 10 codons
|
---|
| 242 | 65.3692
|
---|
| 243 |
|
---|
| 244 | % of genes found (with error margin) not in table
|
---|
| 245 | 43.3849
|
---|
| 246 |
|
---|
| 247 | % genes in table found with error margin <= 20 codons
|
---|
| 248 | 71.6875
|
---|
| 249 |
|
---|
| 250 | % of genes found (with error margin) not in table
|
---|
| 251 | 37.9127
|
---|
| 252 | \end{verbatim}
|
---|
| 253 | \noindent
|
---|
| 254 | As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
|
---|
| 255 | If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
|
---|
| 256 | Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.
|
---|
| 257 |
|
---|
| 258 | \bibliographystyle{plain}
|
---|
| 259 | \bibliography{report}
|
---|
| 260 |
|
---|
| 261 | \end{document}
|
---|