\documentclass{llncs}
\usepackage{multirow}

\title{Databases and Data Mining --- Assignment 4}

\author{Jonathan K. Vis, Hans Wortel and Rick van der Zwet}

\institute{Leiden Institute of Advanced Computer Science (LIACS),
Niels~Bohrweg~1,
2333~CA~Leiden,
The~Netherlands
}

\begin{document}
\maketitle

\section*{Introduction}
Finding genes in DNA can be extremely difficult, due to the large data-sets, not
clear distinction between useful and useless information and other factors.

Hidden Markov Models (\emph{HMMs}) introduced in the late 1960s and early
1970s and nicely explained by L.A.~Rabiner in his field of speech recognition
\cite{hmm-speech}, could also be used in the field of gene recognition as shown
by A.~Krogh~\cite{ecoli}. Very briefly a \emph{HMM} has one basic guideline.
You are able to make predictions about the model (using statistics) and you are
able to view the output of the model. You are how-ever not able to
understand/see the inner workings of the model.

As the \emph{E. coli} genome was not fully known by that time, this report will
focus in making a \emph{HMM} based on the complete genome sequence of the
\emph{Escherichia coli} O157:H7 strain EDL933, sequenced by Madison et
al.~\footnote{http://www.genome.wisc.edu/sequencing/o157.htm}.

The data-set we used contig 1 (\emph{AE005174v2-1.fas}) and contig 2
(\emph{AE005174v2-2.fas}) both taken from the research website. They are
presented in the \emph{FASTA}
format~\footnote{http://en.wikipedia.org/wiki/FASTA\_format}.  This format is
text-based format representing nucleotide sequences (in our case) in which base
pairs are represented using single-letter codes.  The \emph{FASTA} format can
easy be parsed, although a bunch a specific rules has to be
followed\footnote{https://proteomecommons.org/tranche/examples/proteomecommons-fasta/fasta.jsp}

The exact locations of the genes are to be found in the \emph{EDL933} table
taken again from the research website.

\subsection*{Statistics}

When sequencing genes, codons (i.e. nucleotide triplets) are extremely
important as specific combinations are specific markers within the
\emph{E.~coli} genome. The start location influences the codons you will find.
Take for example the sequence 'AGCTA', the 'C' in this case could appear in 3
codons, giving the start location 0,1,2 this will make 'AGC','GCT','CTA'. As
genes could be read both ways the right-to-left codons 'ATC','TCG','CGA' are
also valid. Those 6 combinations are the so called \emph{reading frames}.

% result.txt properly does not cover this
% Before processing the contig files the overlap of 527 nucleotides between the
% contigs (end of contig 2 had overlap with begin contig 1) was removed. There is
% also a gap of 4000 nucleotides introduced between contig 1 and contig 2.

Statistics of individual nucleotides is to be found for contig 1 in
table~\ref{tab:in-contig1} and contig 2 is shown in table~\ref{tab:in-contig2}.
The unknown values has been removed from the results (1726 vs 4229). This shows
that all nucleotides are relatively well spread throughout the genome.

\begin{table}[ht]
\caption{Frequency of individual nucleotides in contig 1}
\centering
\begin{tabular}{c | r | r | r | r}
Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
\hline
a & 420765 & 24.7\% & 365055 & 24.2\% \\
c & 413594 & 24.3\% & 375977 & 24.9\% \\
t & 421486 & 24.7\% & 363547 & 24.1\% \\
g & 447937 & 26.3\% & 406047 & 26.9\% \\
\hline \hline
total & 1703782 & 100\% & 1510626  & 100\% \\
\end{tabular}
\label{tab:in-contig1}
\end{table}

\begin{table}[ht]
\caption{Frequency of individual nucleotides in contig 2}
\centering
\begin{tabular}{c | r | r | r | r}
Nucleotide & \multicolumn{2}{|c|}{All} & \multicolumn{2}{|c}{In Gene} \\
\hline
a & 932625 & 24.9\% & 818308 & 24.3\% \\
c & 963480 & 25.7\% & 881689 & 26.2\% \\
t & 928752 & 24.8\% & 816242 & 24.3\% \\
g & 925343 & 24.7\% & 848201 & 25.2\% \\
\hline \hline
total & 3750200 & 100\% & 3364440 & 100\% \\
\end{tabular}
\label{tab:in-contig2}
\end{table}

The codons (sum of reading frames) for contig 1 shown
in~\ref{tab:contig1-codon}, contig 2 is listed in
table~\ref{tab:contig2-codon}. This tables shows a significance between codons
and there percentages to be part of a gene, although the results has to be
interpreted with care as the reading frames are not clear.

\begin{table}[ht]
\caption{Frequency of codons in contig 1}
\centering
\begin{tabular}{c | r | r | r || c | r | r | r}
Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
\hline
taa & 25300 & 19666 & 22.27\% & gaa & 32713 & 29042 & 11.22\% \\
tta & 25556 & 20255 & 20.74\% & gat & 32453 & 28832 & 11.16\% \\
ata & 23973 & 19088 & 20.38\% & aac & 29250 & 25997 & 11.12\% \\
ttt & 39653 & 31687 & 20.09\% & ccc & 15693 & 13970 & 10.98\% \\
tat & 25469 & 20400 & 19.90\% & gga & 23690 & 21099 & 10.94\% \\
aat & 30468 & 24584 & 19.31\% & atc & 29619 & 26383 & 10.93\% \\
att & 31292 & 25263 & 19.27\% & ggg & 21610 & 19311 & 10.64\% \\
tag & 9388 & 7600 & 19.05\% & tcc & 20000 & 17899 & 10.51\% \\
cta & 9244 & 7555 & 18.27\% & gtg & 27553 & 24953 & 9.44\% \\
aaa & 40987 & 33719 & 17.73\% & gtc & 20059 & 18172 & 9.41\% \\
tgt & 23583 & 19746 & 16.27\% & tgc & 34588 & 31411 & 9.19\% \\
aca & 21443 & 18293 & 14.69\% & gca & 34290 & 31155 & 9.14\% \\
gta & 20357 & 17398 & 14.54\% & cac & 22072 & 20068 & 9.08\% \\
cat & 26817 & 22955 & 14.40\% & ctg & 40230 & 36729 & 8.70\% \\
ctt & 21268 & 18296 & 13.97\% & gct & 29419 & 26911 & 8.53\% \\
tct & 20394 & 17572 & 13.84\% & ggt & 29209 & 26744 & 8.44\% \\
agt & 18920 & 16305 & 13.82\% & gac & 20877 & 19149 & 8.28\% \\
aag & 23852 & 20561 & 13.80\% & cgt & 26382 & 24199 & 8.27\% \\
ttg & 28105 & 24301 & 13.53\% & agc & 28682 & 26395 & 7.97\% \\
agg & 20621 & 17862 & 13.38\% & cag & 38817 & 35751 & 7.90\% \\
gag & 17538 & 15213 & 13.26\% & acc & 25982 & 23959 & 7.79\% \\
aga & 21373 & 18557 & 13.18\% & acg & 25634 & 23670 & 7.66\% \\
tca & 29748 & 25935 & 12.82\% & tcg & 23445 & 21676 & 7.55\% \\
cct & 17669 & 15407 & 12.80\% & tgg & 34254 & 31742 & 7.33\% \\
ctc & 14849 & 12974 & 12.63\% & cga & 24724 & 22955 & 7.15\% \\
act & 18106 & 15820 & 12.63\% & ccg & 31610 & 29350 & 7.15\% \\
gtt & 30111 & 26347 & 12.50\% & gcc & 31092 & 28891 & 7.08\% \\
atg & 30330 & 26547 & 12.47\% & cca & 27790 & 25841 & 7.01\% \\
tac & 18973 & 16638 & 12.31\% & ggc & 35638 & 33190 & 6.87\% \\
tga & 33813 & 29655 & 12.30\% & cgg & 33684 & 31425 & 6.71\% \\
caa & 25552 & 22440 & 12.18\% & cgc & 36963 & 34633 & 6.30\% \\
ttc & 29021 & 25609 & 11.76\% & gcg & 41053 & 38507 & 6.20\% \\
\end{tabular}
\label{tab:contig1-codon}
\end{table}

\begin{table}[ht]
\caption{Frequency of condons in contig 2}
\centering
\begin{tabular}{c | r | r | r || c | r | r | r}
Codon & All & In Gene & NOT In Gene & Codon & All & In Gene & NOT In Gene \\
\hline
taa & 57022 & 45603 & 20.03\% & tcc & 48188 & 43145 & 10.47\% \\
tta & 56799 & 45893 & 19.20\% & gaa & 66528 & 59609 & 10.40\% \\
ata & 53771 & 43900 & 18.36\% & ttc & 69980 & 62761 & 10.32\% \\
tat & 52554 & 43009 & 18.16\% & ggg & 35982 & 32295 & 10.25\% \\
aaa & 90017 & 74087 & 17.70\% & ccc & 42293 & 37976 & 10.21\% \\
att & 68567 & 56580 & 17.48\% & atc & 71291 & 64106 & 10.08\% \\
aat & 69249 & 57488 & 16.98\% & gga & 44736 & 40249 & 10.03\% \\
cta & 21962 & 18280 & 16.77\% & gat & 67808 & 61030 & 10.00\% \\
ttt & 91643 & 76490 & 16.53\% & cac & 56551 & 51543 & 8.86\% \\
tag & 21710 & 18284 & 15.78\% & gca & 76986 & 70469 & 8.47\% \\
aca & 49541 & 42363 & 14.49\% & gtg & 50596 & 46385 & 8.32\% \\
tgt & 46989 & 40725 & 13.33\% & gac & 43889 & 40286 & 8.21\% \\
cat & 64442 & 55969 & 13.15\% & gtc & 44001 & 40397 & 8.19\% \\
cct & 42669 & 37175 & 12.88\% & tgc & 75500 & 69609 & 7.80\% \\
aag & 50103 & 43705 & 12.77\% & cgt & 57354 & 53003 & 7.59\% \\
ctc & 36551 & 31915 & 12.68\% & acg & 58525 & 54099 & 7.56\% \\
ctt & 52621 & 45971 & 12.64\% & ctg & 82451 & 76286 & 7.48\% \\
agg & 39391 & 34455 & 12.53\% & acc & 61306 & 56745 & 7.44\% \\
tct & 46376 & 40586 & 12.48\% & agc & 65049 & 60249 & 7.38\% \\
aga & 45930 & 40346 & 12.16\% & gct & 63263 & 58638 & 7.31\% \\
act & 41282 & 36269 & 12.14\% & cag & 85371 & 79148 & 7.29\% \\
gag & 33303 & 29291 & 12.05\% & ggt & 57375 & 53332 & 7.05\% \\
tca & 70886 & 62481 & 11.86\% & cga & 55011 & 51352 & 6.65\% \\
agt & 40095 & 35356 & 11.82\% & cca & 71910 & 67148 & 6.62\% \\
gta & 41980 & 37101 & 11.62\% & ggc & 70465 & 65829 & 6.58\% \\
tga & 65912 & 58336 & 11.49\% & ccg & 70506 & 65879 & 6.56\% \\
tac & 43233 & 38278 & 11.46\% & tcg & 56497 & 52857 & 6.44\% \\
caa & 62869 & 55722 & 11.37\% & cgg & 68302 & 63908 & 6.43\% \\
ttg & 59753 & 53161 & 11.03\% & gcc & 75588 & 70783 & 6.36\% \\
atg & 60417 & 53782 & 10.98\% & tgg & 64884 & 61044 & 5.92\% \\
aac & 67047 & 59741 & 10.90\% & cgc & 91544 & 86312 & 5.72\% \\
gtt & 65266 & 58406 & 10.51\% & gcg & 86699 & 81754 & 5.70\% \\
\end{tabular}
\label{tab:contig2-codon}
\end{table}

\section*{Hidden Markov Model}
The preferred library for designing and modeling a Hidden Markov Model (HMM), the General Hidden Markov Model library (GHMM)\footnote{http://ghmm.sourceforge.net/ and http://ghmm.org/}, was found to be unusable on the computers within LIACS mostly due to the absence of the Python source code.
As students have no root access, installing the Python source code proved impossible.
Therefore we choose to use a MATLAB toolkit for Hidden Markov Models. \\
Our first attempt was to mimic the design of the HMM referenced during the lectures \cite{ecoli}.
Although we used a simplified version, we were unable to recreate the expected results.
Some analysis of simple HMMs showed that so-called silent states (no emission or observation) are not correctly (intuitively) handled by the MATLAB toolkit.

\subsection*{Model Topology}
As opposed to the presented model in \cite{ecoli} we use only four states.
These states, however, have very much the same biological meaning: a state for the intergenic region, a state for the gene coding regions, a state for start codons and a state for stop codons.
Instead of using nucleotides for observation, we use complete codons for observation. \\
The model is parametrized by the results from the statistics from our dataset.
Additional Baum-Welch training is not performed as it was found to be not improving the results.

\subsection*{Executing the HMM}
A number of MATLAB scripts form the framework of our model.
Some MATLAB data files are also needed.
These files are created from the dataset and contain information about the gene coding regions and their respective orientation. \\
The MATLAB HMM toolkit is included in the source directory.
To run the HMM simply execute the script \texttt{run\_hmm} from the MATLAB command window.
Please note that due to the complexity and size of the data, the execution of the script may take some minutes.
The results are shown, after completion, in the MATLAB command window. \\
Files in the source directory:
\begin{itemize}
\item \texttt{c1.mat} MATLAB data file containing the nucleotides from \texttt{AEOO5174v2-1.fas}
\item \texttt{c2.mat} MATLAB data file containing gene nucleotides from \texttt{AEOO5174v2-2.fas}
\item \texttt{cton.m} MATLAB function which translates codons to nucleotides
\item \texttt{find\_genes.m} MATLAB function which uses the HMM toolkit to find genes in a sequence of nucleotides
\item \texttt{gen\_hmm.m} MATLAB function which creates the HMM from a sequence and the gene coding region information
\item \texttt{ntoc.m} MATLAB function which translates nucleotides to codons
\item \texttt{ranges1.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-1.fas}
\item \texttt{ranges2.mat} MATLAB data file containing the gene coding and orientation information for \texttt{AEOO5174v2-2.fas}
\item \texttt{run\_hmm.m} MATLAB script which builds, executes and test the HMM
\item \texttt{test\_res.m} MATLAB function which tests the found genes against the gene coding region information
\end{itemize}

\section*{Results}
A typical output of our HMM is as follows:

\begin{verbatim}
% nucleotides in gene recognized
   94.3267

% nucleotides outside gene recognized
   91.6520

% genes in table found
   57.9845

% of genes found not in table
   49.7807

% genes in table found with error margin <= 10 codons
   65.3692

% of genes found (with error margin) not in table
   43.3849

% genes in table found with error margin <= 20 codons
   71.6875

% of genes found (with error margin) not in table
   37.9127
\end{verbatim}
\noindent
As is clear from the HMM output, our HMM correctly classifies $58 \%$ of the genes.
If we allow for a small error ($\le 20$ codons), the HMM can correctly classify $72 \%$ of the genes. \\
Although this does not sound as impressive as the results in \cite{ecoli}, if we look at the nucleotide level, we see that $94 \%$ of the individual nucleotides are correctly classified as gene coding.

\bibliographystyle{plain}
\bibliography{report}

\end{document}