\documentclass[a4paper,12pt]{article}
\usepackage{hyperref}
\usepackage{a4wide}
%\usepackage{indentfirst}
\usepackage[english]{babel}
\usepackage{graphics}
%\usepackage[pdftex]{graphicx}
\usepackage{latexsym}
\usepackage{fancyvrb}
\usepackage{fancyhdr}

\pagestyle{fancyplain}
\newcommand{\tstamp}{\today}
\newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
\lfoot[\fancyplain{\tstamp}{\tstamp}]   {\fancyplain{\tstamp}{\tstamp}}
\cfoot[\fancyplain{\id}{\id}]           {\fancyplain{\id}{\id}}
\rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}


\title{ Challenges in Computer Science \\
\large{Assignment 3 - accesslog}}
\author{Rick van der Zwet\\
  \texttt{<hvdzwet@liacs.nl>}\\
  \\
  LIACS\\
  Leiden Universiteit\\
  Niels Bohrweg 1\\
  2333 CA Leiden\\
  Nederland}
\date{\today}
\begin{document}
\maketitle
\section{Introduction}
\label{foo}
    The assignment will be the following
\begin{quote}
Analyse a web server accesslog -using Perl- and find something
'interesting'. Write a three pages article out your finding.
\end{quote}

\section{Problem}
Direct relations are not visible inside the web server accesslog, there
will be a need to process the data and find useful 'connections'. Not
all data will be available all times.

\section{Theory}
\subsection{Apache httpd accesslog}
We are processing an accesslog of Apache httpd server
\footnote{http://httpd.apache.org/} which has a predefined formats
\footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
will analyze will be the \textsl{combined log format}. Combined log
format will contains the most information compared to others. An
accesslog line is formatted as below.
\begin{verbatim}
66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
/~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
\end{verbatim}
Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
\begin{description}
\item[66.196.90.99] This is the IP address of the client (remote host) which
made the request to the server, if a proxy server exists between the user and
the server, this address will be the address of the proxy, rather than the
originating machine.
\item[-] The "hyphen" in the output indicates that the requested piece of
information is not available. In this case, the information that is not
available is the RFC 1413 identity of the client determined by identd on the
clients machine. This information is highly unreliable and should almost never
be used except on tightly controlled internal networks. 
\item[-] This is the
userid of the person requesting the document as determined by HTTP
authentication. The same value is typically provided to CGI scripts in the
REMOTE\_USER environment variable. If the status code for the request (see
below) is 401, then this value should not be trusted because the user is not
yet authenticated. If the document is not password protected, this entry will
be "-" just like the previous one.  
\item[01/Jun/2004:04:04:06 +0200] The time
that the server finished processing the request. The format is:
\begin{verbatim}
[day/month/year:hour:minute:second zone]
day = 2*digit
month = 3*letter
year = 4*digit
hour = 2*digit
minute = 2*digit
second = 2*digit
zone = (`+' | `-') 4*digit
\end{verbatim}
\item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
client is given in double quotes. The request line contains a great deal of
useful information. First, the method used by the client is \textsl{GET}.
Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
and third, the client used the protocol \textsl{HTTP/1.0}.
\item[304] This is the status code that the server sends back to the client.
This information is very valuable, because it reveals whether the request
resulted in a successful response (codes beginning in 2), a redirection (codes
beginning in 3), an error caused by the client (codes beginning in 4), or an
error in the server (codes beginning in 5). The full list of possible status
codes can be found in the HTTP specification (RFC2616 section
10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
\item[-] The last entry indicates the size of the object returned to the
client, not including the response headers. If no content was returned to the
client, this value will be "-".
\item["-"] The "Referer" (sic) HTTP request header. This gives the site that
the client reports having been referred from. (This should be the page that
links to or includes the page requested).
\item["Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
header. This is the identifying information that the client browser reports
about itself.
\end{description}

\subsection{References}
Looking at the information available, there will be a very big ammount of
things able to check (or crossmatch). It will far to difficult to have a
program itself find interconnections, so we will need to define a few and check
them (by using a program)

\section{Implementation}
Perl will be the most easy way to accomplish this goal, cause it's
plain text processing based in the first place.

\section{Experiments}

Results of the experiments are outputs of test file input
\textsl{/scratch/wwwlog/www.access\_log.8}

\subsection{Which ratio/number of pages are requested but don't exists
(anymore)?}  Check every URL and check their status code. A 404 will be
marked unknown.
Perl code at Appendix~\ref{exists.pl}.
\begin{verbatim}
404/total hits: 63774/987696 (6%)
Different 404 url's: 7590 (11%)
\end{verbatim}

\subsection{What will be the ratio human/robot?}
Find type of User-Agents, mark them robot if the string contains
\textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
Perl code at Appendix~\ref{robot.pl}.
\begin{verbatim}
robot/others user-agent: 186/7711 (2%)
\end{verbatim}

\subsection{Which documents generated the most bandwidth}?
Collect the number of hits on a certain page and multiply by the size of
the page.
Perl code at Appendix~\ref{bandwidth.pl}.
\begin{verbatim}
Total Bandwidth (bytes): 2504223027
top 10 bandwidth
1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
3: /~moosten/quackknoppen.zip [19990886 (0%)]
4: /~swolff/Maradonna.mpg [19955712 (0%)]
5: /~phaazebr/weblog [15061021 (0%)]
6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
8: /~sgroot/londen-small.wmv [9017829 (0%)]
9: /~eras045/serious/final_report.ps [8845382 (0%)]
10:
/~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
[8744448 (0%)]
\end{verbatim}

\subsection{Will a certain IP use multiple user-agents?}
Check whether there are multiple User-agents at IP.
Perl code at Appendix~\ref{nat-proxy.pl}.
\begin{verbatim}
proxy/others hosts: 5086/71214 (7%)
\end{verbatim}

\subsection{Which IP ranges access the web server?}
Collect every IP address and try to put them into ranges. We will ignore
hostnames cause the their IP might be changed a few times already.

This will need some more logic like knowledge of the IP subnets. Will
skip this one.


\section{Conclusion}
Simple relations like statistics are easy to find, but the more
sophisticated ones has to be thought out and designed. Finding good
relations will take a lot of time and will be very hard to automate.

Using Perl will be a quick way to process small amounts of data, when
processing more data I recommend writing a small (wrapper) binary program to
(pre)process data.

%\begin{thebibliography}{XX}
%
%\end{thebibliography}

\section*{Appendix}

\subsection{common.pl}
\label{common.pl}
\VerbatimInput{common.pl}
\newpage

\subsection{robot.pl}
\label{robot.pl}
\VerbatimInput{robot.pl}
\newpage

\subsection{bandwidth.pl}
\label{bandwidth.pl}
\VerbatimInput{bandwidth.pl}
\newpage

\subsection{exists.pl}
\label{exists.pl}
\VerbatimInput{exists.pl}
\newpage

\subsection{nat-proxy.pl}
\label{nat-proxy.pl}
\VerbatimInput{nat-proxy.pl}
\end{document}