\documentclass[a4paper,12pt]{article} \usepackage{hyperref} \usepackage{a4wide} %\usepackage{indentfirst} \usepackage[english]{babel} \usepackage{graphics} %\usepackage[pdftex]{graphicx} \usepackage{latexsym} \usepackage{fancyvrb} \usepackage{fancyhdr} \pagestyle{fancyplain} \newcommand{\tstamp}{\today} \newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $} \lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}} \cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}} \rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}} \title{ Challenges in Computer Science \\ \large{Assignment 3 - accesslog}} \author{Rick van der Zwet\\ \texttt{}\\ \\ LIACS\\ Leiden Universiteit\\ Niels Bohrweg 1\\ 2333 CA Leiden\\ Nederland} \date{\today} \begin{document} \maketitle \section{Introduction} \label{foo} The assignment will be the following \begin{quote} Analyse a web server accesslog -using Perl- and find something 'interesting'. Write a three pages article out your finding. \end{quote} \section{Problem} Direct relations are not visible inside the web server accesslog, there will be a need to process the data and find useful 'connections'. Not all data will be available all times. \section{Theory} \subsection{Apache httpd accesslog} We are processing an accesslog of Apache httpd server \footnote{http://httpd.apache.org/} which has a predefined formats \footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we will analyze will be the \textsl{combined log format}. Combined log format will contains the most information compared to others. An accesslog line is formatted as below. \begin{verbatim} 66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET /~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)" \end{verbatim} Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}: \begin{description} \item[66.196.90.99] This is the IP address of the client (remote host) which made the request to the server, if a proxy server exists between the user and the server, this address will be the address of the proxy, rather than the originating machine. \item[-] The "hyphen" in the output indicates that the requested piece of information is not available. In this case, the information that is not available is the RFC 1413 identity of the client determined by identd on the clients machine. This information is highly unreliable and should almost never be used except on tightly controlled internal networks. \item[-] This is the userid of the person requesting the document as determined by HTTP authentication. The same value is typically provided to CGI scripts in the REMOTE\_USER environment variable. If the status code for the request (see below) is 401, then this value should not be trusted because the user is not yet authenticated. If the document is not password protected, this entry will be "-" just like the previous one. \item[01/Jun/2004:04:04:06 +0200] The time that the server finished processing the request. The format is: \begin{verbatim} [day/month/year:hour:minute:second zone] day = 2*digit month = 3*letter year = 4*digit hour = 2*digit minute = 2*digit second = 2*digit zone = (`+' | `-') 4*digit \end{verbatim} \item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the client is given in double quotes. The request line contains a great deal of useful information. First, the method used by the client is \textsl{GET}. Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun}, and third, the client used the protocol \textsl{HTTP/1.0}. \item[304] This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification (RFC2616 section 10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html} \item[-] The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be "-". \item["-"] The "Referer" (sic) HTTP request header. This gives the site that the client reports having been referred from. (This should be the page that links to or includes the page requested). \item["Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request header. This is the identifying information that the client browser reports about itself. \end{description} \subsection{References} Looking at the information available, there will be a very big ammount of things able to check (or crossmatch). It will far to difficult to have a program itself find interconnections, so we will need to define a few and check them (by using a program) \section{Implementation} Perl will be the most easy way to accomplish this goal, cause it's plain text processing based in the first place. \section{Experiments} Results of the experiments are outputs of test file input \textsl{/scratch/wwwlog/www.access\_log.8} \subsection{Which ratio/number of pages are requested but don't exists (anymore)?} Check every URL and check their status code. A 404 will be marked unknown. Perl code at Appendix~\ref{exists.pl}. \begin{verbatim} 404/total hits: 63774/987696 (6%) Different 404 url's: 7590 (11%) \end{verbatim} \subsection{What will be the ratio human/robot?} Find type of User-Agents, mark them robot if the string contains \textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm} Perl code at Appendix~\ref{robot.pl}. \begin{verbatim} robot/others user-agent: 186/7711 (2%) \end{verbatim} \subsection{Which documents generated the most bandwidth}? Collect the number of hits on a certain page and multiply by the size of the page. Perl code at Appendix~\ref{bandwidth.pl}. \begin{verbatim} Total Bandwidth (bytes): 2504223027 top 10 bandwidth 1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)] 2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)] 3: /~moosten/quackknoppen.zip [19990886 (0%)] 4: /~swolff/Maradonna.mpg [19955712 (0%)] 5: /~phaazebr/weblog [15061021 (0%)] 6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)] 7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)] 8: /~sgroot/londen-small.wmv [9017829 (0%)] 9: /~eras045/serious/final_report.ps [8845382 (0%)] 10: /~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls [8744448 (0%)] \end{verbatim} \subsection{Will a certain IP use multiple user-agents?} Check whether there are multiple User-agents at IP. Perl code at Appendix~\ref{nat-proxy.pl}. \begin{verbatim} proxy/others hosts: 5086/71214 (7%) \end{verbatim} \subsection{Which IP ranges access the web server?} Collect every IP address and try to put them into ranges. We will ignore hostnames cause the their IP might be changed a few times already. This will need some more logic like knowledge of the IP subnets. Will skip this one. \section{Conclusion} Simple relations like statistics are easy to find, but the more sophisticated ones has to be thought out and designed. Finding good relations will take a lot of time and will be very hard to automate. Using Perl will be a quick way to process small amounts of data, when processing more data I recommend writing a small (wrapper) binary program to (pre)process data. %\begin{thebibliography}{XX} % %\end{thebibliography} \section*{Appendix} \subsection{common.pl} \label{common.pl} \VerbatimInput{common.pl} \newpage \subsection{robot.pl} \label{robot.pl} \VerbatimInput{robot.pl} \newpage \subsection{bandwidth.pl} \label{bandwidth.pl} \VerbatimInput{bandwidth.pl} \newpage \subsection{exists.pl} \label{exists.pl} \VerbatimInput{exists.pl} \newpage \subsection{nat-proxy.pl} \label{nat-proxy.pl} \VerbatimInput{nat-proxy.pl} \end{document}