[2] | 1 | \documentclass[a4paper,12pt]{article}
| 2 | \usepackage{hyperref}
| 3 | \usepackage{a4wide}
| 4 | %\usepackage{indentfirst}
| 5 | \usepackage[english]{babel}
| 6 | \usepackage{graphics}
| 7 | %\usepackage[pdftex]{graphicx}
| 8 | \usepackage{latexsym}
| 9 | \usepackage{fancyvrb}
| 10 | \usepackage{fancyhdr}
| 11 |
| 12 | \pagestyle{fancyplain}
| 13 | \newcommand{\tstamp}{\today}
| 14 | \newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
| 15 | \lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
| 16 | \cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
| 17 | \rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
| 18 |
| 19 |
| 20 | \title{ Challenges in Computer Science \\
| 21 | \large{Assignment 3 - accesslog}}
| 22 | \author{Rick van der Zwet\\
| 23 | \texttt{<hvdzwet@liacs.nl>}\\
| 24 | \\
| 25 | LIACS\\
| 26 | Leiden Universiteit\\
| 27 | Niels Bohrweg 1\\
| 28 | 2333 CA Leiden\\
| 29 | Nederland}
| 30 | \date{\today}
| 31 | \begin{document}
| 32 | \maketitle
| 33 | \section{Introduction}
| 34 | \label{foo}
| 35 | The assignment will be the following
| 36 | \begin{quote}
| 37 | Analyse a web server accesslog -using Perl- and find something
| 38 | 'interesting'. Write a three pages article out your finding.
| 39 | \end{quote}
| 40 |
| 41 | \section{Problem}
| 42 | Direct relations are not visible inside the web server accesslog, there
| 43 | will be a need to process the data and find useful 'connections'. Not
| 44 | all data will be available all times.
| 45 |
| 46 | \section{Theory}
| 47 | \subsection{Apache httpd accesslog}
| 48 | We are processing an accesslog of Apache httpd server
| 49 | \footnote{http://httpd.apache.org/} which has a predefined formats
| 50 | \footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
| 51 | will analyze will be the \textsl{combined log format}. Combined log
| 52 | format will contains the most information compared to others. An
| 53 | accesslog line is formatted as below.
| 54 | \begin{verbatim}
| 55 | - - [01/Jun/2004:04:04:06 +0200] "GET
| 56 | /~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
| 57 | e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
| 58 | \end{verbatim}
| 59 | Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
| 60 | \begin{description}
| 61 | \item[] This is the IP address of the client (remote host) which
| 62 | made the request to the server, if a proxy server exists between the user and
| 63 | the server, this address will be the address of the proxy, rather than the
| 64 | originating machine.
| 65 | \item[-] The "hyphen" in the output indicates that the requested piece of
| 66 | information is not available. In this case, the information that is not
| 67 | available is the RFC 1413 identity of the client determined by identd on the
| 68 | clients machine. This information is highly unreliable and should almost never
| 69 | be used except on tightly controlled internal networks.
| 70 | \item[-] This is the
| 71 | userid of the person requesting the document as determined by HTTP
| 72 | authentication. The same value is typically provided to CGI scripts in the
| 73 | REMOTE\_USER environment variable. If the status code for the request (see
| 74 | below) is 401, then this value should not be trusted because the user is not
| 75 | yet authenticated. If the document is not password protected, this entry will
| 76 | be "-" just like the previous one.
| 77 | \item[01/Jun/2004:04:04:06 +0200] The time
| 78 | that the server finished processing the request. The format is:
| 79 | \begin{verbatim}
| 80 | [day/month/year:hour:minute:second zone]
| 81 | day = 2*digit
| 82 | month = 3*letter
| 83 | year = 4*digit
| 84 | hour = 2*digit
| 85 | minute = 2*digit
| 86 | second = 2*digit
| 87 | zone = (`+' | `-') 4*digit
| 88 | \end{verbatim}
| 89 | \item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
| 90 | client is given in double quotes. The request line contains a great deal of
| 91 | useful information. First, the method used by the client is \textsl{GET}.
| 92 | Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
| 93 | and third, the client used the protocol \textsl{HTTP/1.0}.
| 94 | \item[304] This is the status code that the server sends back to the client.
| 95 | This information is very valuable, because it reveals whether the request
| 96 | resulted in a successful response (codes beginning in 2), a redirection (codes
| 97 | beginning in 3), an error caused by the client (codes beginning in 4), or an
| 98 | error in the server (codes beginning in 5). The full list of possible status
| 99 | codes can be found in the HTTP specification (RFC2616 section
| 100 | 10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
| 101 | \item[-] The last entry indicates the size of the object returned to the
| 102 | client, not including the response headers. If no content was returned to the
| 103 | client, this value will be "-".
| 104 | \item["-"] The "Referer" (sic) HTTP request header. This gives the site that
| 105 | the client reports having been referred from. (This should be the page that
| 106 | links to or includes the page requested).
| 107 | \item["Mozilla/5.0 (compatible; Yahoo! Slurp;
| 108 | http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
| 109 | header. This is the identifying information that the client browser reports
| 110 | about itself.
| 111 | \end{description}
| 112 |
| 113 | \subsection{References}
| 114 | Looking at the information available, there will be a very big ammount of
| 115 | things able to check (or crossmatch). It will far to difficult to have a
| 116 | program itself find interconnections, so we will need to define a few and check
| 117 | them (by using a program)
| 118 |
| 119 | \section{Implementation}
| 120 | Perl will be the most easy way to accomplish this goal, cause it's
| 121 | plain text processing based in the first place.
| 122 |
| 123 | \section{Experiments}
| 124 |
| 125 | Results of the experiments are outputs of test file input
| 126 | \textsl{/scratch/wwwlog/www.access\_log.8}
| 127 |
| 128 | \subsection{Which ratio/number of pages are requested but don't exists
| 129 | (anymore)?} Check every URL and check their status code. A 404 will be
| 130 | marked unknown.
| 131 | Perl code at Appendix~\ref{exists.pl}.
| 132 | \begin{verbatim}
| 133 | 404/total hits: 63774/987696 (6%)
| 134 | Different 404 url's: 7590 (11%)
| 135 | \end{verbatim}
| 136 |
| 137 | \subsection{What will be the ratio human/robot?}
| 138 | Find type of User-Agents, mark them robot if the string contains
| 139 | \textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
| 140 | Perl code at Appendix~\ref{robot.pl}.
| 141 | \begin{verbatim}
| 142 | robot/others user-agent: 186/7711 (2%)
| 143 | \end{verbatim}
| 144 |
| 145 | \subsection{Which documents generated the most bandwidth}?
| 146 | Collect the number of hits on a certain page and multiply by the size of
| 147 | the page.
| 148 | Perl code at Appendix~\ref{bandwidth.pl}.
| 149 | \begin{verbatim}
| 150 | Total Bandwidth (bytes): 2504223027
| 151 | top 10 bandwidth
| 152 | 1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
| 153 | 2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
| 154 | 3: /~moosten/quackknoppen.zip [19990886 (0%)]
| 155 | 4: /~swolff/Maradonna.mpg [19955712 (0%)]
| 156 | 5: /~phaazebr/weblog [15061021 (0%)]
| 157 | 6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
| 158 | 7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
| 159 | 8: /~sgroot/londen-small.wmv [9017829 (0%)]
| 160 | 9: /~eras045/serious/final_report.ps [8845382 (0%)]
| 161 | 10:
| 162 | /~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
| 163 | [8744448 (0%)]
| 164 | \end{verbatim}
| 165 |
| 166 | \subsection{Will a certain IP use multiple user-agents?}
| 167 | Check whether there are multiple User-agents at IP.
| 168 | Perl code at Appendix~\ref{nat-proxy.pl}.
| 169 | \begin{verbatim}
| 170 | proxy/others hosts: 5086/71214 (7%)
| 171 | \end{verbatim}
| 172 |
| 173 | \subsection{Which IP ranges access the web server?}
| 174 | Collect every IP address and try to put them into ranges. We will ignore
| 175 | hostnames cause the their IP might be changed a few times already.
| 176 |
| 177 | This will need some more logic like knowledge of the IP subnets. Will
| 178 | skip this one.
| 179 |
| 180 |
| 181 | \section{Conclusion}
| 182 | Simple relations like statistics are easy to find, but the more
| 183 | sophisticated ones has to be thought out and designed. Finding good
| 184 | relations will take a lot of time and will be very hard to automate.
| 185 |
| 186 | Using Perl will be a quick way to process small amounts of data, when
| 187 | processing more data I recommend writing a small (wrapper) binary program to
| 188 | (pre)process data.
| 189 |
| 190 | %\begin{thebibliography}{XX}
| 191 | %
| 192 | %\end{thebibliography}
| 193 |
| 194 | \section*{Appendix}
| 195 |
| 196 | \subsection{common.pl}
| 197 | \label{common.pl}
| 198 | \VerbatimInput{common.pl}
| 199 | \newpage
| 200 |
| 201 | \subsection{robot.pl}
| 202 | \label{robot.pl}
| 203 | \VerbatimInput{robot.pl}
| 204 | \newpage
| 205 |
| 206 | \subsection{bandwidth.pl}
| 207 | \label{bandwidth.pl}
| 208 | \VerbatimInput{bandwidth.pl}
| 209 | \newpage
| 210 |
| 211 | \subsection{exists.pl}
| 212 | \label{exists.pl}
| 213 | \VerbatimInput{exists.pl}
| 214 | \newpage
| 215 |
| 216 | \subsection{nat-proxy.pl}
| 217 | \label{nat-proxy.pl}
| 218 | \VerbatimInput{nat-proxy.pl}
| 219 | \end{document}
| 220 |