[2] | 1 | \documentclass[a4paper,12pt]{article}
|
---|
| 2 | \usepackage{hyperref}
|
---|
| 3 | \usepackage{a4wide}
|
---|
| 4 | %\usepackage{indentfirst}
|
---|
| 5 | \usepackage[english]{babel}
|
---|
| 6 | \usepackage{graphics}
|
---|
| 7 | %\usepackage[pdftex]{graphicx}
|
---|
| 8 | \usepackage{latexsym}
|
---|
| 9 | \usepackage{fancyvrb}
|
---|
| 10 | \usepackage{fancyhdr}
|
---|
| 11 |
|
---|
| 12 | \pagestyle{fancyplain}
|
---|
| 13 | \newcommand{\tstamp}{\today}
|
---|
| 14 | \newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
|
---|
| 15 | \lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
|
---|
| 16 | \cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
|
---|
| 17 | \rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
|
---|
| 18 |
|
---|
| 19 |
|
---|
| 20 | \title{ Challenges in Computer Science \\
|
---|
| 21 | \large{Assignment 3 - accesslog}}
|
---|
| 22 | \author{Rick van der Zwet\\
|
---|
| 23 | \texttt{<hvdzwet@liacs.nl>}\\
|
---|
| 24 | \\
|
---|
| 25 | LIACS\\
|
---|
| 26 | Leiden Universiteit\\
|
---|
| 27 | Niels Bohrweg 1\\
|
---|
| 28 | 2333 CA Leiden\\
|
---|
| 29 | Nederland}
|
---|
| 30 | \date{\today}
|
---|
| 31 | \begin{document}
|
---|
| 32 | \maketitle
|
---|
| 33 | \section{Introduction}
|
---|
| 34 | \label{foo}
|
---|
| 35 | The assignment will be the following
|
---|
| 36 | \begin{quote}
|
---|
| 37 | Analyse a web server accesslog -using Perl- and find something
|
---|
| 38 | 'interesting'. Write a three pages article out your finding.
|
---|
| 39 | \end{quote}
|
---|
| 40 |
|
---|
| 41 | \section{Problem}
|
---|
| 42 | Direct relations are not visible inside the web server accesslog, there
|
---|
| 43 | will be a need to process the data and find useful 'connections'. Not
|
---|
| 44 | all data will be available all times.
|
---|
| 45 |
|
---|
| 46 | \section{Theory}
|
---|
| 47 | \subsection{Apache httpd accesslog}
|
---|
| 48 | We are processing an accesslog of Apache httpd server
|
---|
| 49 | \footnote{http://httpd.apache.org/} which has a predefined formats
|
---|
| 50 | \footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
|
---|
| 51 | will analyze will be the \textsl{combined log format}. Combined log
|
---|
| 52 | format will contains the most information compared to others. An
|
---|
| 53 | accesslog line is formatted as below.
|
---|
| 54 | \begin{verbatim}
|
---|
| 55 | 66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
|
---|
| 56 | /~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
|
---|
| 57 | e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
|
---|
| 58 | \end{verbatim}
|
---|
| 59 | Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
|
---|
| 60 | \begin{description}
|
---|
| 61 | \item[66.196.90.99] This is the IP address of the client (remote host) which
|
---|
| 62 | made the request to the server, if a proxy server exists between the user and
|
---|
| 63 | the server, this address will be the address of the proxy, rather than the
|
---|
| 64 | originating machine.
|
---|
| 65 | \item[-] The "hyphen" in the output indicates that the requested piece of
|
---|
| 66 | information is not available. In this case, the information that is not
|
---|
| 67 | available is the RFC 1413 identity of the client determined by identd on the
|
---|
| 68 | clients machine. This information is highly unreliable and should almost never
|
---|
| 69 | be used except on tightly controlled internal networks.
|
---|
| 70 | \item[-] This is the
|
---|
| 71 | userid of the person requesting the document as determined by HTTP
|
---|
| 72 | authentication. The same value is typically provided to CGI scripts in the
|
---|
| 73 | REMOTE\_USER environment variable. If the status code for the request (see
|
---|
| 74 | below) is 401, then this value should not be trusted because the user is not
|
---|
| 75 | yet authenticated. If the document is not password protected, this entry will
|
---|
| 76 | be "-" just like the previous one.
|
---|
| 77 | \item[01/Jun/2004:04:04:06 +0200] The time
|
---|
| 78 | that the server finished processing the request. The format is:
|
---|
| 79 | \begin{verbatim}
|
---|
| 80 | [day/month/year:hour:minute:second zone]
|
---|
| 81 | day = 2*digit
|
---|
| 82 | month = 3*letter
|
---|
| 83 | year = 4*digit
|
---|
| 84 | hour = 2*digit
|
---|
| 85 | minute = 2*digit
|
---|
| 86 | second = 2*digit
|
---|
| 87 | zone = (`+' | `-') 4*digit
|
---|
| 88 | \end{verbatim}
|
---|
| 89 | \item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
|
---|
| 90 | client is given in double quotes. The request line contains a great deal of
|
---|
| 91 | useful information. First, the method used by the client is \textsl{GET}.
|
---|
| 92 | Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
|
---|
| 93 | and third, the client used the protocol \textsl{HTTP/1.0}.
|
---|
| 94 | \item[304] This is the status code that the server sends back to the client.
|
---|
| 95 | This information is very valuable, because it reveals whether the request
|
---|
| 96 | resulted in a successful response (codes beginning in 2), a redirection (codes
|
---|
| 97 | beginning in 3), an error caused by the client (codes beginning in 4), or an
|
---|
| 98 | error in the server (codes beginning in 5). The full list of possible status
|
---|
| 99 | codes can be found in the HTTP specification (RFC2616 section
|
---|
| 100 | 10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
|
---|
| 101 | \item[-] The last entry indicates the size of the object returned to the
|
---|
| 102 | client, not including the response headers. If no content was returned to the
|
---|
| 103 | client, this value will be "-".
|
---|
| 104 | \item["-"] The "Referer" (sic) HTTP request header. This gives the site that
|
---|
| 105 | the client reports having been referred from. (This should be the page that
|
---|
| 106 | links to or includes the page requested).
|
---|
| 107 | \item["Mozilla/5.0 (compatible; Yahoo! Slurp;
|
---|
| 108 | http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
|
---|
| 109 | header. This is the identifying information that the client browser reports
|
---|
| 110 | about itself.
|
---|
| 111 | \end{description}
|
---|
| 112 |
|
---|
| 113 | \subsection{References}
|
---|
| 114 | Looking at the information available, there will be a very big ammount of
|
---|
| 115 | things able to check (or crossmatch). It will far to difficult to have a
|
---|
| 116 | program itself find interconnections, so we will need to define a few and check
|
---|
| 117 | them (by using a program)
|
---|
| 118 |
|
---|
| 119 | \section{Implementation}
|
---|
| 120 | Perl will be the most easy way to accomplish this goal, cause it's
|
---|
| 121 | plain text processing based in the first place.
|
---|
| 122 |
|
---|
| 123 | \section{Experiments}
|
---|
| 124 |
|
---|
| 125 | Results of the experiments are outputs of test file input
|
---|
| 126 | \textsl{/scratch/wwwlog/www.access\_log.8}
|
---|
| 127 |
|
---|
| 128 | \subsection{Which ratio/number of pages are requested but don't exists
|
---|
| 129 | (anymore)?} Check every URL and check their status code. A 404 will be
|
---|
| 130 | marked unknown.
|
---|
| 131 | Perl code at Appendix~\ref{exists.pl}.
|
---|
| 132 | \begin{verbatim}
|
---|
| 133 | 404/total hits: 63774/987696 (6%)
|
---|
| 134 | Different 404 url's: 7590 (11%)
|
---|
| 135 | \end{verbatim}
|
---|
| 136 |
|
---|
| 137 | \subsection{What will be the ratio human/robot?}
|
---|
| 138 | Find type of User-Agents, mark them robot if the string contains
|
---|
| 139 | \textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
|
---|
| 140 | Perl code at Appendix~\ref{robot.pl}.
|
---|
| 141 | \begin{verbatim}
|
---|
| 142 | robot/others user-agent: 186/7711 (2%)
|
---|
| 143 | \end{verbatim}
|
---|
| 144 |
|
---|
| 145 | \subsection{Which documents generated the most bandwidth}?
|
---|
| 146 | Collect the number of hits on a certain page and multiply by the size of
|
---|
| 147 | the page.
|
---|
| 148 | Perl code at Appendix~\ref{bandwidth.pl}.
|
---|
| 149 | \begin{verbatim}
|
---|
| 150 | Total Bandwidth (bytes): 2504223027
|
---|
| 151 | top 10 bandwidth
|
---|
| 152 | 1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
|
---|
| 153 | 2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
|
---|
| 154 | 3: /~moosten/quackknoppen.zip [19990886 (0%)]
|
---|
| 155 | 4: /~swolff/Maradonna.mpg [19955712 (0%)]
|
---|
| 156 | 5: /~phaazebr/weblog [15061021 (0%)]
|
---|
| 157 | 6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
|
---|
| 158 | 7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
|
---|
| 159 | 8: /~sgroot/londen-small.wmv [9017829 (0%)]
|
---|
| 160 | 9: /~eras045/serious/final_report.ps [8845382 (0%)]
|
---|
| 161 | 10:
|
---|
| 162 | /~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
|
---|
| 163 | [8744448 (0%)]
|
---|
| 164 | \end{verbatim}
|
---|
| 165 |
|
---|
| 166 | \subsection{Will a certain IP use multiple user-agents?}
|
---|
| 167 | Check whether there are multiple User-agents at IP.
|
---|
| 168 | Perl code at Appendix~\ref{nat-proxy.pl}.
|
---|
| 169 | \begin{verbatim}
|
---|
| 170 | proxy/others hosts: 5086/71214 (7%)
|
---|
| 171 | \end{verbatim}
|
---|
| 172 |
|
---|
| 173 | \subsection{Which IP ranges access the web server?}
|
---|
| 174 | Collect every IP address and try to put them into ranges. We will ignore
|
---|
| 175 | hostnames cause the their IP might be changed a few times already.
|
---|
| 176 |
|
---|
| 177 | This will need some more logic like knowledge of the IP subnets. Will
|
---|
| 178 | skip this one.
|
---|
| 179 |
|
---|
| 180 |
|
---|
| 181 | \section{Conclusion}
|
---|
| 182 | Simple relations like statistics are easy to find, but the more
|
---|
| 183 | sophisticated ones has to be thought out and designed. Finding good
|
---|
| 184 | relations will take a lot of time and will be very hard to automate.
|
---|
| 185 |
|
---|
| 186 | Using Perl will be a quick way to process small amounts of data, when
|
---|
| 187 | processing more data I recommend writing a small (wrapper) binary program to
|
---|
| 188 | (pre)process data.
|
---|
| 189 |
|
---|
| 190 | %\begin{thebibliography}{XX}
|
---|
| 191 | %
|
---|
| 192 | %\end{thebibliography}
|
---|
| 193 |
|
---|
| 194 | \section*{Appendix}
|
---|
| 195 |
|
---|
| 196 | \subsection{common.pl}
|
---|
| 197 | \label{common.pl}
|
---|
| 198 | \VerbatimInput{common.pl}
|
---|
| 199 | \newpage
|
---|
| 200 |
|
---|
| 201 | \subsection{robot.pl}
|
---|
| 202 | \label{robot.pl}
|
---|
| 203 | \VerbatimInput{robot.pl}
|
---|
| 204 | \newpage
|
---|
| 205 |
|
---|
| 206 | \subsection{bandwidth.pl}
|
---|
| 207 | \label{bandwidth.pl}
|
---|
| 208 | \VerbatimInput{bandwidth.pl}
|
---|
| 209 | \newpage
|
---|
| 210 |
|
---|
| 211 | \subsection{exists.pl}
|
---|
| 212 | \label{exists.pl}
|
---|
| 213 | \VerbatimInput{exists.pl}
|
---|
| 214 | \newpage
|
---|
| 215 |
|
---|
| 216 | \subsection{nat-proxy.pl}
|
---|
| 217 | \label{nat-proxy.pl}
|
---|
| 218 | \VerbatimInput{nat-proxy.pl}
|
---|
| 219 | \end{document}
|
---|
| 220 |
|
---|