1 | \documentclass[a4paper,12pt]{article}
|
---|
2 | \usepackage{hyperref}
|
---|
3 | \usepackage{a4wide}
|
---|
4 | %\usepackage{indentfirst}
|
---|
5 | \usepackage[english]{babel}
|
---|
6 | \usepackage{graphics}
|
---|
7 | %\usepackage[pdftex]{graphicx}
|
---|
8 | \usepackage{latexsym}
|
---|
9 | \usepackage{fancyvrb}
|
---|
10 | \usepackage{fancyhdr}
|
---|
11 |
|
---|
12 | \pagestyle{fancyplain}
|
---|
13 | \newcommand{\tstamp}{\today}
|
---|
14 | \newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
|
---|
15 | \lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
|
---|
16 | \cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
|
---|
17 | \rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
|
---|
18 |
|
---|
19 |
|
---|
20 | \title{ Challenges in Computer Science \\
|
---|
21 | \large{Assignment 3 - accesslog}}
|
---|
22 | \author{Rick van der Zwet\\
|
---|
23 | \texttt{<hvdzwet@liacs.nl>}\\
|
---|
24 | \\
|
---|
25 | LIACS\\
|
---|
26 | Leiden Universiteit\\
|
---|
27 | Niels Bohrweg 1\\
|
---|
28 | 2333 CA Leiden\\
|
---|
29 | Nederland}
|
---|
30 | \date{\today}
|
---|
31 | \begin{document}
|
---|
32 | \maketitle
|
---|
33 | \section{Introduction}
|
---|
34 | \label{foo}
|
---|
35 | The assignment will be the following
|
---|
36 | \begin{quote}
|
---|
37 | Analyse a web server accesslog -using Perl- and find something
|
---|
38 | 'interesting'. Write a three pages article out your finding.
|
---|
39 | \end{quote}
|
---|
40 |
|
---|
41 | \section{Problem}
|
---|
42 | Direct relations are not visible inside the web server accesslog, there
|
---|
43 | will be a need to process the data and find useful 'connections'. Not
|
---|
44 | all data will be available all times.
|
---|
45 |
|
---|
46 | \section{Theory}
|
---|
47 | \subsection{Apache httpd accesslog}
|
---|
48 | We are processing an accesslog of Apache httpd server
|
---|
49 | \footnote{http://httpd.apache.org/} which has a predefined formats
|
---|
50 | \footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
|
---|
51 | will analyze will be the \textsl{combined log format}. Combined log
|
---|
52 | format will contains the most information compared to others. An
|
---|
53 | accesslog line is formatted as below.
|
---|
54 | \begin{verbatim}
|
---|
55 | 66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
|
---|
56 | /~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
|
---|
57 | e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
|
---|
58 | \end{verbatim}
|
---|
59 | Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
|
---|
60 | \begin{description}
|
---|
61 | \item[66.196.90.99] This is the IP address of the client (remote host) which
|
---|
62 | made the request to the server, if a proxy server exists between the user and
|
---|
63 | the server, this address will be the address of the proxy, rather than the
|
---|
64 | originating machine.
|
---|
65 | \item[-] The "hyphen" in the output indicates that the requested piece of
|
---|
66 | information is not available. In this case, the information that is not
|
---|
67 | available is the RFC 1413 identity of the client determined by identd on the
|
---|
68 | clients machine. This information is highly unreliable and should almost never
|
---|
69 | be used except on tightly controlled internal networks.
|
---|
70 | \item[-] This is the
|
---|
71 | userid of the person requesting the document as determined by HTTP
|
---|
72 | authentication. The same value is typically provided to CGI scripts in the
|
---|
73 | REMOTE\_USER environment variable. If the status code for the request (see
|
---|
74 | below) is 401, then this value should not be trusted because the user is not
|
---|
75 | yet authenticated. If the document is not password protected, this entry will
|
---|
76 | be "-" just like the previous one.
|
---|
77 | \item[01/Jun/2004:04:04:06 +0200] The time
|
---|
78 | that the server finished processing the request. The format is:
|
---|
79 | \begin{verbatim}
|
---|
80 | [day/month/year:hour:minute:second zone]
|
---|
81 | day = 2*digit
|
---|
82 | month = 3*letter
|
---|
83 | year = 4*digit
|
---|
84 | hour = 2*digit
|
---|
85 | minute = 2*digit
|
---|
86 | second = 2*digit
|
---|
87 | zone = (`+' | `-') 4*digit
|
---|
88 | \end{verbatim}
|
---|
89 | \item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
|
---|
90 | client is given in double quotes. The request line contains a great deal of
|
---|
91 | useful information. First, the method used by the client is \textsl{GET}.
|
---|
92 | Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
|
---|
93 | and third, the client used the protocol \textsl{HTTP/1.0}.
|
---|
94 | \item[304] This is the status code that the server sends back to the client.
|
---|
95 | This information is very valuable, because it reveals whether the request
|
---|
96 | resulted in a successful response (codes beginning in 2), a redirection (codes
|
---|
97 | beginning in 3), an error caused by the client (codes beginning in 4), or an
|
---|
98 | error in the server (codes beginning in 5). The full list of possible status
|
---|
99 | codes can be found in the HTTP specification (RFC2616 section
|
---|
100 | 10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
|
---|
101 | \item[-] The last entry indicates the size of the object returned to the
|
---|
102 | client, not including the response headers. If no content was returned to the
|
---|
103 | client, this value will be "-".
|
---|
104 | \item["-"] The "Referer" (sic) HTTP request header. This gives the site that
|
---|
105 | the client reports having been referred from. (This should be the page that
|
---|
106 | links to or includes the page requested).
|
---|
107 | \item["Mozilla/5.0 (compatible; Yahoo! Slurp;
|
---|
108 | http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
|
---|
109 | header. This is the identifying information that the client browser reports
|
---|
110 | about itself.
|
---|
111 | \end{description}
|
---|
112 |
|
---|
113 | \subsection{References}
|
---|
114 | Looking at the information available, there will be a very big ammount of
|
---|
115 | things able to check (or crossmatch). It will far to difficult to have a
|
---|
116 | program itself find interconnections, so we will need to define a few and check
|
---|
117 | them (by using a program)
|
---|
118 |
|
---|
119 | \section{Implementation}
|
---|
120 | Perl will be the most easy way to accomplish this goal, cause it's
|
---|
121 | plain text processing based in the first place.
|
---|
122 |
|
---|
123 | \section{Experiments}
|
---|
124 |
|
---|
125 | Results of the experiments are outputs of test file input
|
---|
126 | \textsl{/scratch/wwwlog/www.access\_log.8}
|
---|
127 |
|
---|
128 | \subsection{Which ratio/number of pages are requested but don't exists
|
---|
129 | (anymore)?} Check every URL and check their status code. A 404 will be
|
---|
130 | marked unknown.
|
---|
131 | Perl code at Appendix~\ref{exists.pl}.
|
---|
132 | \begin{verbatim}
|
---|
133 | 404/total hits: 63774/987696 (6%)
|
---|
134 | Different 404 url's: 7590 (11%)
|
---|
135 | \end{verbatim}
|
---|
136 |
|
---|
137 | \subsection{What will be the ratio human/robot?}
|
---|
138 | Find type of User-Agents, mark them robot if the string contains
|
---|
139 | \textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
|
---|
140 | Perl code at Appendix~\ref{robot.pl}.
|
---|
141 | \begin{verbatim}
|
---|
142 | robot/others user-agent: 186/7711 (2%)
|
---|
143 | \end{verbatim}
|
---|
144 |
|
---|
145 | \subsection{Which documents generated the most bandwidth}?
|
---|
146 | Collect the number of hits on a certain page and multiply by the size of
|
---|
147 | the page.
|
---|
148 | Perl code at Appendix~\ref{bandwidth.pl}.
|
---|
149 | \begin{verbatim}
|
---|
150 | Total Bandwidth (bytes): 2504223027
|
---|
151 | top 10 bandwidth
|
---|
152 | 1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
|
---|
153 | 2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
|
---|
154 | 3: /~moosten/quackknoppen.zip [19990886 (0%)]
|
---|
155 | 4: /~swolff/Maradonna.mpg [19955712 (0%)]
|
---|
156 | 5: /~phaazebr/weblog [15061021 (0%)]
|
---|
157 | 6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
|
---|
158 | 7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
|
---|
159 | 8: /~sgroot/londen-small.wmv [9017829 (0%)]
|
---|
160 | 9: /~eras045/serious/final_report.ps [8845382 (0%)]
|
---|
161 | 10:
|
---|
162 | /~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
|
---|
163 | [8744448 (0%)]
|
---|
164 | \end{verbatim}
|
---|
165 |
|
---|
166 | \subsection{Will a certain IP use multiple user-agents?}
|
---|
167 | Check whether there are multiple User-agents at IP.
|
---|
168 | Perl code at Appendix~\ref{nat-proxy.pl}.
|
---|
169 | \begin{verbatim}
|
---|
170 | proxy/others hosts: 5086/71214 (7%)
|
---|
171 | \end{verbatim}
|
---|
172 |
|
---|
173 | \subsection{Which IP ranges access the web server?}
|
---|
174 | Collect every IP address and try to put them into ranges. We will ignore
|
---|
175 | hostnames cause the their IP might be changed a few times already.
|
---|
176 |
|
---|
177 | This will need some more logic like knowledge of the IP subnets. Will
|
---|
178 | skip this one.
|
---|
179 |
|
---|
180 |
|
---|
181 | \section{Conclusion}
|
---|
182 | Simple relations like statistics are easy to find, but the more
|
---|
183 | sophisticated ones has to be thought out and designed. Finding good
|
---|
184 | relations will take a lot of time and will be very hard to automate.
|
---|
185 |
|
---|
186 | Using Perl will be a quick way to process small amounts of data, when
|
---|
187 | processing more data I recommend writing a small (wrapper) binary program to
|
---|
188 | (pre)process data.
|
---|
189 |
|
---|
190 | %\begin{thebibliography}{XX}
|
---|
191 | %
|
---|
192 | %\end{thebibliography}
|
---|
193 |
|
---|
194 | \section*{Appendix}
|
---|
195 |
|
---|
196 | \subsection{common.pl}
|
---|
197 | \label{common.pl}
|
---|
198 | \VerbatimInput{common.pl}
|
---|
199 | \newpage
|
---|
200 |
|
---|
201 | \subsection{robot.pl}
|
---|
202 | \label{robot.pl}
|
---|
203 | \VerbatimInput{robot.pl}
|
---|
204 | \newpage
|
---|
205 |
|
---|
206 | \subsection{bandwidth.pl}
|
---|
207 | \label{bandwidth.pl}
|
---|
208 | \VerbatimInput{bandwidth.pl}
|
---|
209 | \newpage
|
---|
210 |
|
---|
211 | \subsection{exists.pl}
|
---|
212 | \label{exists.pl}
|
---|
213 | \VerbatimInput{exists.pl}
|
---|
214 | \newpage
|
---|
215 |
|
---|
216 | \subsection{nat-proxy.pl}
|
---|
217 | \label{nat-proxy.pl}
|
---|
218 | \VerbatimInput{nat-proxy.pl}
|
---|
219 | \end{document}
|
---|
220 |
|
---|