source: liacs/ccs/op3/report.tex@ 218

Last change on this file since 218 was 2, checked in by Rick van der Zwet, 15 years ago

Initial import of data of old repository ('data') worth keeping (e.g. tracking
means of URL access statistics)

File size: 8.0 KB
Line 
1\documentclass[a4paper,12pt]{article}
2\usepackage{hyperref}
3\usepackage{a4wide}
4%\usepackage{indentfirst}
5\usepackage[english]{babel}
6\usepackage{graphics}
7%\usepackage[pdftex]{graphicx}
8\usepackage{latexsym}
9\usepackage{fancyvrb}
10\usepackage{fancyhdr}
11
12\pagestyle{fancyplain}
13\newcommand{\tstamp}{\today}
14\newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
15\lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
16\cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
17\rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
18
19
20\title{ Challenges in Computer Science \\
21\large{Assignment 3 - accesslog}}
22\author{Rick van der Zwet\\
23 \texttt{<hvdzwet@liacs.nl>}\\
24 \\
25 LIACS\\
26 Leiden Universiteit\\
27 Niels Bohrweg 1\\
28 2333 CA Leiden\\
29 Nederland}
30\date{\today}
31\begin{document}
32\maketitle
33\section{Introduction}
34\label{foo}
35 The assignment will be the following
36\begin{quote}
37Analyse a web server accesslog -using Perl- and find something
38'interesting'. Write a three pages article out your finding.
39\end{quote}
40
41\section{Problem}
42Direct relations are not visible inside the web server accesslog, there
43will be a need to process the data and find useful 'connections'. Not
44all data will be available all times.
45
46\section{Theory}
47\subsection{Apache httpd accesslog}
48We are processing an accesslog of Apache httpd server
49\footnote{http://httpd.apache.org/} which has a predefined formats
50\footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
51will analyze will be the \textsl{combined log format}. Combined log
52format will contains the most information compared to others. An
53accesslog line is formatted as below.
54\begin{verbatim}
5566.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
56/~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
57e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
58\end{verbatim}
59Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
60\begin{description}
61\item[66.196.90.99] This is the IP address of the client (remote host) which
62made the request to the server, if a proxy server exists between the user and
63the server, this address will be the address of the proxy, rather than the
64originating machine.
65\item[-] The "hyphen" in the output indicates that the requested piece of
66information is not available. In this case, the information that is not
67available is the RFC 1413 identity of the client determined by identd on the
68clients machine. This information is highly unreliable and should almost never
69be used except on tightly controlled internal networks.
70\item[-] This is the
71userid of the person requesting the document as determined by HTTP
72authentication. The same value is typically provided to CGI scripts in the
73REMOTE\_USER environment variable. If the status code for the request (see
74below) is 401, then this value should not be trusted because the user is not
75yet authenticated. If the document is not password protected, this entry will
76be "-" just like the previous one.
77\item[01/Jun/2004:04:04:06 +0200] The time
78that the server finished processing the request. The format is:
79\begin{verbatim}
80[day/month/year:hour:minute:second zone]
81day = 2*digit
82month = 3*letter
83year = 4*digit
84hour = 2*digit
85minute = 2*digit
86second = 2*digit
87zone = (`+' | `-') 4*digit
88\end{verbatim}
89\item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
90client is given in double quotes. The request line contains a great deal of
91useful information. First, the method used by the client is \textsl{GET}.
92Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
93and third, the client used the protocol \textsl{HTTP/1.0}.
94\item[304] This is the status code that the server sends back to the client.
95This information is very valuable, because it reveals whether the request
96resulted in a successful response (codes beginning in 2), a redirection (codes
97beginning in 3), an error caused by the client (codes beginning in 4), or an
98error in the server (codes beginning in 5). The full list of possible status
99codes can be found in the HTTP specification (RFC2616 section
10010).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
101\item[-] The last entry indicates the size of the object returned to the
102client, not including the response headers. If no content was returned to the
103client, this value will be "-".
104\item["-"] The "Referer" (sic) HTTP request header. This gives the site that
105the client reports having been referred from. (This should be the page that
106links to or includes the page requested).
107\item["Mozilla/5.0 (compatible; Yahoo! Slurp;
108http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
109header. This is the identifying information that the client browser reports
110about itself.
111\end{description}
112
113\subsection{References}
114Looking at the information available, there will be a very big ammount of
115things able to check (or crossmatch). It will far to difficult to have a
116program itself find interconnections, so we will need to define a few and check
117them (by using a program)
118
119\section{Implementation}
120Perl will be the most easy way to accomplish this goal, cause it's
121plain text processing based in the first place.
122
123\section{Experiments}
124
125Results of the experiments are outputs of test file input
126\textsl{/scratch/wwwlog/www.access\_log.8}
127
128\subsection{Which ratio/number of pages are requested but don't exists
129(anymore)?} Check every URL and check their status code. A 404 will be
130marked unknown.
131Perl code at Appendix~\ref{exists.pl}.
132\begin{verbatim}
133404/total hits: 63774/987696 (6%)
134Different 404 url's: 7590 (11%)
135\end{verbatim}
136
137\subsection{What will be the ratio human/robot?}
138Find type of User-Agents, mark them robot if the string contains
139\textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
140Perl code at Appendix~\ref{robot.pl}.
141\begin{verbatim}
142robot/others user-agent: 186/7711 (2%)
143\end{verbatim}
144
145\subsection{Which documents generated the most bandwidth}?
146Collect the number of hits on a certain page and multiply by the size of
147the page.
148Perl code at Appendix~\ref{bandwidth.pl}.
149\begin{verbatim}
150Total Bandwidth (bytes): 2504223027
151top 10 bandwidth
1521: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
1532: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
1543: /~moosten/quackknoppen.zip [19990886 (0%)]
1554: /~swolff/Maradonna.mpg [19955712 (0%)]
1565: /~phaazebr/weblog [15061021 (0%)]
1576: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
1587: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
1598: /~sgroot/londen-small.wmv [9017829 (0%)]
1609: /~eras045/serious/final_report.ps [8845382 (0%)]
16110:
162/~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
163[8744448 (0%)]
164\end{verbatim}
165
166\subsection{Will a certain IP use multiple user-agents?}
167Check whether there are multiple User-agents at IP.
168Perl code at Appendix~\ref{nat-proxy.pl}.
169\begin{verbatim}
170proxy/others hosts: 5086/71214 (7%)
171\end{verbatim}
172
173\subsection{Which IP ranges access the web server?}
174Collect every IP address and try to put them into ranges. We will ignore
175hostnames cause the their IP might be changed a few times already.
176
177This will need some more logic like knowledge of the IP subnets. Will
178skip this one.
179
180
181\section{Conclusion}
182Simple relations like statistics are easy to find, but the more
183sophisticated ones has to be thought out and designed. Finding good
184relations will take a lot of time and will be very hard to automate.
185
186Using Perl will be a quick way to process small amounts of data, when
187processing more data I recommend writing a small (wrapper) binary program to
188(pre)process data.
189
190%\begin{thebibliography}{XX}
191%
192%\end{thebibliography}
193
194\section*{Appendix}
195
196\subsection{common.pl}
197\label{common.pl}
198\VerbatimInput{common.pl}
199\newpage
200
201\subsection{robot.pl}
202\label{robot.pl}
203\VerbatimInput{robot.pl}
204\newpage
205
206\subsection{bandwidth.pl}
207\label{bandwidth.pl}
208\VerbatimInput{bandwidth.pl}
209\newpage
210
211\subsection{exists.pl}
212\label{exists.pl}
213\VerbatimInput{exists.pl}
214\newpage
215
216\subsection{nat-proxy.pl}
217\label{nat-proxy.pl}
218\VerbatimInput{nat-proxy.pl}
219\end{document}
220
Note: See TracBrowser for help on using the repository browser.