Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

report.tex@ 321

Last change on this file since 321 was 2, checked in by Rick van der Zwet, 15 years ago
Initial import of data of old repository ('data') worth keeping (e.g. tracking means of URL access statistics)
File size: 8.0 KB

Line
1	\documentclass[a4paper,12pt]{article}
2	\usepackage{hyperref}
3	\usepackage{a4wide}
4	%\usepackage{indentfirst}
5	\usepackage[english]{babel}
6	\usepackage{graphics}
7	%\usepackage[pdftex]{graphicx}
8	\usepackage{latexsym}
9	\usepackage{fancyvrb}
10	\usepackage{fancyhdr}
11
12	\pagestyle{fancyplain}
13	\newcommand{\tstamp}{\today}
14	\newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
15	\lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
16	\cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
17	\rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
18
19
20	\title{ Challenges in Computer Science \\
21	\large{Assignment 3 - accesslog}}
22	\author{Rick van der Zwet\\
23	\texttt{<hvdzwet@liacs.nl>}\\
24	\\
25	LIACS\\
26	Leiden Universiteit\\
27	Niels Bohrweg 1\\
28	2333 CA Leiden\\
29	Nederland}
30	\date{\today}
31	\begin{document}
32	\maketitle
33	\section{Introduction}
34	\label{foo}
35	The assignment will be the following
36	\begin{quote}
37	Analyse a web server accesslog -using Perl- and find something
38	'interesting'. Write a three pages article out your finding.
39	\end{quote}
40
41	\section{Problem}
42	Direct relations are not visible inside the web server accesslog, there
43	will be a need to process the data and find useful 'connections'. Not
44	all data will be available all times.
45
46	\section{Theory}
47	\subsection{Apache httpd accesslog}
48	We are processing an accesslog of Apache httpd server
49	\footnote{http://httpd.apache.org/} which has a predefined formats
50	\footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
51	will analyze will be the \textsl{combined log format}. Combined log
52	format will contains the most information compared to others. An
53	accesslog line is formatted as below.
54	\begin{verbatim}
55	66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
56	/~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
57	e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
58	\end{verbatim}
59	Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
60	\begin{description}
61	\item[66.196.90.99] This is the IP address of the client (remote host) which
62	made the request to the server, if a proxy server exists between the user and
63	the server, this address will be the address of the proxy, rather than the
64	originating machine.
65	\item[-] The "hyphen" in the output indicates that the requested piece of
66	information is not available. In this case, the information that is not
67	available is the RFC 1413 identity of the client determined by identd on the
68	clients machine. This information is highly unreliable and should almost never
69	be used except on tightly controlled internal networks.
70	\item[-] This is the
71	userid of the person requesting the document as determined by HTTP
72	authentication. The same value is typically provided to CGI scripts in the
73	REMOTE\_USER environment variable. If the status code for the request (see
74	below) is 401, then this value should not be trusted because the user is not
75	yet authenticated. If the document is not password protected, this entry will
76	be "-" just like the previous one.
77	\item[01/Jun/2004:04:04:06 +0200] The time
78	that the server finished processing the request. The format is:
79	\begin{verbatim}
80	[day/month/year:hour:minute:second zone]
81	day = 2*digit
82	month = 3*letter
83	year = 4*digit
84	hour = 2*digit
85	minute = 2*digit
86	second = 2*digit
87	zone = (`+' \| `-') 4*digit
88	\end{verbatim}
89	\item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
90	client is given in double quotes. The request line contains a great deal of
91	useful information. First, the method used by the client is \textsl{GET}.
92	Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
93	and third, the client used the protocol \textsl{HTTP/1.0}.
94	\item[304] This is the status code that the server sends back to the client.
95	This information is very valuable, because it reveals whether the request
96	resulted in a successful response (codes beginning in 2), a redirection (codes
97	beginning in 3), an error caused by the client (codes beginning in 4), or an
98	error in the server (codes beginning in 5). The full list of possible status
99	codes can be found in the HTTP specification (RFC2616 section
100	10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
101	\item[-] The last entry indicates the size of the object returned to the
102	client, not including the response headers. If no content was returned to the
103	client, this value will be "-".
104	\item["-"] The "Referer" (sic) HTTP request header. This gives the site that
105	the client reports having been referred from. (This should be the page that
106	links to or includes the page requested).
107	\item["Mozilla/5.0 (compatible; Yahoo! Slurp;
108	http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
109	header. This is the identifying information that the client browser reports
110	about itself.
111	\end{description}
112
113	\subsection{References}
114	Looking at the information available, there will be a very big ammount of
115	things able to check (or crossmatch). It will far to difficult to have a
116	program itself find interconnections, so we will need to define a few and check
117	them (by using a program)
118
119	\section{Implementation}
120	Perl will be the most easy way to accomplish this goal, cause it's
121	plain text processing based in the first place.
122
123	\section{Experiments}
124
125	Results of the experiments are outputs of test file input
126	\textsl{/scratch/wwwlog/www.access\_log.8}
127
128	\subsection{Which ratio/number of pages are requested but don't exists
129	(anymore)?} Check every URL and check their status code. A 404 will be
130	marked unknown.
131	Perl code at Appendix~\ref{exists.pl}.
132	\begin{verbatim}
133	404/total hits: 63774/987696 (6%)
134	Different 404 url's: 7590 (11%)
135	\end{verbatim}
136
137	\subsection{What will be the ratio human/robot?}
138	Find type of User-Agents, mark them robot if the string contains
139	\textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
140	Perl code at Appendix~\ref{robot.pl}.
141	\begin{verbatim}
142	robot/others user-agent: 186/7711 (2%)
143	\end{verbatim}
144
145	\subsection{Which documents generated the most bandwidth}?
146	Collect the number of hits on a certain page and multiply by the size of
147	the page.
148	Perl code at Appendix~\ref{bandwidth.pl}.
149	\begin{verbatim}
150	Total Bandwidth (bytes): 2504223027
151	top 10 bandwidth
152	1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
153	2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
154	3: /~moosten/quackknoppen.zip [19990886 (0%)]
155	4: /~swolff/Maradonna.mpg [19955712 (0%)]
156	5: /~phaazebr/weblog [15061021 (0%)]
157	6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
158	7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
159	8: /~sgroot/londen-small.wmv [9017829 (0%)]
160	9: /~eras045/serious/final_report.ps [8845382 (0%)]
161	10:
162	/~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
163	[8744448 (0%)]
164	\end{verbatim}
165
166	\subsection{Will a certain IP use multiple user-agents?}
167	Check whether there are multiple User-agents at IP.
168	Perl code at Appendix~\ref{nat-proxy.pl}.
169	\begin{verbatim}
170	proxy/others hosts: 5086/71214 (7%)
171	\end{verbatim}
172
173	\subsection{Which IP ranges access the web server?}
174	Collect every IP address and try to put them into ranges. We will ignore
175	hostnames cause the their IP might be changed a few times already.
176
177	This will need some more logic like knowledge of the IP subnets. Will
178	skip this one.
179
180
181	\section{Conclusion}
182	Simple relations like statistics are easy to find, but the more
183	sophisticated ones has to be thought out and designed. Finding good
184	relations will take a lot of time and will be very hard to automate.
185
186	Using Perl will be a quick way to process small amounts of data, when
187	processing more data I recommend writing a small (wrapper) binary program to
188	(pre)process data.
189
190	%\begin{thebibliography}{XX}
191	%
192	%\end{thebibliography}
193
194	\section*{Appendix}
195
196	\subsection{common.pl}
197	\label{common.pl}
198	\VerbatimInput{common.pl}
199	\newpage
200
201	\subsection{robot.pl}
202	\label{robot.pl}
203	\VerbatimInput{robot.pl}
204	\newpage
205
206	\subsection{bandwidth.pl}
207	\label{bandwidth.pl}
208	\VerbatimInput{bandwidth.pl}
209	\newpage
210
211	\subsection{exists.pl}
212	\label{exists.pl}
213	\VerbatimInput{exists.pl}
214	\newpage
215
216	\subsection{nat-proxy.pl}
217	\label{nat-proxy.pl}
218	\VerbatimInput{nat-proxy.pl}
219	\end{document}
220

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/ccs/op3/report.tex@ 321

Download in other formats: