Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

report.tex@ 193

Last change on this file since 193 was 2, checked in by Rick van der Zwet, 15 years ago
Initial import of data of old repository ('data') worth keeping (e.g. tracking means of URL access statistics)
File size: 8.0 KB

Rev	Line
[2]	1	\documentclass[a4paper,12pt]{article}
	2	\usepackage{hyperref}
	3	\usepackage{a4wide}
	4	%\usepackage{indentfirst}
	5	\usepackage[english]{babel}
	6	\usepackage{graphics}
	7	%\usepackage[pdftex]{graphicx}
	8	\usepackage{latexsym}
	9	\usepackage{fancyvrb}
	10	\usepackage{fancyhdr}
	11
	12	\pagestyle{fancyplain}
	13	\newcommand{\tstamp}{\today}
	14	\newcommand{\id}{$ $Id: report.tex 166 2007-05-14 08:08:58Z rick $ $}
	15	\lfoot[\fancyplain{\tstamp}{\tstamp}] {\fancyplain{\tstamp}{\tstamp}}
	16	\cfoot[\fancyplain{\id}{\id}] {\fancyplain{\id}{\id}}
	17	\rfoot[\fancyplain{\thepage}{\thepage}] {\fancyplain{\thepage}{\thepage}}
	18
	19
	20	\title{ Challenges in Computer Science \\
	21	\large{Assignment 3 - accesslog}}
	22	\author{Rick van der Zwet\\
	23	\texttt{<hvdzwet@liacs.nl>}\\
	24	\\
	25	LIACS\\
	26	Leiden Universiteit\\
	27	Niels Bohrweg 1\\
	28	2333 CA Leiden\\
	29	Nederland}
	30	\date{\today}
	31	\begin{document}
	32	\maketitle
	33	\section{Introduction}
	34	\label{foo}
	35	The assignment will be the following
	36	\begin{quote}
	37	Analyse a web server accesslog -using Perl- and find something
	38	'interesting'. Write a three pages article out your finding.
	39	\end{quote}
	40
	41	\section{Problem}
	42	Direct relations are not visible inside the web server accesslog, there
	43	will be a need to process the data and find useful 'connections'. Not
	44	all data will be available all times.
	45
	46	\section{Theory}
	47	\subsection{Apache httpd accesslog}
	48	We are processing an accesslog of Apache httpd server
	49	\footnote{http://httpd.apache.org/} which has a predefined formats
	50	\footnote{http://httpd.apache.org/docs/1.3/logs.html}. The format we
	51	will analyze will be the \textsl{combined log format}. Combined log
	52	format will contains the most information compared to others. An
	53	accesslog line is formatted as below.
	54	\begin{verbatim}
	55	66.196.90.99 - - [01/Jun/2004:04:04:06 +0200] "GET
	56	/~kmakhija/daily/16thJun HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatibl
	57	e; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slu rp)"
	58	\end{verbatim}
	59	Detailed explanation\footnote{http://httpd.apache.org/docs/1.3/logs.html\#combined}:
	60	\begin{description}
	61	\item[66.196.90.99] This is the IP address of the client (remote host) which
	62	made the request to the server, if a proxy server exists between the user and
	63	the server, this address will be the address of the proxy, rather than the
	64	originating machine.
	65	\item[-] The "hyphen" in the output indicates that the requested piece of
	66	information is not available. In this case, the information that is not
	67	available is the RFC 1413 identity of the client determined by identd on the
	68	clients machine. This information is highly unreliable and should almost never
	69	be used except on tightly controlled internal networks.
	70	\item[-] This is the
	71	userid of the person requesting the document as determined by HTTP
	72	authentication. The same value is typically provided to CGI scripts in the
	73	REMOTE\_USER environment variable. If the status code for the request (see
	74	below) is 401, then this value should not be trusted because the user is not
	75	yet authenticated. If the document is not password protected, this entry will
	76	be "-" just like the previous one.
	77	\item[01/Jun/2004:04:04:06 +0200] The time
	78	that the server finished processing the request. The format is:
	79	\begin{verbatim}
	80	[day/month/year:hour:minute:second zone]
	81	day = 2*digit
	82	month = 3*letter
	83	year = 4*digit
	84	hour = 2*digit
	85	minute = 2*digit
	86	second = 2*digit
	87	zone = (`+' \| `-') 4*digit
	88	\end{verbatim}
	89	\item["GET /~kmakhija/daily/16thJun HTTP/1.0"] The request line from the
	90	client is given in double quotes. The request line contains a great deal of
	91	useful information. First, the method used by the client is \textsl{GET}.
	92	Second, the client requested the resource \textsl{/~kmakhija/daily/16thJun},
	93	and third, the client used the protocol \textsl{HTTP/1.0}.
	94	\item[304] This is the status code that the server sends back to the client.
	95	This information is very valuable, because it reveals whether the request
	96	resulted in a successful response (codes beginning in 2), a redirection (codes
	97	beginning in 3), an error caused by the client (codes beginning in 4), or an
	98	error in the server (codes beginning in 5). The full list of possible status
	99	codes can be found in the HTTP specification (RFC2616 section
	100	10).\footnote{http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html}
	101	\item[-] The last entry indicates the size of the object returned to the
	102	client, not including the response headers. If no content was returned to the
	103	client, this value will be "-".
	104	\item["-"] The "Referer" (sic) HTTP request header. This gives the site that
	105	the client reports having been referred from. (This should be the page that
	106	links to or includes the page requested).
	107	\item["Mozilla/5.0 (compatible; Yahoo! Slurp;
	108	http://help.yahoo.com/help/us/ysearch/slurp)"] The User-Agent HTTP request
	109	header. This is the identifying information that the client browser reports
	110	about itself.
	111	\end{description}
	112
	113	\subsection{References}
	114	Looking at the information available, there will be a very big ammount of
	115	things able to check (or crossmatch). It will far to difficult to have a
	116	program itself find interconnections, so we will need to define a few and check
	117	them (by using a program)
	118
	119	\section{Implementation}
	120	Perl will be the most easy way to accomplish this goal, cause it's
	121	plain text processing based in the first place.
	122
	123	\section{Experiments}
	124
	125	Results of the experiments are outputs of test file input
	126	\textsl{/scratch/wwwlog/www.access\_log.8}
	127
	128	\subsection{Which ratio/number of pages are requested but don't exists
	129	(anymore)?} Check every URL and check their status code. A 404 will be
	130	marked unknown.
	131	Perl code at Appendix~\ref{exists.pl}.
	132	\begin{verbatim}
	133	404/total hits: 63774/987696 (6%)
	134	Different 404 url's: 7590 (11%)
	135	\end{verbatim}
	136
	137	\subsection{What will be the ratio human/robot?}
	138	Find type of User-Agents, mark them robot if the string contains
	139	\textsl{bot, spider ,slurp ,search ,crawler ,checker ,downloader ,worm}
	140	Perl code at Appendix~\ref{robot.pl}.
	141	\begin{verbatim}
	142	robot/others user-agent: 186/7711 (2%)
	143	\end{verbatim}
	144
	145	\subsection{Which documents generated the most bandwidth}?
	146	Collect the number of hits on a certain page and multiply by the size of
	147	the page.
	148	Perl code at Appendix~\ref{bandwidth.pl}.
	149	\begin{verbatim}
	150	Total Bandwidth (bytes): 2504223027
	151	top 10 bandwidth
	152	1: /~dlf/reisfotos04/foto's.zip [143839972 (5%)]
	153	2: /%7Ehvdspek/cgi-bin/fetchlog.cgi [103325194 (4%)]
	154	3: /~moosten/quackknoppen.zip [19990886 (0%)]
	155	4: /~swolff/Maradonna.mpg [19955712 (0%)]
	156	5: /~phaazebr/weblog [15061021 (0%)]
	157	6: /home/dlf/klr/download/inaug2004.ppt [10070836 (0%)]
	158	7: /~dlf/klr/download/inaug2004.ppt [10064996 (0%)]
	159	8: /~sgroot/londen-small.wmv [9017829 (0%)]
	160	9: /~eras045/serious/final_report.ps [8845382 (0%)]
	161	10:
	162	/~erwin/SR2002/SpeechRecognition2002/Student%20Projects/Recognition_Algorithms_II/timing%20RES.xls
	163	[8744448 (0%)]
	164	\end{verbatim}
	165
	166	\subsection{Will a certain IP use multiple user-agents?}
	167	Check whether there are multiple User-agents at IP.
	168	Perl code at Appendix~\ref{nat-proxy.pl}.
	169	\begin{verbatim}
	170	proxy/others hosts: 5086/71214 (7%)
	171	\end{verbatim}
	172
	173	\subsection{Which IP ranges access the web server?}
	174	Collect every IP address and try to put them into ranges. We will ignore
	175	hostnames cause the their IP might be changed a few times already.
	176
	177	This will need some more logic like knowledge of the IP subnets. Will
	178	skip this one.
	179
	180
	181	\section{Conclusion}
	182	Simple relations like statistics are easy to find, but the more
	183	sophisticated ones has to be thought out and designed. Finding good
	184	relations will take a lot of time and will be very hard to automate.
	185
	186	Using Perl will be a quick way to process small amounts of data, when
	187	processing more data I recommend writing a small (wrapper) binary program to
	188	(pre)process data.
	189
	190	%\begin{thebibliography}{XX}
	191	%
	192	%\end{thebibliography}
	193
	194	\section*{Appendix}
	195
	196	\subsection{common.pl}
	197	\label{common.pl}
	198	\VerbatimInput{common.pl}
	199	\newpage
	200
	201	\subsection{robot.pl}
	202	\label{robot.pl}
	203	\VerbatimInput{robot.pl}
	204	\newpage
	205
	206	\subsection{bandwidth.pl}
	207	\label{bandwidth.pl}
	208	\VerbatimInput{bandwidth.pl}
	209	\newpage
	210
	211	\subsection{exists.pl}
	212	\label{exists.pl}
	213	\VerbatimInput{exists.pl}
	214	\newpage
	215
	216	\subsection{nat-proxy.pl}
	217	\label{nat-proxy.pl}
	218	\VerbatimInput{nat-proxy.pl}
	219	\end{document}
	220

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/ccs/op3/report.tex@ 193

Download in other formats: