Context Navigation

report.tex@ 140

Last change on this file since 140 was 73, checked in by Rick van der Zwet, 15 years ago
Final result of hard work
File size: 10.9 KB

Rev	Line
[71]	1	\documentclass{report}
[73]	2	\usepackage{graphicx}
[71]	3
	4	\title{Databases and Data Mining --- Assignment 5}
	5	\author{Rick van der Zwet}
	6
	7
	8	\begin{document}
[73]	9	\newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}}
[71]	10	\maketitle
	11
	12	\question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
	13	generation of a concept hierarchy for numerical data based on the equal-width
	14	partitioning rule.}
[72]	15	\begin{verbatim}
	16	input = [] # Input array of all input numbers
	17	num_intervals = %User input of number of intervals needed%
	18	max = maximum(input)
	19	min = minimum(input)
	20	interval_width = (max - min) / num_intervals
[71]	21
[73]	22	output = [] # Output array, where the values of interval k
	23	# is stored in value output[k]
	24
[72]	25	for value in input:
	26	interval = value / interval_width # Find it's correct bin
	27	output[interval].append(value) # Put the value inside the bin
	28	endfor
	29	\end{verbatim}
	30
[71]	31	\question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
	32	generation of a concept hierarchy for numerical data based on the equal-frequency
	33	partitioning rule.}
[72]	34	\begin{verbatim}
	35	input = [] # Input array of all input numbers
	36	num_intervals = %User input of number of intervals needed%
	37	input_length = length(input) # Number of items in list
	38	interval_width = input_length / num_intervals
[71]	39
[72]	40	sorted_input = sorted(input) # Sort items on value from small to large
	41
[73]	42	output = [] # Output array, where the values of interval k
	43	# is stored in value output[k]
[72]	44
	45	interval = 0
	46	counter = 0
	47	for value in sorted_input:
	48	output[interval].append(value) # Put the value inside the bin
	49
	50	# If the width has been 'filled' continue with the next one
	51	counter++
	52	if counter > interval_width:
	53	interval++
	54	counter = 0
	55	endif
	56	endfor
	57	\end{verbatim}
	58
[71]	59	\question{2}{Suppose that a data warehouse consists of four dimensions,
	60	date, spectator, location, and game, and two measures, count, and charge, where
	61	charge is the fare that a spectator pays when watching a game on a given date.
	62	Spectators may be students, adults, or seniors, with each category having its own
	63	charge rate.}
	64	\question{2a}{Draw a star schema diagram for the data warehouse.}
[73]	65	See figure~\ref{fig:star}.
[72]	66	% http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
[73]	67	\begin{figure}[htp]
	68	\centering
	69	\includegraphics[scale=0.6]{star-diagram.eps}
	70	\caption{Star schema diagram 2a data warehose}
	71	\label{fig:star}
	72	\end{figure}
[71]	73
[72]	74	\question{2b}{Starting with the base cuboid [date, spectator, location, game],
	75	what specific OLAP operations should one perform in order to list the total
	76	charge paid by student spectators at GM\_Place 2004?}
	77	% http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
	78	You first will need to \emph{slice} on the condition \texttt{game ==
	79	'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
	80	This will give you all the charges for GM Place in 2004. Next we slice to
	81	\texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
	82	display phase (\texttt{pivot}).
	83
	84	\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
	85	as an example, briefly discuss advantages and problems of using a bitmap index
	86	structure.}
	87	% http://en.wikipedia.org/wiki/Bitmap_index
	88	Bitmap indexing in this case is useful to have a compact representation of for
	89	example the spectator type. As this can only be 4 options a four bit
	90	representation fits the need to store all possible combinations. This required
	91	binary operators to do searches in this set are quick, but their results will
	92	need to be processed before beeing able to represent it.
	93	One other advantage is the fact that the bitmap indexing compresses really well
	94	as patterns are reoccuring.
	95
	96
[71]	97	\question{3}{A data cube, C, has n dimensions, and each dimension
	98	has exactly p distinct values in the base cuboid. Assume that there are no concept
	99	hierarchies associated with the dimensions. What is the maximum number of cells
	100	possible (including both base cells and aggregate cells) in the data cube, C?}
[72]	101	% http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
	102	Take for example 2 dimensions with each 4 distinct values in the base cuboid,
	103	this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
[71]	104
	105	\question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
	106	\question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
[72]	107	% http://en.wikipedia.org/wiki/Apriori_algorithm
	108	If an itemset $I$ is frequent it means it's occurence classifies a minimal
	109	support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
	110	same -or larger if the subset is also part of an other itemset $J$, which has
	111	distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
	112	$I$ is frequent.
	113
	114	\question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
	115	â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$
	116	is a subset of $s$.}
	117	If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
	118	conditional propability of also containing a larger itemset. If the repeat this
	119	assumption, you will end up on the left side with a empty set and on the right
	120	the full itemset $l$, which then would have the highest confidence. While this
	121	would not be valid. As rules like 'Customer buying nothing', will also buy
	122	'beer' can not be existing.
	123
[71]	124	\question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
	125	patterns instead. Describe the main differences of mining closed patterns and mining
	126	max patterns.}
[72]	127	% http://www.dataminingarticles.com/closed-maximal-itemsets.html
	128	By definition the itemset $X$ is closed means $X$ is frequent and there exists
	129	no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
	130	is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
	131	Maximum frequent itemsets has the downside that you don't not know the actual
	132	support of those sub-itemsets. The closed frequent itemsets helps purging a lot
	133	of itemsets that are not needed.
[71]	134
	135	\question{6}{The price of each item in a store is nonnegative. For
	136	each of the following cases, identify the kinds of constraint they represent (e.g.
[73]	137	anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association
[71]	138	rules efficiently:}
[72]	139	% www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
[73]	140	\question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
	141	This is \emph{monotonic}, as adding items will never lower the sum of prices.
	142	Mining this will be bestly done by first finding the free items and next order
	143	the other items based on price, starting with the most expensive ones first.
	144	The soon you get to \$190 can al it's supersets to matching as well, next you
	145	remove the last item and try to match up again. Continue this till you removed
	146	the most expensive item and tried to match up again. Next \emph{JOIN} with all
	147	the free items and the list is complete.
[72]	148
[71]	149	\question{6b}{Where the average price of all the items is between \$120 and \$520.}
[73]	150	This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if
	151	you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the
	152	sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in
	153	prefix item value descending order. \emph{monotonic} If itemset $S$ constraint
	154	$avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values
	155	descending order.
[71]	156
[73]	157	Satifing both conditions how-ever does make it neither \emph{anti-monotonic}
	158	nor \emph{monotonic}. A fast way to generate this set is to use the algoritm
	159	used in 6a but modify the number constraint on-the-fly. Like average of 3 items
	160	between a centrain range (120 -- 520) It the same of checking whether the sum
	161	of 3 items is between ($3120$ -- $3520$).
	162
[71]	163	\question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
	164	busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
	165	bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
	166	Describe an efficient system that would allow real time data mining and continuously
	167	querying these video streams for abnormal events, such as explosions, people in
	168	panic, etc.. Also discuss the computational costs and the memory requirements of
	169	your system.}
[72]	170
[73]	171	Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx
	172	21.9 MByte/sec$. I would first use a compression algoritm to send-the-data over
	173	the wire by sending the changes only and not the full frame every time. Secondly
	174	applying abnomality detections using heuristics. Keeping the detected
	175	abnormalities available un-compressed and uncut and save all other data in a
	176	much lower resolution (1fps), with a compression ratio of 4:1 this would make
	177	storage stream of approx 19GB/camera/day. Processing power required depends on
	178	the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple
	179	algoritms are used.
[72]	180
[71]	181	\question{8}{A flight data warehouse for a travel agent consists of six
	182	dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
	183	and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
	184	fare at the lowest level but the average fare at the other levels. Suppose the cube is
	185	fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
	186	arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
	187	airline, etc.) should one perform in order to list the average fare per month for each
	188	business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
	189	year 2007?}
[72]	190
[73]	191	Note bit tricky as we use the departure\_time to determine the year and
	192	arrival\_time to determine the month of the flight. When these values are not
	193	set to the same day we might run into inconsistencies.
[72]	194
[73]	195
	196	\begin{verbatim}
	197	* Roll-up departure\_time to 'Year'
	198	* Roll-up flight to 'Airline'
	199	* Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines'
	200	* Roll-up traveler to 'Consumer/Business'
	201	* Roll-up arrival_time to 'Month'
	202	* Slice traveler = 'Business'
	203	\end{verbatim}
	204
[71]	205	\question{9}{In graph mining what would be the advantage of the described apriori-based approach
	206	over the pattern growth based approach (see lecture slides) and vice versa.}
[73]	207	The advantage comes in the simplicity of the algotitm when it get to
	208	computations needed and it's setup, but this does not make it effient.
[71]	209
[73]	210	On the other hand apriori uses the costly process of candidate generation and
	211	testing, making it more 'heavy' then pattern grow based. and also avoid costly
	212	database scans.
	213
[72]	214	% http://en.wikipedia.org/wiki/Association_rule_learning
	215
[71]	216	\bibliography{report}
	217
	218	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/dbdm/dbdm_5/report.tex@ 140

Download in other formats: