Context Navigation

report.tex@ 269

Last change on this file since 269 was 73, checked in by Rick van der Zwet, 15 years ago
Final result of hard work
File size: 10.9 KB

Line
1	\documentclass{report}
2	\usepackage{graphicx}
3
4	\title{Databases and Data Mining --- Assignment 5}
5	\author{Rick van der Zwet}
6
7
8	\begin{document}
9	\newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}}
10	\maketitle
11
12	\question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
13	generation of a concept hierarchy for numerical data based on the equal-width
14	partitioning rule.}
15	\begin{verbatim}
16	input = [] # Input array of all input numbers
17	num_intervals = %User input of number of intervals needed%
18	max = maximum(input)
19	min = minimum(input)
20	interval_width = (max - min) / num_intervals
21
22	output = [] # Output array, where the values of interval k
23	# is stored in value output[k]
24
25	for value in input:
26	interval = value / interval_width # Find it's correct bin
27	output[interval].append(value) # Put the value inside the bin
28	endfor
29	\end{verbatim}
30
31	\question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
32	generation of a concept hierarchy for numerical data based on the equal-frequency
33	partitioning rule.}
34	\begin{verbatim}
35	input = [] # Input array of all input numbers
36	num_intervals = %User input of number of intervals needed%
37	input_length = length(input) # Number of items in list
38	interval_width = input_length / num_intervals
39
40	sorted_input = sorted(input) # Sort items on value from small to large
41
42	output = [] # Output array, where the values of interval k
43	# is stored in value output[k]
44
45	interval = 0
46	counter = 0
47	for value in sorted_input:
48	output[interval].append(value) # Put the value inside the bin
49
50	# If the width has been 'filled' continue with the next one
51	counter++
52	if counter > interval_width:
53	interval++
54	counter = 0
55	endif
56	endfor
57	\end{verbatim}
58
59	\question{2}{Suppose that a data warehouse consists of four dimensions,
60	date, spectator, location, and game, and two measures, count, and charge, where
61	charge is the fare that a spectator pays when watching a game on a given date.
62	Spectators may be students, adults, or seniors, with each category having its own
63	charge rate.}
64	\question{2a}{Draw a star schema diagram for the data warehouse.}
65	See figure~\ref{fig:star}.
66	% http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
67	\begin{figure}[htp]
68	\centering
69	\includegraphics[scale=0.6]{star-diagram.eps}
70	\caption{Star schema diagram 2a data warehose}
71	\label{fig:star}
72	\end{figure}
73
74	\question{2b}{Starting with the base cuboid [date, spectator, location, game],
75	what specific OLAP operations should one perform in order to list the total
76	charge paid by student spectators at GM\_Place 2004?}
77	% http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
78	You first will need to \emph{slice} on the condition \texttt{game ==
79	'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
80	This will give you all the charges for GM Place in 2004. Next we slice to
81	\texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
82	display phase (\texttt{pivot}).
83
84	\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
85	as an example, briefly discuss advantages and problems of using a bitmap index
86	structure.}
87	% http://en.wikipedia.org/wiki/Bitmap_index
88	Bitmap indexing in this case is useful to have a compact representation of for
89	example the spectator type. As this can only be 4 options a four bit
90	representation fits the need to store all possible combinations. This required
91	binary operators to do searches in this set are quick, but their results will
92	need to be processed before beeing able to represent it.
93	One other advantage is the fact that the bitmap indexing compresses really well
94	as patterns are reoccuring.
95
96
97	\question{3}{A data cube, C, has n dimensions, and each dimension
98	has exactly p distinct values in the base cuboid. Assume that there are no concept
99	hierarchies associated with the dimensions. What is the maximum number of cells
100	possible (including both base cells and aggregate cells) in the data cube, C?}
101	% http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
102	Take for example 2 dimensions with each 4 distinct values in the base cuboid,
103	this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
104
105	\question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
106	\question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
107	% http://en.wikipedia.org/wiki/Apriori_algorithm
108	If an itemset $I$ is frequent it means it's occurence classifies a minimal
109	support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
110	same -or larger if the subset is also part of an other itemset $J$, which has
111	distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
112	$I$ is frequent.
113
114	\question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
115	â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$
116	is a subset of $s$.}
117	If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
118	conditional propability of also containing a larger itemset. If the repeat this
119	assumption, you will end up on the left side with a empty set and on the right
120	the full itemset $l$, which then would have the highest confidence. While this
121	would not be valid. As rules like 'Customer buying nothing', will also buy
122	'beer' can not be existing.
123
124	\question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
125	patterns instead. Describe the main differences of mining closed patterns and mining
126	max patterns.}
127	% http://www.dataminingarticles.com/closed-maximal-itemsets.html
128	By definition the itemset $X$ is closed means $X$ is frequent and there exists
129	no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
130	is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
131	Maximum frequent itemsets has the downside that you don't not know the actual
132	support of those sub-itemsets. The closed frequent itemsets helps purging a lot
133	of itemsets that are not needed.
134
135	\question{6}{The price of each item in a store is nonnegative. For
136	each of the following cases, identify the kinds of constraint they represent (e.g.
137	anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association
138	rules efficiently:}
139	% www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
140	\question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
141	This is \emph{monotonic}, as adding items will never lower the sum of prices.
142	Mining this will be bestly done by first finding the free items and next order
143	the other items based on price, starting with the most expensive ones first.
144	The soon you get to \$190 can al it's supersets to matching as well, next you
145	remove the last item and try to match up again. Continue this till you removed
146	the most expensive item and tried to match up again. Next \emph{JOIN} with all
147	the free items and the list is complete.
148
149	\question{6b}{Where the average price of all the items is between \$120 and \$520.}
150	This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if
151	you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the
152	sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in
153	prefix item value descending order. \emph{monotonic} If itemset $S$ constraint
154	$avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values
155	descending order.
156
157	Satifing both conditions how-ever does make it neither \emph{anti-monotonic}
158	nor \emph{monotonic}. A fast way to generate this set is to use the algoritm
159	used in 6a but modify the number constraint on-the-fly. Like average of 3 items
160	between a centrain range (120 -- 520) It the same of checking whether the sum
161	of 3 items is between ($3120$ -- $3520$).
162
163	\question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
164	busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
165	bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
166	Describe an efficient system that would allow real time data mining and continuously
167	querying these video streams for abnormal events, such as explosions, people in
168	panic, etc.. Also discuss the computational costs and the memory requirements of
169	your system.}
170
171	Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx
172	21.9 MByte/sec$. I would first use a compression algoritm to send-the-data over
173	the wire by sending the changes only and not the full frame every time. Secondly
174	applying abnomality detections using heuristics. Keeping the detected
175	abnormalities available un-compressed and uncut and save all other data in a
176	much lower resolution (1fps), with a compression ratio of 4:1 this would make
177	storage stream of approx 19GB/camera/day. Processing power required depends on
178	the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple
179	algoritms are used.
180
181	\question{8}{A flight data warehouse for a travel agent consists of six
182	dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
183	and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
184	fare at the lowest level but the average fare at the other levels. Suppose the cube is
185	fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
186	arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
187	airline, etc.) should one perform in order to list the average fare per month for each
188	business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
189	year 2007?}
190
191	Note bit tricky as we use the departure\_time to determine the year and
192	arrival\_time to determine the month of the flight. When these values are not
193	set to the same day we might run into inconsistencies.
194
195
196	\begin{verbatim}
197	* Roll-up departure\_time to 'Year'
198	* Roll-up flight to 'Airline'
199	* Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines'
200	* Roll-up traveler to 'Consumer/Business'
201	* Roll-up arrival_time to 'Month'
202	* Slice traveler = 'Business'
203	\end{verbatim}
204
205	\question{9}{In graph mining what would be the advantage of the described apriori-based approach
206	over the pattern growth based approach (see lecture slides) and vice versa.}
207	The advantage comes in the simplicity of the algotitm when it get to
208	computations needed and it's setup, but this does not make it effient.
209
210	On the other hand apriori uses the costly process of candidate generation and
211	testing, making it more 'heavy' then pattern grow based. and also avoid costly
212	database scans.
213
214	% http://en.wikipedia.org/wiki/Association_rule_learning
215
216	\bibliography{report}
217
218	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/dbdm/dbdm_5/report.tex@ 269

Download in other formats: