source: liacs/dbdm/dbdm_5/report.tex@ 344

Last change on this file since 344 was 73, checked in by Rick van der Zwet, 15 years ago

Final result of hard work

File size: 10.9 KB
RevLine 
[71]1\documentclass{report}
[73]2\usepackage{graphicx}
[71]3
4\title{Databases and Data Mining --- Assignment 5}
5\author{Rick van der Zwet}
6
7
8\begin{document}
[73]9\newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}}
[71]10\maketitle
11
12\question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
13generation of a concept hierarchy for numerical data based on the equal-width
14partitioning rule.}
[72]15\begin{verbatim}
16input = [] # Input array of all input numbers
17num_intervals = %User input of number of intervals needed%
18max = maximum(input)
19min = minimum(input)
20interval_width = (max - min) / num_intervals
[71]21
[73]22output = [] # Output array, where the values of interval k
23 # is stored in value output[k]
24
[72]25for value in input:
26 interval = value / interval_width # Find it's correct bin
27 output[interval].append(value) # Put the value inside the bin
28endfor
29\end{verbatim}
30
[71]31\question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
32generation of a concept hierarchy for numerical data based on the equal-frequency
33partitioning rule.}
[72]34\begin{verbatim}
35input = [] # Input array of all input numbers
36num_intervals = %User input of number of intervals needed%
37input_length = length(input) # Number of items in list
38interval_width = input_length / num_intervals
[71]39
[72]40sorted_input = sorted(input) # Sort items on value from small to large
41
[73]42output = [] # Output array, where the values of interval k
43 # is stored in value output[k]
[72]44
45interval = 0
46counter = 0
47for value in sorted_input:
48 output[interval].append(value) # Put the value inside the bin
49
50 # If the width has been 'filled' continue with the next one
51 counter++
52 if counter > interval_width:
53 interval++
54 counter = 0
55 endif
56endfor
57\end{verbatim}
58
[71]59\question{2}{Suppose that a data warehouse consists of four dimensions,
60 date, spectator, location, and game, and two measures, count, and charge, where
61 charge is the fare that a spectator pays when watching a game on a given date.
62 Spectators may be students, adults, or seniors, with each category having its own
63 charge rate.}
64\question{2a}{Draw a star schema diagram for the data warehouse.}
[73]65See figure~\ref{fig:star}.
[72]66% http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
[73]67\begin{figure}[htp]
68\centering
69\includegraphics[scale=0.6]{star-diagram.eps}
70\caption{Star schema diagram 2a data warehose}
71\label{fig:star}
72\end{figure}
[71]73
[72]74\question{2b}{Starting with the base cuboid [date, spectator, location, game],
75what specific OLAP operations should one perform in order to list the total
76charge paid by student spectators at GM\_Place 2004?}
77% http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
78You first will need to \emph{slice} on the condition \texttt{game ==
79'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
80This will give you all the charges for GM Place in 2004. Next we slice to
81\texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
82display phase (\texttt{pivot}).
83
84\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
85as an example, briefly discuss advantages and problems of using a bitmap index
86structure.}
87% http://en.wikipedia.org/wiki/Bitmap_index
88Bitmap indexing in this case is useful to have a compact representation of for
89example the spectator type. As this can only be 4 options a four bit
90representation fits the need to store all possible combinations. This required
91binary operators to do searches in this set are quick, but their results will
92need to be processed before beeing able to represent it.
93One other advantage is the fact that the bitmap indexing compresses really well
94as patterns are reoccuring.
95
96
[71]97\question{3}{A data cube, C, has n dimensions, and each dimension
98 has exactly p distinct values in the base cuboid. Assume that there are no concept
99 hierarchies associated with the dimensions. What is the maximum number of cells
100 possible (including both base cells and aggregate cells) in the data cube, C?}
[72]101% http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
102Take for example 2 dimensions with each 4 distinct values in the base cuboid,
103this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
[71]104
105\question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
106\question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
[72]107% http://en.wikipedia.org/wiki/Apriori_algorithm
108If an itemset $I$ is frequent it means it's occurence classifies a minimal
109support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
110same -or larger if the subset is also part of an other itemset $J$, which has
111distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
112$I$ is frequent.
113
114\question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
115 “$s’ => (l - s’)$” cannot be more than the confidence of the rule “$s => (l - s)$” where $s’$
116 is a subset of $s$.}
117If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
118conditional propability of also containing a larger itemset. If the repeat this
119assumption, you will end up on the left side with a empty set and on the right
120the full itemset $l$, which then would have the highest confidence. While this
121would not be valid. As rules like 'Customer buying nothing', will also buy
122'beer' can not be existing.
123
[71]124\question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
125 patterns instead. Describe the main differences of mining closed patterns and mining
126 max patterns.}
[72]127% http://www.dataminingarticles.com/closed-maximal-itemsets.html
128By definition the itemset $X$ is closed means $X$ is frequent and there exists
129no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
130is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
131Maximum frequent itemsets has the downside that you don't not know the actual
132support of those sub-itemsets. The closed frequent itemsets helps purging a lot
133of itemsets that are not needed.
[71]134
135\question{6}{The price of each item in a store is nonnegative. For
136each of the following cases, identify the kinds of constraint they represent (e.g.
[73]137anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association
[71]138rules efficiently:}
[72]139% www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
[73]140\question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
141This is \emph{monotonic}, as adding items will never lower the sum of prices.
142Mining this will be bestly done by first finding the free items and next order
143the other items based on price, starting with the most expensive ones first.
144The soon you get to \$190 can al it's supersets to matching as well, next you
145remove the last item and try to match up again. Continue this till you removed
146the most expensive item and tried to match up again. Next \emph{JOIN} with all
147the free items and the list is complete.
[72]148
[71]149\question{6b}{Where the average price of all the items is between \$120 and \$520.}
[73]150This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if
151you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the
152sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in
153prefix item value descending order. \emph{monotonic} If itemset $S$ constraint
154$avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values
155descending order.
[71]156
[73]157Satifing both conditions how-ever does make it neither \emph{anti-monotonic}
158nor \emph{monotonic}. A fast way to generate this set is to use the algoritm
159used in 6a but modify the number constraint on-the-fly. Like average of 3 items
160between a centrain range (120 -- 520) It the same of checking whether the sum
161of 3 items is between ($3*120$ -- $3*520$).
162
[71]163\question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
164busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
165bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
166Describe an efficient system that would allow real time data mining and continuously
167querying these video streams for abnormal events, such as explosions, people in
168panic, etc.. Also discuss the computational costs and the memory requirements of
169your system.}
[72]170
[73]171Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx
17221.9 MByte/sec$. I would first use a compression algoritm to send-the-data over
173the wire by sending the changes only and not the full frame every time. Secondly
174applying abnomality detections using heuristics. Keeping the detected
175abnormalities available un-compressed and uncut and save all other data in a
176much lower resolution (1fps), with a compression ratio of 4:1 this would make
177storage stream of approx 19GB/camera/day. Processing power required depends on
178the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple
179algoritms are used.
[72]180
[71]181\question{8}{A flight data warehouse for a travel agent consists of six
182dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
183and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
184fare at the lowest level but the average fare at the other levels. Suppose the cube is
185fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
186arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
187airline, etc.) should one perform in order to list the average fare per month for each
188business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
189year 2007?}
[72]190
[73]191Note bit tricky as we use the departure\_time to determine the year and
192arrival\_time to determine the month of the flight. When these values are not
193set to the same day we might run into inconsistencies.
[72]194
[73]195
196\begin{verbatim}
197* Roll-up departure\_time to 'Year'
198* Roll-up flight to 'Airline'
199* Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines'
200* Roll-up traveler to 'Consumer/Business'
201* Roll-up arrival_time to 'Month'
202* Slice traveler = 'Business'
203\end{verbatim}
204
[71]205\question{9}{In graph mining what would be the advantage of the described apriori-based approach
206over the pattern growth based approach (see lecture slides) and vice versa.}
[73]207The advantage comes in the simplicity of the algotitm when it get to
208computations needed and it's setup, but this does not make it effient.
[71]209
[73]210On the other hand apriori uses the costly process of candidate generation and
211testing, making it more 'heavy' then pattern grow based. and also avoid costly
212database scans.
213
[72]214% http://en.wikipedia.org/wiki/Association_rule_learning
215
[71]216\bibliography{report}
217
218\end{document}
Note: See TracBrowser for help on using the repository browser.