1 | \documentclass{report}
|
---|
2 | \usepackage{graphicx}
|
---|
3 |
|
---|
4 | \title{Databases and Data Mining --- Assignment 5}
|
---|
5 | \author{Rick van der Zwet}
|
---|
6 |
|
---|
7 |
|
---|
8 | \begin{document}
|
---|
9 | \newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}}
|
---|
10 | \maketitle
|
---|
11 |
|
---|
12 | \question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
13 | generation of a concept hierarchy for numerical data based on the equal-width
|
---|
14 | partitioning rule.}
|
---|
15 | \begin{verbatim}
|
---|
16 | input = [] # Input array of all input numbers
|
---|
17 | num_intervals = %User input of number of intervals needed%
|
---|
18 | max = maximum(input)
|
---|
19 | min = minimum(input)
|
---|
20 | interval_width = (max - min) / num_intervals
|
---|
21 |
|
---|
22 | output = [] # Output array, where the values of interval k
|
---|
23 | # is stored in value output[k]
|
---|
24 |
|
---|
25 | for value in input:
|
---|
26 | interval = value / interval_width # Find it's correct bin
|
---|
27 | output[interval].append(value) # Put the value inside the bin
|
---|
28 | endfor
|
---|
29 | \end{verbatim}
|
---|
30 |
|
---|
31 | \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
32 | generation of a concept hierarchy for numerical data based on the equal-frequency
|
---|
33 | partitioning rule.}
|
---|
34 | \begin{verbatim}
|
---|
35 | input = [] # Input array of all input numbers
|
---|
36 | num_intervals = %User input of number of intervals needed%
|
---|
37 | input_length = length(input) # Number of items in list
|
---|
38 | interval_width = input_length / num_intervals
|
---|
39 |
|
---|
40 | sorted_input = sorted(input) # Sort items on value from small to large
|
---|
41 |
|
---|
42 | output = [] # Output array, where the values of interval k
|
---|
43 | # is stored in value output[k]
|
---|
44 |
|
---|
45 | interval = 0
|
---|
46 | counter = 0
|
---|
47 | for value in sorted_input:
|
---|
48 | output[interval].append(value) # Put the value inside the bin
|
---|
49 |
|
---|
50 | # If the width has been 'filled' continue with the next one
|
---|
51 | counter++
|
---|
52 | if counter > interval_width:
|
---|
53 | interval++
|
---|
54 | counter = 0
|
---|
55 | endif
|
---|
56 | endfor
|
---|
57 | \end{verbatim}
|
---|
58 |
|
---|
59 | \question{2}{Suppose that a data warehouse consists of four dimensions,
|
---|
60 | date, spectator, location, and game, and two measures, count, and charge, where
|
---|
61 | charge is the fare that a spectator pays when watching a game on a given date.
|
---|
62 | Spectators may be students, adults, or seniors, with each category having its own
|
---|
63 | charge rate.}
|
---|
64 | \question{2a}{Draw a star schema diagram for the data warehouse.}
|
---|
65 | See figure~\ref{fig:star}.
|
---|
66 | % http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
|
---|
67 | \begin{figure}[htp]
|
---|
68 | \centering
|
---|
69 | \includegraphics[scale=0.6]{star-diagram.eps}
|
---|
70 | \caption{Star schema diagram 2a data warehose}
|
---|
71 | \label{fig:star}
|
---|
72 | \end{figure}
|
---|
73 |
|
---|
74 | \question{2b}{Starting with the base cuboid [date, spectator, location, game],
|
---|
75 | what specific OLAP operations should one perform in order to list the total
|
---|
76 | charge paid by student spectators at GM\_Place 2004?}
|
---|
77 | % http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
|
---|
78 | You first will need to \emph{slice} on the condition \texttt{game ==
|
---|
79 | 'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
|
---|
80 | This will give you all the charges for GM Place in 2004. Next we slice to
|
---|
81 | \texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
|
---|
82 | display phase (\texttt{pivot}).
|
---|
83 |
|
---|
84 | \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
|
---|
85 | as an example, briefly discuss advantages and problems of using a bitmap index
|
---|
86 | structure.}
|
---|
87 | % http://en.wikipedia.org/wiki/Bitmap_index
|
---|
88 | Bitmap indexing in this case is useful to have a compact representation of for
|
---|
89 | example the spectator type. As this can only be 4 options a four bit
|
---|
90 | representation fits the need to store all possible combinations. This required
|
---|
91 | binary operators to do searches in this set are quick, but their results will
|
---|
92 | need to be processed before beeing able to represent it.
|
---|
93 | One other advantage is the fact that the bitmap indexing compresses really well
|
---|
94 | as patterns are reoccuring.
|
---|
95 |
|
---|
96 |
|
---|
97 | \question{3}{A data cube, C, has n dimensions, and each dimension
|
---|
98 | has exactly p distinct values in the base cuboid. Assume that there are no concept
|
---|
99 | hierarchies associated with the dimensions. What is the maximum number of cells
|
---|
100 | possible (including both base cells and aggregate cells) in the data cube, C?}
|
---|
101 | % http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
|
---|
102 | Take for example 2 dimensions with each 4 distinct values in the base cuboid,
|
---|
103 | this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
|
---|
104 |
|
---|
105 | \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
|
---|
106 | \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
|
---|
107 | % http://en.wikipedia.org/wiki/Apriori_algorithm
|
---|
108 | If an itemset $I$ is frequent it means it's occurence classifies a minimal
|
---|
109 | support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
|
---|
110 | same -or larger if the subset is also part of an other itemset $J$, which has
|
---|
111 | distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
|
---|
112 | $I$ is frequent.
|
---|
113 |
|
---|
114 | \question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
|
---|
115 | â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$
|
---|
116 | is a subset of $s$.}
|
---|
117 | If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
|
---|
118 | conditional propability of also containing a larger itemset. If the repeat this
|
---|
119 | assumption, you will end up on the left side with a empty set and on the right
|
---|
120 | the full itemset $l$, which then would have the highest confidence. While this
|
---|
121 | would not be valid. As rules like 'Customer buying nothing', will also buy
|
---|
122 | 'beer' can not be existing.
|
---|
123 |
|
---|
124 | \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
|
---|
125 | patterns instead. Describe the main differences of mining closed patterns and mining
|
---|
126 | max patterns.}
|
---|
127 | % http://www.dataminingarticles.com/closed-maximal-itemsets.html
|
---|
128 | By definition the itemset $X$ is closed means $X$ is frequent and there exists
|
---|
129 | no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
|
---|
130 | is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
|
---|
131 | Maximum frequent itemsets has the downside that you don't not know the actual
|
---|
132 | support of those sub-itemsets. The closed frequent itemsets helps purging a lot
|
---|
133 | of itemsets that are not needed.
|
---|
134 |
|
---|
135 | \question{6}{The price of each item in a store is nonnegative. For
|
---|
136 | each of the following cases, identify the kinds of constraint they represent (e.g.
|
---|
137 | anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association
|
---|
138 | rules efficiently:}
|
---|
139 | % www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
|
---|
140 | \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
|
---|
141 | This is \emph{monotonic}, as adding items will never lower the sum of prices.
|
---|
142 | Mining this will be bestly done by first finding the free items and next order
|
---|
143 | the other items based on price, starting with the most expensive ones first.
|
---|
144 | The soon you get to \$190 can al it's supersets to matching as well, next you
|
---|
145 | remove the last item and try to match up again. Continue this till you removed
|
---|
146 | the most expensive item and tried to match up again. Next \emph{JOIN} with all
|
---|
147 | the free items and the list is complete.
|
---|
148 |
|
---|
149 | \question{6b}{Where the average price of all the items is between \$120 and \$520.}
|
---|
150 | This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if
|
---|
151 | you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the
|
---|
152 | sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in
|
---|
153 | prefix item value descending order. \emph{monotonic} If itemset $S$ constraint
|
---|
154 | $avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values
|
---|
155 | descending order.
|
---|
156 |
|
---|
157 | Satifing both conditions how-ever does make it neither \emph{anti-monotonic}
|
---|
158 | nor \emph{monotonic}. A fast way to generate this set is to use the algoritm
|
---|
159 | used in 6a but modify the number constraint on-the-fly. Like average of 3 items
|
---|
160 | between a centrain range (120 -- 520) It the same of checking whether the sum
|
---|
161 | of 3 items is between ($3*120$ -- $3*520$).
|
---|
162 |
|
---|
163 | \question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
|
---|
164 | busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
|
---|
165 | bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
|
---|
166 | Describe an efficient system that would allow real time data mining and continuously
|
---|
167 | querying these video streams for abnormal events, such as explosions, people in
|
---|
168 | panic, etc.. Also discuss the computational costs and the memory requirements of
|
---|
169 | your system.}
|
---|
170 |
|
---|
171 | Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx
|
---|
172 | 21.9 MByte/sec$. I would first use a compression algoritm to send-the-data over
|
---|
173 | the wire by sending the changes only and not the full frame every time. Secondly
|
---|
174 | applying abnomality detections using heuristics. Keeping the detected
|
---|
175 | abnormalities available un-compressed and uncut and save all other data in a
|
---|
176 | much lower resolution (1fps), with a compression ratio of 4:1 this would make
|
---|
177 | storage stream of approx 19GB/camera/day. Processing power required depends on
|
---|
178 | the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple
|
---|
179 | algoritms are used.
|
---|
180 |
|
---|
181 | \question{8}{A flight data warehouse for a travel agent consists of six
|
---|
182 | dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
|
---|
183 | and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
|
---|
184 | fare at the lowest level but the average fare at the other levels. Suppose the cube is
|
---|
185 | fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
|
---|
186 | arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
|
---|
187 | airline, etc.) should one perform in order to list the average fare per month for each
|
---|
188 | business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
|
---|
189 | year 2007?}
|
---|
190 |
|
---|
191 | Note bit tricky as we use the departure\_time to determine the year and
|
---|
192 | arrival\_time to determine the month of the flight. When these values are not
|
---|
193 | set to the same day we might run into inconsistencies.
|
---|
194 |
|
---|
195 |
|
---|
196 | \begin{verbatim}
|
---|
197 | * Roll-up departure\_time to 'Year'
|
---|
198 | * Roll-up flight to 'Airline'
|
---|
199 | * Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines'
|
---|
200 | * Roll-up traveler to 'Consumer/Business'
|
---|
201 | * Roll-up arrival_time to 'Month'
|
---|
202 | * Slice traveler = 'Business'
|
---|
203 | \end{verbatim}
|
---|
204 |
|
---|
205 | \question{9}{In graph mining what would be the advantage of the described apriori-based approach
|
---|
206 | over the pattern growth based approach (see lecture slides) and vice versa.}
|
---|
207 | The advantage comes in the simplicity of the algotitm when it get to
|
---|
208 | computations needed and it's setup, but this does not make it effient.
|
---|
209 |
|
---|
210 | On the other hand apriori uses the costly process of candidate generation and
|
---|
211 | testing, making it more 'heavy' then pattern grow based. and also avoid costly
|
---|
212 | database scans.
|
---|
213 |
|
---|
214 | % http://en.wikipedia.org/wiki/Association_rule_learning
|
---|
215 |
|
---|
216 | \bibliography{report}
|
---|
217 |
|
---|
218 | \end{document}
|
---|