[71] | 1 | \documentclass{report}
|
---|
[73] | 2 | \usepackage{graphicx}
|
---|
[71] | 3 |
|
---|
| 4 | \title{Databases and Data Mining --- Assignment 5}
|
---|
| 5 | \author{Rick van der Zwet}
|
---|
| 6 |
|
---|
| 7 |
|
---|
| 8 | \begin{document}
|
---|
[73] | 9 | \newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}}
|
---|
[71] | 10 | \maketitle
|
---|
| 11 |
|
---|
| 12 | \question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
| 13 | generation of a concept hierarchy for numerical data based on the equal-width
|
---|
| 14 | partitioning rule.}
|
---|
[72] | 15 | \begin{verbatim}
|
---|
| 16 | input = [] # Input array of all input numbers
|
---|
| 17 | num_intervals = %User input of number of intervals needed%
|
---|
| 18 | max = maximum(input)
|
---|
| 19 | min = minimum(input)
|
---|
| 20 | interval_width = (max - min) / num_intervals
|
---|
[71] | 21 |
|
---|
[73] | 22 | output = [] # Output array, where the values of interval k
|
---|
| 23 | # is stored in value output[k]
|
---|
| 24 |
|
---|
[72] | 25 | for value in input:
|
---|
| 26 | interval = value / interval_width # Find it's correct bin
|
---|
| 27 | output[interval].append(value) # Put the value inside the bin
|
---|
| 28 | endfor
|
---|
| 29 | \end{verbatim}
|
---|
| 30 |
|
---|
[71] | 31 | \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
| 32 | generation of a concept hierarchy for numerical data based on the equal-frequency
|
---|
| 33 | partitioning rule.}
|
---|
[72] | 34 | \begin{verbatim}
|
---|
| 35 | input = [] # Input array of all input numbers
|
---|
| 36 | num_intervals = %User input of number of intervals needed%
|
---|
| 37 | input_length = length(input) # Number of items in list
|
---|
| 38 | interval_width = input_length / num_intervals
|
---|
[71] | 39 |
|
---|
[72] | 40 | sorted_input = sorted(input) # Sort items on value from small to large
|
---|
| 41 |
|
---|
[73] | 42 | output = [] # Output array, where the values of interval k
|
---|
| 43 | # is stored in value output[k]
|
---|
[72] | 44 |
|
---|
| 45 | interval = 0
|
---|
| 46 | counter = 0
|
---|
| 47 | for value in sorted_input:
|
---|
| 48 | output[interval].append(value) # Put the value inside the bin
|
---|
| 49 |
|
---|
| 50 | # If the width has been 'filled' continue with the next one
|
---|
| 51 | counter++
|
---|
| 52 | if counter > interval_width:
|
---|
| 53 | interval++
|
---|
| 54 | counter = 0
|
---|
| 55 | endif
|
---|
| 56 | endfor
|
---|
| 57 | \end{verbatim}
|
---|
| 58 |
|
---|
[71] | 59 | \question{2}{Suppose that a data warehouse consists of four dimensions,
|
---|
| 60 | date, spectator, location, and game, and two measures, count, and charge, where
|
---|
| 61 | charge is the fare that a spectator pays when watching a game on a given date.
|
---|
| 62 | Spectators may be students, adults, or seniors, with each category having its own
|
---|
| 63 | charge rate.}
|
---|
| 64 | \question{2a}{Draw a star schema diagram for the data warehouse.}
|
---|
[73] | 65 | See figure~\ref{fig:star}.
|
---|
[72] | 66 | % http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
|
---|
[73] | 67 | \begin{figure}[htp]
|
---|
| 68 | \centering
|
---|
| 69 | \includegraphics[scale=0.6]{star-diagram.eps}
|
---|
| 70 | \caption{Star schema diagram 2a data warehose}
|
---|
| 71 | \label{fig:star}
|
---|
| 72 | \end{figure}
|
---|
[71] | 73 |
|
---|
[72] | 74 | \question{2b}{Starting with the base cuboid [date, spectator, location, game],
|
---|
| 75 | what specific OLAP operations should one perform in order to list the total
|
---|
| 76 | charge paid by student spectators at GM\_Place 2004?}
|
---|
| 77 | % http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
|
---|
| 78 | You first will need to \emph{slice} on the condition \texttt{game ==
|
---|
| 79 | 'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
|
---|
| 80 | This will give you all the charges for GM Place in 2004. Next we slice to
|
---|
| 81 | \texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
|
---|
| 82 | display phase (\texttt{pivot}).
|
---|
| 83 |
|
---|
| 84 | \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
|
---|
| 85 | as an example, briefly discuss advantages and problems of using a bitmap index
|
---|
| 86 | structure.}
|
---|
| 87 | % http://en.wikipedia.org/wiki/Bitmap_index
|
---|
| 88 | Bitmap indexing in this case is useful to have a compact representation of for
|
---|
| 89 | example the spectator type. As this can only be 4 options a four bit
|
---|
| 90 | representation fits the need to store all possible combinations. This required
|
---|
| 91 | binary operators to do searches in this set are quick, but their results will
|
---|
| 92 | need to be processed before beeing able to represent it.
|
---|
| 93 | One other advantage is the fact that the bitmap indexing compresses really well
|
---|
| 94 | as patterns are reoccuring.
|
---|
| 95 |
|
---|
| 96 |
|
---|
[71] | 97 | \question{3}{A data cube, C, has n dimensions, and each dimension
|
---|
| 98 | has exactly p distinct values in the base cuboid. Assume that there are no concept
|
---|
| 99 | hierarchies associated with the dimensions. What is the maximum number of cells
|
---|
| 100 | possible (including both base cells and aggregate cells) in the data cube, C?}
|
---|
[72] | 101 | % http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
|
---|
| 102 | Take for example 2 dimensions with each 4 distinct values in the base cuboid,
|
---|
| 103 | this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
|
---|
[71] | 104 |
|
---|
| 105 | \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
|
---|
| 106 | \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
|
---|
[72] | 107 | % http://en.wikipedia.org/wiki/Apriori_algorithm
|
---|
| 108 | If an itemset $I$ is frequent it means it's occurence classifies a minimal
|
---|
| 109 | support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
|
---|
| 110 | same -or larger if the subset is also part of an other itemset $J$, which has
|
---|
| 111 | distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
|
---|
| 112 | $I$ is frequent.
|
---|
| 113 |
|
---|
| 114 | \question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
|
---|
| 115 | â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$
|
---|
| 116 | is a subset of $s$.}
|
---|
| 117 | If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
|
---|
| 118 | conditional propability of also containing a larger itemset. If the repeat this
|
---|
| 119 | assumption, you will end up on the left side with a empty set and on the right
|
---|
| 120 | the full itemset $l$, which then would have the highest confidence. While this
|
---|
| 121 | would not be valid. As rules like 'Customer buying nothing', will also buy
|
---|
| 122 | 'beer' can not be existing.
|
---|
| 123 |
|
---|
[71] | 124 | \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
|
---|
| 125 | patterns instead. Describe the main differences of mining closed patterns and mining
|
---|
| 126 | max patterns.}
|
---|
[72] | 127 | % http://www.dataminingarticles.com/closed-maximal-itemsets.html
|
---|
| 128 | By definition the itemset $X$ is closed means $X$ is frequent and there exists
|
---|
| 129 | no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
|
---|
| 130 | is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
|
---|
| 131 | Maximum frequent itemsets has the downside that you don't not know the actual
|
---|
| 132 | support of those sub-itemsets. The closed frequent itemsets helps purging a lot
|
---|
| 133 | of itemsets that are not needed.
|
---|
[71] | 134 |
|
---|
| 135 | \question{6}{The price of each item in a store is nonnegative. For
|
---|
| 136 | each of the following cases, identify the kinds of constraint they represent (e.g.
|
---|
[73] | 137 | anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association
|
---|
[71] | 138 | rules efficiently:}
|
---|
[72] | 139 | % www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
|
---|
[73] | 140 | \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
|
---|
| 141 | This is \emph{monotonic}, as adding items will never lower the sum of prices.
|
---|
| 142 | Mining this will be bestly done by first finding the free items and next order
|
---|
| 143 | the other items based on price, starting with the most expensive ones first.
|
---|
| 144 | The soon you get to \$190 can al it's supersets to matching as well, next you
|
---|
| 145 | remove the last item and try to match up again. Continue this till you removed
|
---|
| 146 | the most expensive item and tried to match up again. Next \emph{JOIN} with all
|
---|
| 147 | the free items and the list is complete.
|
---|
[72] | 148 |
|
---|
[71] | 149 | \question{6b}{Where the average price of all the items is between \$120 and \$520.}
|
---|
[73] | 150 | This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if
|
---|
| 151 | you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the
|
---|
| 152 | sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in
|
---|
| 153 | prefix item value descending order. \emph{monotonic} If itemset $S$ constraint
|
---|
| 154 | $avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values
|
---|
| 155 | descending order.
|
---|
[71] | 156 |
|
---|
[73] | 157 | Satifing both conditions how-ever does make it neither \emph{anti-monotonic}
|
---|
| 158 | nor \emph{monotonic}. A fast way to generate this set is to use the algoritm
|
---|
| 159 | used in 6a but modify the number constraint on-the-fly. Like average of 3 items
|
---|
| 160 | between a centrain range (120 -- 520) It the same of checking whether the sum
|
---|
| 161 | of 3 items is between ($3*120$ -- $3*520$).
|
---|
| 162 |
|
---|
[71] | 163 | \question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
|
---|
| 164 | busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
|
---|
| 165 | bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
|
---|
| 166 | Describe an efficient system that would allow real time data mining and continuously
|
---|
| 167 | querying these video streams for abnormal events, such as explosions, people in
|
---|
| 168 | panic, etc.. Also discuss the computational costs and the memory requirements of
|
---|
| 169 | your system.}
|
---|
[72] | 170 |
|
---|
[73] | 171 | Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx
|
---|
| 172 | 21.9 MByte/sec$. I would first use a compression algoritm to send-the-data over
|
---|
| 173 | the wire by sending the changes only and not the full frame every time. Secondly
|
---|
| 174 | applying abnomality detections using heuristics. Keeping the detected
|
---|
| 175 | abnormalities available un-compressed and uncut and save all other data in a
|
---|
| 176 | much lower resolution (1fps), with a compression ratio of 4:1 this would make
|
---|
| 177 | storage stream of approx 19GB/camera/day. Processing power required depends on
|
---|
| 178 | the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple
|
---|
| 179 | algoritms are used.
|
---|
[72] | 180 |
|
---|
[71] | 181 | \question{8}{A flight data warehouse for a travel agent consists of six
|
---|
| 182 | dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
|
---|
| 183 | and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
|
---|
| 184 | fare at the lowest level but the average fare at the other levels. Suppose the cube is
|
---|
| 185 | fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
|
---|
| 186 | arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
|
---|
| 187 | airline, etc.) should one perform in order to list the average fare per month for each
|
---|
| 188 | business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
|
---|
| 189 | year 2007?}
|
---|
[72] | 190 |
|
---|
[73] | 191 | Note bit tricky as we use the departure\_time to determine the year and
|
---|
| 192 | arrival\_time to determine the month of the flight. When these values are not
|
---|
| 193 | set to the same day we might run into inconsistencies.
|
---|
[72] | 194 |
|
---|
[73] | 195 |
|
---|
| 196 | \begin{verbatim}
|
---|
| 197 | * Roll-up departure\_time to 'Year'
|
---|
| 198 | * Roll-up flight to 'Airline'
|
---|
| 199 | * Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines'
|
---|
| 200 | * Roll-up traveler to 'Consumer/Business'
|
---|
| 201 | * Roll-up arrival_time to 'Month'
|
---|
| 202 | * Slice traveler = 'Business'
|
---|
| 203 | \end{verbatim}
|
---|
| 204 |
|
---|
[71] | 205 | \question{9}{In graph mining what would be the advantage of the described apriori-based approach
|
---|
| 206 | over the pattern growth based approach (see lecture slides) and vice versa.}
|
---|
[73] | 207 | The advantage comes in the simplicity of the algotitm when it get to
|
---|
| 208 | computations needed and it's setup, but this does not make it effient.
|
---|
[71] | 209 |
|
---|
[73] | 210 | On the other hand apriori uses the costly process of candidate generation and
|
---|
| 211 | testing, making it more 'heavy' then pattern grow based. and also avoid costly
|
---|
| 212 | database scans.
|
---|
| 213 |
|
---|
[72] | 214 | % http://en.wikipedia.org/wiki/Association_rule_learning
|
---|
| 215 |
|
---|
[71] | 216 | \bibliography{report}
|
---|
| 217 |
|
---|
| 218 | \end{document}
|
---|