Context Navigation

← Previous Changeset
Next Changeset →

Changeset 72

Timestamp:

Jan 31, 2010, 7:30:26 PM (15 years ago)

Author:

Rick van der Zwet

Message:

First small bits

File:

: 1 edited

liacs/dbdm/dbdm_5/report.tex (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

liacs/dbdm/dbdm_5/report.tex

-              r71
+              r72
 generation of a concept hierarchy for numerical data based on the equal-width
 partitioning rule.}
+\begin{verbatim}
+input = [] # Input array of all input numbers
+num_intervals = %User input of number of intervals needed%
+max = maximum(input)
+min = minimum(input)
+interval_width = (max - min) / num_intervals
+output = [] # Output array, where the values of interval k is stored in value output[k]
+for value in input:
+  interval = value / interval_width # Find it's correct bin
+  output[interval].append(value)    # Put the value inside the bin
+endfor
+\end{verbatim}
 \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
 generation of a concept hierarchy for numerical data based on the equal-frequency
 partitioning rule.}
+\begin{verbatim}
+input = [] # Input array of all input numbers
+num_intervals = %User input of number of intervals needed%
+input_length = length(input) # Number of items in list
+interval_width = input_length / num_intervals
+sorted_input = sorted(input) # Sort items on value from small to large
+output = [] # Output array, where the values of interval k is stored in value output[k]
+interval = 0
+counter = 0
+for value in sorted_input:
+  output[interval].append(value)    # Put the value inside the bin
+  # If the width has been 'filled' continue with the next one
+  counter++
+  if counter > interval_width:
+    interval++
+    counter = 0
+  endif
+endfor
+\end{verbatim}
 \question{2}{Suppose that a data warehouse consists of four dimensions,
 …
     charge rate.}
 \question{2a}{Draw a star schema diagram for the data warehouse.}
+\question{2b}{Starting with the base cuboid [date, spectator, location, game], what specific
+        OLAP operations should one perform in order to list the total charge paid by
+        student spectators at GM\_Place 2004?}
+\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube as an example,
+        briefly discuss advantages and problems of using a bitmap index structure.}
+% http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
+\question{2b}{Starting with the base cuboid [date, spectator, location, game],
+what specific OLAP operations should one perform in order to list the total
+charge paid by student spectators at GM\_Place 2004?}
+% http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
+You first will need to \emph{slice} on the condition \texttt{game ==
+'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
+This will give you all the charges for GM Place in 2004. Next we slice to
+\texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
+display phase (\texttt{pivot}).
+\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
+as an example, briefly discuss advantages and problems of using a bitmap index
+structure.}
+% http://en.wikipedia.org/wiki/Bitmap_index
+Bitmap indexing in this case is useful to have a compact representation of for
+example the spectator type. As this can only be 4 options a four bit
+representation fits the need to store all possible combinations. This required
+binary operators to do searches in this set are quick, but their results will
+need to be processed before beeing able to represent it.
+One other advantage is the fact that the bitmap indexing compresses really well
+as patterns are reoccuring.
 \question{3}{A data cube, C, has n dimensions, and each dimension
 …
     hierarchies associated with the dimensions. What is the maximum number of cells
     possible (including both base cells and aggregate cells) in the data cube, C?}
+% http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
+Take for example 2 dimensions with each 4 distinct values in the base cuboid,
+this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
 \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
 \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
+\question{4b}{Given frequent itemset l and subset s of l, prove that the confidence of the rule âsâ
+         => (l-sâ)â cannot be more than the confidence of the rule âs => (l â s)â where sâ
+         is a subset of s.}
+% http://en.wikipedia.org/wiki/Apriori_algorithm
+If an itemset $I$ is frequent it means it's occurence classifies a minimal
+support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
+same -or larger if the subset is also part of an other itemset $J$, which has
+distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
+$I$ is frequent.
+\question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
+         â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$
+         is a subset of $s$.}
+If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
+conditional propability of also containing a larger itemset. If the repeat this
+assumption, you will end up on the left side with a empty set and on the right
+the full itemset $l$, which then would have the highest confidence. While this
+would not be valid. As rules like 'Customer buying nothing', will also buy
+'beer' can not be existing.
 \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
     patterns instead. Describe the main differences of mining closed patterns and mining
     max patterns.}
+% http://www.dataminingarticles.com/closed-maximal-itemsets.html
+By definition the itemset $X$ is closed means $X$ is frequent and there exists
+no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
+is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
+Maximum frequent itemsets has the downside that you don't not know the actual
+support of those sub-itemsets. The closed frequent itemsets helps purging a lot
+of itemsets that are not needed.
 \question{6}{The price of each item in a store is nonnegative. For
 …
 antimonotonic, monotonic, succinct) and briefly discuss how to mine such association
 rules efficiently:}
+% www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
 \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
 …
 panic, etc.. Also discuss the computational costs and the memory requirements of
 your system.}
+No compression, pre-processing, abnomality detection using heuristics,
+un-compressed storage for direct retrival. compressed storage for long term
+storage.
 \question{8}{A flight data warehouse for a travel agent consists of six
 dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
 …
 business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
 year 2007?}
 \question{9}{In graph mining what would be the advantage of the described apriori-based approach
 over the pattern growth based approach (see lecture slides) and vice versa.}
+% http://en.wikipedia.org/wiki/Association_rule_learning
 \bibliography{report}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 72

Legend:

liacs/dbdm/dbdm_5/report.tex

Download in other formats: