Ignore:
Timestamp:
Jan 31, 2010, 7:30:26 PM (15 years ago)
Author:
Rick van der Zwet
Message:

First small bits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • liacs/dbdm/dbdm_5/report.tex

    r71 r72  
    1212generation of a concept hierarchy for numerical data based on the equal-width
    1313partitioning rule.}
     14\begin{verbatim}
     15input = [] # Input array of all input numbers
     16num_intervals = %User input of number of intervals needed%
     17max = maximum(input)
     18min = minimum(input)
     19interval_width = (max - min) / num_intervals
     20
     21output = [] # Output array, where the values of interval k is stored in value output[k]
     22for value in input:
     23  interval = value / interval_width # Find it's correct bin
     24  output[interval].append(value)    # Put the value inside the bin
     25endfor
     26
     27\end{verbatim}
    1428
    1529\question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
    1630generation of a concept hierarchy for numerical data based on the equal-frequency
    1731partitioning rule.}
     32\begin{verbatim}
     33input = [] # Input array of all input numbers
     34num_intervals = %User input of number of intervals needed%
     35input_length = length(input) # Number of items in list
     36interval_width = input_length / num_intervals
     37
     38sorted_input = sorted(input) # Sort items on value from small to large
     39
     40output = [] # Output array, where the values of interval k is stored in value output[k]
     41
     42interval = 0
     43counter = 0
     44for value in sorted_input:
     45  output[interval].append(value)    # Put the value inside the bin
     46
     47  # If the width has been 'filled' continue with the next one
     48  counter++
     49  if counter > interval_width:
     50    interval++
     51    counter = 0
     52  endif
     53endfor
     54\end{verbatim}
    1855
    1956\question{2}{Suppose that a data warehouse consists of four dimensions,
     
    2360    charge rate.}
    2461\question{2a}{Draw a star schema diagram for the data warehouse.}
    25 \question{2b}{Starting with the base cuboid [date, spectator, location, game], what specific
    26         OLAP operations should one perform in order to list the total charge paid by
    27         student spectators at GM\_Place 2004?}
    28 \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube as an example,
    29         briefly discuss advantages and problems of using a bitmap index structure.}
     62% http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803
     63
     64\question{2b}{Starting with the base cuboid [date, spectator, location, game],
     65what specific OLAP operations should one perform in order to list the total
     66charge paid by student spectators at GM\_Place 2004?}
     67% http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations
     68You first will need to \emph{slice} on the condition \texttt{game ==
     69'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}.
     70This will give you all the charges for GM Place in 2004. Next we slice to
     71\texttt{spectator.type == 'student'}. Lastly we sum all the charges in the
     72display phase (\texttt{pivot}).
     73
     74\question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube
     75as an example, briefly discuss advantages and problems of using a bitmap index
     76structure.}
     77% http://en.wikipedia.org/wiki/Bitmap_index
     78Bitmap indexing in this case is useful to have a compact representation of for
     79example the spectator type. As this can only be 4 options a four bit
     80representation fits the need to store all possible combinations. This required
     81binary operators to do searches in this set are quick, but their results will
     82need to be processed before beeing able to represent it.
     83One other advantage is the fact that the bitmap indexing compresses really well
     84as patterns are reoccuring.
     85
    3086
    3187\question{3}{A data cube, C, has n dimensions, and each dimension
     
    3389    hierarchies associated with the dimensions. What is the maximum number of cells
    3490    possible (including both base cells and aggregate cells) in the data cube, C?}
     91% http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html
     92Take for example 2 dimensions with each 4 distinct values in the base cuboid,
     93this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$.
    3594
    3695\question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
    3796\question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
    38 \question{4b}{Given frequent itemset l and subset s of l, prove that the confidence of the rule “s’
    39          => (l-s’)” cannot be more than the confidence of the rule “s => (l – s)” where s’
    40          is a subset of s.}
     97% http://en.wikipedia.org/wiki/Apriori_algorithm
     98If an itemset $I$ is frequent it means it's occurence classifies a minimal
     99support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the
     100same -or larger if the subset is also part of an other itemset $J$, which has
     101distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if
     102$I$ is frequent.
     103
     104\question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule
     105         â€œ$s’ => (l - s’)$” cannot be more than the confidence of the rule “$s => (l - s)$” where $s’$
     106         is a subset of $s$.}
     107If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher
     108conditional propability of also containing a larger itemset. If the repeat this
     109assumption, you will end up on the left side with a empty set and on the right
     110the full itemset $l$, which then would have the highest confidence. While this
     111would not be valid. As rules like 'Customer buying nothing', will also buy
     112'beer' can not be existing.
     113
    41114\question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
    42115    patterns instead. Describe the main differences of mining closed patterns and mining
    43116    max patterns.}
     117% http://www.dataminingarticles.com/closed-maximal-itemsets.html
     118By definition the itemset $X$ is closed means $X$ is frequent and there exists
     119no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$
     120is frequent and there exists no \textit{frequent} super-pattern $X /in Y$.
     121Maximum frequent itemsets has the downside that you don't not know the actual
     122support of those sub-itemsets. The closed frequent itemsets helps purging a lot
     123of itemsets that are not needed.
    44124
    45125\question{6}{The price of each item in a store is nonnegative. For
     
    47127antimonotonic, monotonic, succinct) and briefly discuss how to mine such association
    48128rules efficiently:}
     129
     130% www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf
    49131
    50132\question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
     
    58140panic, etc.. Also discuss the computational costs and the memory requirements of
    59141your system.}
     142
     143No compression, pre-processing, abnomality detection using heuristics,
     144un-compressed storage for direct retrival. compressed storage for long term
     145storage.
     146
    60147\question{8}{A flight data warehouse for a travel agent consists of six
    61148dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
     
    67154business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
    68155year 2007?}
     156
     157
    69158\question{9}{In graph mining what would be the advantage of the described apriori-based approach
    70159over the pattern growth based approach (see lecture slides) and vice versa.}
     160
     161% http://en.wikipedia.org/wiki/Association_rule_learning
    71162
    72163\bibliography{report}
Note: See TracChangeset for help on using the changeset viewer.