Changeset 72
- Timestamp:
- Jan 31, 2010, 7:30:26 PM (15 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
liacs/dbdm/dbdm_5/report.tex
r71 r72 12 12 generation of a concept hierarchy for numerical data based on the equal-width 13 13 partitioning rule.} 14 \begin{verbatim} 15 input = [] # Input array of all input numbers 16 num_intervals = %User input of number of intervals needed% 17 max = maximum(input) 18 min = minimum(input) 19 interval_width = (max - min) / num_intervals 20 21 output = [] # Output array, where the values of interval k is stored in value output[k] 22 for value in input: 23 interval = value / interval_width # Find it's correct bin 24 output[interval].append(value) # Put the value inside the bin 25 endfor 26 27 \end{verbatim} 14 28 15 29 \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic 16 30 generation of a concept hierarchy for numerical data based on the equal-frequency 17 31 partitioning rule.} 32 \begin{verbatim} 33 input = [] # Input array of all input numbers 34 num_intervals = %User input of number of intervals needed% 35 input_length = length(input) # Number of items in list 36 interval_width = input_length / num_intervals 37 38 sorted_input = sorted(input) # Sort items on value from small to large 39 40 output = [] # Output array, where the values of interval k is stored in value output[k] 41 42 interval = 0 43 counter = 0 44 for value in sorted_input: 45 output[interval].append(value) # Put the value inside the bin 46 47 # If the width has been 'filled' continue with the next one 48 counter++ 49 if counter > interval_width: 50 interval++ 51 counter = 0 52 endif 53 endfor 54 \end{verbatim} 18 55 19 56 \question{2}{Suppose that a data warehouse consists of four dimensions, … … 23 60 charge rate.} 24 61 \question{2a}{Draw a star schema diagram for the data warehouse.} 25 \question{2b}{Starting with the base cuboid [date, spectator, location, game], what specific 26 OLAP operations should one perform in order to list the total charge paid by 27 student spectators at GM\_Place 2004?} 28 \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube as an example, 29 briefly discuss advantages and problems of using a bitmap index structure.} 62 % http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803 63 64 \question{2b}{Starting with the base cuboid [date, spectator, location, game], 65 what specific OLAP operations should one perform in order to list the total 66 charge paid by student spectators at GM\_Place 2004?} 67 % http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations 68 You first will need to \emph{slice} on the condition \texttt{game == 69 'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}. 70 This will give you all the charges for GM Place in 2004. Next we slice to 71 \texttt{spectator.type == 'student'}. Lastly we sum all the charges in the 72 display phase (\texttt{pivot}). 73 74 \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube 75 as an example, briefly discuss advantages and problems of using a bitmap index 76 structure.} 77 % http://en.wikipedia.org/wiki/Bitmap_index 78 Bitmap indexing in this case is useful to have a compact representation of for 79 example the spectator type. As this can only be 4 options a four bit 80 representation fits the need to store all possible combinations. This required 81 binary operators to do searches in this set are quick, but their results will 82 need to be processed before beeing able to represent it. 83 One other advantage is the fact that the bitmap indexing compresses really well 84 as patterns are reoccuring. 85 30 86 31 87 \question{3}{A data cube, C, has n dimensions, and each dimension … … 33 89 hierarchies associated with the dimensions. What is the maximum number of cells 34 90 possible (including both base cells and aggregate cells) in the data cube, C?} 91 % http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html 92 Take for example 2 dimensions with each 4 distinct values in the base cuboid, 93 this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$. 35 94 36 95 \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.} 37 96 \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.} 38 \question{4b}{Given frequent itemset l and subset s of l, prove that the confidence of the rule âsâ 39 => (l-sâ)â cannot be more than the confidence of the rule âs => (l â s)â where sâ 40 is a subset of s.} 97 % http://en.wikipedia.org/wiki/Apriori_algorithm 98 If an itemset $I$ is frequent it means it's occurence classifies a minimal 99 support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the 100 same -or larger if the subset is also part of an other itemset $J$, which has 101 distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if 102 $I$ is frequent. 103 104 \question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule 105 â$sâ => (l - sâ)$â cannot be more than the confidence of the rule â$s => (l - s)$â where $sâ$ 106 is a subset of $s$.} 107 If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher 108 conditional propability of also containing a larger itemset. If the repeat this 109 assumption, you will end up on the left side with a empty set and on the right 110 the full itemset $l$, which then would have the highest confidence. While this 111 would not be valid. As rules like 'Customer buying nothing', will also buy 112 'beer' can not be existing. 113 41 114 \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max 42 115 patterns instead. Describe the main differences of mining closed patterns and mining 43 116 max patterns.} 117 % http://www.dataminingarticles.com/closed-maximal-itemsets.html 118 By definition the itemset $X$ is closed means $X$ is frequent and there exists 119 no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$ 120 is frequent and there exists no \textit{frequent} super-pattern $X /in Y$. 121 Maximum frequent itemsets has the downside that you don't not know the actual 122 support of those sub-itemsets. The closed frequent itemsets helps purging a lot 123 of itemsets that are not needed. 44 124 45 125 \question{6}{The price of each item in a store is nonnegative. For … … 47 127 antimonotonic, monotonic, succinct) and briefly discuss how to mine such association 48 128 rules efficiently:} 129 130 % www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf 49 131 50 132 \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.} … … 58 140 panic, etc.. Also discuss the computational costs and the memory requirements of 59 141 your system.} 142 143 No compression, pre-processing, abnomality detection using heuristics, 144 un-compressed storage for direct retrival. compressed storage for long term 145 storage. 146 60 147 \question{8}{A flight data warehouse for a travel agent consists of six 61 148 dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time, … … 67 154 business traveler who flies American Airlines (AA) from Los Angeles (LA) in the 68 155 year 2007?} 156 157 69 158 \question{9}{In graph mining what would be the advantage of the described apriori-based approach 70 159 over the pattern growth based approach (see lecture slides) and vice versa.} 160 161 % http://en.wikipedia.org/wiki/Association_rule_learning 71 162 72 163 \bibliography{report}
Note:
See TracChangeset
for help on using the changeset viewer.