[71] | 1 | \documentclass{report}
|
---|
| 2 |
|
---|
| 3 | \title{Databases and Data Mining --- Assignment 5}
|
---|
| 4 | \author{Rick van der Zwet}
|
---|
| 5 |
|
---|
| 6 |
|
---|
| 7 | \begin{document}
|
---|
| 8 | \newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. #2{\Large''}\end{quotation}}
|
---|
| 9 | \maketitle
|
---|
| 10 |
|
---|
| 11 | \question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
| 12 | generation of a concept hierarchy for numerical data based on the equal-width
|
---|
| 13 | partitioning rule.}
|
---|
| 14 |
|
---|
| 15 | \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic
|
---|
| 16 | generation of a concept hierarchy for numerical data based on the equal-frequency
|
---|
| 17 | partitioning rule.}
|
---|
| 18 |
|
---|
| 19 | \question{2}{Suppose that a data warehouse consists of four dimensions,
|
---|
| 20 | date, spectator, location, and game, and two measures, count, and charge, where
|
---|
| 21 | charge is the fare that a spectator pays when watching a game on a given date.
|
---|
| 22 | Spectators may be students, adults, or seniors, with each category having its own
|
---|
| 23 | charge rate.}
|
---|
| 24 | \question{2a}{Draw a star schema diagram for the data warehouse.}
|
---|
| 25 | \question{2b}{Starting with the base cuboid [date, spectator, location, game], what specific
|
---|
| 26 | OLAP operations should one perform in order to list the total charge paid by
|
---|
| 27 | student spectators at GM\_Place 2004?}
|
---|
| 28 | \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube as an example,
|
---|
| 29 | briefly discuss advantages and problems of using a bitmap index structure.}
|
---|
| 30 |
|
---|
| 31 | \question{3}{A data cube, C, has n dimensions, and each dimension
|
---|
| 32 | has exactly p distinct values in the base cuboid. Assume that there are no concept
|
---|
| 33 | hierarchies associated with the dimensions. What is the maximum number of cells
|
---|
| 34 | possible (including both base cells and aggregate cells) in the data cube, C?}
|
---|
| 35 |
|
---|
| 36 | \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.}
|
---|
| 37 | \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.}
|
---|
| 38 | \question{4b}{Given frequent itemset l and subset s of l, prove that the confidence of the rule âsâ
|
---|
| 39 | => (l-sâ)â cannot be more than the confidence of the rule âs => (l â s)â where sâ
|
---|
| 40 | is a subset of s.}
|
---|
| 41 | \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max
|
---|
| 42 | patterns instead. Describe the main differences of mining closed patterns and mining
|
---|
| 43 | max patterns.}
|
---|
| 44 |
|
---|
| 45 | \question{6}{The price of each item in a store is nonnegative. For
|
---|
| 46 | each of the following cases, identify the kinds of constraint they represent (e.g.
|
---|
| 47 | antimonotonic, monotonic, succinct) and briefly discuss how to mine such association
|
---|
| 48 | rules efficiently:}
|
---|
| 49 |
|
---|
| 50 | \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.}
|
---|
| 51 | \question{6b}{Where the average price of all the items is between \$120 and \$520.}
|
---|
| 52 |
|
---|
| 53 | \question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in
|
---|
| 54 | busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24-
|
---|
| 55 | bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time.
|
---|
| 56 | Describe an efficient system that would allow real time data mining and continuously
|
---|
| 57 | querying these video streams for abnormal events, such as explosions, people in
|
---|
| 58 | panic, etc.. Also discuss the computational costs and the memory requirements of
|
---|
| 59 | your system.}
|
---|
| 60 | \question{8}{A flight data warehouse for a travel agent consists of six
|
---|
| 61 | dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time,
|
---|
| 62 | and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete
|
---|
| 63 | fare at the lowest level but the average fare at the other levels. Suppose the cube is
|
---|
| 64 | fully materialized. Starting with the base cuboid [traveler, departure, departure\_time,
|
---|
| 65 | arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to
|
---|
| 66 | airline, etc.) should one perform in order to list the average fare per month for each
|
---|
| 67 | business traveler who flies American Airlines (AA) from Los Angeles (LA) in the
|
---|
| 68 | year 2007?}
|
---|
| 69 | \question{9}{In graph mining what would be the advantage of the described apriori-based approach
|
---|
| 70 | over the pattern growth based approach (see lecture slides) and vice versa.}
|
---|
| 71 |
|
---|
| 72 | \bibliography{report}
|
---|
| 73 |
|
---|
| 74 | \end{document}
|
---|