\documentclass{report} \usepackage{graphicx} \title{Databases and Data Mining --- Assignment 5} \author{Rick van der Zwet} \begin{document} \newcommand{\question}[2]{\begin{quotation}\noindent{\Large``}#1. \textit{#2}{\Large''}\end{quotation}} \maketitle \question{1a}{Propose an algorithm, in pseudo-code, for the following: The automatic generation of a concept hierarchy for numerical data based on the equal-width partitioning rule.} \begin{verbatim} input = [] # Input array of all input numbers num_intervals = %User input of number of intervals needed% max = maximum(input) min = minimum(input) interval_width = (max - min) / num_intervals output = [] # Output array, where the values of interval k # is stored in value output[k] for value in input: interval = value / interval_width # Find it's correct bin output[interval].append(value) # Put the value inside the bin endfor \end{verbatim} \question{1b}{Propose an algorithm, in pseudo-code, for the following: The automatic generation of a concept hierarchy for numerical data based on the equal-frequency partitioning rule.} \begin{verbatim} input = [] # Input array of all input numbers num_intervals = %User input of number of intervals needed% input_length = length(input) # Number of items in list interval_width = input_length / num_intervals sorted_input = sorted(input) # Sort items on value from small to large output = [] # Output array, where the values of interval k # is stored in value output[k] interval = 0 counter = 0 for value in sorted_input: output[interval].append(value) # Put the value inside the bin # If the width has been 'filled' continue with the next one counter++ if counter > interval_width: interval++ counter = 0 endif endfor \end{verbatim} \question{2}{Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count, and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.} \question{2a}{Draw a star schema diagram for the data warehouse.} See figure~\ref{fig:star}. % http://it.toolbox.com/blogs/enterprise-solutions/star-schema-modelling-data-warehouse-20803 \begin{figure}[htp] \centering \includegraphics[scale=0.6]{star-diagram.eps} \caption{Star schema diagram 2a data warehose} \label{fig:star} \end{figure} \question{2b}{Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations should one perform in order to list the total charge paid by student spectators at GM\_Place 2004?} % http://en.wikipedia.org/wiki/OLAP_cube#OLAP_operations You first will need to \emph{slice} on the condition \texttt{game == 'GM\_Place'}. Secondly you will need to slice on \texttt{date.year == '2004'}. This will give you all the charges for GM Place in 2004. Next we slice to \texttt{spectator.type == 'student'}. Lastly we sum all the charges in the display phase (\texttt{pivot}). \question{2c}{Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advantages and problems of using a bitmap index structure.} % http://en.wikipedia.org/wiki/Bitmap_index Bitmap indexing in this case is useful to have a compact representation of for example the spectator type. As this can only be 4 options a four bit representation fits the need to store all possible combinations. This required binary operators to do searches in this set are quick, but their results will need to be processed before beeing able to represent it. One other advantage is the fact that the bitmap indexing compresses really well as patterns are reoccuring. \question{3}{A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid. Assume that there are no concept hierarchies associated with the dimensions. What is the maximum number of cells possible (including both base cells and aggregate cells) in the data cube, C?} % http://www2.cs.uregina.ca/~dbd/cs831/notes/dcubes/dcubes.html Take for example 2 dimensions with each 4 distinct values in the base cuboid, this will a total of $4^2=16$ possibilities. Taking this more general will this be $p^n$. \question{4}{The Apriori algorithm uses prior knowledge of subset support properties.} \question{4a}{Prove that all nonempty subsets of a frequent itemset must also be frequent.} % http://en.wikipedia.org/wiki/Apriori_algorithm If an itemset $I$ is frequent it means it's occurence classifies a minimal support level, hence if you take a subset of $I$ ($I_{sub}$). The count will remain the same -or larger if the subset is also part of an other itemset $J$, which has distint matches from $I$- as subset $I$. So $I_{sub}$ will also be frequent if $I$ is frequent. \question{4b}{Given frequent itemset $l$ and subset $s$ of $l$, prove that the confidence of the rule “$s’ => (l - s’)$” cannot be more than the confidence of the rule “$s => (l - s)$” where $s’$ is a subset of $s$.} If $s'$ has an higher confidence than $s$, than a smaller itemset have a higher conditional propability of also containing a larger itemset. If the repeat this assumption, you will end up on the left side with a empty set and on the right the full itemset $l$, which then would have the highest confidence. While this would not be valid. As rules like 'Customer buying nothing', will also buy 'beer' can not be existing. \question{5}{An optimization in frequent item set mining is mining closed patterns, or mining max patterns instead. Describe the main differences of mining closed patterns and mining max patterns.} % http://www.dataminingarticles.com/closed-maximal-itemsets.html By definition the itemset $X$ is closed means $X$ is frequent and there exists no super-pattern $X /in Y$ wth the same support as $X$. While $X$ is max if $X$ is frequent and there exists no \textit{frequent} super-pattern $X /in Y$. Maximum frequent itemsets has the downside that you don't not know the actual support of those sub-itemsets. The closed frequent itemsets helps purging a lot of itemsets that are not needed. \question{6}{The price of each item in a store is nonnegative. For each of the following cases, identify the kinds of constraint they represent (e.g. anti-monotonic, monotonic, succinct) and briefly discuss how to mine such association rules efficiently:} % www.cs.sfu.ca/CC/741/jpei/slides/ConstrainedFrequentPatterns.pdf \question{6a}{Containing one free item and other items the sum of whose prices is at least \$190.} This is \emph{monotonic}, as adding items will never lower the sum of prices. Mining this will be bestly done by first finding the free items and next order the other items based on price, starting with the most expensive ones first. The soon you get to \$190 can al it's supersets to matching as well, next you remove the last item and try to match up again. Continue this till you removed the most expensive item and tried to match up again. Next \emph{JOIN} with all the free items and the list is complete. \question{6b}{Where the average price of all the items is between \$120 and \$520.} This is convertible \emph{anti-monotonic} and convertible \emph{monotonic} if you look at one contrain. \emph{anti-monotonic} If itemset $S$ violates the sub-constraint $avg(S.Price) \le \$520$. so does every itemset with $S$ in prefix item value descending order. \emph{monotonic} If itemset $S$ constraint $avg(S.Price) \ge \$120$ so does every itemset having S as prefix item values descending order. Satifing both conditions how-ever does make it neither \emph{anti-monotonic} nor \emph{monotonic}. A fast way to generate this set is to use the algoritm used in 6a but modify the number constraint on-the-fly. Like average of 3 items between a centrain range (120 -- 520) It the same of checking whether the sum of 3 items is between ($3*120$ -- $3*520$). \question{7}{Suppose a city has installed hundreds of surveillance cameras at strategic locations in busy streets. All the captured video (each stream being 640 pixels x 480 pixels, 24- bits (RGB) per pixel, 25fps) is streamed to a central observation post in real time. Describe an efficient system that would allow real time data mining and continuously querying these video streams for abnormal events, such as explosions, people in panic, etc.. Also discuss the computational costs and the memory requirements of your system.} Every camera generates $640 * 480 * 24 bits * 25fps = 180.000 kbit/sec \approx 21.9 MByte/sec$. I would first use a compression algoritm to send-the-data over the wire by sending the changes only and not the full frame every time. Secondly applying abnomality detections using heuristics. Keeping the detected abnormalities available un-compressed and uncut and save all other data in a much lower resolution (1fps), with a compression ratio of 4:1 this would make storage stream of approx 19GB/camera/day. Processing power required depends on the algoritms used, but a 500Mhz CPU/camera should fit the bill when simple algoritms are used. \question{8}{A flight data warehouse for a travel agent consists of six dimensions: traveler, departure (city), departure\_time, arrival (city), arrival\_time, and flight; and two measures count, and avg\_fare, where avg\_fare stores the concrete fare at the lowest level but the average fare at the other levels. Suppose the cube is fully materialized. Starting with the base cuboid [traveler, departure, departure\_time, arrival, arrival\_time, flight], what specific OLAP operations (e.g roll-up flight to airline, etc.) should one perform in order to list the average fare per month for each business traveler who flies American Airlines (AA) from Los Angeles (LA) in the year 2007?} Note bit tricky as we use the departure\_time to determine the year and arrival\_time to determine the month of the flight. When these values are not set to the same day we might run into inconsistencies. \begin{verbatim} * Roll-up departure\_time to 'Year' * Roll-up flight to 'Airline' * Dice departure = 'LA' and departure_time = '2007' and flight = 'American Airlines' * Roll-up traveler to 'Consumer/Business' * Roll-up arrival_time to 'Month' * Slice traveler = 'Business' \end{verbatim} \question{9}{In graph mining what would be the advantage of the described apriori-based approach over the pattern growth based approach (see lecture slides) and vice versa.} The advantage comes in the simplicity of the algotitm when it get to computations needed and it's setup, but this does not make it effient. On the other hand apriori uses the costly process of candidate generation and testing, making it more 'heavy' then pattern grow based. and also avoid costly database scans. % http://en.wikipedia.org/wiki/Association_rule_learning \bibliography{report} \end{document}