[2] | 1 | %
|
---|
| 2 | % $Id: report.tex 571 2008-04-20 17:31:04Z rick $
|
---|
| 3 | %
|
---|
| 4 |
|
---|
| 5 | \documentclass[12pt,a4paper]{article}
|
---|
| 6 |
|
---|
| 7 | \frenchspacing
|
---|
| 8 | \usepackage[english,dutch]{babel}
|
---|
| 9 | \selectlanguage{dutch}
|
---|
| 10 | \usepackage{graphicx}
|
---|
| 11 | \usepackage{url}
|
---|
| 12 | \usepackage{multicol}
|
---|
| 13 | \usepackage{fancybox}
|
---|
| 14 | \usepackage{amssymb,amsmath}
|
---|
| 15 | \usepackage{float}
|
---|
| 16 | \floatstyle{ruled}
|
---|
| 17 | \newfloat{result}{thp}{lop}
|
---|
| 18 | \floatname{result}{Result}
|
---|
| 19 |
|
---|
| 20 | \title{Churn in Telecom dataset \\
|
---|
| 21 | \large{Databases and Datamining, 2009}}
|
---|
| 22 | \author{Jonathan Vis, Rick van der Zwet\\
|
---|
| 23 | \texttt{<$\{$jvis,hvdzwet$\}$@liacs.nl>}}
|
---|
| 24 | \date{\today}
|
---|
| 25 |
|
---|
| 26 |
|
---|
| 27 | \begin{document}
|
---|
| 28 | \newcommand{\wekacmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
|
---|
| 29 | \newcommand{\unixcmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
|
---|
| 30 |
|
---|
| 31 |
|
---|
| 32 | \maketitle
|
---|
| 33 |
|
---|
| 34 | \section{Introduction}
|
---|
| 35 | This report is focused towards finding association rule learning to find
|
---|
| 36 | relations between variables in large databases. This will be done using
|
---|
| 37 | Weka\footnote{http://www.cs.waikato.ac.nz/ml/weka/} and a telecom churn
|
---|
| 38 | dataset\footnote{http://www.dataminingconsultant.com/DKD.htm}.
|
---|
| 39 |
|
---|
| 40 | \section{Problem description}
|
---|
| 41 | \emph{Churning} -moving to a different company)- today is still a major deal
|
---|
| 42 | within companies. Having to understand why a customer choose to go for an other
|
---|
| 43 | company is crusial in finding flaws in the product-range or services. As more
|
---|
| 44 | and more data about the consumer get stored, trying to find relations why
|
---|
| 45 | he/she churned is becoming more and more interesting.
|
---|
| 46 |
|
---|
| 47 | \section{Statistics}
|
---|
| 48 | Our dataset has 3333 entries and 21 attributes, which the charactistics shown
|
---|
| 49 | in table~\ref{overview}.
|
---|
| 50 | \begin{table}[ht]
|
---|
| 51 | \caption{Statistical report of churn dataset}
|
---|
| 52 | \centering
|
---|
| 53 | \begin{tabular}{l || r | r | r | r | r | r | r | r | r |}
|
---|
| 54 | Item & Type & Distinct & Missing & Unique & Min & Max & Mean & StdDev \\
|
---|
| 55 | \hline \hline
|
---|
| 56 | State & Nominal & 51 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
| 57 | Account Length & Numeric & 212 & 0 & 16 & 1 & 1 & 101 & 40 \\
|
---|
| 58 | Area Code & Numeric & 3 & 0 & 0 & 408 & 510 & 437 & 42 \\
|
---|
| 59 | Phone & Nominal & 3333 & 0 & 3333 & NaN & NaN & NaN & NaN \\
|
---|
| 60 | Int'l Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
| 61 | VMail Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
| 62 | VMail Msg & Numeric & 46 & 0 & 4 & 0 & 51 & 8 & 14 \\
|
---|
| 63 | Day Mins & Numeric & 1667 & 0 & 770 & 0 & 351 & 180 & 54 \\
|
---|
| 64 | Days Calls & Numeric & 119 & 0 & 10 & 0 & 165 & 100 & 20 \\
|
---|
| 65 | Days Charge & Numeric & 1667 & 0 & 770 & 0 & 60 & 30 & 9 \\
|
---|
| 66 | Eve Mins & Numeric & 1611 & 0 & 709 & 0 & 364 & 201 & 51 \\
|
---|
| 67 | Eve Calls & Numeric & 123 & 0 & 17 & 0 & 170 & 100 & 20 \\
|
---|
| 68 | Eve Charge & Numeric & 1440 & 0 & 585 & 0 & 31 & 17 & 4 \\
|
---|
| 69 | Night Mins & Numeric & 1591 & 0 & 586 & 23 & 395 & 201 & 51 \\
|
---|
| 70 | Night Calls & Numeric & 120 & 0 & 11 & 33 & 175 & 100 & 19 \\
|
---|
| 71 | Night Charge & Numeric & 933 & 0 & 236 & 1 & 18 & 9 & 2 \\
|
---|
| 72 | Intl Mins & Numeric & 162 & 0 & 16 & 0 & 20 & 10 & 3 \\
|
---|
| 73 | Intl Calls & Numeric & 21 & 0 & 3 & 0 & 20 & 4.5 & 2.5 \\
|
---|
| 74 | Intl Charge & Numeric & 162 & 0 & 16 & 0 & 5.4 & 2.8 & 0.8 \\
|
---|
| 75 | CusServ Calls & Numeric & 10 & 0 & 0 & 0 & 9 & 1.5 & 1.3 \\
|
---|
| 76 | Churn & Nominal & 2 & 0 & 0 & NaN & NaN & NaN & NaN
|
---|
| 77 | \end{tabular}
|
---|
| 78 | \label{overview}
|
---|
| 79 | \end{table}
|
---|
| 80 |
|
---|
| 81 | We can consider ourself lucky by having an complete dataset. None of the
|
---|
| 82 | attributes is missing at an entry. How-ever this does not mean the data is
|
---|
| 83 | considered error-free. There might be human-errors or others of some kind
|
---|
| 84 | inside the dataset.
|
---|
| 85 |
|
---|
| 86 | \section{Approach}
|
---|
| 87 | As finding association rules needs discrete values, we will discretize the
|
---|
| 88 | attributes tagged as Numeric in table~\ref{overview}. At this process we will
|
---|
| 89 | take a special look of the actual meaning of the attribute. Having 3.5 Customer
|
---|
| 90 | Server Calls is going to be a bit impossible, so make sure to set the binaries
|
---|
| 91 | of all \emph(bins) to rounded values. We will also try \emph(Weka) feature of
|
---|
| 92 | automatic discretizion: \wekacmd{weka.filters.unsupervised.attribute.Discretize
|
---|
| 93 | -unset-class-temporarily -O -B 10 -M -1.0 -R first-last}.
|
---|
| 94 |
|
---|
| 95 | Secondly table~\ref{overview} shows an number of entries which are related.
|
---|
| 96 | Like for example \emph{Day Calls} and \emph{Day Charge}. One could argue that
|
---|
| 97 | both are relevant, as an consumer might churn as making many calls turns out to
|
---|
| 98 | be problematic (bad signal, quality for example). While you could also argue
|
---|
| 99 | the price will determine the churn. We will discard the values related to
|
---|
| 100 | minutes and call numbers and solely focus on the \emph{Charge}.
|
---|
| 101 | The \emph{Phone} attribute also shows some interesting feature. We will make an
|
---|
| 102 | new attributecalled \emph{Phone-prefix} which is the first 3 numbers of the
|
---|
| 103 | \emph{Phone} number, to see whether this give some fine gain grouping over area
|
---|
| 104 | code. Also the combination \emph{Area Code}-\emph{Phone-prefix} will be
|
---|
| 105 | researched.
|
---|
| 106 |
|
---|
| 107 | \section{Implementaion}
|
---|
| 108 | Creating phone-prefix column using standard unix tools:
|
---|
| 109 | \unixcmd{cut -f 4 -d, churn\_ooo.csv | cut -c 1-4,10 | paste -d, -
|
---|
| 110 | churn\_ooo.csv | sed '1s/"Pho/"Phone-Prefix"/' > churn\_parsed.csv}. Using
|
---|
| 111 | \emph{Weka} deleted the columns \texttt{Day Mins, Day Calls, Eve Mins, Eve
|
---|
| 112 | Calls, Night Mins, Night Calls, Intl Mins, Intl Calls} as we believe they are
|
---|
| 113 | subsets of \emph{Charge}.
|
---|
| 114 | \emph{Phone} is a unique identifier for every entry, not allowing any
|
---|
| 115 | generalization. So it is ignored/deleted.
|
---|
| 116 |
|
---|
| 117 | Using \emph{Weka} embedded discretize function on all Numeric Columns of
|
---|
| 118 | table~\ref{overview}. Any charge value was taken to be full integer values e.g
|
---|
| 119 | rounded currency: \wekacmd{weka.filters.unsupervised.attribute.NumericCleaner
|
---|
| 120 | -min -1.7E308 -min-default -1.708 -max 1.7E308 -max-default 1.7E308 -closeto 0.0
|
---|
| 121 | -closeto-default 0.0 -closeto-tolerance 1.0E-6 -R 8-11 -decimals 0} And made
|
---|
| 122 | discrete: \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 8-11}
|
---|
| 123 |
|
---|
| 124 | Account length was set to be a bin of 'weight' 1, assuming a 1
|
---|
| 125 | months\footnote{This might as well be days, years of some other value, but
|
---|
| 126 | assuming fixed phone contracts, given the range (0-244) months seems to be
|
---|
| 127 | make most sense} contract:
|
---|
| 128 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 3}
|
---|
| 129 |
|
---|
| 130 | \emph{Phone-Prefix} is set to be a set on every number unique:
|
---|
| 131 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 1}
|
---|
| 132 | \emph{CustServ Calls} it set to be rounded values, as one cannot make half calls:
|
---|
| 133 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 12}
|
---|
| 134 |
|
---|
| 135 | Then running the association algoritm \emph{Apriori}, with the \emph{Churn}
|
---|
| 136 | value as the result of the equations:
|
---|
| 137 | \begin{result}
|
---|
| 138 | \caption{ association algoritm \emph{Apriori} - run 1}
|
---|
| 139 | \scriptsize
|
---|
| 140 | \begin{verbatim}
|
---|
| 141 | 1. Area Code=415 Int'l Plan=no VMail Plan=yes 423 ==> Churn?=False. 405 conf :(0.96)
|
---|
| 142 | 2. Int'l Plan=no VMail Plan=yes 830 ==> Churn?=False. 786 conf:(0.95)
|
---|
| 143 | ...
|
---|
| 144 | \end{verbatim}
|
---|
| 145 | \end{result}
|
---|
| 146 |
|
---|
| 147 | Mostly negative results e.g. proving when a consumer is not going to churn.
|
---|
| 148 | Secondly data seems to specialise, rule 1 for example is a specialisation of
|
---|
| 149 | rule 2. Try running without the requirement that churn needs to be on the right
|
---|
| 150 | end of the rule.
|
---|
| 151 |
|
---|
| 152 | \begin{result}
|
---|
| 153 | \caption{ association algoritm \emph{Apriori} - run 2}
|
---|
| 154 | \begin{verbatim}
|
---|
| 155 | 1. VMail Message=0 2411 ==> VMail Plan=no 2411 conf:(1)
|
---|
| 156 | 2. VMail Plan=no 2411 ==> VMail Message=0 2411 conf:(1)
|
---|
| 157 | ...
|
---|
| 158 | \end{verbatim}
|
---|
| 159 | \end{result}
|
---|
| 160 |
|
---|
| 161 | Seems like \emph{VMail Plan=no} seems to equal the \emph{VMail Message=0}. So
|
---|
| 162 | \emph{VMail Plan} can be safely deleted from the attribute list. \emph{VMail
|
---|
| 163 | Message} does not seems to be a very clear description for it's claimed
|
---|
| 164 | purpose. Running without \emph{VMail Plan} did not show improvement. Hence we
|
---|
| 165 | decided to go for an normalisation on the Churn number. By taking a random
|
---|
| 166 | sample of $Churn=False$ values such that it equals the number of $Churn=True$
|
---|
| 167 | values\footnote{\scriptsize\texttt{sort -r -t, -k 22,22 churn\_parsed.csv | sed
|
---|
| 168 | '484,2366d' > churn\_equal.csv}} we re-run the experiments, but found no
|
---|
| 169 | improvement in the experiments.
|
---|
| 170 |
|
---|
| 171 | \section{Conclusions}
|
---|
| 172 | The churn dataset does not classify itself properly associations rules. Mainly
|
---|
| 173 | due to the fact that the so called 'hidden factors' for churning, like `if
|
---|
| 174 | calling more than X minutes at rate Y I will churn'. cannot be mined using this
|
---|
| 175 | current dataset. Further research could include this relations by means of
|
---|
| 176 | formula's, but it require domain specific knowledge to include for example
|
---|
| 177 | relations between \emph{Day Min} and {Day Calls}. Alternative methods like
|
---|
| 178 | scat and plat analysis\footnote{Like done at
|
---|
| 179 | http://meru.cecs.missouri.edu/courses/cecs401\_data\_mining/projects/group2/finproject1.htm}
|
---|
| 180 | seems to lead to more promensing results. This could (of course) also be done
|
---|
| 181 | in Weka. Take for a brief preview in Appendix 1.
|
---|
| 182 | \newpage
|
---|
| 183 | \section{Appendix 1}
|
---|
| 184 | Result using \emph{Weka} classifier: \wekacmd{weka.classifiers.trees.J48 -C 0.25
|
---|
| 185 | -M 2} shows interesting details, like \wekacmd{\scriptsize'Day Mins' > 254.4 and 'VMail
|
---|
| 186 | Plan' = no and 'Eve Mins > 187.7 => True.}
|
---|
| 187 | \begin{result}
|
---|
| 188 | \caption{J48 pruned tree of raw churn dataset}
|
---|
| 189 | \scriptsize
|
---|
| 190 | \begin{verbatim}
|
---|
| 191 | === Classifier model (full training set) ===
|
---|
| 192 | J48 pruned tree
|
---|
| 193 | ------------------
|
---|
| 194 | Day Mins <= 264.4
|
---|
| 195 | | CustServ Calls <= 3
|
---|
| 196 | | | Int'l Plan = no
|
---|
| 197 | | | | Day Mins <= 223.2: False. (2221.0/60.0)
|
---|
| 198 | | | | Day Mins > 223.2
|
---|
| 199 | | | | | Eve Mins <= 242.3: False. (296.0/22.0)
|
---|
| 200 | | | | | Eve Mins > 242.3
|
---|
| 201 | | | | | | VMail Plan = yes: False. (20.0)
|
---|
| 202 | | | | | | VMail Plan = no
|
---|
| 203 | | | | | | | Night Mins <= 174.2
|
---|
| 204 | | | | | | | | Day Mins <= 246.8: False. (12.0)
|
---|
| 205 | | | | | | | | Day Mins > 246.8: True. (5.0/1.0)
|
---|
| 206 | | | | | | | Night Mins > 174.2: True. (50.0/8.0)
|
---|
| 207 | | | Int'l Plan = yes
|
---|
| 208 | | | | Intl Calls <= 2: True. (51.0)
|
---|
| 209 | | | | Intl Calls > 2
|
---|
| 210 | | | | | Intl Mins <= 13.1: False. (173.0/7.0)
|
---|
| 211 | | | | | Intl Mins > 13.1: True. (43.0)
|
---|
| 212 | | CustServ Calls > 3
|
---|
| 213 | | | Day Mins <= 160.2
|
---|
| 214 | | | | Eve Charge <= 19.83: True. (79.0/3.0)
|
---|
| 215 | | | | Eve Charge > 19.83
|
---|
| 216 | | | | | Day Mins <= 120.5: True. (10.0)
|
---|
| 217 | | | | | Day Mins > 120.5: False. (13.0/3.0)
|
---|
| 218 | | | Day Mins > 160.2
|
---|
| 219 | | | | Eve Charge <= 12.05
|
---|
| 220 | | | | | Eve Calls <= 125: True. (16.0/2.0)
|
---|
| 221 | | | | | Eve Calls > 125: False. (3.0)
|
---|
| 222 | | | | Eve Charge > 12.05: False. (130.0/24.0)
|
---|
| 223 | Day Mins > 264.4
|
---|
| 224 | | VMail Plan = yes: False. (53.0/6.0)
|
---|
| 225 | | VMail Plan = no
|
---|
| 226 | | | Eve Mins <= 187.7
|
---|
| 227 | | | | Day Mins <= 280.4: False. (30.0/7.0)
|
---|
| 228 | | | | Day Mins > 280.4: True. (27.0/9.0)
|
---|
| 229 | | | Eve Mins > 187.7: True. (101.0/5.0)
|
---|
| 230 | \end{verbatim}
|
---|
| 231 | \end{result}
|
---|
| 232 | \end{document}
|
---|