1 | %
|
---|
2 | % $Id: report.tex 571 2008-04-20 17:31:04Z rick $
|
---|
3 | %
|
---|
4 |
|
---|
5 | \documentclass[12pt,a4paper]{article}
|
---|
6 |
|
---|
7 | \frenchspacing
|
---|
8 | \usepackage[english,dutch]{babel}
|
---|
9 | \selectlanguage{dutch}
|
---|
10 | \usepackage{graphicx}
|
---|
11 | \usepackage{url}
|
---|
12 | \usepackage{multicol}
|
---|
13 | \usepackage{fancybox}
|
---|
14 | \usepackage{amssymb,amsmath}
|
---|
15 | \usepackage{float}
|
---|
16 | \floatstyle{ruled}
|
---|
17 | \newfloat{result}{thp}{lop}
|
---|
18 | \floatname{result}{Result}
|
---|
19 |
|
---|
20 | \title{Churn in Telecom dataset \\
|
---|
21 | \large{Databases and Datamining, 2009}}
|
---|
22 | \author{Jonathan Vis, Rick van der Zwet\\
|
---|
23 | \texttt{<$\{$jvis,hvdzwet$\}$@liacs.nl>}}
|
---|
24 | \date{\today}
|
---|
25 |
|
---|
26 |
|
---|
27 | \begin{document}
|
---|
28 | \newcommand{\wekacmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
|
---|
29 | \newcommand{\unixcmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
|
---|
30 |
|
---|
31 |
|
---|
32 | \maketitle
|
---|
33 |
|
---|
34 | \section{Introduction}
|
---|
35 | This report is focused towards finding association rule learning to find
|
---|
36 | relations between variables in large databases. This will be done using
|
---|
37 | Weka\footnote{http://www.cs.waikato.ac.nz/ml/weka/} and a telecom churn
|
---|
38 | dataset\footnote{http://www.dataminingconsultant.com/DKD.htm}.
|
---|
39 |
|
---|
40 | \section{Problem description}
|
---|
41 | \emph{Churning} -moving to a different company)- today is still a major deal
|
---|
42 | within companies. Having to understand why a customer choose to go for an other
|
---|
43 | company is crusial in finding flaws in the product-range or services. As more
|
---|
44 | and more data about the consumer get stored, trying to find relations why
|
---|
45 | he/she churned is becoming more and more interesting.
|
---|
46 |
|
---|
47 | \section{Statistics}
|
---|
48 | Our dataset has 3333 entries and 21 attributes, which the charactistics shown
|
---|
49 | in table~\ref{overview}.
|
---|
50 | \begin{table}[ht]
|
---|
51 | \caption{Statistical report of churn dataset}
|
---|
52 | \centering
|
---|
53 | \begin{tabular}{l || r | r | r | r | r | r | r | r | r |}
|
---|
54 | Item & Type & Distinct & Missing & Unique & Min & Max & Mean & StdDev \\
|
---|
55 | \hline \hline
|
---|
56 | State & Nominal & 51 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
57 | Account Length & Numeric & 212 & 0 & 16 & 1 & 1 & 101 & 40 \\
|
---|
58 | Area Code & Numeric & 3 & 0 & 0 & 408 & 510 & 437 & 42 \\
|
---|
59 | Phone & Nominal & 3333 & 0 & 3333 & NaN & NaN & NaN & NaN \\
|
---|
60 | Int'l Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
61 | VMail Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
|
---|
62 | VMail Msg & Numeric & 46 & 0 & 4 & 0 & 51 & 8 & 14 \\
|
---|
63 | Day Mins & Numeric & 1667 & 0 & 770 & 0 & 351 & 180 & 54 \\
|
---|
64 | Days Calls & Numeric & 119 & 0 & 10 & 0 & 165 & 100 & 20 \\
|
---|
65 | Days Charge & Numeric & 1667 & 0 & 770 & 0 & 60 & 30 & 9 \\
|
---|
66 | Eve Mins & Numeric & 1611 & 0 & 709 & 0 & 364 & 201 & 51 \\
|
---|
67 | Eve Calls & Numeric & 123 & 0 & 17 & 0 & 170 & 100 & 20 \\
|
---|
68 | Eve Charge & Numeric & 1440 & 0 & 585 & 0 & 31 & 17 & 4 \\
|
---|
69 | Night Mins & Numeric & 1591 & 0 & 586 & 23 & 395 & 201 & 51 \\
|
---|
70 | Night Calls & Numeric & 120 & 0 & 11 & 33 & 175 & 100 & 19 \\
|
---|
71 | Night Charge & Numeric & 933 & 0 & 236 & 1 & 18 & 9 & 2 \\
|
---|
72 | Intl Mins & Numeric & 162 & 0 & 16 & 0 & 20 & 10 & 3 \\
|
---|
73 | Intl Calls & Numeric & 21 & 0 & 3 & 0 & 20 & 4.5 & 2.5 \\
|
---|
74 | Intl Charge & Numeric & 162 & 0 & 16 & 0 & 5.4 & 2.8 & 0.8 \\
|
---|
75 | CusServ Calls & Numeric & 10 & 0 & 0 & 0 & 9 & 1.5 & 1.3 \\
|
---|
76 | Churn & Nominal & 2 & 0 & 0 & NaN & NaN & NaN & NaN
|
---|
77 | \end{tabular}
|
---|
78 | \label{overview}
|
---|
79 | \end{table}
|
---|
80 |
|
---|
81 | We can consider ourself lucky by having an complete dataset. None of the
|
---|
82 | attributes is missing at an entry. How-ever this does not mean the data is
|
---|
83 | considered error-free. There might be human-errors or others of some kind
|
---|
84 | inside the dataset.
|
---|
85 |
|
---|
86 | \section{Approach}
|
---|
87 | As finding association rules needs discrete values, we will discretize the
|
---|
88 | attributes tagged as Numeric in table~\ref{overview}. At this process we will
|
---|
89 | take a special look of the actual meaning of the attribute. Having 3.5 Customer
|
---|
90 | Server Calls is going to be a bit impossible, so make sure to set the binaries
|
---|
91 | of all \emph(bins) to rounded values. We will also try \emph(Weka) feature of
|
---|
92 | automatic discretizion: \wekacmd{weka.filters.unsupervised.attribute.Discretize
|
---|
93 | -unset-class-temporarily -O -B 10 -M -1.0 -R first-last}.
|
---|
94 |
|
---|
95 | Secondly table~\ref{overview} shows an number of entries which are related.
|
---|
96 | Like for example \emph{Day Calls} and \emph{Day Charge}. One could argue that
|
---|
97 | both are relevant, as an consumer might churn as making many calls turns out to
|
---|
98 | be problematic (bad signal, quality for example). While you could also argue
|
---|
99 | the price will determine the churn. We will discard the values related to
|
---|
100 | minutes and call numbers and solely focus on the \emph{Charge}.
|
---|
101 | The \emph{Phone} attribute also shows some interesting feature. We will make an
|
---|
102 | new attributecalled \emph{Phone-prefix} which is the first 3 numbers of the
|
---|
103 | \emph{Phone} number, to see whether this give some fine gain grouping over area
|
---|
104 | code. Also the combination \emph{Area Code}-\emph{Phone-prefix} will be
|
---|
105 | researched.
|
---|
106 |
|
---|
107 | \section{Implementaion}
|
---|
108 | Creating phone-prefix column using standard unix tools:
|
---|
109 | \unixcmd{cut -f 4 -d, churn\_ooo.csv | cut -c 1-4,10 | paste -d, -
|
---|
110 | churn\_ooo.csv | sed '1s/"Pho/"Phone-Prefix"/' > churn\_parsed.csv}. Using
|
---|
111 | \emph{Weka} deleted the columns \texttt{Day Mins, Day Calls, Eve Mins, Eve
|
---|
112 | Calls, Night Mins, Night Calls, Intl Mins, Intl Calls} as we believe they are
|
---|
113 | subsets of \emph{Charge}.
|
---|
114 | \emph{Phone} is a unique identifier for every entry, not allowing any
|
---|
115 | generalization. So it is ignored/deleted.
|
---|
116 |
|
---|
117 | Using \emph{Weka} embedded discretize function on all Numeric Columns of
|
---|
118 | table~\ref{overview}. Any charge value was taken to be full integer values e.g
|
---|
119 | rounded currency: \wekacmd{weka.filters.unsupervised.attribute.NumericCleaner
|
---|
120 | -min -1.7E308 -min-default -1.708 -max 1.7E308 -max-default 1.7E308 -closeto 0.0
|
---|
121 | -closeto-default 0.0 -closeto-tolerance 1.0E-6 -R 8-11 -decimals 0} And made
|
---|
122 | discrete: \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 8-11}
|
---|
123 |
|
---|
124 | Account length was set to be a bin of 'weight' 1, assuming a 1
|
---|
125 | months\footnote{This might as well be days, years of some other value, but
|
---|
126 | assuming fixed phone contracts, given the range (0-244) months seems to be
|
---|
127 | make most sense} contract:
|
---|
128 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 3}
|
---|
129 |
|
---|
130 | \emph{Phone-Prefix} is set to be a set on every number unique:
|
---|
131 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 1}
|
---|
132 | \emph{CustServ Calls} it set to be rounded values, as one cannot make half calls:
|
---|
133 | \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 12}
|
---|
134 |
|
---|
135 | Then running the association algoritm \emph{Apriori}, with the \emph{Churn}
|
---|
136 | value as the result of the equations:
|
---|
137 | \begin{result}
|
---|
138 | \caption{ association algoritm \emph{Apriori} - run 1}
|
---|
139 | \scriptsize
|
---|
140 | \begin{verbatim}
|
---|
141 | 1. Area Code=415 Int'l Plan=no VMail Plan=yes 423 ==> Churn?=False. 405 conf :(0.96)
|
---|
142 | 2. Int'l Plan=no VMail Plan=yes 830 ==> Churn?=False. 786 conf:(0.95)
|
---|
143 | ...
|
---|
144 | \end{verbatim}
|
---|
145 | \end{result}
|
---|
146 |
|
---|
147 | Mostly negative results e.g. proving when a consumer is not going to churn.
|
---|
148 | Secondly data seems to specialise, rule 1 for example is a specialisation of
|
---|
149 | rule 2. Try running without the requirement that churn needs to be on the right
|
---|
150 | end of the rule.
|
---|
151 |
|
---|
152 | \begin{result}
|
---|
153 | \caption{ association algoritm \emph{Apriori} - run 2}
|
---|
154 | \begin{verbatim}
|
---|
155 | 1. VMail Message=0 2411 ==> VMail Plan=no 2411 conf:(1)
|
---|
156 | 2. VMail Plan=no 2411 ==> VMail Message=0 2411 conf:(1)
|
---|
157 | ...
|
---|
158 | \end{verbatim}
|
---|
159 | \end{result}
|
---|
160 |
|
---|
161 | Seems like \emph{VMail Plan=no} seems to equal the \emph{VMail Message=0}. So
|
---|
162 | \emph{VMail Plan} can be safely deleted from the attribute list. \emph{VMail
|
---|
163 | Message} does not seems to be a very clear description for it's claimed
|
---|
164 | purpose. Running without \emph{VMail Plan} did not show improvement. Hence we
|
---|
165 | decided to go for an normalisation on the Churn number. By taking a random
|
---|
166 | sample of $Churn=False$ values such that it equals the number of $Churn=True$
|
---|
167 | values\footnote{\scriptsize\texttt{sort -r -t, -k 22,22 churn\_parsed.csv | sed
|
---|
168 | '484,2366d' > churn\_equal.csv}} we re-run the experiments, but found no
|
---|
169 | improvement in the experiments.
|
---|
170 |
|
---|
171 | \section{Conclusions}
|
---|
172 | The churn dataset does not classify itself properly associations rules. Mainly
|
---|
173 | due to the fact that the so called 'hidden factors' for churning, like `if
|
---|
174 | calling more than X minutes at rate Y I will churn'. cannot be mined using this
|
---|
175 | current dataset. Further research could include this relations by means of
|
---|
176 | formula's, but it require domain specific knowledge to include for example
|
---|
177 | relations between \emph{Day Min} and {Day Calls}. Alternative methods like
|
---|
178 | scat and plat analysis\footnote{Like done at
|
---|
179 | http://meru.cecs.missouri.edu/courses/cecs401\_data\_mining/projects/group2/finproject1.htm}
|
---|
180 | seems to lead to more promensing results. This could (of course) also be done
|
---|
181 | in Weka. Take for a brief preview in Appendix 1.
|
---|
182 | \newpage
|
---|
183 | \section{Appendix 1}
|
---|
184 | Result using \emph{Weka} classifier: \wekacmd{weka.classifiers.trees.J48 -C 0.25
|
---|
185 | -M 2} shows interesting details, like \wekacmd{\scriptsize'Day Mins' > 254.4 and 'VMail
|
---|
186 | Plan' = no and 'Eve Mins > 187.7 => True.}
|
---|
187 | \begin{result}
|
---|
188 | \caption{J48 pruned tree of raw churn dataset}
|
---|
189 | \scriptsize
|
---|
190 | \begin{verbatim}
|
---|
191 | === Classifier model (full training set) ===
|
---|
192 | J48 pruned tree
|
---|
193 | ------------------
|
---|
194 | Day Mins <= 264.4
|
---|
195 | | CustServ Calls <= 3
|
---|
196 | | | Int'l Plan = no
|
---|
197 | | | | Day Mins <= 223.2: False. (2221.0/60.0)
|
---|
198 | | | | Day Mins > 223.2
|
---|
199 | | | | | Eve Mins <= 242.3: False. (296.0/22.0)
|
---|
200 | | | | | Eve Mins > 242.3
|
---|
201 | | | | | | VMail Plan = yes: False. (20.0)
|
---|
202 | | | | | | VMail Plan = no
|
---|
203 | | | | | | | Night Mins <= 174.2
|
---|
204 | | | | | | | | Day Mins <= 246.8: False. (12.0)
|
---|
205 | | | | | | | | Day Mins > 246.8: True. (5.0/1.0)
|
---|
206 | | | | | | | Night Mins > 174.2: True. (50.0/8.0)
|
---|
207 | | | Int'l Plan = yes
|
---|
208 | | | | Intl Calls <= 2: True. (51.0)
|
---|
209 | | | | Intl Calls > 2
|
---|
210 | | | | | Intl Mins <= 13.1: False. (173.0/7.0)
|
---|
211 | | | | | Intl Mins > 13.1: True. (43.0)
|
---|
212 | | CustServ Calls > 3
|
---|
213 | | | Day Mins <= 160.2
|
---|
214 | | | | Eve Charge <= 19.83: True. (79.0/3.0)
|
---|
215 | | | | Eve Charge > 19.83
|
---|
216 | | | | | Day Mins <= 120.5: True. (10.0)
|
---|
217 | | | | | Day Mins > 120.5: False. (13.0/3.0)
|
---|
218 | | | Day Mins > 160.2
|
---|
219 | | | | Eve Charge <= 12.05
|
---|
220 | | | | | Eve Calls <= 125: True. (16.0/2.0)
|
---|
221 | | | | | Eve Calls > 125: False. (3.0)
|
---|
222 | | | | Eve Charge > 12.05: False. (130.0/24.0)
|
---|
223 | Day Mins > 264.4
|
---|
224 | | VMail Plan = yes: False. (53.0/6.0)
|
---|
225 | | VMail Plan = no
|
---|
226 | | | Eve Mins <= 187.7
|
---|
227 | | | | Day Mins <= 280.4: False. (30.0/7.0)
|
---|
228 | | | | Day Mins > 280.4: True. (27.0/9.0)
|
---|
229 | | | Eve Mins > 187.7: True. (101.0/5.0)
|
---|
230 | \end{verbatim}
|
---|
231 | \end{result}
|
---|
232 | \end{document}
|
---|