source: liacs/dbdm/dbdm_3/report.tex@ 314

Last change on this file since 314 was 2, checked in by Rick van der Zwet, 15 years ago

Initial import of data of old repository ('data') worth keeping (e.g. tracking
means of URL access statistics)

File size: 10.1 KB
RevLine 
[2]1%
2% $Id: report.tex 571 2008-04-20 17:31:04Z rick $
3%
4
5\documentclass[12pt,a4paper]{article}
6
7\frenchspacing
8\usepackage[english,dutch]{babel}
9\selectlanguage{dutch}
10\usepackage{graphicx}
11\usepackage{url}
12\usepackage{multicol}
13\usepackage{fancybox}
14\usepackage{amssymb,amsmath}
15\usepackage{float}
16\floatstyle{ruled}
17\newfloat{result}{thp}{lop}
18\floatname{result}{Result}
19
20\title{Churn in Telecom dataset \\
21\large{Databases and Datamining, 2009}}
22\author{Jonathan Vis, Rick van der Zwet\\
23 \texttt{<$\{$jvis,hvdzwet$\}$@liacs.nl>}}
24\date{\today}
25
26
27\begin{document}
28\newcommand{\wekacmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
29\newcommand{\unixcmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
30
31
32\maketitle
33
34\section{Introduction}
35This report is focused towards finding association rule learning to find
36relations between variables in large databases. This will be done using
37Weka\footnote{http://www.cs.waikato.ac.nz/ml/weka/} and a telecom churn
38dataset\footnote{http://www.dataminingconsultant.com/DKD.htm}.
39
40\section{Problem description}
41\emph{Churning} -moving to a different company)- today is still a major deal
42within companies. Having to understand why a customer choose to go for an other
43company is crusial in finding flaws in the product-range or services. As more
44and more data about the consumer get stored, trying to find relations why
45he/she churned is becoming more and more interesting.
46
47\section{Statistics}
48Our dataset has 3333 entries and 21 attributes, which the charactistics shown
49in table~\ref{overview}.
50\begin{table}[ht]
51\caption{Statistical report of churn dataset}
52\centering
53\begin{tabular}{l || r | r | r | r | r | r | r | r | r |}
54Item & Type & Distinct & Missing & Unique & Min & Max & Mean & StdDev \\
55\hline \hline
56State & Nominal & 51 & 0 & NaN & NaN & NaN & NaN & NaN \\
57Account Length & Numeric & 212 & 0 & 16 & 1 & 1 & 101 & 40 \\
58Area Code & Numeric & 3 & 0 & 0 & 408 & 510 & 437 & 42 \\
59Phone & Nominal & 3333 & 0 & 3333 & NaN & NaN & NaN & NaN \\
60Int'l Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
61VMail Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
62VMail Msg & Numeric & 46 & 0 & 4 & 0 & 51 & 8 & 14 \\
63Day Mins & Numeric & 1667 & 0 & 770 & 0 & 351 & 180 & 54 \\
64Days Calls & Numeric & 119 & 0 & 10 & 0 & 165 & 100 & 20 \\
65Days Charge & Numeric & 1667 & 0 & 770 & 0 & 60 & 30 & 9 \\
66Eve Mins & Numeric & 1611 & 0 & 709 & 0 & 364 & 201 & 51 \\
67Eve Calls & Numeric & 123 & 0 & 17 & 0 & 170 & 100 & 20 \\
68Eve Charge & Numeric & 1440 & 0 & 585 & 0 & 31 & 17 & 4 \\
69Night Mins & Numeric & 1591 & 0 & 586 & 23 & 395 & 201 & 51 \\
70Night Calls & Numeric & 120 & 0 & 11 & 33 & 175 & 100 & 19 \\
71Night Charge & Numeric & 933 & 0 & 236 & 1 & 18 & 9 & 2 \\
72Intl Mins & Numeric & 162 & 0 & 16 & 0 & 20 & 10 & 3 \\
73Intl Calls & Numeric & 21 & 0 & 3 & 0 & 20 & 4.5 & 2.5 \\
74Intl Charge & Numeric & 162 & 0 & 16 & 0 & 5.4 & 2.8 & 0.8 \\
75CusServ Calls & Numeric & 10 & 0 & 0 & 0 & 9 & 1.5 & 1.3 \\
76Churn & Nominal & 2 & 0 & 0 & NaN & NaN & NaN & NaN
77\end{tabular}
78\label{overview}
79\end{table}
80
81We can consider ourself lucky by having an complete dataset. None of the
82attributes is missing at an entry. How-ever this does not mean the data is
83considered error-free. There might be human-errors or others of some kind
84inside the dataset.
85
86\section{Approach}
87As finding association rules needs discrete values, we will discretize the
88attributes tagged as Numeric in table~\ref{overview}. At this process we will
89take a special look of the actual meaning of the attribute. Having 3.5 Customer
90Server Calls is going to be a bit impossible, so make sure to set the binaries
91of all \emph(bins) to rounded values. We will also try \emph(Weka) feature of
92automatic discretizion: \wekacmd{weka.filters.unsupervised.attribute.Discretize
93-unset-class-temporarily -O -B 10 -M -1.0 -R first-last}.
94
95Secondly table~\ref{overview} shows an number of entries which are related.
96Like for example \emph{Day Calls} and \emph{Day Charge}. One could argue that
97both are relevant, as an consumer might churn as making many calls turns out to
98be problematic (bad signal, quality for example). While you could also argue
99the price will determine the churn. We will discard the values related to
100minutes and call numbers and solely focus on the \emph{Charge}.
101The \emph{Phone} attribute also shows some interesting feature. We will make an
102new attributecalled \emph{Phone-prefix} which is the first 3 numbers of the
103\emph{Phone} number, to see whether this give some fine gain grouping over area
104code. Also the combination \emph{Area Code}-\emph{Phone-prefix} will be
105researched.
106
107\section{Implementaion}
108Creating phone-prefix column using standard unix tools:
109\unixcmd{cut -f 4 -d, churn\_ooo.csv | cut -c 1-4,10 | paste -d, -
110churn\_ooo.csv | sed '1s/"Pho/"Phone-Prefix"/' > churn\_parsed.csv}. Using
111\emph{Weka} deleted the columns \texttt{Day Mins, Day Calls, Eve Mins, Eve
112Calls, Night Mins, Night Calls, Intl Mins, Intl Calls} as we believe they are
113subsets of \emph{Charge}.
114\emph{Phone} is a unique identifier for every entry, not allowing any
115generalization. So it is ignored/deleted.
116
117Using \emph{Weka} embedded discretize function on all Numeric Columns of
118table~\ref{overview}. Any charge value was taken to be full integer values e.g
119rounded currency: \wekacmd{weka.filters.unsupervised.attribute.NumericCleaner
120-min -1.7E308 -min-default -1.708 -max 1.7E308 -max-default 1.7E308 -closeto 0.0
121-closeto-default 0.0 -closeto-tolerance 1.0E-6 -R 8-11 -decimals 0} And made
122discrete: \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 8-11}
123
124Account length was set to be a bin of 'weight' 1, assuming a 1
125months\footnote{This might as well be days, years of some other value, but
126assuming fixed phone contracts, given the range (0-244) months seems to be
127make most sense} contract:
128\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 3}
129
130\emph{Phone-Prefix} is set to be a set on every number unique:
131\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 1}
132\emph{CustServ Calls} it set to be rounded values, as one cannot make half calls:
133\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 12}
134
135Then running the association algoritm \emph{Apriori}, with the \emph{Churn}
136value as the result of the equations:
137\begin{result}
138\caption{ association algoritm \emph{Apriori} - run 1}
139\scriptsize
140\begin{verbatim}
141 1. Area Code=415 Int'l Plan=no VMail Plan=yes 423 ==> Churn?=False. 405 conf :(0.96)
142 2. Int'l Plan=no VMail Plan=yes 830 ==> Churn?=False. 786 conf:(0.95)
143 ...
144\end{verbatim}
145\end{result}
146
147Mostly negative results e.g. proving when a consumer is not going to churn.
148Secondly data seems to specialise, rule 1 for example is a specialisation of
149rule 2. Try running without the requirement that churn needs to be on the right
150end of the rule.
151
152\begin{result}
153\caption{ association algoritm \emph{Apriori} - run 2}
154\begin{verbatim}
155 1. VMail Message=0 2411 ==> VMail Plan=no 2411 conf:(1)
156 2. VMail Plan=no 2411 ==> VMail Message=0 2411 conf:(1)
157 ...
158\end{verbatim}
159\end{result}
160
161Seems like \emph{VMail Plan=no} seems to equal the \emph{VMail Message=0}. So
162\emph{VMail Plan} can be safely deleted from the attribute list. \emph{VMail
163Message} does not seems to be a very clear description for it's claimed
164purpose. Running without \emph{VMail Plan} did not show improvement. Hence we
165decided to go for an normalisation on the Churn number. By taking a random
166sample of $Churn=False$ values such that it equals the number of $Churn=True$
167values\footnote{\scriptsize\texttt{sort -r -t, -k 22,22 churn\_parsed.csv | sed
168'484,2366d' > churn\_equal.csv}} we re-run the experiments, but found no
169improvement in the experiments.
170
171\section{Conclusions}
172The churn dataset does not classify itself properly associations rules. Mainly
173due to the fact that the so called 'hidden factors' for churning, like `if
174calling more than X minutes at rate Y I will churn'. cannot be mined using this
175current dataset. Further research could include this relations by means of
176formula's, but it require domain specific knowledge to include for example
177relations between \emph{Day Min} and {Day Calls}. Alternative methods like
178scat and plat analysis\footnote{Like done at
179http://meru.cecs.missouri.edu/courses/cecs401\_data\_mining/projects/group2/finproject1.htm}
180seems to lead to more promensing results. This could (of course) also be done
181in Weka. Take for a brief preview in Appendix 1.
182\newpage
183\section{Appendix 1}
184Result using \emph{Weka} classifier: \wekacmd{weka.classifiers.trees.J48 -C 0.25
185-M 2} shows interesting details, like \wekacmd{\scriptsize'Day Mins' > 254.4 and 'VMail
186Plan' = no and 'Eve Mins > 187.7 => True.}
187\begin{result}
188\caption{J48 pruned tree of raw churn dataset}
189\scriptsize
190\begin{verbatim}
191=== Classifier model (full training set) ===
192J48 pruned tree
193------------------
194Day Mins <= 264.4
195| CustServ Calls <= 3
196| | Int'l Plan = no
197| | | Day Mins <= 223.2: False. (2221.0/60.0)
198| | | Day Mins > 223.2
199| | | | Eve Mins <= 242.3: False. (296.0/22.0)
200| | | | Eve Mins > 242.3
201| | | | | VMail Plan = yes: False. (20.0)
202| | | | | VMail Plan = no
203| | | | | | Night Mins <= 174.2
204| | | | | | | Day Mins <= 246.8: False. (12.0)
205| | | | | | | Day Mins > 246.8: True. (5.0/1.0)
206| | | | | | Night Mins > 174.2: True. (50.0/8.0)
207| | Int'l Plan = yes
208| | | Intl Calls <= 2: True. (51.0)
209| | | Intl Calls > 2
210| | | | Intl Mins <= 13.1: False. (173.0/7.0)
211| | | | Intl Mins > 13.1: True. (43.0)
212| CustServ Calls > 3
213| | Day Mins <= 160.2
214| | | Eve Charge <= 19.83: True. (79.0/3.0)
215| | | Eve Charge > 19.83
216| | | | Day Mins <= 120.5: True. (10.0)
217| | | | Day Mins > 120.5: False. (13.0/3.0)
218| | Day Mins > 160.2
219| | | Eve Charge <= 12.05
220| | | | Eve Calls <= 125: True. (16.0/2.0)
221| | | | Eve Calls > 125: False. (3.0)
222| | | Eve Charge > 12.05: False. (130.0/24.0)
223Day Mins > 264.4
224| VMail Plan = yes: False. (53.0/6.0)
225| VMail Plan = no
226| | Eve Mins <= 187.7
227| | | Day Mins <= 280.4: False. (30.0/7.0)
228| | | Day Mins > 280.4: True. (27.0/9.0)
229| | Eve Mins > 187.7: True. (101.0/5.0)
230\end{verbatim}
231\end{result}
232\end{document}
Note: See TracBrowser for help on using the repository browser.