%
% $Id: report.tex 571 2008-04-20 17:31:04Z rick $
%

\documentclass[12pt,a4paper]{article}

\frenchspacing
\usepackage[english,dutch]{babel}
\selectlanguage{dutch}
\usepackage{graphicx}
\usepackage{url}
\usepackage{multicol}
\usepackage{fancybox}
\usepackage{amssymb,amsmath}
\usepackage{float}
\floatstyle{ruled}
\newfloat{result}{thp}{lop}
\floatname{result}{Result}

\title{Churn in Telecom dataset \\
\large{Databases and Datamining, 2009}}
\author{Jonathan Vis, Rick van der Zwet\\
  \texttt{<$\{$jvis,hvdzwet$\}$@liacs.nl>}}
\date{\today}


\begin{document}
\newcommand{\wekacmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
\newcommand{\unixcmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}


\maketitle

\section{Introduction}
This report is focused towards finding association rule learning to find
relations between variables in large databases. This will be done using
Weka\footnote{http://www.cs.waikato.ac.nz/ml/weka/} and a telecom churn
dataset\footnote{http://www.dataminingconsultant.com/DKD.htm}.

\section{Problem description}
\emph{Churning} -moving to a different company)- today is still a major deal
within companies. Having to understand why a customer choose to go for an other
company is crusial in finding flaws in the product-range or services. As more
and more data about the consumer get stored, trying to find relations why
he/she churned is becoming more and more interesting.

\section{Statistics}
Our dataset has 3333 entries and 21 attributes, which the charactistics shown
in table~\ref{overview}.
\begin{table}[ht]
\caption{Statistical report of churn dataset}
\centering
\begin{tabular}{l || r | r | r | r | r | r | r | r | r |}
Item & Type & Distinct & Missing & Unique & Min & Max & Mean & StdDev \\
\hline \hline
State & Nominal & 51 & 0 & NaN & NaN & NaN & NaN & NaN \\
Account Length & Numeric & 212 & 0 & 16 & 1 & 1 & 101 & 40 \\
Area Code & Numeric & 3 & 0 & 0 & 408 & 510 & 437 & 42 \\
Phone & Nominal & 3333 & 0 & 3333 & NaN & NaN & NaN & NaN \\
Int'l Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
VMail Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
VMail Msg & Numeric & 46 & 0 & 4 & 0 & 51 & 8 & 14 \\
Day Mins & Numeric & 1667 & 0 & 770 & 0 & 351 & 180 & 54 \\
Days Calls & Numeric & 119 & 0 & 10 & 0 & 165 & 100 & 20 \\
Days Charge & Numeric & 1667 & 0 & 770 & 0 & 60 & 30 & 9 \\
Eve Mins & Numeric & 1611 & 0 & 709 & 0 & 364 & 201 & 51 \\
Eve Calls & Numeric & 123 & 0 & 17 & 0 & 170 & 100 & 20 \\
Eve Charge & Numeric & 1440 & 0 & 585 & 0 & 31 & 17 & 4 \\
Night Mins & Numeric & 1591 & 0 & 586 & 23 & 395 & 201 & 51 \\
Night Calls & Numeric & 120 & 0 & 11 & 33 & 175 & 100 & 19 \\
Night Charge & Numeric & 933 & 0 & 236 & 1 & 18 & 9 & 2 \\
Intl Mins & Numeric & 162 & 0 & 16 & 0 & 20 & 10 & 3 \\
Intl Calls & Numeric & 21 & 0 & 3 & 0 & 20 & 4.5 & 2.5 \\
Intl Charge & Numeric & 162 & 0 & 16 & 0 & 5.4 & 2.8 & 0.8 \\
CusServ Calls & Numeric & 10 & 0 & 0 & 0 & 9 & 1.5 & 1.3 \\
Churn & Nominal & 2 & 0 & 0 & NaN & NaN & NaN & NaN
\end{tabular}
\label{overview}
\end{table}

We can consider ourself lucky by having an complete dataset. None of the
attributes is missing at an entry. How-ever this does not mean the data is
considered error-free. There might be human-errors or others of some kind
inside the dataset.

\section{Approach}
As finding association rules needs discrete values, we will discretize the
attributes tagged as Numeric in table~\ref{overview}. At this process we will
take a special look of the actual meaning of the attribute. Having 3.5 Customer
Server Calls is going to be a bit impossible, so make sure to set the binaries
of all \emph(bins) to rounded values. We will also try \emph(Weka) feature of
automatic discretizion: \wekacmd{weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -O -B 10 -M -1.0 -R first-last}.

Secondly table~\ref{overview} shows an number of entries which are related.
Like for example \emph{Day Calls} and \emph{Day Charge}. One could argue that
both are relevant, as an consumer might churn as making many calls turns out to
be problematic (bad signal, quality for example). While you could also argue
the price will determine the churn. We will discard the values related to
minutes and call numbers and solely focus on the \emph{Charge}.
The \emph{Phone} attribute also shows some interesting feature. We will make an
new attributecalled \emph{Phone-prefix} which is the first 3 numbers of the
\emph{Phone} number, to see whether this give some fine gain grouping over area
code. Also the combination \emph{Area Code}-\emph{Phone-prefix} will be
researched.

\section{Implementaion}
Creating phone-prefix column using standard unix tools:
\unixcmd{cut -f 4 -d, churn\_ooo.csv | cut -c 1-4,10 | paste -d, -
churn\_ooo.csv | sed '1s/"Pho/"Phone-Prefix"/' > churn\_parsed.csv}.  Using
\emph{Weka} deleted the columns \texttt{Day Mins, Day Calls, Eve Mins, Eve
Calls, Night Mins, Night Calls, Intl Mins, Intl Calls} as we believe they are
subsets of \emph{Charge}.
\emph{Phone} is a unique identifier for every entry, not allowing any
generalization. So it is ignored/deleted.

Using \emph{Weka} embedded discretize function on all Numeric Columns of
table~\ref{overview}. Any charge value was taken to be full integer values e.g
rounded currency: \wekacmd{weka.filters.unsupervised.attribute.NumericCleaner
-min -1.7E308 -min-default -1.708 -max 1.7E308 -max-default 1.7E308 -closeto 0.0
-closeto-default 0.0 -closeto-tolerance 1.0E-6 -R 8-11 -decimals 0} And made
discrete: \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 8-11}

Account length was set to be a bin of 'weight' 1, assuming a 1
months\footnote{This might as well be days, years of some other value, but
assuming fixed phone contracts, given the range (0-244) months seems to be
make most sense} contract:
\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 3}

\emph{Phone-Prefix} is set to be a set on every number unique:
\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 1}
\emph{CustServ Calls} it set to be rounded values, as one cannot make half calls:
\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 12}

Then running the association algoritm \emph{Apriori}, with the \emph{Churn}
value as the result of the equations:
\begin{result}
\caption{ association algoritm \emph{Apriori} - run 1}
\scriptsize
\begin{verbatim}
 1. Area Code=415 Int'l Plan=no VMail Plan=yes 423 ==> Churn?=False. 405 conf :(0.96)
 2. Int'l Plan=no VMail Plan=yes 830 ==> Churn?=False. 786    conf:(0.95)
 ...
\end{verbatim}
\end{result}

Mostly negative results e.g. proving when a consumer is not going to churn.
Secondly data seems to specialise, rule 1 for example is a specialisation of
rule 2. Try running without the requirement that churn needs to be on the right
end of the rule.

\begin{result}
\caption{ association algoritm \emph{Apriori} - run 2}
\begin{verbatim}
 1. VMail Message=0 2411 ==> VMail Plan=no 2411    conf:(1)
 2. VMail Plan=no 2411 ==> VMail Message=0 2411    conf:(1)
 ...
\end{verbatim}
\end{result}

Seems like \emph{VMail Plan=no} seems to equal the \emph{VMail Message=0}. So
\emph{VMail Plan} can be safely deleted from the attribute list. \emph{VMail
Message} does not seems to be a very clear description for it's claimed
purpose. Running without \emph{VMail Plan} did not show improvement. Hence we
decided to go for an normalisation on the Churn number. By taking a random
sample of $Churn=False$ values such that it equals the number of $Churn=True$
values\footnote{\scriptsize\texttt{sort -r -t, -k 22,22 churn\_parsed.csv | sed
'484,2366d' > churn\_equal.csv}} we re-run the experiments, but found no
improvement in the experiments.

\section{Conclusions}
The churn dataset does not classify itself properly associations rules. Mainly
due to the fact that the so called 'hidden factors' for churning, like `if
calling more than X minutes at rate Y I will churn'. cannot be mined using this
current dataset. Further research could include this relations by means of
formula's, but it require domain specific knowledge to include for example
relations between \emph{Day Min} and {Day Calls}.  Alternative methods like
scat and plat analysis\footnote{Like done at
http://meru.cecs.missouri.edu/courses/cecs401\_data\_mining/projects/group2/finproject1.htm}
seems to lead to more promensing results. This could (of course) also be done
in Weka. Take for a brief preview in Appendix 1.
\newpage
\section{Appendix 1}
Result using \emph{Weka} classifier: \wekacmd{weka.classifiers.trees.J48 -C 0.25
-M 2} shows interesting details, like \wekacmd{\scriptsize'Day Mins' > 254.4 and 'VMail
Plan' = no and 'Eve Mins > 187.7 => True.}
\begin{result}
\caption{J48 pruned tree of raw churn dataset}
\scriptsize
\begin{verbatim}
=== Classifier model (full training set) ===
J48 pruned tree
------------------
Day Mins <= 264.4
|   CustServ Calls <= 3
|   |   Int'l Plan = no
|   |   |   Day Mins <= 223.2: False. (2221.0/60.0)
|   |   |   Day Mins > 223.2
|   |   |   |   Eve Mins <= 242.3: False. (296.0/22.0)
|   |   |   |   Eve Mins > 242.3
|   |   |   |   |   VMail Plan = yes: False. (20.0)
|   |   |   |   |   VMail Plan = no
|   |   |   |   |   |   Night Mins <= 174.2
|   |   |   |   |   |   |   Day Mins <= 246.8: False. (12.0)
|   |   |   |   |   |   |   Day Mins > 246.8: True. (5.0/1.0)
|   |   |   |   |   |   Night Mins > 174.2: True. (50.0/8.0)
|   |   Int'l Plan = yes
|   |   |   Intl Calls <= 2: True. (51.0)
|   |   |   Intl Calls > 2
|   |   |   |   Intl Mins <= 13.1: False. (173.0/7.0)
|   |   |   |   Intl Mins > 13.1: True. (43.0)
|   CustServ Calls > 3
|   |   Day Mins <= 160.2
|   |   |   Eve Charge <= 19.83: True. (79.0/3.0)
|   |   |   Eve Charge > 19.83
|   |   |   |   Day Mins <= 120.5: True. (10.0)
|   |   |   |   Day Mins > 120.5: False. (13.0/3.0)
|   |   Day Mins > 160.2
|   |   |   Eve Charge <= 12.05
|   |   |   |   Eve Calls <= 125: True. (16.0/2.0)
|   |   |   |   Eve Calls > 125: False. (3.0)
|   |   |   Eve Charge > 12.05: False. (130.0/24.0)
Day Mins > 264.4
|   VMail Plan = yes: False. (53.0/6.0)
|   VMail Plan = no
|   |   Eve Mins <= 187.7
|   |   |   Day Mins <= 280.4: False. (30.0/7.0)
|   |   |   Day Mins > 280.4: True. (27.0/9.0)
|   |   Eve Mins > 187.7: True. (101.0/5.0)
\end{verbatim}
\end{result}
\end{document}