Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

report.tex@ 270

Last change on this file since 270 was 2, checked in by Rick van der Zwet, 15 years ago
Initial import of data of old repository ('data') worth keeping (e.g. tracking means of URL access statistics)
File size: 10.1 KB

Rev	Line
[2]	1	%
	2	% $Id: report.tex 571 2008-04-20 17:31:04Z rick $
	3	%
	4
	5	\documentclass[12pt,a4paper]{article}
	6
	7	\frenchspacing
	8	\usepackage[english,dutch]{babel}
	9	\selectlanguage{dutch}
	10	\usepackage{graphicx}
	11	\usepackage{url}
	12	\usepackage{multicol}
	13	\usepackage{fancybox}
	14	\usepackage{amssymb,amsmath}
	15	\usepackage{float}
	16	\floatstyle{ruled}
	17	\newfloat{result}{thp}{lop}
	18	\floatname{result}{Result}
	19
	20	\title{Churn in Telecom dataset \\
	21	\large{Databases and Datamining, 2009}}
	22	\author{Jonathan Vis, Rick van der Zwet\\
	23	\texttt{<$\{$jvis,hvdzwet$\}$@liacs.nl>}}
	24	\date{\today}
	25
	26
	27	\begin{document}
	28	\newcommand{\wekacmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
	29	\newcommand{\unixcmd}[1]{\begin{quote}\small{\texttt{#1}}\end{quote}}
	30
	31
	32	\maketitle
	33
	34	\section{Introduction}
	35	This report is focused towards finding association rule learning to find
	36	relations between variables in large databases. This will be done using
	37	Weka\footnote{http://www.cs.waikato.ac.nz/ml/weka/} and a telecom churn
	38	dataset\footnote{http://www.dataminingconsultant.com/DKD.htm}.
	39
	40	\section{Problem description}
	41	\emph{Churning} -moving to a different company)- today is still a major deal
	42	within companies. Having to understand why a customer choose to go for an other
	43	company is crusial in finding flaws in the product-range or services. As more
	44	and more data about the consumer get stored, trying to find relations why
	45	he/she churned is becoming more and more interesting.
	46
	47	\section{Statistics}
	48	Our dataset has 3333 entries and 21 attributes, which the charactistics shown
	49	in table~\ref{overview}.
	50	\begin{table}[ht]
	51	\caption{Statistical report of churn dataset}
	52	\centering
	53	\begin{tabular}{l \|\| r \| r \| r \| r \| r \| r \| r \| r \| r \|}
	54	Item & Type & Distinct & Missing & Unique & Min & Max & Mean & StdDev \\
	55	\hline \hline
	56	State & Nominal & 51 & 0 & NaN & NaN & NaN & NaN & NaN \\
	57	Account Length & Numeric & 212 & 0 & 16 & 1 & 1 & 101 & 40 \\
	58	Area Code & Numeric & 3 & 0 & 0 & 408 & 510 & 437 & 42 \\
	59	Phone & Nominal & 3333 & 0 & 3333 & NaN & NaN & NaN & NaN \\
	60	Int'l Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
	61	VMail Plan & Nominal & 2 & 0 & NaN & NaN & NaN & NaN & NaN \\
	62	VMail Msg & Numeric & 46 & 0 & 4 & 0 & 51 & 8 & 14 \\
	63	Day Mins & Numeric & 1667 & 0 & 770 & 0 & 351 & 180 & 54 \\
	64	Days Calls & Numeric & 119 & 0 & 10 & 0 & 165 & 100 & 20 \\
	65	Days Charge & Numeric & 1667 & 0 & 770 & 0 & 60 & 30 & 9 \\
	66	Eve Mins & Numeric & 1611 & 0 & 709 & 0 & 364 & 201 & 51 \\
	67	Eve Calls & Numeric & 123 & 0 & 17 & 0 & 170 & 100 & 20 \\
	68	Eve Charge & Numeric & 1440 & 0 & 585 & 0 & 31 & 17 & 4 \\
	69	Night Mins & Numeric & 1591 & 0 & 586 & 23 & 395 & 201 & 51 \\
	70	Night Calls & Numeric & 120 & 0 & 11 & 33 & 175 & 100 & 19 \\
	71	Night Charge & Numeric & 933 & 0 & 236 & 1 & 18 & 9 & 2 \\
	72	Intl Mins & Numeric & 162 & 0 & 16 & 0 & 20 & 10 & 3 \\
	73	Intl Calls & Numeric & 21 & 0 & 3 & 0 & 20 & 4.5 & 2.5 \\
	74	Intl Charge & Numeric & 162 & 0 & 16 & 0 & 5.4 & 2.8 & 0.8 \\
	75	CusServ Calls & Numeric & 10 & 0 & 0 & 0 & 9 & 1.5 & 1.3 \\
	76	Churn & Nominal & 2 & 0 & 0 & NaN & NaN & NaN & NaN
	77	\end{tabular}
	78	\label{overview}
	79	\end{table}
	80
	81	We can consider ourself lucky by having an complete dataset. None of the
	82	attributes is missing at an entry. How-ever this does not mean the data is
	83	considered error-free. There might be human-errors or others of some kind
	84	inside the dataset.
	85
	86	\section{Approach}
	87	As finding association rules needs discrete values, we will discretize the
	88	attributes tagged as Numeric in table~\ref{overview}. At this process we will
	89	take a special look of the actual meaning of the attribute. Having 3.5 Customer
	90	Server Calls is going to be a bit impossible, so make sure to set the binaries
	91	of all \emph(bins) to rounded values. We will also try \emph(Weka) feature of
	92	automatic discretizion: \wekacmd{weka.filters.unsupervised.attribute.Discretize
	93	-unset-class-temporarily -O -B 10 -M -1.0 -R first-last}.
	94
	95	Secondly table~\ref{overview} shows an number of entries which are related.
	96	Like for example \emph{Day Calls} and \emph{Day Charge}. One could argue that
	97	both are relevant, as an consumer might churn as making many calls turns out to
	98	be problematic (bad signal, quality for example). While you could also argue
	99	the price will determine the churn. We will discard the values related to
	100	minutes and call numbers and solely focus on the \emph{Charge}.
	101	The \emph{Phone} attribute also shows some interesting feature. We will make an
	102	new attributecalled \emph{Phone-prefix} which is the first 3 numbers of the
	103	\emph{Phone} number, to see whether this give some fine gain grouping over area
	104	code. Also the combination \emph{Area Code}-\emph{Phone-prefix} will be
	105	researched.
	106
	107	\section{Implementaion}
	108	Creating phone-prefix column using standard unix tools:
	109	\unixcmd{cut -f 4 -d, churn\_ooo.csv \| cut -c 1-4,10 \| paste -d, -
	110	churn\_ooo.csv \| sed '1s/"Pho/"Phone-Prefix"/' > churn\_parsed.csv}. Using
	111	\emph{Weka} deleted the columns \texttt{Day Mins, Day Calls, Eve Mins, Eve
	112	Calls, Night Mins, Night Calls, Intl Mins, Intl Calls} as we believe they are
	113	subsets of \emph{Charge}.
	114	\emph{Phone} is a unique identifier for every entry, not allowing any
	115	generalization. So it is ignored/deleted.
	116
	117	Using \emph{Weka} embedded discretize function on all Numeric Columns of
	118	table~\ref{overview}. Any charge value was taken to be full integer values e.g
	119	rounded currency: \wekacmd{weka.filters.unsupervised.attribute.NumericCleaner
	120	-min -1.7E308 -min-default -1.708 -max 1.7E308 -max-default 1.7E308 -closeto 0.0
	121	-closeto-default 0.0 -closeto-tolerance 1.0E-6 -R 8-11 -decimals 0} And made
	122	discrete: \wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 8-11}
	123
	124	Account length was set to be a bin of 'weight' 1, assuming a 1
	125	months\footnote{This might as well be days, years of some other value, but
	126	assuming fixed phone contracts, given the range (0-244) months seems to be
	127	make most sense} contract:
	128	\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 3}
	129
	130	\emph{Phone-Prefix} is set to be a set on every number unique:
	131	\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 1}
	132	\emph{CustServ Calls} it set to be rounded values, as one cannot make half calls:
	133	\wekacmd{weka.filters.unsupervised.attribute.NumericToNominal -R 12}
	134
	135	Then running the association algoritm \emph{Apriori}, with the \emph{Churn}
	136	value as the result of the equations:
	137	\begin{result}
	138	\caption{ association algoritm \emph{Apriori} - run 1}
	139	\scriptsize
	140	\begin{verbatim}
	141	1. Area Code=415 Int'l Plan=no VMail Plan=yes 423 ==> Churn?=False. 405 conf :(0.96)
	142	2. Int'l Plan=no VMail Plan=yes 830 ==> Churn?=False. 786 conf:(0.95)
	143	...
	144	\end{verbatim}
	145	\end{result}
	146
	147	Mostly negative results e.g. proving when a consumer is not going to churn.
	148	Secondly data seems to specialise, rule 1 for example is a specialisation of
	149	rule 2. Try running without the requirement that churn needs to be on the right
	150	end of the rule.
	151
	152	\begin{result}
	153	\caption{ association algoritm \emph{Apriori} - run 2}
	154	\begin{verbatim}
	155	1. VMail Message=0 2411 ==> VMail Plan=no 2411 conf:(1)
	156	2. VMail Plan=no 2411 ==> VMail Message=0 2411 conf:(1)
	157	...
	158	\end{verbatim}
	159	\end{result}
	160
	161	Seems like \emph{VMail Plan=no} seems to equal the \emph{VMail Message=0}. So
	162	\emph{VMail Plan} can be safely deleted from the attribute list. \emph{VMail
	163	Message} does not seems to be a very clear description for it's claimed
	164	purpose. Running without \emph{VMail Plan} did not show improvement. Hence we
	165	decided to go for an normalisation on the Churn number. By taking a random
	166	sample of $Churn=False$ values such that it equals the number of $Churn=True$
	167	values\footnote{\scriptsize\texttt{sort -r -t, -k 22,22 churn\_parsed.csv \| sed
	168	'484,2366d' > churn\_equal.csv}} we re-run the experiments, but found no
	169	improvement in the experiments.
	170
	171	\section{Conclusions}
	172	The churn dataset does not classify itself properly associations rules. Mainly
	173	due to the fact that the so called 'hidden factors' for churning, like `if
	174	calling more than X minutes at rate Y I will churn'. cannot be mined using this
	175	current dataset. Further research could include this relations by means of
	176	formula's, but it require domain specific knowledge to include for example
	177	relations between \emph{Day Min} and {Day Calls}. Alternative methods like
	178	scat and plat analysis\footnote{Like done at
	179	http://meru.cecs.missouri.edu/courses/cecs401\_data\_mining/projects/group2/finproject1.htm}
	180	seems to lead to more promensing results. This could (of course) also be done
	181	in Weka. Take for a brief preview in Appendix 1.
	182	\newpage
	183	\section{Appendix 1}
	184	Result using \emph{Weka} classifier: \wekacmd{weka.classifiers.trees.J48 -C 0.25
	185	-M 2} shows interesting details, like \wekacmd{\scriptsize'Day Mins' > 254.4 and 'VMail
	186	Plan' = no and 'Eve Mins > 187.7 => True.}
	187	\begin{result}
	188	\caption{J48 pruned tree of raw churn dataset}
	189	\scriptsize
	190	\begin{verbatim}
	191	=== Classifier model (full training set) ===
	192	J48 pruned tree
	193	------------------
	194	Day Mins <= 264.4
	195	\| CustServ Calls <= 3
	196	\| \| Int'l Plan = no
	197	\| \| \| Day Mins <= 223.2: False. (2221.0/60.0)
	198	\| \| \| Day Mins > 223.2
	199	\| \| \| \| Eve Mins <= 242.3: False. (296.0/22.0)
	200	\| \| \| \| Eve Mins > 242.3
	201	\| \| \| \| \| VMail Plan = yes: False. (20.0)
	202	\| \| \| \| \| VMail Plan = no
	203	\| \| \| \| \| \| Night Mins <= 174.2
	204	\| \| \| \| \| \| \| Day Mins <= 246.8: False. (12.0)
	205	\| \| \| \| \| \| \| Day Mins > 246.8: True. (5.0/1.0)
	206	\| \| \| \| \| \| Night Mins > 174.2: True. (50.0/8.0)
	207	\| \| Int'l Plan = yes
	208	\| \| \| Intl Calls <= 2: True. (51.0)
	209	\| \| \| Intl Calls > 2
	210	\| \| \| \| Intl Mins <= 13.1: False. (173.0/7.0)
	211	\| \| \| \| Intl Mins > 13.1: True. (43.0)
	212	\| CustServ Calls > 3
	213	\| \| Day Mins <= 160.2
	214	\| \| \| Eve Charge <= 19.83: True. (79.0/3.0)
	215	\| \| \| Eve Charge > 19.83
	216	\| \| \| \| Day Mins <= 120.5: True. (10.0)
	217	\| \| \| \| Day Mins > 120.5: False. (13.0/3.0)
	218	\| \| Day Mins > 160.2
	219	\| \| \| Eve Charge <= 12.05
	220	\| \| \| \| Eve Calls <= 125: True. (16.0/2.0)
	221	\| \| \| \| Eve Calls > 125: False. (3.0)
	222	\| \| \| Eve Charge > 12.05: False. (130.0/24.0)
	223	Day Mins > 264.4
	224	\| VMail Plan = yes: False. (53.0/6.0)
	225	\| VMail Plan = no
	226	\| \| Eve Mins <= 187.7
	227	\| \| \| Day Mins <= 280.4: False. (30.0/7.0)
	228	\| \| \| Day Mins > 280.4: True. (27.0/9.0)
	229	\| \| Eve Mins > 187.7: True. (101.0/5.0)
	230	\end{verbatim}
	231	\end{result}
	232	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/dbdm/dbdm_3/report.tex@ 270

Download in other formats: