Articles

WEKA Tutorial #1.2 – How to Build a Data Mining Model from Scratch

November 7, 2019


So, let’s see to where it is installed. Okay, so it’s right here, WEKA 3.8.3, so
I’ll click on the WEKA 3.8. Okay, so I just click on OK here. Just a
warning information telling that we can use many learning schemes and tools and there’s a tools menu. Okay, click on OK and so let’s get started and building
our model, so why don’t you click on the Explorer button. So in this WEKA
Explorer, it allows you to intuitively build your prediction model by clicking
on specific functions. So let’s get started by importing the data set; and
the data set that we’re going to use in this practical tutorial will be the
famous Iris data set. Iris data set is a public domain data set that is commonly
used as an example data set for teaching data mining; and let’s find it so click
on the “Open File”, go to the C Drive, go to the “Program Files”, I’m not sure whether
it is “Program Files” here or “Program Files” with x86. So let me try the first
one, so there should be a folder called WEKA. Okay, so it’s in the “Program Files”>WEKA 3.8, go to “data” subfolder and then find “iris”. Okay, here we go, “iris.arff”.
Now they also have the iris 2D, so let’s go with “iris.arff”. Click on
“Open” and then this is what we see. So let’s just have a look at what the
various feature or menus are doing. So in this panel here, it tells you about the
attributes or the variables that this data set has; and so we can see that there are a total of five variables. The first one being
the sepal length, the sepal width, the pedal length, the pedal width and the class (label). So this data set, it’s a data set of 150 flowers that are closely related called
Iris and they are described by four variables: the length and the width of
the sepal, the length and the width of the petal and then it is, for each flower,
it is given a corresponding label as being either a setosa, a versicolor or a
virginica. And so here we can see again that okay this is the Iris dataset, it
has a total of a hundred and fifty flowers and there are five attributes or
five variables that we see here and if you click this, if you click on it and to
the right you will see some description about it. So you can see that there are
50 Iris setosa, 50 Iris versicolor, (and) 50 Iris virginica. And there are no missing
data here, that’s good to know, so click on the first variable we can see that
the minimum value is 4.3, the maximum value is 7.9, the mean value 5.843
with a standard deviation of 0.828 and then we get the same information by
clicking on the subsequent variables. So we see the minimum, maximum, the mean (and) the
standard deviation. Okay so notice that the numbers here, mean and standard
deviation, it really depends on the average value of each variable and then
the standard deviation which just tell us the variability of each variable. So
before we build a prediction model, let’s first start by normalizing our data
because each variable will have different minimum and
maximum as you can see, the first one has 4.3 a minimum of 4.3, maximum of 7.9.
Second variable has minimum of 2 maximum 4.4 The third one has a pedal length of
1, maximum of 6.9. The fourth variable has a minimum of 0.1, maximum of 2.5 and we
notice that the mean and standard deviation of each of one of them are
different. So let’s get started by first normalizing or standardizing our
variable. So let me begin by normalizing the minimum and maximum to be 0 and 1. So we
can do that very easily in WEKA. So we have to click on the arrow on supervised
attributes then click on normalize and then “Apply”. So before I click on the
button, notice that the minimum, maximum are 0.1 and 2.5, 1 and 6.9, 2 and 4.4, 4.3 and 7.9. So I’ll click here “Apply” and
notice what changes. So the minimum and maximum values becomes 0 and 1 and
we also notice that the mean and also standard deviation also differed and
here same thing minimum and maximum is 0 and 1. Third variable, same thing, 0
and 1. Fourth variable, 0 and 1 or alternatively so I can undo that
alternatively so that will bring us back to the original state. So alternatively
instead of normalized I could use standardized so it’s in the same
subfolder: filter>unsupervised>attribute and then I’ll find the “standardize”, so
click on “standardize” and click on “Apply”. But notice that the mean and the
standard deviation will be altered so we see that they are 5.8 here, 0.8 here, 3
and 0.4, 3.7 and 1.7 1.199 and 7.763. So click on “Apply”. Okay and so the the mean becomes 0
and the standard deviation becomes 1. So this will happen for every one of
them. Third variable also and also the fourth variable. However, okay so nothing
happens to the class, because that is the class (label) where we are going to make our
classification. So in data mining there are many tasks that you can do. You could
visualize data, you could cluster the data, you could classify the data, you
could build a regression model. But for this example because our class or our
output label is a qualitative label therefore we will perform classification.
By classification we mean that we will categorize each of the 150 flowers into
one of the three class label, here either as a setosa, versicolor
or virginica. So this step is called data pre-processing where we normalize or
standardize the variable. So decide on one or the other, either normalize
or standardize, but not both. So the one or the other and when you are
ready go to the next step which is to click on the “Classify” tab. So let’s click
on that and then go to the classifier so choose a classifier and let’s go with a…
how about a decision tree? So let’s begin with the J48 so this is essentially using
the C4.5 algorithm by Ross Quinlan. So click on the J48 and then the
default for doing the test would be cross-validation using a 10-fold. So I’m
going to cover this in a future video about how you can split your data set
into training and testing and also how you can do the cross-validation set. So
in this tutorial I’m going to stick to the default and so we’ll click on the
“Start” button and then your prediction model will be constructed so you have
see that it takes only a couple of seconds, not even a second,
so maybe half a second to create your model so here, this is the summary of
your prediction model. So let’s start by scrolling up on top. Okay so this
provides a description of the algorithm that you’re using the J48, you
have 150 sample size, you have 5 variables, you are using
10-fold cross-validation and this is the resulting decision tree created and found
inside your prediction model. There are a total of 5 leaves, the size of the tree is 9
and then these are the performance metric of your prediction model. So you
see that you have a 96% accuracy and correctly classifying
144 out of 150 flowers into one of the three classes correctly. And that we have
six that we have misclassified. And we have kappa statistics here, mean absolute
error, root mean squared error and others as well. We also are provided with the
true positive rate, false positive rate, precision, recall, F-measure, MCC or the
Matthews correlation coefficient, ROC, and the class. So here we are given
the performance metric for each of the three classes and then this is the
average weight of all the three classes and the
confusion matrix is provided below. So what is the confusion matrix? It allows
you to see how your prediction model is confused. So if you look under the hood
you have 50 flowers for each of the Iris setosa class, 50 flowers for Iris
versicolor and 50 flowers for Iris virginica. So out of 50
we have correctly classified 49 and we have misclassified 1 of them. So “A”, right here, is represented by Iris setosa and so for the Iris setosa, out of
50 one of them is misclassified to be “B”. B is iris versicolor. So we can see
that one flower that is supposed to be classified as Iris setosa was
misclassified as Iris versicolor. Okay so going to the second line we see that 47
iris versicolor have been correctly classified and 3 Iris versicolor have
been misclassified to be Iris virginica. Okay we’ll be going to the third row, 48
Iris virginica have been correctly classified as Iris virginica, however two
of the Iris virginica have been misclassified to be B or Iris versicolor. So this is very useful in helping us to
understand the confusion made by our prediction model. So until next time I’m
Chanin Nantasenamat on the Data Professor channel and if you haven’t
subscribed yet please consider subscribing and clicking on the notification bell so that you will be notified on the next video. So I’ll see you in the next one!

No Comments

Leave a Reply