So, let’s see to where it is installed. Okay, so it’s right here, WEKA 3.8.3, so

I’ll click on the WEKA 3.8. Okay, so I just click on OK here. Just a

warning information telling that we can use many learning schemes and tools and there’s a tools menu. Okay, click on OK and so let’s get started and building

our model, so why don’t you click on the Explorer button. So in this WEKA

Explorer, it allows you to intuitively build your prediction model by clicking

on specific functions. So let’s get started by importing the data set; and

the data set that we’re going to use in this practical tutorial will be the

famous Iris data set. Iris data set is a public domain data set that is commonly

used as an example data set for teaching data mining; and let’s find it so click

on the “Open File”, go to the C Drive, go to the “Program Files”, I’m not sure whether

it is “Program Files” here or “Program Files” with x86. So let me try the first

one, so there should be a folder called WEKA. Okay, so it’s in the “Program Files”>WEKA 3.8, go to “data” subfolder and then find “iris”. Okay, here we go, “iris.arff”.

Now they also have the iris 2D, so let’s go with “iris.arff”. Click on

“Open” and then this is what we see. So let’s just have a look at what the

various feature or menus are doing. So in this panel here, it tells you about the

attributes or the variables that this data set has; and so we can see that there are a total of five variables. The first one being

the sepal length, the sepal width, the pedal length, the pedal width and the class (label). So this data set, it’s a data set of 150 flowers that are closely related called

Iris and they are described by four variables: the length and the width of

the sepal, the length and the width of the petal and then it is, for each flower,

it is given a corresponding label as being either a setosa, a versicolor or a

virginica. And so here we can see again that okay this is the Iris dataset, it

has a total of a hundred and fifty flowers and there are five attributes or

five variables that we see here and if you click this, if you click on it and to

the right you will see some description about it. So you can see that there are

50 Iris setosa, 50 Iris versicolor, (and) 50 Iris virginica. And there are no missing

data here, that’s good to know, so click on the first variable we can see that

the minimum value is 4.3, the maximum value is 7.9, the mean value 5.843

with a standard deviation of 0.828 and then we get the same information by

clicking on the subsequent variables. So we see the minimum, maximum, the mean (and) the

standard deviation. Okay so notice that the numbers here, mean and standard

deviation, it really depends on the average value of each variable and then

the standard deviation which just tell us the variability of each variable. So

before we build a prediction model, let’s first start by normalizing our data

because each variable will have different minimum and

maximum as you can see, the first one has 4.3 a minimum of 4.3, maximum of 7.9.

Second variable has minimum of 2 maximum 4.4 The third one has a pedal length of

1, maximum of 6.9. The fourth variable has a minimum of 0.1, maximum of 2.5 and we

notice that the mean and standard deviation of each of one of them are

different. So let’s get started by first normalizing or standardizing our

variable. So let me begin by normalizing the minimum and maximum to be 0 and 1. So we

can do that very easily in WEKA. So we have to click on the arrow on supervised

attributes then click on normalize and then “Apply”. So before I click on the

button, notice that the minimum, maximum are 0.1 and 2.5, 1 and 6.9, 2 and 4.4, 4.3 and 7.9. So I’ll click here “Apply” and

notice what changes. So the minimum and maximum values becomes 0 and 1 and

we also notice that the mean and also standard deviation also differed and

here same thing minimum and maximum is 0 and 1. Third variable, same thing, 0

and 1. Fourth variable, 0 and 1 or alternatively so I can undo that

alternatively so that will bring us back to the original state. So alternatively

instead of normalized I could use standardized so it’s in the same

subfolder: filter>unsupervised>attribute and then I’ll find the “standardize”, so

click on “standardize” and click on “Apply”. But notice that the mean and the

standard deviation will be altered so we see that they are 5.8 here, 0.8 here, 3

and 0.4, 3.7 and 1.7 1.199 and 7.763. So click on “Apply”. Okay and so the the mean becomes 0

and the standard deviation becomes 1. So this will happen for every one of

them. Third variable also and also the fourth variable. However, okay so nothing

happens to the class, because that is the class (label) where we are going to make our

classification. So in data mining there are many tasks that you can do. You could

visualize data, you could cluster the data, you could classify the data, you

could build a regression model. But for this example because our class or our

output label is a qualitative label therefore we will perform classification.

By classification we mean that we will categorize each of the 150 flowers into

one of the three class label, here either as a setosa, versicolor

or virginica. So this step is called data pre-processing where we normalize or

standardize the variable. So decide on one or the other, either normalize

or standardize, but not both. So the one or the other and when you are

ready go to the next step which is to click on the “Classify” tab. So let’s click

on that and then go to the classifier so choose a classifier and let’s go with a…

how about a decision tree? So let’s begin with the J48 so this is essentially using

the C4.5 algorithm by Ross Quinlan. So click on the J48 and then the

default for doing the test would be cross-validation using a 10-fold. So I’m

going to cover this in a future video about how you can split your data set

into training and testing and also how you can do the cross-validation set. So

in this tutorial I’m going to stick to the default and so we’ll click on the

“Start” button and then your prediction model will be constructed so you have

see that it takes only a couple of seconds, not even a second,

so maybe half a second to create your model so here, this is the summary of

your prediction model. So let’s start by scrolling up on top. Okay so this

provides a description of the algorithm that you’re using the J48, you

have 150 sample size, you have 5 variables, you are using

10-fold cross-validation and this is the resulting decision tree created and found

inside your prediction model. There are a total of 5 leaves, the size of the tree is 9

and then these are the performance metric of your prediction model. So you

see that you have a 96% accuracy and correctly classifying

144 out of 150 flowers into one of the three classes correctly. And that we have

six that we have misclassified. And we have kappa statistics here, mean absolute

error, root mean squared error and others as well. We also are provided with the

true positive rate, false positive rate, precision, recall, F-measure, MCC or the

Matthews correlation coefficient, ROC, and the class. So here we are given

the performance metric for each of the three classes and then this is the

average weight of all the three classes and the

confusion matrix is provided below. So what is the confusion matrix? It allows

you to see how your prediction model is confused. So if you look under the hood

you have 50 flowers for each of the Iris setosa class, 50 flowers for Iris

versicolor and 50 flowers for Iris virginica. So out of 50

we have correctly classified 49 and we have misclassified 1 of them. So “A”, right here, is represented by Iris setosa and so for the Iris setosa, out of

50 one of them is misclassified to be “B”. B is iris versicolor. So we can see

that one flower that is supposed to be classified as Iris setosa was

misclassified as Iris versicolor. Okay so going to the second line we see that 47

iris versicolor have been correctly classified and 3 Iris versicolor have

been misclassified to be Iris virginica. Okay we’ll be going to the third row, 48

Iris virginica have been correctly classified as Iris virginica, however two

of the Iris virginica have been misclassified to be B or Iris versicolor. So this is very useful in helping us to

understand the confusion made by our prediction model. So until next time I’m

Chanin Nantasenamat on the Data Professor channel and if you haven’t

subscribed yet please consider subscribing and clicking on the notification bell so that you will be notified on the next video. So I’ll see you in the next one!

## No Comments