5. Visual Programming with Orange Tool.
How to Split our data into training data and testing data in Orange?
We have two ways to split data into training and testing
1. Prediction
2. Test and score
Predictions widget and Test & Score widget as they perform different tasks.
Predictions
The predictions widget is used to predict test data based on a trained model. It does not perform any kind of cross-validation. You train a model with a train set and connect the model to the Predictions widget to test the test set. The results will be different when compared to Test & Score
We split the data in testing and training datasets because we want to predict the future correctly by training model.
Here I will show you how to use a data sampler with tests and scores.
The Data Sampler widget implements several data sampling methods. It outputs a sampled and a complementary dataset (with instances from the input set that are not included in the sampled dataset). The output is processed after the input dataset is provided and Sample Data is pressed.
How to efficiently use cross-validation in Orange?
Now, we will use the Data Sampler to split the data into training and testing parts. We are using the iris data, which we loaded with the File widget. In Data Sampler, we split the data with a Fixed proportion of data, keeping 70% of data instances in the sample.
Now it is time to bring in our test data (the remaining 30%) for testing. Connect Data Sampler to Test & Score once again and set the connection Remaining Data — Test Data
Test & Score
Test & Score widget is used to evaluate a model based on a training dataset. It will perform cross-validation based on the number of folds defined. If you set the number of folds to 10, it will split the dataset into 10 portions and run 10 rounds of evaluation using 9/10 of the dataset as a train set while the remaining 1/10. Each round will use different portions as train set and test set.
What is the effect of it on model output/accuracy?
Seems like LogReg still performs well.
What is the effect of splitting data on the classification result/ classification model?
The widget supports various sampling methods. Cross-validation is one of them. It splits the data into a given number of folds (usually 5 or 10). The algorithm is tested by holding out examples from one fold at a time; the model is induced from other folds and examples from the held-out fold are classified. This is repeated for all the folds.
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance.
In summary, cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance.
That’s all for now. 🙃