Splitting data
Pre-processing your data into training and testing sets improves its quality and helps resolve issues.
Data pre-processing is the technique of transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and likely contains many errors. Pre-processing your data improves its quality and helps resolve these issues. One common data pre-processing practice is to split data into training and testing sets. You would use the training data to build a model and then run it against the testing data. Training a model for scoring helps you minimize the effects of data discrepancies and better diagnose the model.
There are two common scenarios for this:
Train a model to score existing data.
Re-use an existing trained model to score new data.
Scenario 1: Training a model for scoring
Some organizations train a model for scoring when they have a lot of data and it takes too long to build a solid model. They will randomly select a portion of data for training and then use the model built from the training data to score the rest of the data.
Split the data into training and testing sets.
Type "split", or click + Tool from the toolbar to add a Split tool.
In the sidebar, enter the percentages for the two data sets.
Tip
The general recommended ratio is 80% for training and 20% for testing.
Add an Export tool for each output node.
Click Build.
Note
Building your pipe applies all selected tools to your entire data set to calculate your model's accuracy. Building your pipe regularly helps keep your Builds fast.
Once the calculation successfully completes, select an Export tool and click Download. Repeat for the other Export tool.
Build a model with the training data set.
Create a second pipe and import the training data set.
Name the pipe as "Training data model".
Add a Classifier tool.
Click Build.
Score the testing data set.
In the Training data model pipe, select the Data tool and click Build.
Note
A build sends data through the pipe and scores it.
If it's your first time creating a build, click Get started.
Type a name for your build.
From the Data drop-down list, select your testing data set as the data source.
Click Build.
Scenario 2: Scoring with an existing model
Some organizations will use data from a period of time to build a model and use it to score incoming data for predictions. In this case, the predictions for the rest of the year are based on the same model. You can then simply apply the new data set to your built model.
Select the Data tool in your built model.
Click Build.
From the Data drop-down list, select your new data set as the data source.
Click Build.
Once the build successfully completes, review the data in the row viewer.