Skip to main content

Symon.AI help center

Splitting data into training and testing sets

Abstract

Pre-processing your data improves its quality and helps resolve issues.

Data pre-processing is the technique of transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and likely contains many errors. Pre-processing your data improves its quality and helps resolve these issues. One common data pre-processing practice is to split data into training and testing sets. You would use the training data to build a model and then run it against the testing data. Training a model for scoring helps you minimize the effects of data discrepancies and better diagnose the model.

There are two common scenarios for this:

  1. Train a model to score existing data.

  2. Re-use an existing trained model to score new data.

Scenario 1: Training a model for scoring

Some organizations train a model for scoring when they have a lot of data and it takes too long to build a solid model. They will randomly select a portion of data for training and then use the model built from the training data to score the rest of the data.

  1. Split the data into training and testing sets.

    1. Create a new pipe and import your data.

    2. Type "split", or click the Add button symon_add_icon.png to add a Split tool.

    3. In the sidebar, enter the percentages for the two data sets, and click Apply.

      Tip

      The general recommended ratio is 80% for training and 20% for testing.

    4. Add an Export tool for each output node.

    5. Click Build.

      Note

      Building your pipe applies all selected tools to your entire dataset to calculate your model's accuracy. Building your pipe regularly helps keep your Builds fast.

    6. Once the calculation successfully completes, select an Export tool and click Download. Repeat for the other Export tool.

  2. Build a model with the training data set.

    1. Create a second pipe and import the training data set.

    2. Name the pipe as "Training data model".

    3. Add a Classifier tool.

    4. Click Build.

  3. Score the testing data set.

    1. In the Training data model pipe, select the Data tool and click Run.

      Note

      A run sends data through the pipe and scores it.

    2. If it's your first time creating a run, click Get started.

      • If you've already created a run, click New run.

    3. Type a name for your run.

    4. From the Data drop-down list, select your testing data set as the data source.

    5. Click Make runs.

    6. Once the run successfully completes, click View to review the data.

Scenario 2: Scoring with an existing model

Some organizations will use data from a period of time to build a model and use it to score incoming data for predictions. In this case, the predictions for the rest of the year are based on the same model. You can then simply apply the new data set to your built model.

  1. Import your new data set into Symon.AI.

  2. Select the Data tool in your built model and click Run.

  3. Click New run.

  4. Type a name for your run.

  5. From the Data drop-down list, select your new data set as the data source.

  6. Click Make runs.

  7. Once the run successfully completes, click View to review the data.