Skip to main content

Symon.AI help center

One Hot Encoder and Category Encoder

Abstract

In Symon.AI, you have several different strategies to apply encoding. However, when using any of the smart tools in Symon.AI, applying an encoder is not necessary because it's automatically applied.

When working in data science, you may be accustomed to turning everything into a number. This is because data science tools often don't accept text fields. Typically, you start with fields containing categorical information and transform them into numbers. Then you can use a machine learning package to solve classification and regression problems. The different ways of transforming these text fields into numbers are called encodings.

In Symon.AI, you have several different strategies to apply encoding. However, when using any of the smart tools in Symon.AI, applying an encoder is not necessary because it's automatically applied.

One of the simplest encoders is one hot encoding. When used, the one hot encoder takes the specific values of a categorical column and breaks them out into their own columns. This is best used for nominal column types with low cardinality, where the categories these columns contain are names or labels, with no specific order implied. Examples of nominal columns are color or a type of restaurant.

Here is a more practical example:

A column describes the lead status that will become 4 columns.

  1. Email received

  2. Email opened

  3. Lead updated

  4. Meeting booked

After applying the one hot encoder, the categories are represented by numbers instead. Each lead status corresponds to one of the possible values and contains a value 1 to represent that value.

  1. Email_received = 1

  2. Email_opened = 0

  3. Lead_updated = 0

  4. Meeting_booked = 0

Whichever column has a value of 1, the other three columns will have a value of 0.

The other type of encoder in Symon.AI is the category encoder. This encoding is best used for nominal columns with high cardinality that have greater than 20 unique values such as countries, cities, or sales representative IDs.

Replacement tools are used to perform ordinal encoding and map how qualified the leads are according to this scale:

  • 0 - Unknown

  • 1 - Poor fit

  • 2 - Medium fit

  • 3 - Good fit

Encoder_3_.png

Tip

The one hot encoder works best with less than 12 categories. If you have more than 12 categories, it's better to roll up the categories into higher levels, like defining parent categories for a small set of sub-categories.