Skip to main content

Symon.AI help center

Data profiling correlation methods

Abstract

Correlation is a bivariate analysis that measures the strength of association between two sets (columns) of data and the direction of the relationship.

Correlation is a bivariate analysis that measures the strength of association between two sets (columns) of data and the direction of the relationship.

Pearson's r

Pearson’s correlation coefficient measures the linear relationship between two sets (columns) of numerical data. Therefore, Symon.AI won't return results for text columns. It’s a measure between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation, and 1 indicating total positive linear correlation. It's calculated using the covariance between two columns, as well as the standard deviation between them. This method reflects only linear relationships and ignores more complex correlations.

Spearman’s ρ

Spearman's rank correlation coefficient measures the monotonic correlation between two numerical columns. Therefore, Symon.AI will not return results for text columns. It measures how similar the ranking of one column's values can predict the ranking of another column. It penalizes the coefficient based on how far off the rankings are. It is better at catching nonlinear order-preserving correlations than Pearson's r but is weaker at measuring linear relationships. As it is sensitive to outliers, it penalizes large differences in ranking (outliers) more than Kendall's tau. Its value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation, and 1 indicating total positive monotonic correlation.

Kendall's τ

Similar to Spearman's coefficient, the Kendall rank correlation coefficient (τ) measures the ordinal association between two numerical columns. Therefore, Symon.AI will not return results for text columns. It measures how similar the ranking of one column's values can predict the ranking of another column. By comparing the two columns' rankings, it penalizes the coefficient for matched values that do not match. It is better at measuring nonlinear correlations than Pearson’s r but is weaker at measuring linear relationships. It penalizes differences in rank equally and is less sensitive to outliers than Spearman's ρ. It is also more accurate with small sample sizes. Its value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation, and 1 indicating total positive correlation.

Phik (φk)

Symon.AI’s default choice. Phik measures the dependencies between two sets (columns) of data, both categorical and numerical (string and numeric columns), and captures non-linear dependencies. Symon.AI will return results for any column choice. When comparing two normally distributed numerical columns, it is roughly equivalent to Pearson's r. Its value lies between 0 and 1, with 0 indicating no correlation, and 1 indicating complete correlation.

Cramér's V (φc)

Cramér’s V measures categorical variables. Symon.AI assumes only text columns are categorical, so it will not return results for other column types. It's based on the well-known Chi-squared test, and measures how likely it is that the distribution of categories from the two columns is random. The less independent one categorical column is, the higher the score. Its value lies between 0 and 1, with 0 indicating no correlation, and 1 indicating complete correlation.