/*Font style and formatting for LibGuides*/ Skip to Main Content

Data Cleaning

An overview of categories data cleaning, general techniques, and available platforms for data cleaning.

What is Normalization?

Normalization can mean several things within the context of statistical programming.

Normalization can refer to re-scaling variables produced using different measurement systems (such as customer ratings for products or services). It can also mean assigning percentiles, quartiles, or other quantile ranks. It can also refer to scaling the values of a variable so they fit between 0 and 1, or -1 and 1.

One especially important example is scaling numerical variables such that they fit or at least, more closely match, a normal (Gaussian) distribution. This type of normalization is often necessary to prepare data so that it meets the assumptions of linear regression.

Assumptions of Linear Regression

Linear regression (or Ordinary Least Squares) is a type of model that can predict a value (the dependent variable) based on one or more other values (the independent variable(s)). Linear regression is a powerful tool, but several conditions must be met in order for it to be a reliable tool. These four (sometimes five) assumptions are, in no particular order:

1. Linearity - In order to perform regression, there must be some kind of linear relationship between the independent and the dependent variable. If the independent and dependent variable are not correlated, the regression model will have little predictive power.

2. Normality of Residuals - When establishing a predictive relationship between the independent variable(s) and the dependent variable, the residuals (observed values minus predicted values) should be normally distributed. If they are not, this implies that there may bias in the model.

3. Homoscedasticity (also spelled homoskedasticity, also known as "homogeneity of variance") - Not only should the residuals be normally distributed, they should also have constant variance. If the distribution of points either spreads out or contracts, the variance is said to be heteroscedastic (also heteroskedastic/heterogeneically variant).

Plots showing homoscedastic and heteroscedastic relationships between two sets of variables as well as their residuals from regression.

4. Independence - Data points in the dataset do not influence one another; the inclusion of data points is not dependent on other data points. (There should also be no duplicate data points.)

(5.) No Multicollinearity - In Multiple Linear Regression*, independent variables should not be highly correlated with one another. Correlation among independent variables indicates that one may be affecting the other, or that both may be affected by an unobserved variable.

*Multiple Linear Regression is linear regression with more than one independent variable, not to be confused with Multivariate Regression, which models more than one dependent variable on more than one independent variable.