Data Cleaning and Preprocessing

3 min readJan 26, 2023

Data cleaning and preprocessing is an essential step in the data science process. It involves identifying and correcting any errors, inconsistencies, or missing values in the data. This step is crucial because dirty data can lead to inaccurate conclusions and poor decision-making.

One common issue that needs to be addressed during data cleaning is missing values. In Python, missing values can be identified using the isnull() function from the pandas library. For example, the code below will return a Boolean mask indicating which values in a DataFrame are missing:

import pandas as pd

df = pd.read_csv("data.csv")
missing_values = df.isnull()
print(missing_values)

There are several ways to handle missing values, depending on the situation. One common approach is to simply remove any rows or columns that contain missing values. However, this method can lead to a loss of valuable data. Another approach is to fill in missing values with a specific value or using interpolation methods. For example, using the fillna() method in pandas, we can replace missing values with a specific value, such as 0:

df = df.fillna(0)

Another important aspect of data cleaning is dealing with outliers. Outliers are values that are significantly different from the rest of the data. They can be caused by errors in data collection or measurement and can skew the overall results. In Python, the zscore() function from the scipy.stats library can be used to identify outliers. The code below will return an array of z-scores for each value in a given column:

from scipy import stats

z = np.abs(stats.zscore(df["column_name"]))

Values with a z-score greater than a certain threshold (typically 3 or 3.5) are considered outliers. These values can be removed or replaced with the median value of the column.

Data preprocessing also includes the process of data transformation, which involves converting the data into a format that is suitable for analysis. One common data transformation technique is normalization, which scales the data to a specific range, such as 0 to 1. In Python, the MinMaxScaler function from the sklearn.preprocessing library can be used to normalize data:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df)

Another data transformation technique is encoding categorical variables. Categorical variables are variables that have a limited number of possible values, such as gender or product type. In Python, the OneHotEncoder function from the sklearn.preprocessing library can be used to convert categorical variables into numerical variables:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
df_encoded = encoder.fit_transform(df[["categorical_column"]])

In SQL, data cleaning and preprocessing can be achieved using a combination of SQL commands and data types. For example, to remove null values from a table, the following command can be used:

DELETE FROM table_name
WHERE column_name IS NULL;

To find and remove outliers, the following command can be used:

WITH CTE AS (
    SELECT column_name,
           ROW_NUMBER() OVER (ORDER BY column_name) AS RowNum,
           COUNT(*) OVER () AS TotalCount
    FROM table_name
)
DELETE FROM CTE
WHERE 
    column_name > (SELECT AVG(column_name) + 3 * STDEV(column_name) FROM table_name) 
    OR column_name < (SELECT AVG(column_name) - 3 * STDEV(column_name) FROM table_name)

This command uses the Common Table Expression (CTE) to find the average and standard deviation of the column, and then deletes any rows where the value of the column is more than 3 standard deviations away from the average.

In order to normalize the data, SQL has a built-in function called NORMALIZE() which can be used to scale the data to a specific range. For example:

SELECT NORMALIZE(column_name, 0, 1)
FROM table_name;

Finally, to handle categorical variables, SQL has a built-in function called GROUP_CONCAT() which can be used to concatenate the values of a categorical column together and create a new column. For example:

SELECT column_name, GROUP_CONCAT(categorical_column) AS new_column
FROM table_name
GROUP BY column_name;

In conclusion, data cleaning and preprocessing are essential steps in the data science process. It involves identifying and correcting any errors, inconsistencies, or missing values in the data. By using the above techniques, data scientists and analysts can ensure that their data is reliable and accurate, allowing them to make more informed decisions based on their analysis.

Data Cleaning and Preprocessing

Written by Ofili Lewis