Data Preprocessing Techniques Every ML Beginner Should Know(Step 1)

Welcome Tech Geeks!

Introduction

In machine learning, raw data is often messy, incomplete, and unorganized. Before you feed this data to a machine learning model, you need to prepare it. This process is called data preprocessing. It ensures the data is clean, consistent, and ready to be understood by the model.

Think of it like preparing ingredients before cooking. If your vegetables aren't cleaned or cut properly, you can't cook a great meal. Similarly, without preprocessing, machine learning models won't deliver accurate predictions.

In this blog, we will explore the key data preprocessing techniques every ML beginner must know. By the end, you'll have a solid understanding of how to clean, transform, and prepare data to boost the performance of machine learning models.

What is Data Preprocessing?

Data preprocessing is the process of converting raw data into a clean, structured, and usable format. It removes errors, fills in missing information, and prepares the data so that the machine learning model can understand it.

Why is it important?

Better Accuracy: Clean data leads to more accurate models.
Faster Training: Well-structured data speeds up the training process.
Avoids Errors: Unprocessed data can cause errors in predictions.

Key Data Preprocessing Techniques

1. Handling Missing Values

Sometimes, data may be incomplete, with missing information in certain rows or columns. For example, a table of customer data may have missing ages or missing email addresses. If we don't handle missing data, it can affect the performance of the machine learning model.

How to Handle Missing Data?

Remove Missing Data: If only a few data points are missing, you can simply remove those rows or columns.
Imputation: If a lot of data is missing, deleting it might not be a good idea. Instead, we can fill in the missing values using:
- Mean: Replace missing values with the average of the column.
- Median: Use the median value of the column.
- Mode: Use the most common value from the column.
- Advanced Methods: Use K-Nearest Neighbors (KNN) or machine learning models to predict and fill missing values.

Example (Using Python):

import pandas as pd

# Sample Data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, None, 30, 28],
        'Salary': [50000, 60000, None, 52000]}

df = pd.DataFrame(data)

# Fill missing Age with median value
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Salary with mean value
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

print(df)

2. Data Transformation

Data transformation involves converting raw data into a format that is easier for the machine learning model to understand.

Key Techniques:

Feature Scaling: Sometimes, the range of numbers in data columns can vary widely (like Age: 18-80, Salary: 10,000-100,000). This difference can confuse machine learning models.
- Min-Max Scaling: Rescales the data to be in the range of 0 to 1.
- Standardization (Z-score Normalization): Makes the data have a mean of 0 and a standard deviation of 1.

Example (Using Python):

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd

data = {'Age': [18, 25, 30, 45, 60],
        'Salary': [10000, 20000, 50000, 80000, 120000]}

df = pd.DataFrame(data)

# Min-Max Scaling
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)

# Standardization
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)

print("Min-Max Scaled Data:\n", df_scaled)
print("Standardized Data:\n", df_standardized)

Encoding Categorical Data:
If your data has text categories like 'Male', 'Female', 'Yes', 'No', you'll need to convert them into numbers so the model can understand them.
- Label Encoding: Converts categories into numbers (like Male=0, Female=1).
- One-Hot Encoding: Creates a separate column for each category and marks 1 or 0 for its presence.

Example (Using Python):

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

data = {'Gender': ['Male', 'Female', 'Female', 'Male'],
        'Country': ['India', 'USA', 'India', 'Germany']}

df = pd.DataFrame(data)

# Label Encoding
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Country'])

print(df)
print(df_encoded)

3. Feature Selection and Feature Engineering

Feature selection and feature engineering help identify which data points (or features) have the most influence on model performance.

Feature Selection:

Remove features that have little or no effect on the prediction.
Techniques include correlation analysis and feature importance from models like Random Forest.

Feature Engineering:

Create new features from existing ones. For example, instead of using “Date of Birth,” create an “Age” column.
Combine features to create new useful features.

Example (Using Python):

import pandas as pd

data = {'Date_of_Birth': ['2000-01-01', '1995-06-15', '1980-09-30']}
df = pd.DataFrame(data)

# Calculate Age from Date of Birth
df['Date_of_Birth'] = pd.to_datetime(df['Date_of_Birth'])
df['Age'] = 2024 - df['Date_of_Birth'].dt.year

print(df)

4. Splitting Data into Training and Testing Sets

To evaluate how well a machine learning model performs, we split the data into:

Training Set: Used to train the model.
Testing Set: Used to evaluate the model's accuracy on unseen data.

How to Split Data?

Common practice: 80% for training, 20% for testing.
If you have less data, use cross-validation to make the best use of it.

Example (Using Python):

from sklearn.model_selection import train_test_split
import pandas as pd

data = {'Age': [18, 25, 30, 45, 60],
        'Salary': [10000, 20000, 50000, 80000, 120000],
        'Purchased': [0, 1, 0, 1, 1]}

df = pd.DataFrame(data)

X = df[['Age', 'Salary']]
y = df['Purchased']

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data:\n", X_train)
print("Testing Data:\n", X_test)

Best Practices for Data Preprocessing

Visualize the Data: Plot data before and after preprocessing to see changes.
Start with Simple Methods: Use simple imputation (like mean or median) first.
Avoid Data Leakage: Make sure test data is not used in preprocessing steps like scaling.
Check Feature Importance: Only keep relevant features for better accuracy.

Conclusion

Data preprocessing is one of the most important steps in machine learning. Without clean, transformed, and well-structured data, even the best machine learning models can fail. In this blog, we discussed key techniques like handling missing data, scaling, encoding, feature selection, and splitting datasets.

If you're new to ML, mastering these techniques will give you a huge advantage. As a next step, try to apply these concepts to your own datasets.

Up Next: In the next blog, we’ll explore Machine Learning Algorithms Explained — where we’ll simplify complex concepts like Linear Regression, Decision Trees, and Random Forests.

📘 1. Beginner’s Guide to Machine Learning: Simplified and Explained

Start your ML journey with this easy-to-understand guide covering the basics, types of ML, and how models learn from data.

🔗 Read it here

📘 2. Top 10 Essential Libraries for Training Machine Learning Models

Learn about the top libraries every ML practitioner must know, from model building to data handling and visualization, with practical examples.

🔗 Explore it here

Stay tuned, and as always, “The future of tech is in your hands—keep building, keep dreaming!”

ML Explorations

Search This Blog