Skip to main content

Data Preprocessing Techniques Every ML Beginner Should Know(Step 1)

Welcome Tech Geeks!

Introduction

In machine learning, raw data is often messy, incomplete, and unorganized. Before you feed this data to a machine learning model, you need to prepare it. This process is called data preprocessing. It ensures the data is clean, consistent, and ready to be understood by the model.

Think of it like preparing ingredients before cooking. If your vegetables aren't cleaned or cut properly, you can't cook a great meal. Similarly, without preprocessing, machine learning models won't deliver accurate predictions.

In this blog, we will explore the key data preprocessing techniques every ML beginner must know. By the end, you'll have a solid understanding of how to clean, transform, and prepare data to boost the performance of machine learning models.


What is Data Preprocessing?

Data preprocessing is the process of converting raw data into a clean, structured, and usable format. It removes errors, fills in missing information, and prepares the data so that the machine learning model can understand it.

Why is it important?

  1. Better Accuracy: Clean data leads to more accurate models.
  2. Faster Training: Well-structured data speeds up the training process.
  3. Avoids Errors: Unprocessed data can cause errors in predictions.

Key Data Preprocessing Techniques

1. Handling Missing Values

Sometimes, data may be incomplete, with missing information in certain rows or columns. For example, a table of customer data may have missing ages or missing email addresses. If we don't handle missing data, it can affect the performance of the machine learning model.

How to Handle Missing Data?

  • Remove Missing Data: If only a few data points are missing, you can simply remove those rows or columns.
  • Imputation: If a lot of data is missing, deleting it might not be a good idea. Instead, we can fill in the missing values using:
    • Mean: Replace missing values with the average of the column.
    • Median: Use the median value of the column.
    • Mode: Use the most common value from the column.
    • Advanced Methods: Use K-Nearest Neighbors (KNN) or machine learning models to predict and fill missing values.

Example (Using Python):

import pandas as pd # Sample Data data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, None, 30, 28], 'Salary': [50000, 60000, None, 52000]} df = pd.DataFrame(data) # Fill missing Age with median value df['Age'].fillna(df['Age'].median(), inplace=True) # Fill missing Salary with mean value df['Salary'].fillna(df['Salary'].mean(), inplace=True) print(df)

2. Data Transformation

Data transformation involves converting raw data into a format that is easier for the machine learning model to understand.

Key Techniques:

  • Feature Scaling: Sometimes, the range of numbers in data columns can vary widely (like Age: 18-80, Salary: 10,000-100,000). This difference can confuse machine learning models.
    • Min-Max Scaling: Rescales the data to be in the range of 0 to 1.
    • Standardization (Z-score Normalization): Makes the data have a mean of 0 and a standard deviation of 1.

Example (Using Python):

from sklearn.preprocessing import MinMaxScaler, StandardScaler import pandas as pd data = {'Age': [18, 25, 30, 45, 60], 'Salary': [10000, 20000, 50000, 80000, 120000]} df = pd.DataFrame(data) # Min-Max Scaling scaler = MinMaxScaler() df_scaled = scaler.fit_transform(df) # Standardization scaler_std = StandardScaler() df_standardized = scaler_std.fit_transform(df) print("Min-Max Scaled Data:\n", df_scaled) print("Standardized Data:\n", df_standardized)
  • Encoding Categorical Data:
    If your data has text categories like 'Male', 'Female', 'Yes', 'No', you'll need to convert them into numbers so the model can understand them.
    • Label Encoding: Converts categories into numbers (like Male=0, Female=1).
    • One-Hot Encoding: Creates a separate column for each category and marks 1 or 0 for its presence.

Example (Using Python):

from sklearn.preprocessing import LabelEncoder, OneHotEncoder import pandas as pd data = {'Gender': ['Male', 'Female', 'Female', 'Male'], 'Country': ['India', 'USA', 'India', 'Germany']} df = pd.DataFrame(data) # Label Encoding le = LabelEncoder() df['Gender'] = le.fit_transform(df['Gender']) # One-Hot Encoding df_encoded = pd.get_dummies(df, columns=['Country']) print(df) print(df_encoded)

3. Feature Selection and Feature Engineering




Feature selection and feature engineering help identify which data points (or features) have the most influence on model performance.

Feature Selection:

  • Remove features that have little or no effect on the prediction.
  • Techniques include correlation analysis and feature importance from models like Random Forest.

Feature Engineering:

  • Create new features from existing ones. For example, instead of using “Date of Birth,” create an “Age” column.
  • Combine features to create new useful features.

Example (Using Python):

import pandas as pd data = {'Date_of_Birth': ['2000-01-01', '1995-06-15', '1980-09-30']} df = pd.DataFrame(data) # Calculate Age from Date of Birth df['Date_of_Birth'] = pd.to_datetime(df['Date_of_Birth']) df['Age'] = 2024 - df['Date_of_Birth'].dt.year print(df)

4. Splitting Data into Training and Testing Sets

To evaluate how well a machine learning model performs, we split the data into:

  • Training Set: Used to train the model.
  • Testing Set: Used to evaluate the model's accuracy on unseen data.

How to Split Data?

  • Common practice: 80% for training, 20% for testing.
  • If you have less data, use cross-validation to make the best use of it.

Example (Using Python):

from sklearn.model_selection import train_test_split import pandas as pd data = {'Age': [18, 25, 30, 45, 60], 'Salary': [10000, 20000, 50000, 80000, 120000], 'Purchased': [0, 1, 0, 1, 1]} df = pd.DataFrame(data) X = df[['Age', 'Salary']] y = df['Purchased'] # Split into training (80%) and testing (20%) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("Training Data:\n", X_train) print("Testing Data:\n", X_test)

Best Practices for Data Preprocessing

  1. Visualize the Data: Plot data before and after preprocessing to see changes.
  2. Start with Simple Methods: Use simple imputation (like mean or median) first.
  3. Avoid Data Leakage: Make sure test data is not used in preprocessing steps like scaling.
  4. Check Feature Importance: Only keep relevant features for better accuracy.

Conclusion

Data preprocessing is one of the most important steps in machine learning. Without clean, transformed, and well-structured data, even the best machine learning models can fail. In this blog, we discussed key techniques like handling missing data, scaling, encoding, feature selection, and splitting datasets.

If you're new to ML, mastering these techniques will give you a huge advantage. As a next step, try to apply these concepts to your own datasets.

Up Next: In the next blog, we’ll explore Machine Learning Algorithms Explained — where we’ll simplify complex concepts like Linear Regression, Decision Trees, and Random Forests.


📘 1. Beginner’s Guide to Machine Learning: Simplified and Explained

Start your ML journey with this easy-to-understand guide covering the basics, types of ML, and how models learn from data.

🔗 Read it here


📘 2. Top 10 Essential Libraries for Training Machine Learning Models

Learn about the top libraries every ML practitioner must know, from model building to data handling and visualization, with practical examples.

🔗 Explore it here

Stay tuned, and as always, “The future of tech is in your hands—keep building, keep dreaming!” 

Comments

Popular posts from this blog

Beginner’s Guide to Machine Learning: Simplified and Explained

Welcome to the World of Machine Learning (ML) Tech Geeks! ! If you've ever wondered how Netflix recommends your next favorite show or how Google Maps predicts traffic, the answer lies in machine learning. Let’s dive into the basics and uncover what makes ML so powerful! If you are so eager to know what is happening visit  https://netflixtechblog.com What is Machine Learning? Machine learning is a way for computers to "learn" from data without being explicitly programmed. Instead of telling a machine what to do step by step, you give it examples (data), and it figures out the patterns. For example: When you shop online, ML suggests items you might like based on your browsing history. In healthcare, ML helps predict diseases before symptoms appear. Think of ML as teaching a child by showing examples rather than giving strict instructions. Want to learn more? Check out What is Machine Learning? for a detailed explanation. How Does Machine Learning Work? Data is Key : ML ne...

Top 10 Essential Libraries for Training Machine Learning Models

Welcome, Tech Geeks!   Machine Learning (ML) is one of the most influential technologies of our era. From predicting stock prices to recommending your next binge-worthy show, ML models are transforming industries worldwide. But behind every successful ML model lies a robust set of tools and libraries that make the development process smooth, efficient, and scalable. If you’re a beginner or an experienced ML enthusiast, knowing the essential machine learning libraries is crucial. These libraries help in everything from data preprocessing to building complex neural networks. Let’s explore the top 10 must-know libraries and understand their importance in machine learning. 1. NumPy: The Backbone of Numerical Computation What is it? NumPy (short for "Numerical Python") is a fundamental Python library used for numerical computation. It provides support for multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. Why is i...