Data Cleaning Crash Course

# Load the dataset
data(mtcars)

# Simulate messy data (just for demonstration purposes)
mtcars$mpg[c(5, 10, 15)] <- NA  # Introduce missing values
mtcars$cyl[c(3, 8)] <- NA       # More missing values in categorical column
mtcars <- rbind(mtcars, mtcars[1, ])  # Introduce a duplicate row

# View the first few rows of the messy data
head(mtcars)
A data.frame: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 NA 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout NA 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

3. Installing and Loading Necessary Packages

We’ll use the dplyr package, part of the tidyverse, which provides a set of tools for data manipulation and cleaning.

# Install and load the dplyr package
# install.packages("dplyr")
# library(dplyr)
Installing package into ‘/home/jumashafara/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)

4. Checking for Missing Values

The first step in data cleaning is to check for missing values in your dataset.

# Check for missing values in the dataset
colSums(is.na(mtcars))
mpg
3
cyl
2
disp
0
hp
0
drat
0
wt
0
qsec
0
vs
0
am
0
gear
0
carb
0

Explanation: - is.na(): Identifies missing values. - colSums(): Sums the number of missing values per column.

5. Handling Missing Data

There are several ways to handle missing data, including removing rows with missing values, filling them in with a specific value (such as the mean or median), or using more advanced imputation techniques.

Removing Rows with Missing Values

# Remove rows with any missing values
mtcars_cleaned <- na.omit(mtcars)

# View cleaned dataset
head(mtcars_cleaned)
A data.frame: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

Filling Missing Values with Mean or Median

You can fill missing numeric values with the column mean or median.

# Fill missing values in the 'mpg' column with the mean
mtcars$mpg[is.na(mtcars$mpg)] <- mean(mtcars$mpg, na.rm = TRUE)

# Fill missing values in the 'cyl' column with the mode
mtcars$cyl[is.na(mtcars$cyl)] <- as.numeric(names(sort(table(mtcars$cyl), decreasing = TRUE)[1]))

# Verify that missing values are filled
colSums(is.na(mtcars))
mpg
0
cyl
0
disp
0
hp
0
drat
0
wt
0
qsec
0
vs
0
am
0
gear
0
carb
0

Explanation: - mean(mtcars$mpg, na.rm = TRUE): Calculates the mean while ignoring NA values. - table(): Creates a frequency table (used to find the mode).

6. Dealing with Duplicates

Duplicate rows can distort your analysis, so it’s important to remove them.

# Check for duplicate rows
sum(duplicated(mtcars))

# Remove duplicate rows
mtcars <- mtcars[!duplicated(mtcars), ]

# Verify removal
sum(duplicated(mtcars))
1
0

Explanation: - duplicated(): Returns a logical vector indicating which rows are duplicates. - !duplicated(): Filters out the duplicated rows.

7. Data Type Conversion

Sometimes, columns have incorrect data types, such as numeric columns being stored as characters. You need to convert them to the correct type.

Checking Data Types

# Check the data types of all columns
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 20.5 ...
 $ cyl : num  6 6 8 6 8 6 8 8 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Converting Data Types

Convert the cyl column from numeric to factor since it represents categorical data.

# Convert 'cyl' to a factor
mtcars$cyl <- as.factor(mtcars$cyl)

# Verify the conversion
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 20.5 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 3 2 3 2 3 3 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

8. Renaming Columns

Renaming columns is often necessary to make your dataset easier to work with.

# Rename columns using dplyr
mtcars <- mtcars %>%
  rename(Miles_Per_Gallon = mpg, Cylinders = cyl)

# View the renamed dataset
head(mtcars)

**Explanation:**
- `rename()`: Renames columns, using the syntax `new_name = old_name`.

## 9. Filtering and Subsetting Data

Filtering allows you to work with a subset of the data based on certain conditions.

### Filtering Rows

::: {#cell-21 .cell vscode='{"languageId":"r"}'}
``` {.R .cell-code}
# Filter rows where MPG is greater than 20
mtcars_filtered <- mtcars %>%
  filter(Miles_Per_Gallon > 20)

# View the filtered data
head(mtcars_filtered)

:::

Selecting Specific Columns

# Select only the 'Miles_Per_Gallon' and 'Cylinders' columns
mtcars_subset <- mtcars %>%
  select(Miles_Per_Gallon, Cylinders)

# View the subsetted data
head(mtcars_subset)

Explanation: - filter(): Filters rows based on a condition. - select(): Selects specific columns from the dataset.

10. Dealing with Outliers

Outliers can skew your analysis, so it’s important to detect and handle them. Here’s how to detect outliers using the Interquartile Range (IQR) method.

Identifying Outliers

# Calculate the IQR for the 'mpg' column
Q1 <- quantile(mtcars$Miles_Per_Gallon, 0.25)
Q3 <- quantile(mtcars$Miles_Per_Gallon, 0.75)
IQR <- Q3 - Q1

# Define outlier bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers
outliers <- mtcars %>%
  filter(Miles_Per_Gallon < lower_bound | Miles_Per_Gallon > upper_bound)

# View the outliers
outliers

Removing Outliers

# Remove outliers
mtcars_no_outliers <- mtcars %>%
  filter(Miles_Per_Gallon >= lower_bound & Miles_Per_Gallon <= upper_bound)

# Verify removal of outliers
summary(mtcars_no_outliers$Miles_Per_Gallon)

Explanation: - quantile(): Calculates quantiles. - IQR (Interquartile Range) is used to define lower and upper bounds for outliers.

11. Transforming Variables

Data transformation can help make your data more suitable for analysis. Common transformations include creating new variables, normalizing, or scaling data.

Creating New Variables

# Create a new variable 'wt_kg' by converting 'wt' from 1000 lbs to kg
mtcars <- mtcars %>%
  mutate(wt_kg = wt * 453.592)

# View the new variable
head(mtcars)

Scaling and Normalizing Data

# Normalize 'Miles_Per_Gallon' (min-max scaling)
mtcars <- mtcars %>%
  mutate(Miles_Per_Gallon_scaled = (Miles_Per_Gallon - min(Miles_Per_Gallon)) / (max(Miles_Per_Gallon) - min(Miles_Per_Gallon)))

# View the scaled column
head(mtcars$Miles_Per_Gallon_scaled)

Explanation: - mutate(): Adds new variables or modifies existing ones. - Scaling helps bring all values into a comparable range, especially for machine learning tasks.

12. Conclusion

In this tutorial, we have covered the essential steps in data cleaning using R. Here’s a quick recap:

  • Checking and Handling Missing Data: Use functions like na.omit() or fill missing values with mean, median, or mode.
  • Dealing with Duplicates: Identify and remove duplicate rows.
  • Data Type Conversion: Convert columns to their appropriate data types.
  • Renaming Columns: Rename columns for better clarity.
  • Filtering and Subsetting Data: Filter rows and select specific columns based on your needs.
  • Handling Outliers: Identify and remove outliers using the IQR method.
  • Transforming Data: Create new variables, scale, or normalize existing data.

These techniques form the foundation of effective data cleaning.

Once your data is cleaned, it’s ready for analysis, visualization, or modeling.

What’s on your mind? Put it in the comments!