Data Cleaning Crash Course

# Load the dataset
data(mtcars)

# Simulate messy data (just for demonstration purposes)
mtcars$mpg[c(5, 10, 15)] <- NA  # Introduce missing values
mtcars$cyl[c(3, 8)] <- NA       # More missing values in categorical column
mtcars <- rbind(mtcars, mtcars[1, ])  # Introduce a duplicate row

# View the first few rows of the messy data
head(mtcars)

A data.frame: 6 × 11
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	NA	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	NA	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

3. Installing and Loading Necessary Packages

We’ll use the dplyr package, part of the tidyverse, which provides a set of tools for data manipulation and cleaning.

# Install and load the dplyr package
# install.packages("dplyr")
# library(dplyr)

Installing package into ‘/home/jumashafara/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)

4. Checking for Missing Values

The first step in data cleaning is to check for missing values in your dataset.

# Check for missing values in the dataset
colSums(is.na(mtcars))

mpg: 3
cyl: 2
disp: 0
hp: 0
drat: 0
wt: 0
qsec: 0
vs: 0
am: 0
gear: 0
carb: 0

Explanation: - is.na(): Identifies missing values. - colSums(): Sums the number of missing values per column.

5. Handling Missing Data

There are several ways to handle missing data, including removing rows with missing values, filling them in with a specific value (such as the mean or median), or using more advanced imputation techniques.

Removing Rows with Missing Values

# Remove rows with any missing values
mtcars_cleaned <- na.omit(mtcars)

# View cleaned dataset
head(mtcars_cleaned)

A data.frame: 6 × 11
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2

Filling Missing Values with Mean or Median

You can fill missing numeric values with the column mean or median.

# Fill missing values in the 'mpg' column with the mean
mtcars$mpg[is.na(mtcars$mpg)] <- mean(mtcars$mpg, na.rm = TRUE)

# Fill missing values in the 'cyl' column with the mode
mtcars$cyl[is.na(mtcars$cyl)] <- as.numeric(names(sort(table(mtcars$cyl), decreasing = TRUE)[1]))

# Verify that missing values are filled
colSums(is.na(mtcars))

mpg: 0
cyl: 0
disp: 0
hp: 0
drat: 0
wt: 0
qsec: 0
vs: 0
am: 0
gear: 0
carb: 0

Explanation: - mean(mtcars$mpg, na.rm = TRUE): Calculates the mean while ignoring NA values. - table(): Creates a frequency table (used to find the mode).

6. Dealing with Duplicates

Duplicate rows can distort your analysis, so it’s important to remove them.

# Check for duplicate rows
sum(duplicated(mtcars))

# Remove duplicate rows
mtcars <- mtcars[!duplicated(mtcars), ]

# Verify removal
sum(duplicated(mtcars))

Explanation: - duplicated(): Returns a logical vector indicating which rows are duplicates. - !duplicated(): Filters out the duplicated rows.

7. Data Type Conversion

Sometimes, columns have incorrect data types, such as numeric columns being stored as characters. You need to convert them to the correct type.

Checking Data Types

# Check the data types of all columns
str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 20.5 ...
 $ cyl : num  6 6 8 6 8 6 8 8 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Converting Data Types

Convert the cyl column from numeric to factor since it represents categorical data.

# Convert 'cyl' to a factor
mtcars$cyl <- as.factor(mtcars$cyl)

# Verify the conversion
str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 20.5 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 3 2 3 2 3 3 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

8. Renaming Columns

Renaming columns is often necessary to make your dataset easier to work with.

# Rename columns using dplyr
mtcars <- mtcars %>%
  rename(Miles_Per_Gallon = mpg, Cylinders = cyl)

# View the renamed dataset
head(mtcars)

**Explanation:**
- `rename()`: Renames columns, using the syntax `new_name = old_name`.

## 9. Filtering and Subsetting Data

Filtering allows you to work with a subset of the data based on certain conditions.

### Filtering Rows

::: {#cell-21 .cell vscode='{"languageId":"r"}'}
``` {.R .cell-code}
# Filter rows where MPG is greater than 20
mtcars_filtered <- mtcars %>%
  filter(Miles_Per_Gallon > 20)

# View the filtered data
head(mtcars_filtered)

:::

Selecting Specific Columns

# Select only the 'Miles_Per_Gallon' and 'Cylinders' columns
mtcars_subset <- mtcars %>%
  select(Miles_Per_Gallon, Cylinders)

# View the subsetted data
head(mtcars_subset)

Explanation: - filter(): Filters rows based on a condition. - select(): Selects specific columns from the dataset.

10. Dealing with Outliers

Outliers can skew your analysis, so it’s important to detect and handle them. Here’s how to detect outliers using the Interquartile Range (IQR) method.

Identifying Outliers

# Calculate the IQR for the 'mpg' column
Q1 <- quantile(mtcars$Miles_Per_Gallon, 0.25)
Q3 <- quantile(mtcars$Miles_Per_Gallon, 0.75)
IQR <- Q3 - Q1

# Define outlier bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers
outliers <- mtcars %>%
  filter(Miles_Per_Gallon < lower_bound | Miles_Per_Gallon > upper_bound)

# View the outliers
outliers

Removing Outliers

# Remove outliers
mtcars_no_outliers <- mtcars %>%
  filter(Miles_Per_Gallon >= lower_bound & Miles_Per_Gallon <= upper_bound)

# Verify removal of outliers
summary(mtcars_no_outliers$Miles_Per_Gallon)

Explanation: - quantile(): Calculates quantiles. - IQR (Interquartile Range) is used to define lower and upper bounds for outliers.

11. Transforming Variables

Data transformation can help make your data more suitable for analysis. Common transformations include creating new variables, normalizing, or scaling data.

Creating New Variables

# Create a new variable 'wt_kg' by converting 'wt' from 1000 lbs to kg
mtcars <- mtcars %>%
  mutate(wt_kg = wt * 453.592)

# View the new variable
head(mtcars)

Scaling and Normalizing Data

# Normalize 'Miles_Per_Gallon' (min-max scaling)
mtcars <- mtcars %>%
  mutate(Miles_Per_Gallon_scaled = (Miles_Per_Gallon - min(Miles_Per_Gallon)) / (max(Miles_Per_Gallon) - min(Miles_Per_Gallon)))

# View the scaled column
head(mtcars$Miles_Per_Gallon_scaled)

Explanation: - mutate(): Adds new variables or modifies existing ones. - Scaling helps bring all values into a comparable range, especially for machine learning tasks.

12. Conclusion

In this tutorial, we have covered the essential steps in data cleaning using R. Here’s a quick recap:

Checking and Handling Missing Data: Use functions like na.omit() or fill missing values with mean, median, or mode.
Dealing with Duplicates: Identify and remove duplicate rows.
Data Type Conversion: Convert columns to their appropriate data types.
Renaming Columns: Rename columns for better clarity.
Filtering and Subsetting Data: Filter rows and select specific columns based on your needs.
Handling Outliers: Identify and remove outliers using the IQR method.
Transforming Data: Create new variables, scale, or normalize existing data.

These techniques form the foundation of effective data cleaning.

Once your data is cleaned, it’s ready for analysis, visualization, or modeling.