# Load the datasetdata(mtcars)# Simulate messy data (just for demonstration purposes)mtcars$mpg[c(5, 10, 15)] <-NA# Introduce missing valuesmtcars$cyl[c(3, 8)] <-NA# More missing values in categorical columnmtcars <-rbind(mtcars, mtcars[1, ]) # Introduce a duplicate row# View the first few rows of the messy datahead(mtcars)
A data.frame: 6 × 11
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
Mazda RX4
21.0
6
160
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160
110
3.90
2.875
17.02
0
1
4
4
Datsun 710
22.8
NA
108
93
3.85
2.320
18.61
1
1
4
1
Hornet 4 Drive
21.4
6
258
110
3.08
3.215
19.44
1
0
3
1
Hornet Sportabout
NA
8
360
175
3.15
3.440
17.02
0
0
3
2
Valiant
18.1
6
225
105
2.76
3.460
20.22
1
0
3
1
3. Installing and Loading Necessary Packages
We’ll use the dplyr package, part of the tidyverse, which provides a set of tools for data manipulation and cleaning.
# Install and load the dplyr package# install.packages("dplyr")# library(dplyr)
Installing package into ‘/home/jumashafara/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)
4. Checking for Missing Values
The first step in data cleaning is to check for missing values in your dataset.
# Check for missing values in the datasetcolSums(is.na(mtcars))
mpg
3
cyl
2
disp
0
hp
0
drat
0
wt
0
qsec
0
vs
0
am
0
gear
0
carb
0
Explanation: - is.na(): Identifies missing values. - colSums(): Sums the number of missing values per column.
5. Handling Missing Data
There are several ways to handle missing data, including removing rows with missing values, filling them in with a specific value (such as the mean or median), or using more advanced imputation techniques.
Removing Rows with Missing Values
# Remove rows with any missing valuesmtcars_cleaned <-na.omit(mtcars)# View cleaned datasethead(mtcars_cleaned)
A data.frame: 6 × 11
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
Mazda RX4
21.0
6
160.0
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160.0
110
3.90
2.875
17.02
0
1
4
4
Hornet 4 Drive
21.4
6
258.0
110
3.08
3.215
19.44
1
0
3
1
Valiant
18.1
6
225.0
105
2.76
3.460
20.22
1
0
3
1
Duster 360
14.3
8
360.0
245
3.21
3.570
15.84
0
0
3
4
Merc 230
22.8
4
140.8
95
3.92
3.150
22.90
1
0
4
2
Filling Missing Values with Mean or Median
You can fill missing numeric values with the column mean or median.
# Fill missing values in the 'mpg' column with the meanmtcars$mpg[is.na(mtcars$mpg)] <-mean(mtcars$mpg, na.rm =TRUE)# Fill missing values in the 'cyl' column with the modemtcars$cyl[is.na(mtcars$cyl)] <-as.numeric(names(sort(table(mtcars$cyl), decreasing =TRUE)[1]))# Verify that missing values are filledcolSums(is.na(mtcars))
mpg
0
cyl
0
disp
0
hp
0
drat
0
wt
0
qsec
0
vs
0
am
0
gear
0
carb
0
Explanation: - mean(mtcars$mpg, na.rm = TRUE): Calculates the mean while ignoring NA values. - table(): Creates a frequency table (used to find the mode).
6. Dealing with Duplicates
Duplicate rows can distort your analysis, so it’s important to remove them.
Renaming columns is often necessary to make your dataset easier to work with.
# Rename columns using dplyrmtcars <- mtcars %>%rename(Miles_Per_Gallon = mpg, Cylinders = cyl)# View the renamed datasethead(mtcars)**Explanation:**-`rename()`: Renames columns, using the syntax `new_name = old_name`.## 9. Filtering and Subsetting DataFiltering allows you to work with a subset of the data based on certain conditions.### Filtering Rows::: {#cell-21 .cell vscode='{"languageId":"r"}'}``` {.R .cell-code}# Filter rows where MPG is greater than 20mtcars_filtered <- mtcars %>% filter(Miles_Per_Gallon > 20)# View the filtered datahead(mtcars_filtered)
:::
Selecting Specific Columns
# Select only the 'Miles_Per_Gallon' and 'Cylinders' columnsmtcars_subset <- mtcars %>%select(Miles_Per_Gallon, Cylinders)# View the subsetted datahead(mtcars_subset)
Explanation: - filter(): Filters rows based on a condition. - select(): Selects specific columns from the dataset.
10. Dealing with Outliers
Outliers can skew your analysis, so it’s important to detect and handle them. Here’s how to detect outliers using the Interquartile Range (IQR) method.
Explanation: - quantile(): Calculates quantiles. - IQR (Interquartile Range) is used to define lower and upper bounds for outliers.
11. Transforming Variables
Data transformation can help make your data more suitable for analysis. Common transformations include creating new variables, normalizing, or scaling data.
Creating New Variables
# Create a new variable 'wt_kg' by converting 'wt' from 1000 lbs to kgmtcars <- mtcars %>%mutate(wt_kg = wt *453.592)# View the new variablehead(mtcars)
Explanation: - mutate(): Adds new variables or modifies existing ones. - Scaling helps bring all values into a comparable range, especially for machine learning tasks.
12. Conclusion
In this tutorial, we have covered the essential steps in data cleaning using R. Here’s a quick recap:
Checking and Handling Missing Data: Use functions like na.omit() or fill missing values with mean, median, or mode.
Dealing with Duplicates: Identify and remove duplicate rows.
Data Type Conversion: Convert columns to their appropriate data types.
Renaming Columns: Rename columns for better clarity.
Filtering and Subsetting Data: Filter rows and select specific columns based on your needs.
Handling Outliers: Identify and remove outliers using the IQR method.
Transforming Data: Create new variables, scale, or normalize existing data.
These techniques form the foundation of effective data cleaning.
Once your data is cleaned, it’s ready for analysis, visualization, or modeling.