R

Why would anyone learn R? Well, choosing between R and Python for data science depends on several factors, including the specific requirements of the task, personal preference, and existing infrastructure. Here are some scenarios where someone might choose R over Python:

When to Use R for Data Science
  1. Statistical Analysis and Data Visualization:
    • R has a rich ecosystem of packages specifically designed for statistical analysis, making it ideal for tasks such as hypothesis testing, linear modeling, and time series analysis.
    • The ggplot2 package in R is renowned for its expressive and powerful plotting capabilities, which are often preferred by statisticians.
  2. Academic and Research Purposes:
    • R is widely used in academic and research environments, especially in disciplines like statistics, bioinformatics, and social sciences.
    • Many academic papers and research publications provide R code for reproducibility, facilitating collaboration and peer review.
  3. Data Manipulation and Cleaning:
    • R provides powerful tools like the dplyr and tidyr packages for data manipulation and cleaning, allowing users to efficiently reshape, filter, and summarize datasets.
  4. Specialized Statistical Tests:
    • R has extensive support for specialized statistical tests and procedures, including advanced regression models, survival analysis, and Bayesian statistics.
    • Researchers and statisticians often find R’s comprehensive statistical libraries beneficial for performing complex analyses.
  5. Integrated Development Environment (IDE):
    • RStudio, the predominant IDE for R, offers a dedicated environment with built-in tools for data visualization, package management, and interactive analysis.
  6. Community and Package Ecosystem:
    • The R community is highly focused on statistical computing and data analysis, with a vast repository of specialized packages and libraries catering to diverse analytical needs.
Considerations for Choosing Between R and Python
  • Learning Curve: R’s syntax and approach may be more intuitive for statisticians and researchers familiar with statistical methods.
  • Task-Specific Requirements: Evaluate whether the task involves primarily statistical analysis, where R excels, or requires broader integration with other technologies (e.g., web development, machine learning).
  • Team Skills and Preferences: Consider the existing expertise and preferences of the team members who will be working on the project.
  • Interoperability: Python’s versatility and integration capabilities may be advantageous when working within a broader data ecosystem that includes web applications, machine learning models, and data engineering pipelines.

Introduction to R

Objectives
  • Understand what R is and its uses in data science.
  • Set up R and RStudio.
  • Learn basic R syntax and data types.
What is R?

R is a programming language and environment commonly used for statistical computing, data analysis, and graphical representation. Developed by statisticians Ross Ihaka and Robert Gentleman, R has become a vital tool in data science due to its versatility and the extensive library of packages available.

Setting Up R and RStudio
  1. Install R:
  2. Install RStudio:

Basic R Syntax and Data Types

R as a Calculator

R can be used to perform basic arithmetic operations.

# Addition
3 + 4

# Subtraction
10 - 2

# Multiplication
5 * 6

# Division
9 / 3

# Exponentiation
2^3

# Modulus (remainder)
10 %% 3
Variables and Assignment

In R, you can store values in variables using the assignment operator <-.

# Assigning values to variables
x <- 5
y <- 10

# Performing operations with variables
z <- x + y

# Printing the value of z
print(z)
Data Types

R supports several basic data types.

Numeric: Represents numbers.

num <- 42

Character: Represents text strings.

char <- "Hello, R!"

Logical: Represents TRUE or FALSE values.

log <- TRUE

Vector: A sequence of data elements of the same type.

vec <- c(1, 2, 3, 4, 5)
Basic Functions

R comes with many built-in functions that can perform various tasks.

Mathematical functions:

sqrt(16)  # Square root
abs(-5) # Absolute value

Character functions:

nchar("Hello")    # Number of characters
tolower("HELLO") # Convert to lowercase

Logical functions:

all(c(TRUE, FALSE, TRUE))  # Check if all are TRUE
any(c(TRUE, FALSE, TRUE)) # Check if any are TRUE
Working with Data Structures

Vectors: An ordered collection of elements of the same type.

numbers <- c(1, 2, 3, 4, 5)

Matrices: Two-dimensional arrays where elements are of the same type.

matrix_1 <- matrix(1:9, nrow=3, ncol=3)

Data Frames: Table-like structures where each column can contain different types of data.

df <- data.frame(
name = c("John", "Jane", "Doe"),
age = c(23, 25, 31),
height = c(177, 165, 180)
)

Lists: Collections of elements that can contain different types of data.

lst <- list(name="John", age=23, height=177)
Hands-On Exercise

Create a vector of your five favorite numbers.

fav_numbers <- c(7, 13, 21, 34, 42)

Calculate the mean of these numbers.

mean(fav_numbers)

Create a data frame with columns for name, age, and favorite color for three people.

friends_df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
favorite_color = c("blue", "green", "red")
)

Print the data frame.

print(friends_df)

Data Manipulation with dplyr

Objectives
  • Understand the basics of the dplyr package.
  • Learn to perform data manipulation tasks such as filtering, selecting, mutating, summarizing, and grouping data.
Introduction to dplyr

dplyr is a powerful R package for data manipulation, providing a consistent set of verbs that help in performing common data manipulation tasks. It’s part of the tidyverse collection of R packages designed for data science.

To start, install and load the dplyr package.

install.packages("dplyr")
library(dplyr)

Sample Dataset

For this lesson, we’ll use the built-in mtcars dataset.

data(mtcars)

Basic Verbs in dplyr

  1. filter(): Subset rows based on conditions.
  2. select(): Select columns by name.
  3. mutate(): Create new columns or modify existing ones.
  4. summarize(): Reduce multiple values to a single summary.
  5. group_by(): Group data for summary operations.
  6. arrange(): Order rows by values of columns.

Filtering Rows with filter()

The filter() function is used to subset rows based on conditions.

# Filter cars with mpg greater than 20
filtered_cars <- filter(mtcars, mpg > 20)
print(filtered_cars)

Selecting Columns with select()

The select() function is used to choose specific columns from a dataset.

# Select the mpg, cyl, and hp columns
selected_columns <- select(mtcars, mpg, cyl, hp)
print(selected_columns)

Creating and Modifying Columns with mutate()

The mutate() function is used to add new columns or modify existing ones.

# Add a new column for weight in kilograms
mtcars_kg <- mutate(mtcars, weight_kg = wt * 453.592)
print(mtcars_kg)

Summarizing Data with summarize()

The summarize() function is used to generate summary statistics.

# Calculate the average mpg
avg_mpg <- summarize(mtcars, avg_mpg = mean(mpg))
print(avg_mpg)

Grouping Data with group_by()

The group_by() function is used to group data before applying summary functions.

# Calculate the average mpg by number of cylinders
avg_mpg_by_cyl <- mtcars %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg))
print(avg_mpg_by_cyl)

Arranging Rows with arrange()

The arrange() function is used to reorder rows of a dataset.

# Arrange the data by mpg in descending order
arranged_cars <- arrange(mtcars, desc(mpg))
print(arranged_cars)

Chaining Operations with the Pipe Operator (%>%)

dplyr functions can be chained together using the pipe operator %>% to create more readable and concise code.

# Chain multiple operations: filter, select, and arrange
result <- mtcars %>%
filter(mpg > 20) %>%
select(mpg, cyl, hp) %>%
arrange(desc(mpg))
print(result)
Hands-On Exercise
  1. Filter the mtcars dataset for cars with 6 cylinders.
cars_6_cyl <- filter(mtcars, cyl == 6)
print(cars_6_cyl)
  1. Select the columns mpg and hp for cars with 4 cylinders.
cars_4_cyl <- filter(mtcars, cyl == 4)
selected_cars_4_cyl <- select(cars_4_cyl, mpg, hp)
print(selected_cars_4_cyl)
  1. Add a new column to mtcars that contains the horsepower-to-weight ratio.
mtcars_with_ratio <- mutate(mtcars, hp_to_wt = hp / wt)
print(mtcars_with_ratio)
  1. Calculate the average horsepower (hp) for cars grouped by the number of gears (gear).
avg_hp_by_gear <- mtcars %>%
group_by(gear) %>%
summarize(avg_hp = mean(hp))
print(avg_hp_by_gear)

Data Tidying with tidyr

Objectives
  • Understand the basics of the tidyr package.
  • Learn to use tidyr functions to transform data into a tidy format.
  • Explore key functions: gather(), spread(), separate(), unite(), pivot_longer(), and pivot_wider().
Introduction to tidyr

tidyr is an R package that provides a set of functions designed to help you clean and organize your data into a tidy format. In a tidy dataset:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

To start, install and load the tidyr package.

install.packages("tidyr")
library(tidyr)

Sample Dataset

We’ll use a small dataset to demonstrate tidyr functions. Let’s create a sample dataset.

# Sample dataset
df <- data.frame(
id = 1:3,
year_2019 = c(10, 20, 30),
year_2020 = c(15, 25, 35)
)
print(df)

Key Functions in tidyr

pivot_longer()

pivot_longer() is used to transform data from wide format to long format, making it easier to analyze.

# Transform data from wide to long format
df_long <- pivot_longer(df, cols = starts_with("year"), names_to = "year", values_to = "value")
print(df_long)
pivot_wider()

pivot_wider() is used to transform data from long format to wide format, which can be useful for certain types of analyses and visualizations.

# Transform data from long to wide format
df_wide <- pivot_wider(df_long, names_from = "year", values_from = "value")
print(df_wide)
separate()

separate() splits a single column into multiple columns based on a delimiter.

# Sample dataset
df2 <- data.frame(
name = c("John_Smith", "Jane_Doe"),
age = c(25, 30)
)
print(df2)

# Separate the 'name' column into 'first_name' and 'last_name'
df_separated <- separate(df2, name, into = c("first_name", "last_name"), sep = "_")
print(df_separated)
unite()

unite() combines multiple columns into a single column.

# Combine 'first_name' and 'last_name' into 'full_name'
df_united <- unite(df_separated, full_name, first_name, last_name, sep = " ")
print(df_united)

Additional Functions

drop_na()

drop_na() removes rows containing missing values.

# Sample dataset with missing values
df3 <- data.frame(
id = 1:3,
value = c(10, NA, 30)
)
print(df3)

# Remove rows with missing values
df_clean <- drop_na(df3)
print(df_clean)
fill()

fill() fills in missing values with the previous or next value.

# Sample dataset with missing values
df4 <- data.frame(
id = 1:4,
value = c(10, NA, 30, NA)
)
print(df4)

# Fill missing values with the previous value
df_filled <- fill(df4, value, .direction = "down")
print(df_filled)
Hands-On Exercise
  1. Create a dataset with columns for student_id, exam1_score, and exam2_score.
students_df <- data.frame(
student_id = 1:3,
exam1_score = c(85, 90, 78),
exam2_score = c(88, 92, 81)
)
print(students_df)
  1. Use pivot_longer() to convert the dataset to long format with columns student_id, exam, and score.
students_long <- pivot_longer(students_df, cols = starts_with("exam"), names_to = "exam", values_to = "score")
print(students_long)
  1. Separate the exam column into exam_number and subject (e.g., “exam1” -> “exam”, “1”).
students_long <- separate(students_long, exam, into = c("exam", "number"), sep = "_")
print(students_long)
  1. Use pivot_wider() to convert the dataset back to wide format.
students_wide <- pivot_wider(students_long, names_from = number, values_from = score)
print(students_wide)

Data Visualization with ggplot2

Objectives
  • Understand the basics of the ggplot2 package.
  • Learn to create various types of plots using ggplot2.
  • Explore key components and functions of ggplot2.
Introduction to ggplot2

ggplot2 is a data visualization package for R, part of the tidyverse suite of packages. It is based on the Grammar of Graphics, which provides a coherent system for describing and building graphs.

To start, install and load the ggplot2 package.

install.packages("ggplot2")
library(ggplot2)

Sample Dataset

We’ll use the built-in mtcars dataset for our examples.

data(mtcars)

Basic Components of ggplot2

A ggplot2 plot is constructed using the ggplot() function, along with layers such as geom_*, aes(), facet_*, and more.

1. ggplot()

The ggplot() function initializes a plot object.

# Initialize a plot with mtcars dataset
p <- ggplot(mtcars)
2. aes()

The aes() function specifies the aesthetic mappings, describing how variables in the data are mapped to visual properties (aesthetics) of the plot.

# Initialize a plot with mtcars dataset and set aesthetic mappings
p <- ggplot(mtcars, aes(x = wt, y = mpg))
3. geom_*

Geometric objects (geom_*) define the type of plot. Common geoms include geom_point() for scatter plots, geom_line() for line plots, geom_bar() for bar plots, and geom_histogram() for histograms.

Scatter Plot
# Scatter plot of weight vs. mpg
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
Line Plot
# Line plot of mpg over index
ggplot(mtcars, aes(x = seq_along(mpg), y = mpg)) +
geom_line()
Bar Plot
# Bar plot of count of cars per number of cylinders
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
Histogram
# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 5)

Customizing Plots

Titles and Labels

You can add titles and axis labels using labs().

# Scatter plot with titles and labels
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(
title = "Scatter plot of MPG vs. Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon"
)
Themes

Themes control the overall appearance of the plot. You can use predefined themes or create custom ones.

# Scatter plot with a minimal theme
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal()
Colors

You can map variables to colors using the color aesthetic.

# Scatter plot with points colored by number of cylinders
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point()
Facets

Faceting allows you to split your data into subsets and display those subsets as multiple panels.

# Faceted scatter plot by number of cylinders
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl)

Saving Plots

You can save plots using the ggsave() function.

# Save the last plot to a file
ggsave("scatter_plot.png")
Hands-On Exercise
  1. Create a histogram of the hp (horsepower) variable from the mtcars dataset.
ggplot(mtcars, aes(x = hp)) +
geom_histogram(binwidth = 20) +
labs(
title = "Histogram of Horsepower",
x = "Horsepower",
y = "Count"
)
  1. Create a scatter plot of mpg vs. disp (displacement), colored by the number of gears (gear).
ggplot(mtcars, aes(x = disp, y = mpg, color = factor(gear))) +
geom_point() +
labs(
title = "Scatter plot of MPG vs. Displacement",
x = "Displacement (cu.in.)",
y = "Miles per Gallon"
)
  1. Create a faceted bar plot showing the count of cars for each number of cylinders, with facets for each number of gears.
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar() +
facet_wrap(~ gear) +
labs(
title = "Bar plot of Cylinder Count by Gears",
x = "Number of Cylinders",
y = "Count"
)

Statistical Methods with R

Objectives
  • Understand basic statistical methods and concepts.
  • Learn to perform statistical analysis using R.
  • Explore key statistical functions and techniques in R.
Introduction to Statistical Methods

Statistical methods are techniques used to collect, analyze, interpret, and present data. In this lesson, we will cover the following topics:

  • Descriptive statistics
  • Inferential statistics
  • Hypothesis testing
  • Correlation and regression analysis

To start, ensure you have R installed and loaded.

install.packages("tidyverse")
install.packages("MASS")
library(tidyverse)
library(MASS)

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. These include measures of central tendency and measures of variability.

Measures of Central Tendency
  • Mean: The average of a set of numbers.
# Calculate mean
data <- c(4, 8, 15, 16, 23, 42)
mean(data)
  • Median: The middle value of a dataset when it is ordered.
# Calculate median
median(data)
  • Mode: The value that appears most frequently in a dataset.
# Calculate mode
mode <- as.numeric(names(sort(table(data), decreasing = TRUE))[1])
mode
Measures of Variability
  • Variance: The average of the squared differences from the mean.
# Calculate variance
var(data)
  • Standard Deviation: The square root of the variance, representing the average distance from the mean.
# Calculate standard deviation
sd(data)
  • Range: The difference between the maximum and minimum values.
# Calculate range
range_val <- max(data) - min(data)
range_val
  • Interquartile Range (IQR): The range of the middle 50% of the data.
# Calculate IQR
IQR(data)

Inferential Statistics

Inferential statistics allow us to make inferences and draw conclusions about a population based on a sample.

Hypothesis Testing

Hypothesis testing is a method used to determine if there is enough evidence to reject a null hypothesis.

  • t-test: Used to compare the means of two groups.
# Perform a t-test
group1 <- c(5, 6, 7, 8, 9)
group2 <- c(7, 8, 9, 10, 11)
t.test(group1, group2)
  • Chi-Square Test: Used to test the association between categorical variables.
# Create a contingency table
data <- matrix(c(10, 20, 30, 40), nrow = 2)
# Perform a chi-square test
chisq.test(data)
Correlation and Regression Analysis

Correlation and regression analysis are used to examine relationships between variables.

  • Correlation: Measures the strength and direction of the relationship between two variables.
# Calculate correlation
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
cor(x, y)
  • Linear Regression: Models the relationship between a dependent variable and one or more independent variables.
# Fit a linear regression model
data(mtcars)
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
ANOVA (Analysis of Variance)

ANOVA is used to compare the means of three or more groups.

# Perform ANOVA
data(iris)
anova_model <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_model)
Hands-On Exercise
  1. Load the iris dataset and calculate descriptive statistics for the Sepal.Length variable.
data(iris)
summary(iris$Sepal.Length)
  1. Perform a t-test to compare the Sepal.Length between the setosa and versicolor species.
setosa <- subset(iris, Species == "setosa")$Sepal.Length
versicolor <- subset(iris, Species == "versicolor")$Sepal.Length
t.test(setosa, versicolor)
  1. Calculate the correlation between Sepal.Length and Petal.Length in the iris dataset.
cor(iris$Sepal.Length, iris$Petal.Length)
  1. Fit a linear regression model to predict Sepal.Length using Petal.Length and Petal.Width.
model <- lm(Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
summary(model)
  1. Perform an ANOVA to compare Sepal.Width across different species in the iris dataset.
anova_model <- aov(Sepal.Width ~ Species, data = iris)
summary(anova_model)

Machine Learning with R

Objectives
  • Understand the basics of machine learning.
  • Learn to implement key machine learning algorithms in R.
  • Explore essential machine learning packages in R.
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from and make decisions based on data. In this lesson, we’ll cover the following topics:

  • Supervised learning
  • Unsupervised learning
  • Model evaluation

We’ll use the following packages for our examples:

install.packages("tidyverse")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
library(tidyverse)
library(caret)
library(randomForest)
library(e1071)
library(rpart)

Supervised Learning

Supervised learning involves training a model on a labeled dataset, which means that each training example is paired with an output label. We’ll cover two main types of supervised learning: regression and classification.

1. Regression

Regression is used to predict a continuous value. We’ll use the mtcars dataset to predict the mpg (miles per gallon) using linear regression.

# Load the mtcars dataset
data(mtcars)

# Split the data into training and testing sets
set.seed(123)
training_samples <- mtcars$mpg %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- mtcars[training_samples, ]
test_data <- mtcars[-training_samples, ]

# Train a linear regression model
model <- lm(mpg ~ ., data = train_data)

# Make predictions
predictions <- model %>% predict(test_data)

# Evaluate the model
RMSE <- sqrt(mean((predictions - test_data$mpg)^2))
RMSE
2. Classification

Classification is used to predict a categorical label. We’ll use the iris dataset to classify the species of iris flowers using a random forest.

# Load the iris dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]

# Train a random forest model
model <- randomForest(Species ~ ., data = train_data)

# Make predictions
predictions <- model %>% predict(test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)

Unsupervised Learning

Unsupervised learning involves training a model on data without labeled responses. We’ll cover clustering as an example of unsupervised learning.

Clustering

Clustering is used to group data points into clusters based on similarity. We’ll use the iris dataset for k-means clustering.

# Load the iris dataset
data(iris)

# Remove the species column for clustering
iris_data <- iris[, -5]

# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(iris_data, centers = 3)

# Add the cluster assignments to the original dataset
iris$Cluster <- as.factor(kmeans_result$cluster)

# Visualize the clusters
ggplot(iris, aes(Petal.Length, Petal.Width, color = Cluster)) +
geom_point() +
labs(title = "K-means Clustering of Iris Dataset")

Model Evaluation

Evaluating the performance of a machine learning model is crucial to ensure its effectiveness. We’ve already seen some evaluation techniques like RMSE for regression and confusion matrix for classification. Let’s explore cross-validation.

Cross-Validation

Cross-validation is a technique for assessing how a model will generalize to an independent dataset. We’ll use 10-fold cross-validation for the random forest model on the iris dataset.

# Define the control using a 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train the model
model <- train(Species ~ ., data = iris, method = "rf", trControl = train_control)

# Summarize the results
print(model)
Hands-On Exercise
  1. Load the Boston dataset from the MASS package and use linear regression to predict the median value of owner-occupied homes (medv).
library(MASS)
data(Boston)

# Split the data into training and testing sets
set.seed(123)
training_samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- Boston[training_samples, ]
test_data <- Boston[-training_samples, ]

# Train a linear regression model
model <- lm(medv ~ ., data = train_data)

# Make predictions
predictions <- model %>% predict(test_data)

# Evaluate the model
RMSE <- sqrt(mean((predictions - test_data$medv)^2))
RMSE
  1. Use the iris dataset and an SVM (Support Vector Machine) to classify the species of iris flowers.
# Load the iris dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]

# Train an SVM model
model <- svm(Species ~ ., data = train_data)

# Make predictions
predictions <- model %>% predict(test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)
  1. Perform hierarchical clustering on the iris dataset and visualize the dendrogram.
# Load the iris dataset
data(iris)

# Remove the species column for clustering
iris_data <- iris[, -5]

# Scale the data
iris_data <- scale(iris_data)

# Perform hierarchical clustering
hc <- hclust(dist(iris_data), method = "complete")

# Plot the dendrogram
plot(hc, labels = iris$Species, main = "Hierarchical Clustering Dendrogram")

Machine Learning with R Part 2

Objectives
  • Dive deeper into advanced machine learning techniques.
  • Learn about ensemble methods, hyperparameter tuning, and advanced model evaluation.
  • Explore essential advanced machine learning packages in R.
Introduction

In this lesson, we will cover advanced topics in machine learning with R, including:

  • Ensemble methods (e.g., boosting and bagging)
  • Hyperparameter tuning
  • Advanced model evaluation techniques
  • Model deployment

We will use the following packages:

install.packages("tidyverse")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("xgboost")
install.packages("ROCR")
install.packages("shiny")
library(tidyverse)
library(caret)
library(randomForest)
library(e1071)
library(xgboost)
library(ROCR)
library(shiny)

Ensemble Methods

Ensemble methods combine multiple models to improve performance. Two popular ensemble methods are bagging and boosting.

Bagging (Bootstrap Aggregating)

Bagging involves training multiple models on different subsets of the training data and averaging their predictions. Random forests are a popular bagging method.

# Load the iris dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]

# Train a random forest model
model <- randomForest(Species ~ ., data = train_data, ntree = 100)

# Make predictions
predictions <- model %>% predict(test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)
Boosting

Boosting sequentially trains models, each trying to correct the errors of its predecessor. XGBoost is a popular boosting algorithm.

# Prepare the data for XGBoost
train_matrix <- model.matrix(Species ~ . - 1, data = train_data)
train_label <- as.numeric(train_data$Species) - 1
test_matrix <- model.matrix(Species ~ . - 1, data = test_data)
test_label <- as.numeric(test_data$Species) - 1

# Train an XGBoost model
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
params <- list(objective = "multi:softprob", num_class = 3, eval_metric = "mlogloss")
model <- xgboost(params = params, data = dtrain, nrounds = 100)

# Make predictions
predictions <- predict(model, dtest)
predicted_labels <- max.col(matrix(predictions, ncol = 3)) - 1

# Evaluate the model
confusionMatrix(as.factor(predicted_labels), as.factor(test_label))

Hyperparameter Tuning

Hyperparameter tuning is the process of finding the optimal settings for a machine learning model to improve its performance. The caret package provides tools for hyperparameter tuning.

# Define the control using a grid search
train_control <- trainControl(method = "cv", number = 10)

# Define the parameter grid
tune_grid <- expand.grid(mtry = c(2, 3, 4), splitrule = "gini", min.node.size = 1)

# Train the model using grid search
model <- train(Species ~ ., data = iris, method = "ranger", trControl = train_control, tuneGrid = tune_grid)

# Summarize the results
print(model)

Advanced Model Evaluation

Advanced model evaluation techniques provide more insight into model performance. ROC curves and AUC (Area Under the Curve) are common evaluation metrics for classification models.

# Load the iris dataset and convert to binary classification
data(iris)
iris_binary <- iris %>% filter(Species != "virginica")
iris_binary$Species <- factor(iris_binary$Species)

# Split the data
set.seed(123)
training_samples <- iris_binary$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris_binary[training_samples, ]
test_data <- iris_binary[-training_samples, ]

# Train a logistic regression model
model <- glm(Species ~ ., data = train_data, family = binomial)

# Make predictions
predictions <- predict(model, test_data, type = "response")

# Evaluate the model using ROC and AUC
pred <- prediction(predictions, test_data$Species)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = "blue", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Calculate AUC
auc <- performance(pred, "auc")
auc@y.values[[1]]

Model Deployment

Deploying a model means making it available for use in a production environment. Shiny is a package that allows you to build interactive web applications to showcase your models.

# Define the UI
ui <- fluidPage(
titlePanel("Iris Species Predictor"),
sidebarLayout(
sidebarPanel(
numericInput("sepal_length", "Sepal Length:", value = 5.0),
numericInput("sepal_width", "Sepal Width:", value = 3.5),
numericInput("petal_length", "Petal Length:", value = 1.5),
numericInput("petal_width", "Petal Width:", value = 0.3),
actionButton("predict", "Predict")
),
mainPanel(
textOutput("prediction")
)
)
)

# Define the server
server <- function(input, output) {
model <- randomForest(Species ~ ., data = iris, ntree = 100)

observeEvent(input$predict, {
new_data <- data.frame(
Sepal.Length = input$sepal_length,
Sepal.Width = input$sepal_width,
Petal.Length = input$petal_length,
Petal.Width = input$petal_width
)
prediction <- predict(model, new_data)
output$prediction <- renderText({ paste("Predicted Species:", prediction) })
})
}

# Run the application
shinyApp(ui = ui, server = server)
Hands-On Exercise
  1. Load the Boston dataset from the MASS package and use gradient boosting to predict the median value of owner-occupied homes (medv).
library(MASS)
data(Boston)

# Prepare the data
train_index <- createDataPartition(Boston$medv, p = 0.8, list = FALSE)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

# Train a gradient boosting model
train_matrix <- model.matrix(medv ~ . - 1, data = train_data)
train_label <- train_data$medv
test_matrix <- model.matrix(medv ~ . - 1, data = test_data)
test_label <- test_data$medv

dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
params <- list(objective = "reg:squarederror", eval_metric = "rmse")
model <- xgboost(params = params, data = dtrain, nrounds = 100)

# Make predictions
predictions <- predict(model, dtest)
rmse <- sqrt(mean((predictions - test_label)^2))
rmse
  1. Perform hyperparameter tuning on a random forest model using the caret package on the iris dataset.
# Define the control using a grid search
train_control <- trainControl(method = "cv", number = 10)

# Define the parameter grid
tune_grid <- expand.grid(mtry = c(2, 3, 4), splitrule = "gini", min.node.size = 1)

# Train the model using grid search
model <- train(Species ~ ., data = iris, method = "ranger", trControl = train_control, tuneGrid = tune_grid)

# Summarize the results
print(model)
  1. Use the e1071 package to train an SVM model on the iris dataset and evaluate its performance using ROC and AUC.
# Load the iris dataset and convert to binary classification
data(iris)
iris_binary <- iris %>% filter(Species != "virginica")
iris_binary$Species <- factor(iris_binary$Species)

# Split the data
set.seed(123)
training_samples <- iris_binary$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris_binary[training_samples, ]
test_data <- iris_binary[-training_samples, ]

# Train an SVM model
model <- svm(Species ~ ., data = train_data, probability = TRUE)

# Make predictions
predictions <- predict(model, test_data, probability = TRUE)
probabilities <- attr(predictions, "probabilities")[,2]

# Evaluate the model using ROC and AUC
pred <- prediction(probabilities, test_data$Species)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = "blue", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Calculate AUC
auc <- performance(pred, "auc")
auc@y.values[[1]]