Why would anyone learn R? Well, choosing between R and Python for data science depends on several factors, including the specific requirements of the task, personal preference, and existing infrastructure. Here are some scenarios where someone might choose R over Python:
When to Use R for Data Science
- Statistical Analysis and Data Visualization:
- R has a rich ecosystem of packages specifically designed for statistical analysis, making it ideal for tasks such as hypothesis testing, linear modeling, and time series analysis.
- The
ggplot2
package in R is renowned for its expressive and powerful plotting capabilities, which are often preferred by statisticians.
- Academic and Research Purposes:
- R is widely used in academic and research environments, especially in disciplines like statistics, bioinformatics, and social sciences.
- Many academic papers and research publications provide R code for reproducibility, facilitating collaboration and peer review.
- Data Manipulation and Cleaning:
- R provides powerful tools like the
dplyr
andtidyr
packages for data manipulation and cleaning, allowing users to efficiently reshape, filter, and summarize datasets.
- R provides powerful tools like the
- Specialized Statistical Tests:
- R has extensive support for specialized statistical tests and procedures, including advanced regression models, survival analysis, and Bayesian statistics.
- Researchers and statisticians often find R’s comprehensive statistical libraries beneficial for performing complex analyses.
- Integrated Development Environment (IDE):
- RStudio, the predominant IDE for R, offers a dedicated environment with built-in tools for data visualization, package management, and interactive analysis.
- Community and Package Ecosystem:
- The R community is highly focused on statistical computing and data analysis, with a vast repository of specialized packages and libraries catering to diverse analytical needs.
Considerations for Choosing Between R and Python
- Learning Curve: R’s syntax and approach may be more intuitive for statisticians and researchers familiar with statistical methods.
- Task-Specific Requirements: Evaluate whether the task involves primarily statistical analysis, where R excels, or requires broader integration with other technologies (e.g., web development, machine learning).
- Team Skills and Preferences: Consider the existing expertise and preferences of the team members who will be working on the project.
- Interoperability: Python’s versatility and integration capabilities may be advantageous when working within a broader data ecosystem that includes web applications, machine learning models, and data engineering pipelines.
Introduction to R
Objectives
- Understand what R is and its uses in data science.
- Set up R and RStudio.
- Learn basic R syntax and data types.
What is R?
R is a programming language and environment commonly used for statistical computing, data analysis, and graphical representation. Developed by statisticians Ross Ihaka and Robert Gentleman, R has become a vital tool in data science due to its versatility and the extensive library of packages available.
Setting Up R and RStudio
- Install R:
- Download and install R from the CRAN website.
- Install RStudio:
- Download and install RStudio from the RStudio website.
Basic R Syntax and Data Types
R as a Calculator
R can be used to perform basic arithmetic operations.
# Addition
3 + 4
# Subtraction
10 - 2
# Multiplication
5 * 6
# Division
9 / 3
# Exponentiation
2^3
# Modulus (remainder)
10 %% 3
Variables and Assignment
In R, you can store values in variables using the assignment operator <-
.
# Assigning values to variables
x <- 5
y <- 10
# Performing operations with variables
z <- x + y
# Printing the value of z
print(z)
Data Types
R supports several basic data types.
Numeric: Represents numbers.
num <- 42
Character: Represents text strings.
char <- "Hello, R!"
Logical: Represents TRUE or FALSE values.
log <- TRUE
Vector: A sequence of data elements of the same type.
vec <- c(1, 2, 3, 4, 5)
Basic Functions
R comes with many built-in functions that can perform various tasks.
Mathematical functions:
sqrt(16) # Square root
abs(-5) # Absolute value
Character functions:
nchar("Hello") # Number of characters
tolower("HELLO") # Convert to lowercase
Logical functions:
all(c(TRUE, FALSE, TRUE)) # Check if all are TRUE
any(c(TRUE, FALSE, TRUE)) # Check if any are TRUE
Working with Data Structures
Vectors: An ordered collection of elements of the same type.
numbers <- c(1, 2, 3, 4, 5)
Matrices: Two-dimensional arrays where elements are of the same type.
matrix_1 <- matrix(1:9, nrow=3, ncol=3)
Data Frames: Table-like structures where each column can contain different types of data.
df <- data.frame(
name = c("John", "Jane", "Doe"),
age = c(23, 25, 31),
height = c(177, 165, 180)
)
Lists: Collections of elements that can contain different types of data.
lst <- list(name="John", age=23, height=177)
Hands-On Exercise
Create a vector of your five favorite numbers.
fav_numbers <- c(7, 13, 21, 34, 42)
Calculate the mean of these numbers.
mean(fav_numbers)
Create a data frame with columns for name, age, and favorite color for three people.
friends_df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
favorite_color = c("blue", "green", "red")
)
Print the data frame.
print(friends_df)
Data Manipulation with dplyr
Objectives
- Understand the basics of the
dplyr
package. - Learn to perform data manipulation tasks such as filtering, selecting, mutating, summarizing, and grouping data.
Introduction to dplyr
dplyr
is a powerful R package for data manipulation, providing a consistent set of verbs that help in performing common data manipulation tasks. It’s part of the tidyverse
collection of R packages designed for data science.
To start, install and load the dplyr
package.
install.packages("dplyr")
library(dplyr)
Sample Dataset
For this lesson, we’ll use the built-in mtcars
dataset.
data(mtcars)
Basic Verbs in dplyr
- filter(): Subset rows based on conditions.
- select(): Select columns by name.
- mutate(): Create new columns or modify existing ones.
- summarize(): Reduce multiple values to a single summary.
- group_by(): Group data for summary operations.
- arrange(): Order rows by values of columns.
Filtering Rows with filter()
The filter()
function is used to subset rows based on conditions.
# Filter cars with mpg greater than 20
filtered_cars <- filter(mtcars, mpg > 20)
print(filtered_cars)
Selecting Columns with select()
The select()
function is used to choose specific columns from a dataset.
# Select the mpg, cyl, and hp columns
selected_columns <- select(mtcars, mpg, cyl, hp)
print(selected_columns)
Creating and Modifying Columns with mutate()
The mutate()
function is used to add new columns or modify existing ones.
# Add a new column for weight in kilograms
mtcars_kg <- mutate(mtcars, weight_kg = wt * 453.592)
print(mtcars_kg)
Summarizing Data with summarize()
The summarize()
function is used to generate summary statistics.
# Calculate the average mpg
avg_mpg <- summarize(mtcars, avg_mpg = mean(mpg))
print(avg_mpg)
Grouping Data with group_by()
The group_by()
function is used to group data before applying summary functions.
# Calculate the average mpg by number of cylinders
avg_mpg_by_cyl <- mtcars %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg))
print(avg_mpg_by_cyl)
Arranging Rows with arrange()
The arrange()
function is used to reorder rows of a dataset.
# Arrange the data by mpg in descending order
arranged_cars <- arrange(mtcars, desc(mpg))
print(arranged_cars)
Chaining Operations with the Pipe Operator (%>%)
dplyr
functions can be chained together using the pipe operator %>%
to create more readable and concise code.
# Chain multiple operations: filter, select, and arrange
result <- mtcars %>%
filter(mpg > 20) %>%
select(mpg, cyl, hp) %>%
arrange(desc(mpg))
print(result)
Hands-On Exercise
- Filter the
mtcars
dataset for cars with 6 cylinders.
cars_6_cyl <- filter(mtcars, cyl == 6)
print(cars_6_cyl)
- Select the columns
mpg
andhp
for cars with 4 cylinders.
cars_4_cyl <- filter(mtcars, cyl == 4)
selected_cars_4_cyl <- select(cars_4_cyl, mpg, hp)
print(selected_cars_4_cyl)
- Add a new column to
mtcars
that contains the horsepower-to-weight ratio.
mtcars_with_ratio <- mutate(mtcars, hp_to_wt = hp / wt)
print(mtcars_with_ratio)
- Calculate the average horsepower (
hp
) for cars grouped by the number of gears (gear
).
avg_hp_by_gear <- mtcars %>%
group_by(gear) %>%
summarize(avg_hp = mean(hp))
print(avg_hp_by_gear)
Data Tidying with tidyr
Objectives
- Understand the basics of the
tidyr
package. - Learn to use
tidyr
functions to transform data into a tidy format. - Explore key functions:
gather()
,spread()
,separate()
,unite()
,pivot_longer()
, andpivot_wider()
.
Introduction to tidyr
tidyr
is an R package that provides a set of functions designed to help you clean and organize your data into a tidy format. In a tidy dataset:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
To start, install and load the tidyr
package.
install.packages("tidyr")
library(tidyr)
Sample Dataset
We’ll use a small dataset to demonstrate tidyr
functions. Let’s create a sample dataset.
# Sample dataset
df <- data.frame(
id = 1:3,
year_2019 = c(10, 20, 30),
year_2020 = c(15, 25, 35)
)
print(df)
Key Functions in tidyr
pivot_longer()
pivot_longer()
is used to transform data from wide format to long format, making it easier to analyze.
# Transform data from wide to long format
df_long <- pivot_longer(df, cols = starts_with("year"), names_to = "year", values_to = "value")
print(df_long)
pivot_wider()
pivot_wider()
is used to transform data from long format to wide format, which can be useful for certain types of analyses and visualizations.
# Transform data from long to wide format
df_wide <- pivot_wider(df_long, names_from = "year", values_from = "value")
print(df_wide)
separate()
separate()
splits a single column into multiple columns based on a delimiter.
# Sample dataset
df2 <- data.frame(
name = c("John_Smith", "Jane_Doe"),
age = c(25, 30)
)
print(df2)
# Separate the 'name' column into 'first_name' and 'last_name'
df_separated <- separate(df2, name, into = c("first_name", "last_name"), sep = "_")
print(df_separated)
unite()
unite()
combines multiple columns into a single column.
# Combine 'first_name' and 'last_name' into 'full_name'
df_united <- unite(df_separated, full_name, first_name, last_name, sep = " ")
print(df_united)
Additional Functions
drop_na()
drop_na()
removes rows containing missing values.
# Sample dataset with missing values
df3 <- data.frame(
id = 1:3,
value = c(10, NA, 30)
)
print(df3)
# Remove rows with missing values
df_clean <- drop_na(df3)
print(df_clean)
fill()
fill()
fills in missing values with the previous or next value.
# Sample dataset with missing values
df4 <- data.frame(
id = 1:4,
value = c(10, NA, 30, NA)
)
print(df4)
# Fill missing values with the previous value
df_filled <- fill(df4, value, .direction = "down")
print(df_filled)
Hands-On Exercise
- Create a dataset with columns for
student_id
,exam1_score
, andexam2_score
.
students_df <- data.frame(
student_id = 1:3,
exam1_score = c(85, 90, 78),
exam2_score = c(88, 92, 81)
)
print(students_df)
- Use
pivot_longer()
to convert the dataset to long format with columnsstudent_id
,exam
, andscore
.
students_long <- pivot_longer(students_df, cols = starts_with("exam"), names_to = "exam", values_to = "score")
print(students_long)
- Separate the
exam
column intoexam_number
andsubject
(e.g., “exam1” -> “exam”, “1”).
students_long <- separate(students_long, exam, into = c("exam", "number"), sep = "_")
print(students_long)
- Use
pivot_wider()
to convert the dataset back to wide format.
students_wide <- pivot_wider(students_long, names_from = number, values_from = score)
print(students_wide)
Data Visualization with ggplot2
Objectives
- Understand the basics of the
ggplot2
package. - Learn to create various types of plots using
ggplot2
. - Explore key components and functions of
ggplot2
.
Introduction to ggplot2
ggplot2
is a data visualization package for R, part of the tidyverse
suite of packages. It is based on the Grammar of Graphics, which provides a coherent system for describing and building graphs.
To start, install and load the ggplot2
package.
install.packages("ggplot2")
library(ggplot2)
Sample Dataset
We’ll use the built-in mtcars
dataset for our examples.
data(mtcars)
Basic Components of ggplot2
A ggplot2
plot is constructed using the ggplot()
function, along with layers such as geom_*
, aes()
, facet_*
, and more.
1. ggplot()
The ggplot()
function initializes a plot object.
# Initialize a plot with mtcars dataset
p <- ggplot(mtcars)
2. aes()
The aes()
function specifies the aesthetic mappings, describing how variables in the data are mapped to visual properties (aesthetics) of the plot.
# Initialize a plot with mtcars dataset and set aesthetic mappings
p <- ggplot(mtcars, aes(x = wt, y = mpg))
3. geom_*
Geometric objects (geom_*
) define the type of plot. Common geoms include geom_point()
for scatter plots, geom_line()
for line plots, geom_bar()
for bar plots, and geom_histogram()
for histograms.
Scatter Plot
# Scatter plot of weight vs. mpg
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
Line Plot
# Line plot of mpg over index
ggplot(mtcars, aes(x = seq_along(mpg), y = mpg)) +
geom_line()
Bar Plot
# Bar plot of count of cars per number of cylinders
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
Histogram
# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 5)
Customizing Plots
Titles and Labels
You can add titles and axis labels using labs()
.
# Scatter plot with titles and labels
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(
title = "Scatter plot of MPG vs. Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon"
)
Themes
Themes control the overall appearance of the plot. You can use predefined themes or create custom ones.
# Scatter plot with a minimal theme
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal()
Colors
You can map variables to colors using the color
aesthetic.
# Scatter plot with points colored by number of cylinders
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point()
Facets
Faceting allows you to split your data into subsets and display those subsets as multiple panels.
# Faceted scatter plot by number of cylinders
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl)
Saving Plots
You can save plots using the ggsave()
function.
# Save the last plot to a file
ggsave("scatter_plot.png")
Hands-On Exercise
- Create a histogram of the
hp
(horsepower) variable from themtcars
dataset.
ggplot(mtcars, aes(x = hp)) +
geom_histogram(binwidth = 20) +
labs(
title = "Histogram of Horsepower",
x = "Horsepower",
y = "Count"
)
- Create a scatter plot of
mpg
vs.disp
(displacement), colored by the number of gears (gear
).
ggplot(mtcars, aes(x = disp, y = mpg, color = factor(gear))) +
geom_point() +
labs(
title = "Scatter plot of MPG vs. Displacement",
x = "Displacement (cu.in.)",
y = "Miles per Gallon"
)
- Create a faceted bar plot showing the count of cars for each number of cylinders, with facets for each number of gears.
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar() +
facet_wrap(~ gear) +
labs(
title = "Bar plot of Cylinder Count by Gears",
x = "Number of Cylinders",
y = "Count"
)
Statistical Methods with R
Objectives
- Understand basic statistical methods and concepts.
- Learn to perform statistical analysis using R.
- Explore key statistical functions and techniques in R.
Introduction to Statistical Methods
Statistical methods are techniques used to collect, analyze, interpret, and present data. In this lesson, we will cover the following topics:
- Descriptive statistics
- Inferential statistics
- Hypothesis testing
- Correlation and regression analysis
To start, ensure you have R installed and loaded.
install.packages("tidyverse")
install.packages("MASS")
library(tidyverse)
library(MASS)
Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. These include measures of central tendency and measures of variability.
Measures of Central Tendency
- Mean: The average of a set of numbers.
# Calculate mean
data <- c(4, 8, 15, 16, 23, 42)
mean(data)
- Median: The middle value of a dataset when it is ordered.
# Calculate median
median(data)
- Mode: The value that appears most frequently in a dataset.
# Calculate mode
mode <- as.numeric(names(sort(table(data), decreasing = TRUE))[1])
mode
Measures of Variability
- Variance: The average of the squared differences from the mean.
# Calculate variance
var(data)
- Standard Deviation: The square root of the variance, representing the average distance from the mean.
# Calculate standard deviation
sd(data)
- Range: The difference between the maximum and minimum values.
# Calculate range
range_val <- max(data) - min(data)
range_val
- Interquartile Range (IQR): The range of the middle 50% of the data.
# Calculate IQR
IQR(data)
Inferential Statistics
Inferential statistics allow us to make inferences and draw conclusions about a population based on a sample.
Hypothesis Testing
Hypothesis testing is a method used to determine if there is enough evidence to reject a null hypothesis.
- t-test: Used to compare the means of two groups.
# Perform a t-test
group1 <- c(5, 6, 7, 8, 9)
group2 <- c(7, 8, 9, 10, 11)
t.test(group1, group2)
- Chi-Square Test: Used to test the association between categorical variables.
# Create a contingency table
data <- matrix(c(10, 20, 30, 40), nrow = 2)
# Perform a chi-square test
chisq.test(data)
Correlation and Regression Analysis
Correlation and regression analysis are used to examine relationships between variables.
- Correlation: Measures the strength and direction of the relationship between two variables.
# Calculate correlation
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
cor(x, y)
- Linear Regression: Models the relationship between a dependent variable and one or more independent variables.
# Fit a linear regression model
data(mtcars)
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
ANOVA (Analysis of Variance)
ANOVA is used to compare the means of three or more groups.
# Perform ANOVA
data(iris)
anova_model <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_model)
Hands-On Exercise
- Load the
iris
dataset and calculate descriptive statistics for theSepal.Length
variable.
data(iris)
summary(iris$Sepal.Length)
- Perform a t-test to compare the
Sepal.Length
between thesetosa
andversicolor
species.
setosa <- subset(iris, Species == "setosa")$Sepal.Length
versicolor <- subset(iris, Species == "versicolor")$Sepal.Length
t.test(setosa, versicolor)
- Calculate the correlation between
Sepal.Length
andPetal.Length
in theiris
dataset.
cor(iris$Sepal.Length, iris$Petal.Length)
- Fit a linear regression model to predict
Sepal.Length
usingPetal.Length
andPetal.Width
.
model <- lm(Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
summary(model)
- Perform an ANOVA to compare
Sepal.Width
across different species in theiris
dataset.
anova_model <- aov(Sepal.Width ~ Species, data = iris)
summary(anova_model)
Machine Learning with R
Objectives
- Understand the basics of machine learning.
- Learn to implement key machine learning algorithms in R.
- Explore essential machine learning packages in R.
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from and make decisions based on data. In this lesson, we’ll cover the following topics:
- Supervised learning
- Unsupervised learning
- Model evaluation
We’ll use the following packages for our examples:
install.packages("tidyverse")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
library(tidyverse)
library(caret)
library(randomForest)
library(e1071)
library(rpart)
Supervised Learning
Supervised learning involves training a model on a labeled dataset, which means that each training example is paired with an output label. We’ll cover two main types of supervised learning: regression and classification.
1. Regression
Regression is used to predict a continuous value. We’ll use the mtcars
dataset to predict the mpg
(miles per gallon) using linear regression.
# Load the mtcars dataset
data(mtcars)
# Split the data into training and testing sets
set.seed(123)
training_samples <- mtcars$mpg %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- mtcars[training_samples, ]
test_data <- mtcars[-training_samples, ]
# Train a linear regression model
model <- lm(mpg ~ ., data = train_data)
# Make predictions
predictions <- model %>% predict(test_data)
# Evaluate the model
RMSE <- sqrt(mean((predictions - test_data$mpg)^2))
RMSE
2. Classification
Classification is used to predict a categorical label. We’ll use the iris
dataset to classify the species of iris flowers using a random forest.
# Load the iris dataset
data(iris)
# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]
# Train a random forest model
model <- randomForest(Species ~ ., data = train_data)
# Make predictions
predictions <- model %>% predict(test_data)
# Evaluate the model
confusionMatrix(predictions, test_data$Species)
Unsupervised Learning
Unsupervised learning involves training a model on data without labeled responses. We’ll cover clustering as an example of unsupervised learning.
Clustering
Clustering is used to group data points into clusters based on similarity. We’ll use the iris
dataset for k-means clustering.
# Load the iris dataset
data(iris)
# Remove the species column for clustering
iris_data <- iris[, -5]
# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(iris_data, centers = 3)
# Add the cluster assignments to the original dataset
iris$Cluster <- as.factor(kmeans_result$cluster)
# Visualize the clusters
ggplot(iris, aes(Petal.Length, Petal.Width, color = Cluster)) +
geom_point() +
labs(title = "K-means Clustering of Iris Dataset")
Model Evaluation
Evaluating the performance of a machine learning model is crucial to ensure its effectiveness. We’ve already seen some evaluation techniques like RMSE for regression and confusion matrix for classification. Let’s explore cross-validation.
Cross-Validation
Cross-validation is a technique for assessing how a model will generalize to an independent dataset. We’ll use 10-fold cross-validation for the random forest model on the iris
dataset.
# Define the control using a 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Species ~ ., data = iris, method = "rf", trControl = train_control)
# Summarize the results
print(model)
Hands-On Exercise
- Load the
Boston
dataset from theMASS
package and use linear regression to predict the median value of owner-occupied homes (medv
).
library(MASS)
data(Boston)
# Split the data into training and testing sets
set.seed(123)
training_samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- Boston[training_samples, ]
test_data <- Boston[-training_samples, ]
# Train a linear regression model
model <- lm(medv ~ ., data = train_data)
# Make predictions
predictions <- model %>% predict(test_data)
# Evaluate the model
RMSE <- sqrt(mean((predictions - test_data$medv)^2))
RMSE
- Use the
iris
dataset and an SVM (Support Vector Machine) to classify the species of iris flowers.
# Load the iris dataset
data(iris)
# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]
# Train an SVM model
model <- svm(Species ~ ., data = train_data)
# Make predictions
predictions <- model %>% predict(test_data)
# Evaluate the model
confusionMatrix(predictions, test_data$Species)
- Perform hierarchical clustering on the
iris
dataset and visualize the dendrogram.
# Load the iris dataset
data(iris)
# Remove the species column for clustering
iris_data <- iris[, -5]
# Scale the data
iris_data <- scale(iris_data)
# Perform hierarchical clustering
hc <- hclust(dist(iris_data), method = "complete")
# Plot the dendrogram
plot(hc, labels = iris$Species, main = "Hierarchical Clustering Dendrogram")
Machine Learning with R Part 2
Objectives
- Dive deeper into advanced machine learning techniques.
- Learn about ensemble methods, hyperparameter tuning, and advanced model evaluation.
- Explore essential advanced machine learning packages in R.
Introduction
In this lesson, we will cover advanced topics in machine learning with R, including:
- Ensemble methods (e.g., boosting and bagging)
- Hyperparameter tuning
- Advanced model evaluation techniques
- Model deployment
We will use the following packages:
install.packages("tidyverse")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("xgboost")
install.packages("ROCR")
install.packages("shiny")
library(tidyverse)
library(caret)
library(randomForest)
library(e1071)
library(xgboost)
library(ROCR)
library(shiny)
Ensemble Methods
Ensemble methods combine multiple models to improve performance. Two popular ensemble methods are bagging and boosting.
Bagging (Bootstrap Aggregating)
Bagging involves training multiple models on different subsets of the training data and averaging their predictions. Random forests are a popular bagging method.
# Load the iris dataset
data(iris)
# Split the data into training and testing sets
set.seed(123)
training_samples <- iris$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris[training_samples, ]
test_data <- iris[-training_samples, ]
# Train a random forest model
model <- randomForest(Species ~ ., data = train_data, ntree = 100)
# Make predictions
predictions <- model %>% predict(test_data)
# Evaluate the model
confusionMatrix(predictions, test_data$Species)
Boosting
Boosting sequentially trains models, each trying to correct the errors of its predecessor. XGBoost is a popular boosting algorithm.
# Prepare the data for XGBoost
train_matrix <- model.matrix(Species ~ . - 1, data = train_data)
train_label <- as.numeric(train_data$Species) - 1
test_matrix <- model.matrix(Species ~ . - 1, data = test_data)
test_label <- as.numeric(test_data$Species) - 1
# Train an XGBoost model
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
params <- list(objective = "multi:softprob", num_class = 3, eval_metric = "mlogloss")
model <- xgboost(params = params, data = dtrain, nrounds = 100)
# Make predictions
predictions <- predict(model, dtest)
predicted_labels <- max.col(matrix(predictions, ncol = 3)) - 1
# Evaluate the model
confusionMatrix(as.factor(predicted_labels), as.factor(test_label))
Hyperparameter Tuning
Hyperparameter tuning is the process of finding the optimal settings for a machine learning model to improve its performance. The caret
package provides tools for hyperparameter tuning.
# Define the control using a grid search
train_control <- trainControl(method = "cv", number = 10)
# Define the parameter grid
tune_grid <- expand.grid(mtry = c(2, 3, 4), splitrule = "gini", min.node.size = 1)
# Train the model using grid search
model <- train(Species ~ ., data = iris, method = "ranger", trControl = train_control, tuneGrid = tune_grid)
# Summarize the results
print(model)
Advanced Model Evaluation
Advanced model evaluation techniques provide more insight into model performance. ROC curves and AUC (Area Under the Curve) are common evaluation metrics for classification models.
# Load the iris dataset and convert to binary classification
data(iris)
iris_binary <- iris %>% filter(Species != "virginica")
iris_binary$Species <- factor(iris_binary$Species)
# Split the data
set.seed(123)
training_samples <- iris_binary$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris_binary[training_samples, ]
test_data <- iris_binary[-training_samples, ]
# Train a logistic regression model
model <- glm(Species ~ ., data = train_data, family = binomial)
# Make predictions
predictions <- predict(model, test_data, type = "response")
# Evaluate the model using ROC and AUC
pred <- prediction(predictions, test_data$Species)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = "blue", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")
# Calculate AUC
auc <- performance(pred, "auc")
auc@y.values[[1]]
Model Deployment
Deploying a model means making it available for use in a production environment. Shiny is a package that allows you to build interactive web applications to showcase your models.
# Define the UI
ui <- fluidPage(
titlePanel("Iris Species Predictor"),
sidebarLayout(
sidebarPanel(
numericInput("sepal_length", "Sepal Length:", value = 5.0),
numericInput("sepal_width", "Sepal Width:", value = 3.5),
numericInput("petal_length", "Petal Length:", value = 1.5),
numericInput("petal_width", "Petal Width:", value = 0.3),
actionButton("predict", "Predict")
),
mainPanel(
textOutput("prediction")
)
)
)
# Define the server
server <- function(input, output) {
model <- randomForest(Species ~ ., data = iris, ntree = 100)
observeEvent(input$predict, {
new_data <- data.frame(
Sepal.Length = input$sepal_length,
Sepal.Width = input$sepal_width,
Petal.Length = input$petal_length,
Petal.Width = input$petal_width
)
prediction <- predict(model, new_data)
output$prediction <- renderText({ paste("Predicted Species:", prediction) })
})
}
# Run the application
shinyApp(ui = ui, server = server)
Hands-On Exercise
- Load the
Boston
dataset from theMASS
package and use gradient boosting to predict the median value of owner-occupied homes (medv
).
library(MASS)
data(Boston)
# Prepare the data
train_index <- createDataPartition(Boston$medv, p = 0.8, list = FALSE)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
# Train a gradient boosting model
train_matrix <- model.matrix(medv ~ . - 1, data = train_data)
train_label <- train_data$medv
test_matrix <- model.matrix(medv ~ . - 1, data = test_data)
test_label <- test_data$medv
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
params <- list(objective = "reg:squarederror", eval_metric = "rmse")
model <- xgboost(params = params, data = dtrain, nrounds = 100)
# Make predictions
predictions <- predict(model, dtest)
rmse <- sqrt(mean((predictions - test_label)^2))
rmse
- Perform hyperparameter tuning on a random forest model using the
caret
package on theiris
dataset.
# Define the control using a grid search
train_control <- trainControl(method = "cv", number = 10)
# Define the parameter grid
tune_grid <- expand.grid(mtry = c(2, 3, 4), splitrule = "gini", min.node.size = 1)
# Train the model using grid search
model <- train(Species ~ ., data = iris, method = "ranger", trControl = train_control, tuneGrid = tune_grid)
# Summarize the results
print(model)
- Use the
e1071
package to train an SVM model on theiris
dataset and evaluate its performance using ROC and AUC.
# Load the iris dataset and convert to binary classification
data(iris)
iris_binary <- iris %>% filter(Species != "virginica")
iris_binary$Species <- factor(iris_binary$Species)
# Split the data
set.seed(123)
training_samples <- iris_binary$Species %>%
createDataPartition(p = 0.8, list = FALSE)
train_data <- iris_binary[training_samples, ]
test_data <- iris_binary[-training_samples, ]
# Train an SVM model
model <- svm(Species ~ ., data = train_data, probability = TRUE)
# Make predictions
predictions <- predict(model, test_data, probability = TRUE)
probabilities <- attr(predictions, "probabilities")[,2]
# Evaluate the model using ROC and AUC
pred <- prediction(probabilities, test_data$Species)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = "blue", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")
# Calculate AUC
auc <- performance(pred, "auc")
auc@y.values[[1]]