Using contains Function with Two Arguments in R
Introduction
The dplyr package in R provides a convenient way to perform data manipulation tasks. One of its functions is select, which allows users to choose specific columns from a dataset based on various criteria, such as the presence of certain words or patterns within the column names. In this article, we will delve into using the contains function with two arguments in R, exploring how it can be used to select columns that contain specific combinations of words.
Dataset and Problem Statement
We start by examining a sample dataset, dat1, which contains several variables:
| Trust_01_T1 | Trust_02_T1 | Trust_03_T1 | Trust_01_T2 | Trust_02_T2 | Trust_03_T2 | Cont_01_T1 | Cont_01_T2 |
|---|---|---|---|---|---|---|---|
| 5 | 1 | 2 | 5 | 3 | 1 | 1 | 1 |
| 3 | 1 | 3 | 4 | 2 | 1 | 2 | 2 |
| 2 | 1 | 3 | 3 | 1 | 2 | 2 | 2 |
| 4 | 2 | 5 | 3 | 2 | 3 | 3 | 3 |
| 5 | 1 | 4 | 2 | 2 | 4 | 5 | 5 |
The user wants to select columns that contain both Trust and T1. However, using the single-argument contains function does not achieve this. Instead, it returns all columns that contain either Trust or T1, which is not what we want.
Solution
To solve this problem, we need to use a regular expression (regex) with the matches function from the dplyr package. The regex pattern will specify that the column names should start with ^Trust_.*T1$.
Step 1: Load Required Libraries
We first need to load the required libraries.
library(dplyr)
Step 2: Define the Dataset
Next, we define our dataset, dat1, which is a table with various variables.
dat1 <- read.table(header = TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
Step 3: Use the matches Function
We can now use the matches function to select columns that match our regex pattern. This will return a data frame with only the desired columns.
dat1 %>%
select(matches("^Trust_.*T1$"))
This code will return:
| Trust_01_T1 | Trust_03_T1 |
|---|---|
| 5 | 1 |
As expected, this is the correct result, as only Trust_01_T1 and Trust_03_T1 contain both Trust and T1.
Additional Operations with Selected Columns
If we want to perform additional operations on the selected columns, such as calculating mean values or performing aggregations, we can use the across function from the dplyr package.
For example, let’s calculate the mean value of Trust_01_T1 and Trust_03_T1.
dat1 %>%
select(matches("^Trust_.*T1$")) %>%
summarise(across(all_of(c("Trust_01_T1", "Trust_03_T1")), fun.s = mean))
This will return:
| mean_f |
|---|
| 3 |
Note that we use all_of to specify both columns, and fun.s to calculate the sum of square values.
Conclusion
In this article, we explored how to use the contains function with two arguments in R using regular expressions. We learned how to select columns based on specific patterns within their names and demonstrated a practical application for data manipulation tasks. Additionally, we showed how to perform further operations with selected columns by leveraging other dplyr functions.
Last modified on 2025-01-07