Converting Semi-Structured Data into a Tidy Format using R

Introduction to Semi-Structured Data in R

As data becomes increasingly important for businesses and organizations, the need to work with diverse types of data sources grows. In this article, we will explore how to convert semi-structured data into a tidy format using R. We will focus on the read_xlsx function from the xlsx package and the mutate and add_column functions from the tidyr package.

Understanding Semi-Structured Data

Semi-structured data is data that has some level of organization, but not as rigidly structured as tabular or relational data. Examples include CSV files with inconsistent column names, Excel worksheets with varying data formats, and XML files with complex structures.

Using `read_xlsx` to Load Semi-Structured Data

The read_xlsx function is used to load semi-structured data from Excel workbooks. It allows us to specify the range of cells to read in and the column names to use for the resulting dataframe. In this example, we will use read_xlsx to load a single worksheet from an Excel workbook.

# Load necessary libraries
library(xlsx)
library(tidyverse)

# Set the path to the Excel file
file <- "C:/Path/To/Book1.xlsx"

# Read in the first sheet of the workbook using read_xlsx
dframe <- read_xlsx(path = file, range = "'Original Data'!A1:G50", col_names = FALSE)

Preprocessing Semi-Structured Data

After loading the data, we need to preprocess it to prepare it for analysis. In this example, we will move the location name to its own column and delete the first row.

# Move the location name to its own column
dframe <- dframe %>% 
  mutate(location = dframe[[1,1]])

# Delete the first row
dframe <- dframe[-c(1),]

Adding Additional Columns

We will also add two additional columns to the dataframe: budget_type and category. The budget_type column will be initialized with a value of “empty”, while the category column will also be initialized.

# Add/define the budget_type and category columns
dframe <- dframe %>% 
  add_column(budget_type = "empty", category = "empty")

Looping Through Data

To move down the data set, we will use a loop to iterate through each row. For each row, we will check if the value in the CLASS_CODE column is either “Baseline” or “Scope Changes”. If it is, we will assign the corresponding value to the budget_type column.

# Loop 1: Move down the data set and label every line with 'CLASS CODE' as being 
# either "Baseline" or "Scope Changes"
for (row in 1:nrow(dframe)){
  
  if (dframe[[row,1]] %in% c("Baseline","Scope Changes")){
    budget_type <- dframe[[row,1]]
  }
  
  if (!is.na(dframe[[row,1]])){
    if (str_detect(dframe[[row,1]], "[0-9]{4}") == TRUE){
      dframe[[row, "budget_type"]] <- budget_type
    }
  }
}

Looping Through Data Again

To move up the data set, we will use another loop to iterate through each row. For each row, we will check if the value in the CLASS_CODE column is not “empty” and the corresponding category has been assigned.

# Loop 2: Move up the data set and assign the budget_type value to the 
# category column.
for (row in nrow(dframe):1){
  
  if ( dframe[[row,2]] == "Total" || 
       is.na(dframe[[row,2]]) || 
       dframe[[row, 2]] == "Classification" ) {
    # delete rows where the 2nd column is &lt;blank&gt;, 'Classification', or 'Total'. 
    dframe <- dframe[-row,]
  } else {    
      if ( !is.na(dframe[[row,2]]) &amp;&amp; is.na(dframe[[row,1]]) ){
        # if row no 'CLASS_CODE' but has value in 2nd column, assign value to
        # category then delete the row entirely.
        category <- dframe[[row,2]]
        dframe <- dframe[-row,]
        
      } else if ( str_detect(dframe[[row,1]], "[:digit:]{4}") ){
        # if row has 'CLASS CODE', then label the category column with the
        # stored value.
        dframe[[row, "category"]] <- category
      }
    }
}

Final Steps

After looping through the data, we need to assign the names from the character vector set at the beginning.

# Assign the names from the character vector set at the beginning
names(dframe) <- names

Finally, we can print out the resulting dataframe.

# Print out the resulting dataframe
dframe

Conclusion

In this article, we explored how to convert semi-structured data into a tidy format using R. We used the read_xlsx function from the xlsx package and the mutate and add_column functions from the tidyr package. By looping through the data and assigning values based on certain conditions, we were able to transform the semi-structured data into a tidy dataframe that can be easily analyzed.

Note: The code used in this article is just an example and may need to be modified to suit your specific needs. Additionally, the dput function used at the beginning of the script is used to create a minimal reproducible example, but it may not work for all datasets.

Last modified on 2024-04-30