Topic Modelling by Group using LDA in R

=====================================

Topic modelling is a technique used to discover hidden topics or themes within unstructured text data. In this article, we will explore how to perform topic modeling for each group in your dataset using Latent Dirichlet Allocation (LDA) in R.

Introduction to LDA

LDA is a popular unsupervised machine learning algorithm that can be used for topic modelling. It assumes that the text data follows a probabilistic model, where each document is composed of multiple topics, and each topic is represented by a distribution over words.

In this article, we will use the R package called topicmodels to perform LDA on our dataset.

Preparing the Dataset

To start with topic modelling, we need to prepare our dataset. Our dataset consists of sentences or comments from different groups. We will first convert these sentences into a format that can be used for topic modelling.

# Install and load necessary libraries
install.packages("udpipe")
install.packages("topicmodels")
library(udpipe)
library(topicmodels)

# Load the dataset
x <- read.csv("your_file.csv")

# Extract comments from the dataset
comments <- x$Feedback.Comments

# Create a document-term matrix (DTM) for each group
dtf <- subset(comments, upos %in% c("NOUN", "ADJ"))

# Create a DTM using document-term frequencies (DTF)
dtm <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")

Topic Modelling for Each Group

Now that we have our dataset prepared, let’s perform topic modelling for each group. We will use the LDA function from the topicmodels package to create a model for each group.

# Create an LDA model for each group
set.seed(1)
m <- lapply(comments, 
             function(x) {
               dtf <- subset(x, upos %in% c("NOUN", "ADJ"))
               dtm <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
               return(LDA(dtm, k = 4, method = "Gibbs", control = list(nstart = 5, burnin = 2000, best = TRUE)))
             })

# Get the topic terminology for each model
topicterminology <- lapply(m, function(x) {
  x$terms[, -1]
})

# Extract the words with their corresponding probability for each topic
scores <- lapply(topicterminology, function(x) {
  scores <- data.frame(term = names(x), prob = unname(x))
  return(scores)
})

Grouping by Division Name

Now that we have our model for each group, let’s extract the words with their corresponding probability for each topic and group them by division name.

# Create a dataframe to store the results
result <- data.frame()

# Extract the division names from the dataset
divisions <- unique(x$division_name)

# Loop through each division
for (i in divisions) {
  # Get the scores for this division
  temp_scores <- scores[[findIndex(topicterminology, i)]]
  
  # Add the results to the dataframe
  result <- rbind(result, temp_scores)
}

# Print the final results
print(result)

Conclusion

In this article, we have explored how to perform topic modelling for each group in your dataset using Latent Dirichlet Allocation (LDA) in R. We used the R package called topicmodels to create a model for each group and extracted the words with their corresponding probability for each topic.

By grouping by division name, we can see which topics are most relevant to each group, allowing us to better understand the themes and trends within our dataset.

References

Last modified on 2024-07-16