Watts-Lab · pradnayapathak · Oct 22, 2024 · Nov 12, 2024 · Nov 21, 2024 · Dec 1, 2024
diff --git a/docs/source/features_conceptual/lsm.rst b/docs/source/features_conceptual/lsm.rst
@@ -0,0 +1,36 @@
+.. _LSM:
+
+LANGUAGE STYLE MATCHING
+============
+
+High-Level Intuition
+*********************
+Language Style Matching (LSM) measures the degree to which individuals in a conversation align their linguistic styles. It reflects social dynamics like rapport, group cohesion, and interpersonal understanding by analyzing the similarity in function word usage (e.g., pronouns, conjunctions) between group members.
+
+Citation
+*********
+Gonzales, A. L., Hancock, J. T., & Pennebaker, J. W. (2010). Language style matching as a predictor of social dynamics in small groups. Communication Research, 37(1), 3–19. https://doi.org/10.1177/0093650209351468
+
+Implementation Basics 
+**********************
+The code computes LSM by analyzing the usage of specific function words (e.g., pronouns, conjunctions) by speakers in a conversation. It calculates the proportion of each function word type used by a speaker and compares it to the average usage of the same word type by other speakers in the same conversation. The formula for LSM reflects the similarity between these proportions, with higher values indicating greater alignment.
+
+Implementation Notes/Caveats 
+*****************************
+This implementation adheres closely to the methodology described in the source paper, noting that our implementation does not include the calculation of Cronbach's alpha. However, there may be differences in how certain edge cases are handled, such as:
+1. Conversations with only one speaker.
+2. Instances where specific function word counts are zero, requiring safeguards to avoid division by zero errors.
+
+Interpreting the Feature 
+*************************
+Read the code associated with this feature and answer the following questions, if applicable:
+
+1. What are the bounds of the score? What does a high versus low score mean? (How should you read this score?) LSM scores range from 0 to 1, where 1 indicates perfect linguistic alignment and 0 indicates no alignment.
+2. **Concrete Example:** 
+2. Give a concrete example (e.g., negative score versus positive score) A high LSM score (e.g., 0.85) for pronouns suggests that a speaker's pronoun usage closely matches the average usage by other group members. A low LSM score (e.g., 0.2) suggests less alignment in pronoun usage.
+3. What DOESN’T the score measure? That is, what does the score take into account, and what are some ways that it might not capture the high-level social science concept? LSM does not capture the content or context of conversations. It focuses purely on function word alignment and may not reflect deeper social or relational dynamics.
+4. Are there any edge cases that we should be aware of? (e.g., if the conversation contains only one chat?) To the best of your knowledge, how does the code handle it? If a conversation has only one speaker, no LSM score is calculated.
+
+Related Features 
+*****************
+Are there any related/similar features to this one? Is this part of an "umbrella" or group of features? Write about them here, and do your best to explain how they are different. Why would you use one implementation over the other? This might be similar to the mimicry score, but is different in that it hones in on the repetition of function words instead of overall mimicry.
diff --git a/src/team_comm_tools/feature_dict.py b/src/team_comm_tools/feature_dict.py
@@ -605,6 +605,30 @@
     "preprocess": [],
     "vect_data": False,
     "bert_sentiment_data": False
+  },
+  "Language Style Matching (LSM)": {
+    "columns": [
+        "total_pronouns_lsm",
+        "conjunction_lexical_wordcount_lsm",
+        "adverbs_lexical_wordcount_lsm",
+        "article_lexical_wordcount_lsm",
+        "quantifier_lexical_wordcount_lsm",
+        "negation_lexical_wordcount_lsm",
+        "preposition_lexical_wordcount_lsm",
+        "indefinite_pronoun_lexical_wordcount_lsm",
+        "auxiliary_verbs_lexical_wordcount_lsm"
+    ],
+    "file": "./features/lsm_features.py",
+    "level": "Conversation",
+    "semantic_grouping": "Language Style",
+    "description": "Measures the alignment of linguistic styles among group members by calculating LSM scores for various function word categories.",
+    "references": "(Gonzales, Hancock, and Pennebaker, 2010)",
+    "wiki_link": "https://github.com/Watts-Lab/team-process-map/wiki/Language-Style-Matching-(LSM)",
+    "function": "calculate_lsm",
+    "dependencies": [],
+    "preprocess": [],
+    "vect_data": False,
+    "bert_sentiment_data": False
   }
 }
 
@@ -619,4 +643,4 @@ def generate_filtered_dict():
 
 if __name__ == "__main__":
   if len(sys.argv) > 1 and sys.argv[1] == 'run':
-      generate_filtered_dict()
+      generate_filtered_dict()
diff --git a/src/team_comm_tools/features/lsm_features.py b/src/team_comm_tools/features/lsm_features.py
@@ -0,0 +1,72 @@
+import pandas as pd
+
+def calculate_lsm(chat_df):
+
+    """ 
+    This function calculates Language Style Matching (LSM) scores for the Team Communication Toolkit.
+
+    Source: Language Style Matching as a Predictor of Social Dynamics in Small Groups by Amy L. Gonzales, Jeffrey T. Hancock, and James W. Pennebaker.
+
+     Args:
+        chat_df (pd.DataFrame): A pandas DataFrame with columns for conversation_id, speaker_id, 
+            and various word-level counts (e.g., num_words, conjunction_lexical_wordcount, etc.).
+
+    Returns:
+        pd.DataFrame: A pandas DataFrame with additional columns for LSM scores and related calculations.
+    """
+
+    # Create a new column with the sum of all pronouns (first person singular, first person plural, second person, third person)
+    chat_df['total_pronouns'] = (
+        chat_df['first_person_singular_lexical_wordcount'] +
+        chat_df['first_person_plural_lexical_wordcount'] +
+        chat_df['second_person_lexical_wordcount'] +
+        chat_df['third_person_lexical_wordcount']
+    )
+
+    # Group by conversation_id and speaker_id to prepare for LSM calculations
+    grouped_df = chat_df.groupby(['conversation_id', 'speaker_id']).agg({
+        'num_words': 'sum',
+        'conjunction_lexical_wordcount': 'sum',
+        'total_pronouns': 'sum',
+        'adverbs_lexical_wordcount': 'sum',
+        'article_lexical_wordcount': 'sum',
+        'quantifier_lexical_wordcount': 'sum',
+        'negation_lexical_wordcount': 'sum',
+        'preposition_lexical_wordcount': 'sum',
+        'indefinite_pronoun_lexical_wordcount': 'sum',
+        'auxiliary_verbs_lexical_wordcount': 'sum',
+    }).reset_index()
+    # Resets the index so speaker is treated as a normal column
+
+    # Now start calculating LSM score (for each function word category for each person)
+
+    # List of function word columns to calculate percentages for
+    function_word_columns = [
+        'total_pronouns',
+        'conjunction_lexical_wordcount',
+        'adverbs_lexical_wordcount',
+        'article_lexical_wordcount',
+        'quantifier_lexical_wordcount',
+        'negation_lexical_wordcount',
+        'preposition_lexical_wordcount',
+        'indefinite_pronoun_lexical_wordcount',
+        'auxiliary_verbs_lexical_wordcount'
+    ]
+
+    # Loop through each function word column and divide it by the total number of words (eg. x percent of total words were pronouns)
+    for column in function_word_columns:
+        grouped_df[f'{column}_percent'] = (grouped_df[column] / grouped_df['num_words']) * 100 
+
+    # Compute group-level sums and counts for each conversation
+    group_sums = grouped_df.groupby('conversation_id')[function_word_columns].transform('sum')
+    group_counts = grouped_df.groupby('conversation_id')[function_word_columns].transform('count')
+
+    # Calculate group averages excluding the current speaker
+    for column in function_word_columns:
+        grouped_df[f'{column}_group_avg'] = (group_sums[column] - grouped_df[column]) / (group_counts[column] - 1)
+
+    # Calculate LSM score
+    for column in function_word_columns:
+        grouped_df[f'{column}_lsm'] = 1 - (abs(grouped_df[f'{column}_percent'] - grouped_df[f'{column}_group_avg']) / (grouped_df[f'{column}_percent'] + grouped_df[f'{column}_group_avg']))
+
+    return grouped_df