{"id":3654,"date":"2024-01-20T05:58:35","date_gmt":"2024-01-20T05:58:35","guid":{"rendered":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/"},"modified":"2025-01-08T15:47:12","modified_gmt":"2025-01-08T15:47:12","slug":"evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/evaluating-cinematic-dialogue-which-syntactic-and-semantic-features-are-predictive-of-genre-2c69a71af6e2\/","title":{"rendered":"Evaluating Cinematic Dialogue\u200a-\u200aWhich syntactic and semantic features are predictive of genre?"},"content":{"rendered":"
From fragmented speech in thrillers to expletive-laden exchanges in action movies, can we guess a movie’s genre simply by knowing its semantic and syntactic characteristics in the dialogue? If so, which ones?<\/p>\n
We will investigate whether or not the nuanced dialogue patterns within a screenplay – its lexicon, structure, and pacing – can be powerful predictors of genre. The focus here is twofold: to leverage syntactic and semantic script characteristics as predictive features and to underscore the significance of informed feature engineering.<\/p>\n
One of the primary gaps in many data science courses is the lack of emphasis on domain expertise and feature generation, engineering, and selection. Many courses also provide students with pre-existing datasets, and sometimes, these datasets are already cleaned. Moreover, in the workplace, the rush to produce results often overshadows the process of hypothesizing and validating predictive features, leaving little room for domain-specific exploration and understanding.<\/p>\n
In my own experience outlined in "Using Multi-Task and Ensemble Learning to Predict Alzheimer’s Cognitive Functioning<\/a>," I witnessed the positive impact of informed feature engineering. Researching known predictors of Alzheimer’s allowed me to question the initial task and data, ultimately leading to the inclusion of key features during modeling.<\/p>\n In this article, I delve into a project that examines movie dialogue to illustrate my approach to research and feature extraction. The focus will be on identifying and analyzing textual, semantic, and syntactic elements within film dialogue, investigating how they interrelate, and evaluating their capacity to accurately predict a movie’s genre.<\/p>\n I like to start every project by conducting a literature review. I begin by jotting down relevant concepts and questions to guide my review. This initial phase is crucial and, depending on the time I have, I intentionally steer clear of research directly related to the modeling problem at hand. The goal is to understand the broader context and seek out supplemental information first. This strategy helps in cultivating an unbiased understanding of the subject matter, ensuring that my approach to the problem is informed, yet not prematurely narrowed by the solutions and methodologies already explored by others.<\/p>\n There is a body of literature that explores the interplay between natural dialogue and our emotions. Screenwriters capture an emotion or mood by capitalizing on textual and syntactical relationships. These vary across genres since different moods are associated with different genres.<\/p>\n We will extract and evaluate the 4 characteristics listed below. In each section, I’ll explain the rationale:<\/p>\n The dataset used here is the Cornell Movie-Dialogs Corpus (MIT License) from Kaggle<\/a>, which was originally retrieved from the ConvoKit toolkit<\/a> (Chang et al., 2020). This is comprised of over 300k<\/strong> spoken lines ** across <\/strong>~220k conversational exchanges derived from <\/strong>61**7 different movies.<\/p>\n We’ll begin by loading data using the The columns are split by I used spaCy – an open-source natural language processing library written in Python and Cython – to process the text. This included cleaning contractions, removing punctuation, and lemmatizing words.<\/p>\n In suspense movies, dialogue is often sparse, showcasing the link between syntax and emotions. When characters are in states of terror, their speech tends to be concise, while nervousness often leads to longer utterances (i.e. rambling), a trait more commonly seen in comedies. Therefore, we will examine the length attributes of each line in the corpus.<\/p>\n In this section, we’ll take a look at:<\/p>\n In the boxplot and the statistics data frame above, we see that:<\/p>\n Less than half of the script lines maintain more than 1 sentence. This informs us that each script line is short<\/strong>, and should be framed accordingly.<\/p>\n The metrics mentioned above were calculated on a ‘per line’ basis within the movie script data. In the next section, we shift our focus to explore the average length of lines per movie, allowing us to examine variations in word length at the movie level.<\/p>\n The "Length Features Statistics DataFrame" figure shows that individual lines in scripts range from 0 to 582 words, with a median of 7 words, which suggests a high degree of variability in dialogue density on a line-by-line basis<\/strong>. In contrast, the aggregated movie data shows a much narrower range<\/strong>, with a maximum average of 38.69 words per line, indicating that while individual lines can be extremely verbose or concise, movies tend to balance out to a moderate density of words.<\/p>\n With over 39% of script lines containing more than one sentence, the per-line analysis indicates a tendency towards compound or complex sentences. However, the tighter standard deviation in the movie averages (0.29 for sentences) suggests a consistency in narrative rhythm across different films,<\/strong> aiming for a steady pace in dialogue delivery.<\/p>\n The contrast between the median length of individual lines (7 words) and the average across movies (11.36 words) implies that screenwriters might often intersperse shorter lines of dialogue with longer monologues or exchanges.<\/strong> This technique could be a deliberate choice to create dynamic interactions between characters, keep the audience engaged, and ensure that each movie has its unique tempo and style.<\/p>\n The histograms show a right-skewed distribution, with a central tendency for movies to feature lines averaging 7\u201313 words. This skewness is indicative of a minority of films with unusually long lines, which heavily influence the overall average.<\/p>\n After outliers are excluded, the bimodal distribution for words per line becomes more evident, suggesting that there are two common line lengths in scripts. This observation is interesting as it could reflect different styles or genres within the corpus. The distribution of sentences per line appears to be approximately normal, with a negligible right skew, indicating a consistent sentence structure across screenplays.<\/p>\n There are various ways to represent a heightened state of emotion in a script. One of which is to use an exclamation point (!) for emphasis and another is to use CAPITALIZATION FOR EMPHASIS. We’ll look at the presence of both and see if there’s a correlation with the overarching sentiment.<\/p>\n A hyphen placed at the end of a character’s dialogue (-) may signify an interruption in their speech or an abrupt pause in the character’s thinking (e.g., the character has an epiphany). It can also convey fragmented speech.<\/p>\n I had no prior knowledge or intuition about the relationship between the presence of questions in a script and other features. However, the proportion of questions is easily measurable, and it could be intriguing to explore whether any patterns can be detected.<\/p>\n Below, we see that the proportion of lines with questions, indicated at 31.4%, suggests a strong preference for interactive dialogue<\/strong> within movies. This is substantially higher than the proportion of lines with exclamations, at 8.9%, which could indicate that while intense emotional expressions are present, they are less frequent than interrogative exchanges<\/strong>.<\/p>\n The boxplot for the count of all-caps words reveals that the use of capitalized words is not common, suggesting that screenwriters may prefer subtler methods of conveying emphasis in dialogue rather than relying on text formatting.<\/p>\n While questions are more common, the range of usage varies widely among movies, potentially reflecting different genres or directorial styles. For example, a thriller may have more questions built into the dialogue to maintain suspense, whereas a comedy may use exclamations to highlight punchlines.<\/p>\n The histogram for lines that end with a hyphen shows a significant skew towards a lower proportion, indicating that lines ending with a hyphen are relatively uncommon in movie scripts. This could suggest that interrupted dialogue or sentences leading into actions (which are often denoted by hyphens) are used sparingly, perhaps to maintain the flow of dialogue or to avoid overusing a device that might otherwise lose its impact.<\/p>\n Part of speech helps us understand the grammatical function of a word in a sentence. For instance, genres like historical or biographical films are often flooded with proper nouns, making the tracking of these and other common tags potentially revealing.<\/p>\n According to "Judging Screenplays by Their Coverage<\/a>" by Stephen Follows and Josh Cockcroft, "swear words (are) not spread equally across all scripts […] Comedies are the sweariest, beating Action and Horror scripts by a tiny margin (and) the genres featuring the lowest levels of swearing are Family, Animated and Faith-based scripts" (42).<\/p>\n We’ll start by taking a look at the most frequent tags from the text by flattening the text, taking a sample, and using SpaCy for POS tagging.<\/p>\n Overall, nouns are by far the most common parts of speech, with adjectives and verbs maintaining relatively similar counts. Adverbs are the rarest part of speech for our movies.<\/p>\n I chose to display all four histograms on this plot because it highlights a clear differentiation in the usage of various parts of speech within movie dialogues. Nouns dominate the linguistic landscape, occupying 40% to 60% of the dialogue<\/strong> whereas adverbs range anywhere between 0 to 10%. This prevalence underlines the concrete and tangible nature of film narratives, which often rely on specific nouns to anchor the conversation and set scenes. Adverbs, conversely, appear infrequently, suggesting that movie dialogue may favor direct and concise language over descriptive or qualifying phrases.<\/strong><\/p>\n We’ll detect profanity using the ‘badwords.txt’ from profanityfilter<\/a>.<\/p>\n While most movie lines are devoid of profanity, there is a significant presence of it in certain scripts, with a few reaching a proportion as high as 0.37. This might reflect the genre, setting, or character development choices, where profanity is used to add realism, and intensity, or to delineate characters’ personalities.<\/p>\n We’ll utilize two sentiment analysis models: NLTK Vader, which is quick but uses a basic rule-based approach, and Flair, which is more accurate but computationally intensive.<\/p>\n NLTK Vader assigns sentiment scores based on individual words and may be biased by neutral words even in the presence of strong negative words, making it less precise. It also struggles to identify sarcasm or context nuances.<\/p>\n Flair is an embedding-based model which enables it to capture context. Words with similar vector representations are often used in the same context. The downside to using this approach is that it’s significantly slower than the naive, rules-based approach. The NLTK model took ~ 4 minutes to run while this model took ~3 hours to run.<\/p>\n "The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate." (source<\/a>)<\/p><\/blockquote>\n In our analysis, we will use the Spearman correlation coefficient to identify a monotonic relationship between all values.<\/p>\n Below displays only the significant correlations where the p-value for the Spearman correlation is less than 0.05.<\/p>\n I expected to find some significant correlations, such as those between the average number of words and the average number of sentences or the average number of uppercase words. I’d also anticipated the following correlations:<\/p>\n There were a few interesting observations:<\/p>\n A significant positive correlation between the average number of words and the use of proper nouns ( As noted above, I was quite surprised to see a correlation between questions and profanity yet no relationship between exclamation marks and profanity. Therefore, I decided to plot out a slope graph to see if we could uncover any relationship there.<\/p>\n Interestingly enough, despite there being no significant correlation between the proportion of exclamation and the proportion of profanity, it appears that the most significant jump between the proportion of profanity occurs from dialogue with no exclamation marks to dialogue with exclamation marks.<\/p>\n I am going to fast-forward through this last part and provide a brief overview of the modeling process and performance. However, please feel free to let me know if you’d like a more in-depth exploration of the modeling work done here and I’ll release a part two \ud83d\ude42<\/p>\n Here we are building a classifier to predict the genre of drama.<\/p>\n To expedite the modeling phase, we utilized LazyPredict, an AutoML Python package that applies all of the common machine learning algorithms to a dataset and presents common metrics based on the task.<\/p>\n We then performed hyperparameter tuning on the first 4 models:<\/p>\n Classically, hyperparameter sweeps are run via grid search (brute force), where all possible combinations of hyperparameters are empirically evaluated for optimization. Given that the number of trials grows exponentially with every new hyperparameter, this is usually non-feasible. Another approach, random search, randomly combines hyperparameters, reaching a local optimum more efficiently than grid search if all combinations are not exhausted.<\/p>\n Instead of either of these options, I will utilize Bayesian Optimization. This method constructs a Gaussian process to model the black-box function and search space. The overarching advantage is that we are converging to a local solution (like any ML model does) rather than shooting simply trying out different hyperparameters.<\/p>\n The F1 score, a harmonic mean of precision and recall, serves as a key indicator of our model’s performance. Precision reflects the model’s reliability in correctly identifying a movie as belonging to the ‘drama’ genre, while recall measures the model’s ability to capture all relevant instances of drama movies.<\/p>\n Considering the constraints, such as the absence of a fully developed pipeline for filtering low variance columns, addressing potential multi-collinearity, and a more extensive feature engineering process, the model demonstrated reasonable effectiveness. The following section will highlight the features that were most important for the model.<\/p>\n This article was mainly focused on the process of feature generation and analyzing the data within the context of screenplays. However, if I wanted to work more on modeling, I’d focus on feature engineering, examine the effects of multi-collinearly, and spend more time on model selection.<\/p>\n More specifically, I would:<\/p>\n I hope you enjoyed this analysis and that this article showcased the potential of tailoring analyses to the unique characteristics of a field. While I focused on cinematic dialogue, the principles of domain-driven data analysis and modeling are universal. I encourage you to research your chosen domain, remain curious, and get creative with feature engineering during your next modeling task. I would also love to hear about your own experiences with interesting domain-driven analyses so feel free to write a comment here or email me at christabellepabalan@gmail.com. Thanks!<\/p>\n Cornell Movie-Dialogs Corpus. Retrieved from https:\/\/www.kaggle.com\/\nrajathmc\/cornell-moviedialog-corpus\/kernels<\/a>. (originally retrieved from ConvoKit).<\/p>\n<\/li>\n Follows, S. (2019). Judging screenplays by their coverage. Retrieved from https:\/\/stephenfollows.com\/wp-content\/uploads\/2019\/01\/Judging\nScreenplaysByTheirCoverage_StephenFollows_c.pdf<\/a><\/p>\n<\/li>\n This article explores the relationship between a movie’s dialogue and its genre, leveraging domain-driven data analysis and informed…<\/p>\n","protected":false},"author":18,"featured_media":3655,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"This article explores the relationship between a movie's dialogue and its genre, leveraging domain-driven data analysis and informed...","footnotes":""},"categories":[14690,47,22,23],"tags":[579,508,468,446,568],"coauthors":[27508],"class_list":["post-3654","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cinema","category-data-visualization","category-machine-learning","category-nlp","tag-cinema","tag-data-visualization","tag-deep-dives","tag-machine-learning","tag-nlp"],"yoast_head":"\nInitial Questions<\/h2>\n
A few questions I’d jotted down:<\/h3>\n
\n
What I found<\/h3>\n
What are these syntactical & textual characteristics?<\/h2>\n
\n
Data<\/h3>\n
Load the Data<\/h3>\n
movie_lines.txt<\/code> file.<\/p>\n
# Define the directory path where the 'movie_lines.txt' file is located\ncorpus_directory = 'cornell movie-dialogs corpus'\n\n# Construct the full file path\nfile_path = os.path.join(corpus_directory, 'movie_lines.txt')\n\n# Open the file in read mode with 'mac_roman' encoding\nwith open(file_path, 'r', encoding='mac_roman') as file:\n\n # Read the contents of the file and split them into individual lines\n lines = file.read().splitlines()<\/code><\/pre>\n
lines[:2]\n['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',\n 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!']<\/code><\/pre>\n
+++$+++<\/code>so this will be used as the separator to split each line, extract the columns, and read the data into a data frame.<\/p>\n
# Split each line in 'lines' using ' +++$+++ ' as the separator\npreprocessed_list = list(\n map(lambda x: (str(x).split(' +++$+++ ')), lines)\n)\n\n# Define column names for the DataFrame\ncolumn_names = ['line', 'speaker_id', 'movie_id', 'name', 'text']\n\n# Create a DataFrame using 'preprocessed_list' \ndf = pd.DataFrame(preprocessed_list, columns=column_names)\n\n# Display the first 2 rows of the DataFrame\ndf.head(2)<\/code><\/pre>\n
Preprocess Text<\/h3>\n
# Transforms all contractions to their longer form\ndf['text'] = df.text.map(clean_contractions)\n\n# Removes all punctuation and punctuation errors in the data\ndf['text_no_punct'] = df.text.map(remove_punctuation)\n\n# Remove words <2 chars and stopwords, lemmatize, & transform to lowercase\ndf['clean_text'] = df.text_no_punct.map(\n lambda x: preprocess(' '.join(x))\n)<\/code><\/pre>\n
1. Length Attributes<\/h2>\n
\n
# Calculate the number of words in each line\ndf['num_words'] = df['text_no_punct'].map(len)\n\n# Extract the number of sentences in each line\ndf['num_sentences'] = df['text'].map(\n lambda x: len(nltk.sent_tokenize(x))\n)\n\n# Remove entries with empty or non-textual content\ndf = df[df['num_words'] != 0]<\/code><\/pre>\n
Statistics Per Movie: Boxplot and Statistics DataFrame<\/h3>\n
\n
Distribution on a Per Movie Basis<\/h3>\n
Dialogue Density Variation<\/h3>\n
Narrative Rhythm<\/h3>\n
Scriptwriting Consistency<\/h3>\n
Visualizing the Outliers That Pull the Average to the Right<\/h3>\n
2. Types of Sentences<\/h2>\n
Exclamation Points<\/strong><\/h3>\n
Hyphens<\/strong><\/h3>\n
Questions<\/h3>\n
3. Part of Speech and Profanity<\/h2>\n
Part of Speech Tagging<\/h3>\n
NN: noun, singular or mass\nJJ: adjective\nVB: verb, base form\nVBP: verb, non-3rd person singular present\nRB: adverb<\/code><\/pre>\n
Profanity<\/h3>\n
4. Sentiment<\/h2>\n
Visualizing Frequency of Positive, Negative, Non-Neutral and All Words<\/h3>\n
Relationships Among Variables<\/h2>\n
Correlation<\/h3>\n
Only Display Significant Correlations<\/h3>\n
\n
\n
prop_noun<\/code>) may also indicate that more complex dialogues include more specific references to entities or names, which could be characteristic of certain genres like science fiction or fantasy with complex world-building.<\/p>\n
Profanity Against Variables<\/h3>\n
Profanity Against Variables Summary<\/h3>\n
Final Data<\/h3>\n
Modeling<\/h2>\n
LazyPredict<\/h3>\n
Hyperparameter Tuning: Bayesian Optimization<\/h3>\n
\n
Manual Hyperparameter Tuning Extra Trees Classifier<\/h3>\n
Train F1 Score: 0.985\nTrain Accuracy Score: 0.984\n\nTest F1 Score: 0.696\nTest Accuracy Score: 0.675<\/code><\/pre>\n
Future Work<\/h2>\n
\n
Concluding Remarks<\/h2>\n
References<\/h3>\n
\n