Analysis of speech behaviours between genders.

Do speech behaviours related to confidence and uncertainty vary between men and women?

Analysis of speech behaviours between genders


Among all species on Earth, humans have a unique capability of communication using a symbolic communication system, i.e., verbal and written language1. The highly sophisticated language enables humans to communicate in a very precise and complex manner. Still, communicative speech acts seem to differ between genders. One of the major differences in women and men’s speech is that men have been found to dominate conversations through the use of interruptions and overlaps2. Additionally, men use strong expletives, while women use politer versions.

In this project, we investigate the variety of speech that is related to a specific gender, social norms and variations in the use of language among those genders. We suppose men and women have different speech behaviours, women talk with more uncertainties (doubts). For example, we expect a woman to say “I expect this to do that” while a man would rather say “I know this does that”. Our idea is therefore to analyse whether there is a real difference between genders and, if so, to what extent it is the case.


We are interested in using this dataset to answer the following question:

"Do speech behaviours related to confidence and uncertainty vary between men and women?"

To answer this question, we'll go through the following points:

Profession gender difference logo

Profession gender difference

To what extent can we observe the differences in communicative acts in relation to gender within a professional area?
Culture gender difference logo

Culture gender difference

What are the roles of nationality, culture and education in determining those differences in speech between men and women?
Temporal gender difference logo

Temporal gender difference

Has there been a possible change over time from 2015 to 2020 concerning gender speech differences?

What is our data?

In the following, we analyse the data from Quotebank, an open corpus that gathers 178 million quotations (attributed to speakers) from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020. We combine this dataset with speakers’ information from Wikidata, a collaboratively edited open source knowledge base.



Creation of professional & background data frames

To have a general overview of the speakers’ occupations, we focus on four main professional fields: arts, science, economy and politics. Our speakers are then regrouped under professions from each professional field. Additionally, to determine the roles of nationality, religion and education in determining a possible cultural gender difference in communicative acts, we selected a general data frame with no condition on the profession.

Classification of the quotations

To analyse speech uncertainty, we adapted an already existing uncertainty detection classifier3, using 6 features. Uncertainty is defined by speculative verbs (like suggest or presume), adjectives and adverbs (like probably or possibly), auxiliary verbs (like must or should) or the use of some tense or modes of conjugation (subjunctive, conditional). This classifier is an automatic machine learning method to detect uncertainty in natural language.

Let's explore our data!

Before starting to investigate our research questions, let's have a look at what our dataset looks like.

Speakers gender exploration