Context

Among all species on Earth, humans have a unique capability of communication using a symbolic communication system, i.e., verbal and written language¹. The highly sophisticated language enables humans to communicate in a very precise and complex manner. Still, communicative speech acts seem to differ between genders. One of the major differences in women and men’s speech is that men have been found to dominate conversations through the use of interruptions and overlaps². Additionally, men use strong expletives, while women use politer versions.

In this project, we investigate the variety of speech that is related to a specific gender, social norms and variations in the use of language among those genders. We suppose men and women have different speech behaviours, women talk with more uncertainties (doubts). For example, we expect a woman to say “I expect this to do that” while a man would rather say “I know this does that”. Our idea is therefore to analyse whether there is a real difference between genders and, if so, to what extent it is the case.

Goals

We are interested in using this dataset to answer the following question:

"Do speech behaviours related to confidence and uncertainty vary between men and women?"

To answer this question, we'll go through the following points:

Profession gender difference

To what extent can we observe the differences in communicative acts in relation to gender within a professional area?

Culture gender difference

What are the roles of nationality, culture and education in determining those differences in speech between men and women?

Temporal gender difference

Has there been a possible change over time from 2015 to 2020 concerning gender speech differences?

What is our data?

In the following, we analyse the data from Quotebank, an open corpus that gathers 178 million quotations (attributed to speakers) from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020. We combine this dataset with speakers’ information from Wikidata, a collaboratively edited open source knowledge base.

X

Methods

Creation of professional & background data frames

To have a general overview of the speakers’ occupations, we focus on four main professional fields: arts, science, economy and politics. Our speakers are then regrouped under professions from each professional field. Additionally, to determine the roles of nationality, religion and education in determining a possible cultural gender difference in communicative acts, we selected a general data frame with no condition on the profession.

Classification of the quotations

To analyse speech uncertainty, we adapted an already existing uncertainty detection classifier³, using 6 features. Uncertainty is defined by speculative verbs (like suggest or presume), adjectives and adverbs (like probably or possibly), auxiliary verbs (like must or should) or the use of some tense or modes of conjugation (subjunctive, conditional). This classifier is an automatic machine learning method to detect uncertainty in natural language.

Let's explore our data!

Before starting to investigate our research questions, let's have a look at what our dataset looks like.

Speakers gender exploration

We see that there are 32 genders present in Wikidata. In this analysis, we will only focus on 2 genders: “Female” and “Male”.

Quotes language exploration

We notice that the majority of the quotes are in English. Still, there are some non-English quotes in the dataset. These are removed from our analysis, limiting our study only to the English quotes.

Female & male speakers ratio

As seen from the data, a great majority of speakers are males, meaning there is a persistent under-representation of women in the news. Unfortunately, this was expected. Indeed, even today in the early twenty-first century, women continue to receive substantially less media coverage than men, despite women’s much increased participation in public life. However, only a few studies have systematically examined whether such media bias exists.

Still, a recent American Sociological Review⁴ found that societal-level inequalities are the dominant determinants of continued gender differences in coverage: the media focuses nearly exclusively on the highest strata of occupational and social hierarchies, in which women’s representation has remained poor. As a result, we will focus our analysis on whether there is a difference in speech uncertainty in different professional areas, and whether those have improved from 2015 to 2020. Additionally, to broaden our analysis, the background of the speakers will also be analysed to find other possible correlations between the speakers’ environments and their speech behaviours.

Let's check out how many women are represented in the 4 professional fields we defined of interest.

The female ratio for the different occupation groups

Politicians

21%

Artists

34%

Scientists

25%

Economists

18%

All occupations

20%

Results

Analysis of differences in communicative acts in relation to gender within the different professional areas

Looking at the figure above, there seem to be only a few differences in speech uncertainty between men and women, regrouping all quotes (2015 to 2020) together, the females even seeming a little less uncertain than males, intriguingly.

Analysis of speaker's background influence on certainty

What are the roles of the environment (nationality), culture/tradition (religion), and education (whether the speaker obtained a specific academic degree) in determining those differences in speech between men and women? How are the lines drawn between the language we use and the environment around us?

Let's first have a look at nationality influence:

The colors range from white, indicating that women represent the highest number of uncertain speakers in the country, to dark red, indicating that men are the highest number of uncertain speakers from the country.Using the cursor, we can have an overview of the distribution of gender uncertainty over the years. It seems that female speakers tend to become more uncertain as the years go by.

Let's then have a look at religions influence:

Finally, let's have a look at academic degrees influence:

Looking at the figures above, there seems to be a consistent distribution of 70% certain and 30% uncertain speakers in each condition. Still, the speakers having the least certain speech behaviours are Hindus (only 63% of quotations are certain). In contrast, the speakers having a bachelor degree are the most certain, with 75% of their quotes being certain.

Analysis of temporal evolution of the gender difference from 2015 to 2020

We can observe a general direction of the quotes towards more uncertainty. We were able to confirm this statistically and we found that the certainty was correlated to a drop of -0.35% per year. We also found that women’s quotes are becoming slightly more certain, +0.1% per year.

We can observe again a general direction of the quotes towards more uncertainty.

Statistical analysis

We performed linear regression using the certainty label we obtained from the Pajean uncertainty classifier to identify important features for certainty prediction. It is important to keep in mind that the classifier has a 62.8 F-score thus some of the results we obtain could be due to chance. We considered features with a p-value over 0.05 not statistically significant.

Considering all quotes, our first findings were that the average certainty for a male's quote was 0.70 and a female's quote is correlated with an increase in certainty probability of 0.19%. This could be slightly unintuitive at first. Then, we selected a smaller portion of the quotes selecting only quotes from artists, economists, politicians and scientists. We resumed our findings in the following diagram:

Then, we selected a smaller portion of the quotes selecting only quotes from artists, economists, politicians and scientists. Our first surprise was that amongst this population, the base probability for having a certain quote was 93%, 20% more than quotes from any individual. In this subset, we could observe that the most certain speakers were estimated to be the scientists with an increase of 3% in quote certainty estimation. We also found that the feature which was lowering the most the certainty probability in our model was being a female, with a -1.3% certainty probability drop without relation with the female's profession.

Exploring the background of the speakers, we found that Indian's quotes are correlated with an important -8.5% certainty probability drop. Similarly, Hindu's quotes are also associated with a -7.5% certainty fall. Doctors in philosophy are also matched with low certainty of -7.5%.

Cases of background and gender were rarely significant. It could be due to the large unequal proportions of males and females in the different background categories. Our most representative case of interaction was with a female with a bachelor of science. Having a bachelor of science in our model is correlated with a -1.7% and a female with a bachelor of science is correlated with a -4.5% of having a certain quote.

Conclusion

Through this notebook, we aimed to analyze the speech difference between women and men using the Quotebank dataset. We started from the hypothesis that women speak less confidently and in a more uncertain way than men. To verify this claim, we conducted an analysis with the help of a classifier distinguishing uncertain from certain quotations. We also used Wikidata as a supplement input data to study more closely the quotations’ speakers. As a first analysis, we found similar certainty levels of approximately 70% for males and females, considering all speakers indiscriminately.

Then, we performed various data frames’ separations with respect to the occupation to be able to measure the impact of each professional field and to remove their bias. Concerning our initial question: “to what extent can we observe the differences in communicative acts in relation to gender within a professional area?”, it seems that there is no significant difference between men and women when compared in the same field of work. However, we can note a significant overall certainty probability of 90% for this population, being 20% more than when considering the full population.

We continued our research by analyzing the backgrounds of the speakers, such as their nationality, religion, and academic degree. Again, background and gender interactions were rarely significant. This could be due to the important imbalance in the gender of the Quotebank dataset (~ 75% males & 25% females). Thus, interactions of a specific background with the female gender were measured on much smaller samples.

Finally, we analyzed whether there had been a possible change over time from 2015 to 2020. We found a statistical significance that male quotes were correlated to a certainty drop of -0.35% per year while women’s quotes are becoming slightly more certain, +0.1% per year.

It is important to remember that female speakers from Quotebank may not be representative of all women. They represent a subset of women who have acquired high notoriety and acknowledgement. The imbalance of the Quotebank dataset suggests that the subset of successful women amongst women is still today smaller than the ratio of successful men amongst men. This could be an important bias in our study, which could have directed our study towards apparent equality amongst genders. Therefore, the real problem does not seem to be the way women express themselves but could rather be the voice we give them.

References

Adani S, Cepanec M. Sex differences in early communication development: behavioral and neurobiological indicators of more vulnerable communication system development in boys. Croat Med J. 2019;60(2):141-149. doi:10.3325/cmj.2019.60.141
Broadbridge, James. “An Investigation into Differences between Women ’ s and Men ’ s Speech.” (2009). Retrieved from http://www.birmingham.ac.uk.
Pierre-Antoine Jean, Sébastien Harispe, Sylvie Ranwez, Patrice Bellot, Jacky Montmain. Uncertainty detection in natural language: a probabilistic model. WIMS '16 Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics, Jun 2016, Nîmes, France. ⟨10.1145/2912845.2912873⟩. ⟨hal-01484994⟩
Shor E, van de Rijt A, Miltsov A, Kulkarni V, Skiena S. A Paper Ceiling: Explaining the Persistent Underrepresentation of Women in Printed News. American Sociological Review. 2015;80(5):960-984. doi:10.1177/0003122415596999

Analysis of speech behaviours between genders.