corpus design – Kat Gupta

Apologies for the silence. I am trying to write a conference paper for, um, Thursday and my data is stubbornly refusing to organise itself into categories. In a way I’m quite pleased – I’m now working with two corpora and it’s interesting that they show this difference. One is the Suffrage corpus that I’ve been using until now, created by identifying all the articles in the Times Digital Archive containing suffrag* and pulling them out. The asterisk is a wildcard which means that I don’t need to specify an ending – because it’s got that wildcard in it, the search term will find suffrage, suffragism, suffragette, suffragettes, suffragist, suffragists and so on. It will also identify Suffragan, an ecclesiastical term and one that has nothing to do with the suffrage movement. So the script has an exception in it for that term.

The other corpus is composed of Letters to the Editor – the LttE corpus. This sounds very staid and genteel but actually contained heated exchanges between different factions of the suffrage movement, the Women’s Anti-Suffrage League, various anti-suffragist men and anyone else who felt compelled to stick their oar in. At times it reads more like a blogging flamewar! This corpus was extracted using suffrag* as a search term to get letters mentioning suffrage etc; to get the letters I looked at the header of each text. The header contains information like the file name, the date it was published in the Times, the title of the article and, crucially, what it’s classified as – News, Editorials, Leaders or, indeed, Letters to the Editor. So this time the script looked for suffrag* and Letters to the Editor in the header.

Both corpora are divided by year and month, so I have a folder for 1908, 1909, 1910 etc and within those, sub-folders for each month. So if I wanted to, I could compare texts from April 1909 to April 1910, or June 1913 to December 1913, or the first six months of 1911 to the first six months of 1912. I like organising corpora in a way that allows this flexibility.

In Chapter Four, I investigated Mutual Information (MI) for suffragist, suffragists, suffragette and suffragettes in each year in the Suffrage corpus, then categorised the words it came up with. Mutual Information is a measure of how closely words are linked together. So, suffragist and banana aren’t linked at all, but as I found, suffragist and violence are linked. I then came up with categories for these words – direct action, gender, politics, law & prison and so on, and compared these categories across the different years.

I’ve now done the same for the LttE. What’s interesting is that there is not much overlap between the words associated with suffragist, suffragists, suffragette and suffragettes in the LttE corpus and the words associated with suffragist, suffragists, suffragette and suffragettes in the Suffrage corpus. Part of this is to do with the different functions of the texts; rather than reporting news, the Letters to the Editor try to argue, advocate and persuade. However, there are also words like inferior, educated and employed in the LttE data – words that seem to be more about the attributes of women or suffragist campaigners. This just doesn’t seem to be a feature in the Suffrage data.

Also interestingly, the categorise I came up with don’t work for this corpus. While direct action was a prominent category for the Suffrage corpus, I don’t think I can find a single term in the LttE MI data. Not even things like demonstration which is pretty innocuous as far as direct action goes.

So what’s going on here? At least part of it is due to the different functions of news reports and what are essentially open letters. But I think there’s also a difference in who was writing the letters. Letters to the Editor offered both suffrage campaigners and anti-suffrage campaigners an opportunity to represent their views themselves, rather than being represented by or mediated through a reporter, editor and others engaged in the the production of a news report. I don’t think it’s that strange that the language they use and avoid is different.

Back, after an unwelcome hiatus. I’ve learnt my lesson though, and will be backing my wordpress database up. Regularly.

Anyway, being a linguist of the sweary variety, I was intrigued to see someone on twitter use Google lab’s ngram viewer to look at cunt and express surprise and delight that cunt was being used so frequently rather earlier than expected.

I thought the graph looked interesting. The frequency of cunt was rather erratic: an isolated big peak in around 1625-35; an isolated smaller peak in around 1675; peaks in 1690ish and 1705ish; a rather spiky presence between 1705 and 1800; then fairly consistently low frequency until around 1950 when its frequency increases again.

This seemed puzzling – rather than being fairly low-level but present, there were these huge spikes in the 17th century. I decided to have a look at the texts themselves. These turned out to be in Latin, and the following image rather neatly illustrates the two different meanings at work here:

The books themselves seem to be religious texts written in Latin, even if Google’s ever-helpful advertising algorithm seems to interpret things rather differently. As you can see in the first image, I selected texts from the English corpus. It’s possible that the books are assigned a corpus based on their place of publication, but it’s not very intuitive.

I took a closer look at the texts to try and work out what was going on. Some of the texts were in Latin, as this example taken from De paradiso voluptatis quem scriptura sacra Genesis secundo et tertio capite:

However, this was not the only issue. I found at least one example of a musical score – this example taken from Liber primus motectorum quatuor vocibus:

Here, the full lexical item is benedicunt. In both of these examples, cunt is not a full lexical item; I can understand why the layout of the score might have led to it being parsed as a separate item, but I’m a bit confused why the same seems to have happened with dicunt.

The high frequency of cunt can also be attributed to Optical Character Recognition (OCR). Basically, the text is scanned and a computer program tries to convert the images into text. This has varying degrees of accuracy – it can be very good, but things like size and font of print, the paper it was printed on and age of the texts all have an effect. The text obtained through scanning with OCR is then linked to the image.

This example, taken from Incogniti clariss. olim theologi Michaelis Aygnani Carmelitarum Generalis, is probably familiar to those working with OCR scanned texts. The text actually reads cont. but the OCR has read it as cunt. The search program can’t read the image files; all it has to go on are OCR scanned texts. When these aren’t accurate, you get results like these.

I think Google ngram is interesting, but with some caveats. Corpora can be tiny – the researcher can have read every single text in their corpus and know it inside-outside. Corpora can be large and highly structured, like the British National Corpus. Corpora can be large and the researcher doesn’t need to have read every single text contained in them, but through careful compilation the researcher knows where the texts have come from, where they were published and so on – for example, corpora assembled through LexisNexis. This is a bit different – it’s not really clear what’s even in the collection of texts and the researcher has to trust that Google has put the right texts in the right language section. I’ve seen Google ngrams being used to gauge relative frequencies or two or more phrases, but for now I think I’ll stick to more traditional corpora for most in-depth work.

Mark Davis also has a post comparing the Corpus of Historical American English with Google Books/Culturomics. His post is in-depth, interesting and systematic; I just swear a lot. You should probably read his.

Kat Gupta’s research blog

Tag: corpus design

“To the Editor of the Times…”

swearing with Google ngrams

Citing me

Categories

Tags

Me elsewhere

RSS

Change font size

What is a mixosaurus anyway?

Archives