Corpus Linguistics 2011

I admit that I was feeling rather grumpy before CL2011. Extracting my data had proved tricky, I worried that the stuff I was working on wasn’t ready to present and I was feeling somewhat anti-social.

However, I ended up having a rather good conference. Part of it is just that corpus linguists tend to be nice people – as one first-time attendee noted to me, people were constructive and helpful when commenting on people’s presentations. This is not always the case – these things can turn into an academic pissing contest – and she was pleasantly surprised. As Costas noted, it can feel a bit like a family reunion (the good kind, I hope). It was nice to catch up with friends, meet new people and extract others from the hilariously awkward situations they managed to create for themselves. I have a story about a red devil tattoo now.

The organisation was impeccable. This was the first conference I’ve been to that was in a dedicated conference centre rather than in a university. I’ve got to say, the food was much better than I’m used to at these things. I won’t name names, but some of us were rather enamoured with the little moussey-cakey things at lunch. The only problem seemed to be with workshop venues – there weren’t computing facilities so attendees were asked to bring their own laptops, but the room assigned to one workshop wasn’t suitable for an active, hands-on workshop.
The conference scheduling was thoughtfully done and I presented in the same session as others working on newspaper discourse including Anna Marchi. It was interesting both for us and for the audience – we could make links between each others’ papers and also had the chance to talk afterwards.

I do wonder why corpus linguists haven’t really embraced twitter though. There was a presentation on it (which I livetweeted) but we weren’t told about hashtags, organised a tweetup or similar. Having seen something of how my astrophysicist sister uses twitter at her conferences I think we’re missing out – it looks like a good way of engaging with presentations and finding other conference attendees. Next time eh?

“To the Editor of the Times…”

Apologies for the silence. I am trying to write a conference paper for, um, Thursday and my data is stubbornly refusing to organise itself into categories. In a way I’m quite pleased – I’m now working with two corpora and it’s interesting that they show this difference. One is the Suffrage corpus that I’ve been using until now, created by identifying all the articles in the Times Digital Archive containing suffrag* and pulling them out. The asterisk is a wildcard which means that I don’t need to specify an ending – because it’s got that wildcard in it, the search term will find suffrage, suffragism, suffragette, suffragettes, suffragist, suffragists and so on. It will also identify Suffragan, an ecclesiastical term and one that has nothing to do with the suffrage movement. So the script has an exception in it for that term.

The other corpus is composed of Letters to the Editor – the LttE corpus. This sounds very staid and genteel but actually contained heated exchanges between different factions of the suffrage movement, the Women’s Anti-Suffrage League, various anti-suffragist men and anyone else who felt compelled to stick their oar in. At times it reads more like a blogging flamewar! This corpus was extracted using suffrag* as a search term to get letters mentioning suffrage etc; to get the letters I looked at the header of each text. The header contains information like the file name, the date it was published in the Times, the title of the article and, crucially, what it’s classified as – News, Editorials, Leaders or, indeed, Letters to the Editor. So this time the script looked for suffrag* and Letters to the Editor in the header.

Both corpora are divided by year and month, so I have a folder for 1908, 1909, 1910 etc and within those, sub-folders for each month. So if I wanted to, I could compare texts from April 1909 to April 1910, or June 1913 to December 1913, or the first six months of 1911 to the first six months of 1912. I like organising corpora in a way that allows this flexibility.

In Chapter Four, I investigated Mutual Information (MI) for suffragist, suffragists, suffragette and suffragettes in each year in the Suffrage corpus, then categorised the words it came up with. Mutual Information is a measure of how closely words are linked together. So, suffragist and banana aren’t linked at all, but as I found, suffragist and violence are linked. I then came up with categories for these words – direct action, gender, politics, law & prison and so on, and compared these categories across the different years.

I’ve now done the same for the LttE. What’s interesting is that there is not much overlap between the words associated with suffragist, suffragists, suffragette and suffragettes in the LttE corpus and the words associated with suffragist, suffragists, suffragette and suffragettes in the Suffrage corpus. Part of this is to do with the different functions of the texts; rather than reporting news, the Letters to the Editor try to argue, advocate and persuade. However, there are also words like inferior, educated and employed in the LttE data – words that seem to be more about the attributes of women or suffragist campaigners. This just doesn’t seem to be a feature in the Suffrage data.

Also interestingly, the categorise I came up with don’t work for this corpus. While direct action was a prominent category for the Suffrage corpus, I don’t think I can find a single term in the LttE MI data. Not even things like demonstration which is pretty innocuous as far as direct action goes.

So what’s going on here? At least part of it is due to the different functions of news reports and what are essentially open letters. But I think there’s also a difference in who was writing the letters. Letters to the Editor offered both suffrage campaigners and anti-suffrage campaigners an opportunity to represent their views themselves, rather than being represented by or mediated through a reporter, editor and others engaged in the the production of a news report. I don’t think it’s that strange that the language they use and avoid is different.

Five (plus two) questions from Sophie

Sophie Duncan at Clamorous Voice thought it would be interesting to bring the five question meme to our academic or otherwise real-life blogs. She describes it as a “creative nonfiction thing…little snapshots of what’s going on with people” and well, how could I refuse an offer like that? So here goes, and if you would like five questions from me, comment and ask!

What would you like to ask Christabel Pankhurst?
I always get a bit nervous about “what would you ask [famous person]?” questions because I’m worried that I’ll be like I am in real life and gaze worriedly at them, realise I have no intelligent question or, indeed, response and blurt out something about paneer. So this takes place in an alternate universe where I a) can time-travel and b) am not totally useless at talking to people and c) am cool.

At first I’d probably try to start off with vaguely academic questions, like her thoughts on direct action and how she’d gauge its success, what her intentions were in founding the WSPU and how these changed over time, her thoughts on the role of male suffragists, how she felt about the portrayal of the suffragist movement in the press and so on. And then I’d probably get increasingly nosy about the intra-suffrage movement tensions, so really, tell me exactly how you feel about the NUWSS, and what really happened with the Pethick-Lawrences, and why did you choose to base the WSPU on a military organisation and whose idea was that and ooh, syphilis and white slavery. And then either ask her about falling out with her sister, Sylvia Pankhurst, or possibly present her with a cuddly syphilis. Either way, it would go magnificently.

Sue Perkins or Sandi Toksvig? [This is probably the most important question I’ll ask anyone, nota bene]
I really admire Sandi Toksvig’s knowledge on such a wide range of subjects, how she’s a ferociously intelligent and respected older female broadcaster, presenter and entertainer when there are so few on TV and radio, and how she’s fought discrimination against her and her family due to her sexuality. On the other hand, Sue Perkins is one of the few comedians who can make me laugh and laugh (I saw Mitchell and Webb live and fell asleep, true story), and while she’s self-deprecating she’s also whip-smart and passionate about the arts. On balance I’d say that Sue Perkins is ahead by a whisker, but that’s due to her commitment to empirical research as demonstrated on The Supersizers go….

What is corpus linguistics?
Very very basically, it involves collecting together machine-readable texts and using a computer program to look for patterns in them. The patterns you look for might be whether a word prefers or avoids other words (collocation), have a certain grammatical function (colligation), are associated with a specific semantic field (semantic preference) or are associated with a set of words or phrases which can reveal (hidden) attitudes (discourse prosody). Some people work with massive corpora, like the Bank of English, and some people work with very small corpora of tens of thousands of words. Some people treat it as a sub-discipline in itself while others treat it as a methodology. As such, there’s a tremendous variation on what corpus linguistics is and it kind of depends on who you ask as to what answer you’ll get.

How and where do you see yourself teaching, in the post-apocalyptic maelstrom of the Higher Ed future?
It’s hard to say. I’m troubled by the attitude that universities are profit-making service providers and students are consumers; I believe it fundamentally changes the relationship between teacher and student. On the other hand, the networks and resources you find in universities are valuable and it’s hard to create them from scratch. The answer is that I’m really not sure; I’d like to do some teaching within the university system, but I’d also like to work with groups outside it – school and college groups, activists, the public and others.

What’re your own newspaper & magazine reading habits?
Being a bit of a cheapskate, it depends if I’m buying them or not. I sometimes buy Diva if I’m faced with a long train journey, but other than that I tend to do most of my reading online. However, if there are magazines or newspapers lying around, I’ll probably read them – National Geographic, New Scientist, the Metro, I’m not particularly fussy. I am also likely to pounce on people’s copies of trashy magazines, especially if they have dodgy real life stories (e.g. I made my mum-in-law out of toast). I probably won’t read the Daily Mail though – I do have some standards.

What’s the best thing about your life right now?
Right now? Possibly the cherry tomatoes, courgette and garlic roasting in the oven that I’m going to make something with for my dinner. It’s a beautiful sunny evening, my window’s open, and I can hear birdsong and collared doves cooing. It’s not the life I thought I was letting myself in for when I first started my PhD at Liverpool, but I’m trying to make the best of it.

What do your mornings look like?
Best avoided.

And now, questions for her!

  • Do you try to get distance from your PhD, and what form does that take?
  • What’s the arts organisation that doesn’t exist, but you really really wish it did?
  • Let’s imagine that you have the chance to go back in time and interact (talk, get drunk with, slap, etc) with any historical figure. They’ll then conveniently bang their head and forget they ever met you. Who would you pick and what would you do with/to them?
  • How has blogging influenced or affected your PhD?
  • What are you most looking forward to?
  • Comment if you’d like some questions from me.

    Press understanding of the black bloc

    On Saturday, over 500,000 people took part in the March for the Alternative. The Guardian live-blogged it (first part, second part) and for the majority, it was a peaceful and diverse march.

    At some point, some protesters seem to have headed to Oxford Street to engage in some direct action, namely occupying Fortnum & Masons (and were duly kettled upon leaving, having been told they’d be free to leave the area), and in a late evening a large group gathered at Trafalgar Square, apparently to rest, catch up, swap news and so on. At this point something happened, and the police responded by kettling them. People’s experiences could be very different depending on where they were and when – one person was baton charged by the police, Laurie Penny was caught in the Trafalgar Square kettle, this young blogger found himself protecting a girl whose arm was broken by the police in the Trafalgar Square kettle and Katie writes about the march and Trafalgar Square and the aftermath as a St John’s Ambulance first aider.

    The reaction from the conservative press was predictable but again, people were anxious to distance themselves from those not participating in the march and engaging in different forms of direct action.

    Johann Hari:

    Shame on the media for focusing on a few idiots from yesterday not the inspiring 500,000, and shame on the idiots for giving them the excuse (source)

    They were Black Block, who are entirely different people (and twats) (source)

    Charlie Brooker:

    Confusing these twats with the hundreds of thousands of actual protesters = mistaking football hooligans for footballers. (source)

    La Sophielle has some interesting stuff to say on the distinction between “good” protesters and “bad” protesters:

    All those news outlets with their talk of “splinter groups”, “mobs”, “maelstroms of violence”, “violent minorities” and “masked thugs” who “hijack” things – and don’t forget the bafflingly recurrent remark that those responsible “used Twitter to coordinate actions and cause trouble” – all these news outlets actually don’t care to differentiate between various expressions of political resistance, whatever they may say to the contrary. Protestors come in ‘nice’ or ‘black’ – full stop. I don’t resent this because I resent UK Uncut being “smeared” or lumped in with the black bloc. I resent this because it means that inane dichotomies (legitimate/illegitimate, nice/nasty, peaceful/violent) are shored up in the name of reporting, which in fact serve nothing at all except sensation. (source)

    Aside from the debate about acceptable and unacceptable forms of protest which is probably as old as protest itself, I find it really interesting how the term “black bloc” is used. I understand it as a tactic (as this FAQ explains): a black bloc is a temporary gathering of people with different ideologies and aims working together for the duration of a march etc. Wearing similar clothes promotes solidarity, is highly visible and hinders identification, particularly by Forward Intelligence Teams. What it is not, however, is an organisation. To my knowledge, there is no black bloc membership list. There is no black bloc committee. It forms on the ground, and dissolves afterwards. The individuals involved might have connections to each other, but the black bloc itself is not the organisation that they belong to.

    As a linguist, what I find interesting are the different ways the black bloc is discussed in this current round of articles. Not so much the evaluative stance, but the concept of the black bloc itself. This term is not being disputed in the press – instead, it seems to be misunderstood and the misunderstanding apparently goes unchallenged. I suspect there’s a power dynamic in that those most likely to participate in a black bloc and understand it are not likely to have a powerful voice in the press; the people writing about the black bloc in the newspapers are unlikely to be the ones with direct experience of it. And so “Black Bloc(k)” seems to become an identity rather than a tactic.

    It makes me wonder how prevalent this is, both diachronically and across domains. Is this a fairly standard feature of mainstream press discourse about the black bloc? Is it something more recent – was the black bloc discussed differently in the 1990s/early 2000s/mid-2000s to now? Is the black bloc understood differently when taking part in different kinds of protest e.g. anti-war, environmental, anti-cuts (even if these issues are often closely connected)? Has the term become more widespread, or used more frequently?

    This is the kind of research that lends itself to corpus research methodologies – focusing on a limited number of terms where a) the term is crucial to identifying the group being discussed and b) the term itself is what’s interesting. There may well be incidences of “protesters dressed in black” and so on, but I’m not convinced that identifies the protesters explicitly enough to know that it’s a black bloc being discussed. Because the black bloc itself is a somewhat nebulous concept – its power lies in its lack of organisation and definition – it becomes a site for projection. Do you want the black bloc to be full of violent hooligans, justifiably angry disenfranchised working class kids, rentamob thugs? Again, this seems more about identity than discussing the black bloc as a tactic.

    If I didn’t have a conference paper to write I’d be creating a custom corpus with WebBootCaT, but the paper must take precedence. The custom corpus will have to wait a couple of weeks.

    My tools

    Recently I’ve been going to bike maintenance workshops. It’s been an interesting and often satisfying experience. There’s the comfort of knowing how to check your own bike for damage, how to mend punctures and how to be a more self-sufficient cyclist. There’s also a sense of satisfaction about doing something that has physical results, something that gets you covered in bike grease and dirt, something that requires you to work with your hands as well as your mind.

    There seem to be a few books out discussing the issue of working with one’s hands and the dangers of office work. It’s hard to write about this without romanticising manual work from a smug, clean fingernailed, academic stance, from the privileged stance of this being a choice for me, of this being as unnatural for me as Marie Antoinette playing milkmaid at her hameau.

    However, for me at least, there’s also a sense of responsibility; if I use something every day I should be able to understand how some aspects of it work, be able to fix some problems, know when I’m out of my depth. It gives me a better understanding of my tool’s capabilities and limitations. Does it make me a better corpus linguist? In some ways, yes – I built my desktop myself to a specification I designed for corpus linguistics. It’s skewed towards processing power and data storage, and light on the graphics. In other ways, not yet. While a computer is a bit more than a magic box for me, there’s still so much I don’t know about what it can do and how it works. The more sensitive I become to how computers work, what they’re capable of, what they are incapable of, and the implications these have for investigating language use…well, it can only be a good thing, right?

    Bit of a ramble I’m afraid, I’ve caught some horrible student lurgy and am feeling a bit fuzzy-headed.

    swearing with Google ngrams

    Back, after an unwelcome hiatus. I’ve learnt my lesson though, and will be backing my wordpress database up. Regularly.

    Anyway, being a linguist of the sweary variety, I was intrigued to see someone on twitter use Google lab’s ngram viewer to look at cunt and express surprise and delight that cunt was being used so frequently rather earlier than expected.

    I thought the graph looked interesting.  The frequency of cunt was rather erratic: an isolated big peak in around 1625-35; an isolated smaller peak in around 1675; peaks in 1690ish and 1705ish; a rather spiky presence between 1705 and 1800; then fairly consistently low frequency until around 1950 when its frequency increases again.

    This seemed puzzling – rather than being fairly low-level but present, there were these huge spikes in the 17th century.  I decided to have a look at the texts themselves.  These turned out to be in Latin, and the following image rather neatly illustrates the two different meanings at work here:

    The books themselves seem to be religious texts written in Latin, even if Google’s ever-helpful advertising algorithm seems to interpret things rather differently. As you can see in the first image, I selected texts from the English corpus. It’s possible that the books are assigned a corpus based on their place of publication, but it’s not very intuitive.

    I took a closer look at the texts to try and work out what was going on.  Some of the texts were in Latin, as this example taken from De paradiso voluptatis quem scriptura sacra Genesis secundo et tertio capite:

    However, this was not the only issue.  I found at least one example of a musical score – this example taken from Liber primus motectorum quatuor vocibus:

    Here, the full lexical item is benedicunt. In both of these examples, cunt is not a full lexical item; I can understand why the layout of the score might have led to it being parsed as a separate item, but I’m a bit confused why the same seems to have happened with dicunt.

    The high frequency of cunt can also be attributed to Optical Character Recognition (OCR). Basically, the text is scanned and a computer program tries to convert the images into text.  This has varying degrees of accuracy – it can be very good, but things like size and font of print, the paper it was printed on and age of the texts all have an effect. The text obtained through scanning with OCR is then linked to the image.

    This example, taken from Incogniti clariss. olim theologi Michaelis Aygnani Carmelitarum Generalis, is probably familiar to those working with OCR scanned texts. The text actually reads cont. but the OCR has read it as cunt. The search program can’t read the image files; all it has to go on are OCR scanned texts. When these aren’t accurate, you get results like these.

    I think Google ngram is interesting, but with some caveats. Corpora can be tiny – the researcher can have read every single text in their corpus and know it inside-outside. Corpora can be large and highly structured, like the British National Corpus. Corpora can be large and the researcher doesn’t need to have read every single text contained in them, but through careful compilation the researcher knows where the texts have come from, where they were published and so on – for example, corpora assembled through LexisNexis. This is a bit different – it’s not really clear what’s even in the collection of texts and the researcher has to trust that Google has put the right texts in the right language section. I’ve seen Google ngrams being used to gauge relative frequencies or two or more phrases, but for now I think I’ll stick to more traditional corpora for most in-depth work.

    Mark Davis also has a post comparing the Corpus of Historical American English with Google Books/Culturomics. His post is in-depth, interesting and systematic; I just swear a lot. You should probably read his.

    it lives! it liiiiiives!

    I’m currently working on something that seem to have mutated out of a chapter. No, wait, that’s a rubbish description.

    In Chapter 4 I examined words derived from Mutual Information for the words suffragette, suffragettes, suffragist and suffragists. Between the historical research and my data-driven categories, I identified the following categories: constitutionalist vs militant, class, geography, gender/gender roles, origins, direct action, legal and prison, proper names, organisational, politics and opposition. I then investigated the direct action category in more detail.

    The terms I looked at (disturbance*, outrage*, violence, crime*, disorder and incident?) were evaluative on a lexico-grammatical level. However, upon reading the texts, I realised that there were other types of evaluation at work in the texts. These were longer and more analytical, operating at the discourse level and could only be discovered by reading the texts. So I read texts. One of the themes that emerged was the tension between organised actions and individual actions. I started planning out Chapter 5, worked out which period to focus on, worked out which articles in that period I was going to analyse, and read a lot of discourse analysis.

    I then started to analyse the articles, only to discover something a bit interesting in the arrangement of texts within the articles. It wasn’t mentioned in the scholarship about historical newspapers I’d read. I’m still searching to see if someone, anyone has researched it. I thought it was interesting though, and talked to my supervisor about it. She encouraged me to explore that idea a bit more; maybe it would be interesting in its own right, maybe it would make another chapter stronger.

    It’s now becoming something that I think makes the link between Chapters 4 and 5 stronger, but which I think ought to be a chapter itself. This is very much data-driven research; I thought I’d be doing some fairly straightforward (critical) discourse analysis and I wasn’t expecting to find something like this, but instead I’ve found something that makes me reconsider the structure of my thesis.

    In a way, I like the chaos. I like having a sense of freedom to explore things, I like being able to say “wow, this is interesting, I should pursue it”, I like getting excited about new things and part of me is going what, wait, how has no one else discovered this? Am I really the first?. I’m a perfectionist, and I have a hard time committing to something because I’m convinced that if I fussed over it just a little more it would be even better. But at the same time, I’m incredibly conscious of the time restraints and the fact that I need to knuckle down and get this thesis done.

    Does anyone else feel like this about their thesis? How do you decide between sticking solidly to your plan or haring off after something interesting? Am I setting up a false dichotomy here and it’s possible to have a compromise?