Two items.

Feb. 2nd, 2007 02:04 am
aparrish: (Default)
ONE. State of the Union word count analysis done right. The included essay at least gives lip service to the idea that we should "be skeptical of the positivist implications of a statistical analysis of language," and has some other ideas about the iconicity of words in new media etc etc etc. Good software, good read. Does anyone know why Taft said "wool" so much?

aparrish: (nosblech)
I'm here this evening to talk to you about token frequency analyses of the State of the Union. Here are a couple that have been tossed around your interwebs over the past few days:
The ostensible goal of this kind of analysis is to determine, proportionally, what the President is talking about most. I don't think it does that. For the two following reasons.

First: What someone "means," what they're talking about, isn't localized to the individual lexical items in that individual's speech. In other words, there isn't a one-to-one relationship between a word and an act of meaning something. Because of this, token frequency analysis fails to capture what might be important references to a particular thing that don't use the word in question. Bush might say, for example, "the country where our military operations are focused," or even "For the rest of the speech, I will be referring to 'Iraq' with the word 'Smoo.'" Even though Bush is obviously talking about Iraq, the word count for the token "Iraq" wouldn't get incremented in those instances.

Moreover, some words might have their count incremented by uses that don't align with the meaning you'd expect. A President might say, "We should worry about the welfare of all our citizens" or "Now that I think about it, 'homeland security' is a pretty dumb phrase." Those words in italics would get bigger in the tag cloud, even though their meaning in context doesn't line up with what the word out of context seems to mean.

Our intuition about words is that they have a definition in the dictionary, and that's what they mean, regardless of context (except, maybe, in special circumstances). Token frequency analysis counts on this intuition being true: that a use of a word will, more often than not, correspond with its dictionary definition. This assumption works well for glib analysis, but I don't see any empirical reason to believe it. (In fact, it doesn't seem like a question particularly open to empirical investigation.)

The second reason is that token frequency analyses like the ones above arbitrarily reject certain words. The NYT analysis doesn't allow you to search for words with less than three letters; the historical tag cloud page says that its algorithm "removes the most common words like 'the', 'and', 'this', 'that' and some not so common language-specific words like 'hitherto', and 'notwithstanding.'" I can't see a good reason for doing this. If saying the word "Iraq" a lot means that the President is talking about Iraq a lot, why doesn't it follow that saying "and" a lot means that the President likes to conjoin phrases? That saying "the" a lot means that the President likes to pick out one salient referent among many?

Geoffrey Pullum at Language Log makes another good point, which is that a token frequency analysis needs to either decide to count word forms or lexemes. Counting word forms means that you'd have separate counts for bless, blessing, and blesses; a lexeme count would lump all of these together. Both methods have problems: a word form count misses generalizations among words, while a lexeme count lumps together words that might have significant semantic differences in context (Pullum's example is insure and insurance).

In conclusion, here are funny videos of talking dogs and hamsters or something doing backflips.
aparrish: (Default)
It has come to my attention that my blog posts recently have all been about video games. My good friend Josh plays a mean A Boy and His Blob, but now claims that this blog is, from his standpoint, "written in code." So the next few posts are geared specifically for you, Josh. I'll try to keep the video game chatter in check.

First up: A surprisingly readable excerpt chapter from The Oxford Introduction to Proto-Indo-European and The Proto-Indo-European World. This is the chapter about Indo-European Fauna. I read it, enjoyed it, and made some notes below.

The beginning of the chapter expresses some suspicion about historical linguistics' ability to help us understand Proto-Indo-European anthropology (given that prehistoric societies aren't in the habit of keeping around names and meanings of words "... for thousands of years as a gesture of benevolence to future historical linguists"). This is a justifiable suspicion, I think, which nevertheless fails to hinder the irresistable speculation that creeps up throughout the rest of the chapter: Did the Proto-Indo-Europeans selectively breed dogs? Maybe not, since we can't reconstruct names for particular breeds. Could the PIE homeland have been in Asia? A reconstruction of *(y)ebh- 'elephant' (from Latin ebus, Sanskrit ibha-) says yes, but couldn't these have been coincidental early borrowings from Egyptian 3bw? We can reconstruct many PIE words for domesticated animals, but few for wild mammals, birds, and fish; for Proto-Uralic, the situation is the exact opposite. Can we use this to support the theory that PIEs had a neolithic economy, and the Proto-Uralics were hunter-gatherers?

Aside from these (fascinating) theories, the chapter is made up of brief histories of words, paragraphs that give a PIE root and its incarnations in attested languages (sometimes ancient, sometimes contemporary). This is the part that I love. The PIE word for bird, *haewei-, becomes Welsh hwyad 'duck', Latin avis 'bird', Albanian vida 'dove', and Greek aietós 'eagle.' The PIE verb *bhei(hx)- 'strike' gives cognates for 'stinging insect': Old Irish bech, Old Church Slavonic bĭčela, Lithuanian bìtė and of course Modern English bee. These are demonstrations of the expressive potential for language, its strange malleability and its persistent familiarity. They're like stories, like grand family histories with poignant (and sometimes unexpected) endings. The effect is hypnotic and I wouldn't mind reading a book entirely composed of these.

Oh, and it's chock-full of nebbish humor (nebbish, from PIE *nebh- 'pasty', cf. Phrygian nihe '(male) virgin', and possibly Ligurian enef 'cuckold'1) and good-natured pedantry. PIE ducks didn't go 'quack quack,' they went 'pad pad.' Swine are notoriously difficult to herd over long distances! Oh, and just so you know, snakes were absent from Ireland even before St. Patrick. I have to admit that this is what I miss most about my linguistics education: those few moments in Ling 100 when Professor Holland (particularly nebbish himself) was able to let his enthusiasm for these obscure tidbits bubble to the surface (along with an unfortunate surfeit of perspiration). This would happen whenever he could break free of the cogsci students, forced to take the class as a requirement, who would whine about his unenthusiastic explanations of generative grammar. I ask you: What man could give a damn about traces and X-bars when in possession of the secret knowledge that cow, bœuf, Latvian guovs and Sanskrit gáu are all cognate (PIE *gwous)?

1 This is an imaginary etymology.
aparrish: (Default)
1. Philip Pullman's His Dark Materials trilogy. I finished it a few days ago. It left me heartbroken. The atheism of the series will most likely be bowdlerized in the forthcoming film, which is a shame. These are three of the most spiritual books I've ever read.

2. Mathematics Elsewhere by Marcia Ascher. Ascher only makes a half-assed attempt to appeal to lay readers, so I had to skip over most of the technical stuff. Still, this book is jet fuel for the anthropological imagination. I wish someone had told me in high school that mathematics could be like this, that it can be about people, not just charts and graphs. (Read Piman's review of the book that he wrote a few months back, which is how the book ended up on my wishlist in the first place.)

3. L'été meurtrier ("The Killer Summer" or something) by Sebastien Japrisot. It is a testament to Japrisot's skill that I didn't just give up after a hundred pages or so - reading what amounts to a thriller in French can be frustrating when you have to look up two new words per paragraph. Nevertheless, a fun read. Un long dimanche de fiançailles is next in the Japrisot parade - I'm anxious to see which of his baffling narrative techniques they had to flatten in order to bring the book to the big screen. (I did like the movie quite a bit, btw.)

4. Apparently, Chomsky and his Minimalist Program friends now think that the only human capacity specific to language is recursion. In this paper (warning: MS Word document, slightly technical but lots of fun), Steven Pinker and Ray Jackendoff offer a convincing rebuttal. I'm not a big Pinker fan, but it is satisfying to see him so soundly refute the gross simplifications of his mentor.

March 2016

20212223 242526


RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 19th, 2017 06:56 pm
Powered by Dreamwidth Studios