According to the study, Success with Style: Using Writing Style to Predict the Success of Novels, by Stony Brook University’s Vikas Gajingunte Ashok, Song Feng and Yejin Choi, whether or not a book will sell can be determined by several quantifiable factors.
The researchers downloaded classic literature from the Project Gutenberg archive, used more recent award-winning novels and analyzed low-ranking books on Amazon — and included genres from science fiction to classic literature and even poetry.
Successful books utilized a high percentage of nouns and adjectives, conjunctions, prepositions, pronouns, determiners and adjectives. They found that successful books made great use of conjunctions to join sentences (“and” or “but”) and prepositions than less successful books.
Less successful books had a higher percentage of verbs, adverbs, and foreign words. Such books also relied heavily on clichés, extreme and negative words. Less successful books also rely on dull verbs that describe direct action, such as “took,” “promised” and “cried,” while more successful books use more verbs that describe thought-processing, such as “recognized” and “remembered.”
The least known books further describe actions and emotions and, conversely, the most renowned have a vocabulary associated with reflection, thought and memories…
What the researchers found but didn’t identify by name was that lexical density contributed to books that readers consistently find engaging, whether books written by masters of the past or current best selling authors.
But What Is Lexical Density?
Lexical density is defined as the number of lexical words (or content words) divided by the total number of words. Lexical words give a text its meaning and provide information regarding what the text is about. More precisely, lexical words are simply nouns, adjectives, verbs, and adverbs.
And Grammarly says this:
Lexical Density is term used in text analysis. It measures the ratio of content words to grammatical words. Content words are nouns, adjectives, most verbs, and most adverbs. Grammatical (sometimes called functional) words are pronouns, prepositions, conjunctions, auxiliary verbs, some adverbs, determiners, and interjections.
Lexical density also considers the number of unique words. If you’ve re-used words, you’ve reduced your lexical density.
Well, What Is A Common Lexical Density?
Analyze My Writing says:
Fiction on average tends to score between 49% and 51%. The reader may verify this by trying this experiment.
More general prose tends to have slightly lower lexical densities near 48% and 50% as observed in this experiment.
However another website says:
Unfortunately, there is no reference for lexical density as such. It is a well-known measure of lexical variation which is used in many linguistic analyses. If you search the internet for ‘lexical density’ you will find several of these. I do not know who was the first person to use a measure of lexical density in a study but it is now well-known and, as it is in the public domain, no one really references its use anymore in articles, reports, and so on.
As we will see, the difference between some current bestsellers is quite different from the 48% to 51% percentages tagged for lexical density. I suspect this is because it seem the amount of text used for these calculations are quite small, while a analysis of a whole piece of fiction would show a smaller percentage of lexical density because of the addition of dialogue which tends to rely more on those elements, verbs, an adverbs.
Case in point. I started my comparisons with Grisham’s Gray Mountain with the first twenty-five hundred words of text that revealed a lexical density of 37%. However, with the other books I chose the first thousand words so I figured I’d had to compare apples to apples. Sure enough the lexical density of the first thousand words of Gray Mountain shot up to 48%. Then I checked again to the samples used by the Analyze My Writing Website to see they were using either the first twenty sentences or the first paragraph for analysis where they report lexical densities of 48% to 51%.
This makes sense. One parameter of Lexical density is the number of words that are used more than once. The larger the writing sample, the more often words are reused. Smaller word samples means fewer repeat words hence the lexical densities they report of 48% to 51%.
Where Did This Sucker Come From Anyway?
Researching the exact whys and wherefores of Lexical Density runs a little far afield from the original intent of this post. There a few clues out there on the web, but I’m not going to spend a week figuring out the intricacies of the history of the inception of Lexical Density.
From what I can find out, Lexical Density was a coin termed in 1971 by a J. Ure who wrote a paper called Lexical density: a computational technique and some findings which was published in M. Coultard (Ed.) Talking about text (pp 27-48). Birmingham: English Language Research, University of Birmingham. This paper is referenced multiple times in different research papers. However, in this context Lexical Density was used as a measurement in how English as a Second Language students used the English language in their daily speech. Along the way someone (and no, I don’t know who) grabbed onto Lexical Density and applied it as a measurement of readibility enough so we have one website reporting this definition of Lexical Density:
The Lexical Density Test is a Readability Test designed to show how easy or difficult a text is to read. The Lexical Density Test uses the following formula:
Lexical Density = (Number of different words / Total number of words) x 100
The lexical density of a text tries to measure the proportion of the content (lexical) words over the total words. Texts with a lower density are more easily understood.
As a guide, lexically dense text has a lexical density of around 60-70% and those which are not dense have a lower lexical density measure of around 40-50%.
Personally, I have some problems with the formulas and methods for determining Lexical Density. As we’ve seen computations are taken from very small samples of larger works. The method of computation may work for spoken English because people don’t vary how they speak very often. But this doesn’t hold true in a piece of prose. Writers may not repeat the pattern of word use in the first paragraphs. We’ll add dialogue, which has an entirely different lexical density, or just vary how we use our words. As I’ve found in analyzing different pieces of writers’ works, lexical density goes down the more words you look at, most likely because the repetition of words counts against lexical density. The more you write the more often you will use certain words. If you analyze a whole piece of work, the results may vary so widely from the first one hundred words you might doubt it was the same piece of work.
So we have a readability test that morphed into a measurement of writing skill, which for the lack of evidence we could use as an interesting discussion, except for the good people at Stony Brook who stumbled onto and affirmed a key element of lexical density. And that element is what they said in their paper and bears repeating:
Successful books utilized a high percentage of nouns and adjectives, conjunctions, prepositions, pronouns, determiners and adjectives.
The more dense and complicated is the novel, the most likely it will stand out.
And that is worth looking into. I’ll be writing more on this and other measurments in a couple other posts tentatively titled “Lexical Density vs. The Rules of Writing” and “Compare Your Writing to Best Selling Authors.”
Success with Style: Using Writing Style to Predict the Success of Novels, http://aclweb.org/anthology/D/D13/D13-1181.pdf
Photo published under a Creative Commons License issued by Flickr user Nina Jean.