Tuesday 27 October 2015

Quantitative Analysis with Question Marks

The fall break started two days ago, and I have had just the leisure to get back to writing a short Python script. I have been working on this project for a while, but as a newbie I just take steps forward pretty slowly. The script I am working on is supposed to analyse any text but actually every modification I introduce into it is the result of the problems I face when I run the script to analyse quantitatively the quarto edition of Shakespeare’s Much Ado About Nothing. I am wondering if you have to tune the script for every text. But then this would mean that comparing different texts would be impossible. This, however, would lead too far, so instead of this let me mull over a specific problem.

In this post I am going to share one type of insight into the text that I have gained when working with the quarto text of Much Ado About Nothing. When running the script I encountered a problem. This problem concerns the hyphens in the text, insofar as words divided at the end of lines with a hyphen were counted as two separate words. To overcome this problem I tried to remove these hyphens from the end of the lines automatically, but then I ran into a further problem: the machine either removed them simply but left the words divided without a hyphen, and this was no good, as they remained two separate strings. Or if they were removed and the two halves of the words were united, this was no better either, because then the two lines in which the two halves were located became united, too, and this resulted in the distortion of the number of lines. So finally I removed the hyphens and united the words manually so as to avoid the unification of lines. The manual unification of words was beneficial on a further account as well, as I could make a decision on an individual bases in which line the word was to be placed.

When working on this task, which did not last long, it took approximately 15 minutes, I noticed that actually compound words divided with hyphens appeared in mid-line position as well. So what I did next was writing up a short script to collect all these instances of compounds separated with a hyphen, count the number of lines where there are instances of this and also count the number of lines of the play. Once having these numbers I also counted the relative frequency of the lines in which compounds appear.

Compound words divided with a hyphen in the order of appearance in the quarto edition of Much Ado About Nothing are the following:

['turne-coate,'], ['Hare-finder,'], ['Ballad-makers'], ['warre-thoughts,'], ['ouer-heard'], ['March-chicke,'], ['start-vp'], ["heart-burn'd"], ['mid-way'], ['ouer-masterd'], ['day-light.'], ['Schoole-boy,'], ['ouer-ioyed'], ['tooth-picker'], ['sun-burnt,'], ['working-daies,'], ['loue-gods,'], ['kid-foxe'], ['night-rauen,'], ['out-rage'], ['ouer-heardst'], ['hony-suckles'], ['heare-say:'], ['wood-bine'], ['bow-string,'], ['hang-man'], ['tooth-ach.'], ['tooth-ach.'], ['Dutch-man'], ['French-man'], ['lute-string,'], ['tooth-ake,'], ['hobby-horses'], ['Ote-cake', 'Sea-cole,'], ['Sea-cole.'], ['Hot-blouds,'], ['worm-eaten'], ['cod-peece'], ['gentle-woman,'], ['night-gown'], ['Sea-cole,'], ['eie-liddes'], ['ouer-whelmd'], ['candle-wasters:'], ['tooth-ake'], ['milke-sops.'], ['out-facing,', 'fashion-monging'], ['trans-shape'], ['vnder-neath,'], ['gossep-like'], ['Lacke-beard,'], ['grey-hounds'], ['carpet-mongers,'], ['witte-crackers'].

It seems that out of the 2589 lines of the play, hyphenated compounds appear in 54 lines, and in two lines there are two of these compounds, so altogether there are 56 hyphenated compound words in the text. The relative frequency of the lines in which there are hyphenated compounds is 0.0208574739282 . Furthermore, as there are 22, 171 words in the text, the relative frequency of hyphenated compound words in the texts is 0.00252582201976.


Now why are these numbers important? The significance of these numbers can only be gauged if compared to another text, to other texts, because then a pattern may emerge. But then what kind of texts are to be compared and contrasted to. Those of Shakespeare? Or those of the printer? If Shakespeare’s, only the quarto editions, as these are close in time, or all the early prints, i.e. the First and Second Folios as also books of the same period or only those early printed editions that go back to some form of a manuscript, as Much Ado About Nothing, because then these may reveal something about Shakespeare? Or only those that were published by Andrew Wise and William Aspley, as they were the publishers of the quarto edition of the play, or those that were printed by Valentine Simmes, as it is his employees who created the printed text in the final analysis? Or in reality these features do not have anything to do with Shakespeare but rather with the publishers, i.e. Wise and Aspley, or the printer, i.e. Simmes, and these features should be compared only to books one of these parties printed and not necessarily authored by Shakespeare, as they are the people who are responsible for the text that we can witness nowadays. In other words is this statistical analysis related more to studying the history of the book, or the history of spelling than to studying Shakespeare? Answering these questions might be unavoidable when looking for texts to compare the quarto of Much Ado About Nothing to.