Published on 15 September 2018
random data analysis and visualization :)
I recently saw a news that one of the biggest scientific grant agency, Wellcome Trust, is sharing details of eligible applications for Open Research Fund (ORF). As I am in process of various postdoctoral fellowship applications, I got curious to know how other researchers write their proposal. I read few of these proposals and wondered there must be some similarities between all these diverse proposals. It was obvious that there must be some similarity because they were written for a common theme. Nonetheless, I started analyzing these proposals to look for some similarities. Dirtiest way to check similarity is to check the grammar of proposals and what different kinds of words they use.
To check if grammar analysis is doing a correct job, I needed a few controls. In addition to these ORF proposals, I started looking for different text source which can be used in this analysis. I selected two different kinds of literary work,
I also wanted to compare with “Harry Potter and the philosopher’s stone” but there is no free version1 available for this book. ORF application data had a total of 87 proposals with total character length of 476381 (including spaces) (5475 on average). It includes on average 838.72 words per proposal. Just to cross if this is correct, I checked what is the word limit in the official guidelines. It is 850 words. Hence I picked only first 476381 characters of ‘On the Origin of Species’ and ‘To Kill a Mockingbird’. Headers were removed before text processing.
Next thing I needed was a way to analyze grammar. I can just split all content and then check which words are used most commonly. However, that will be the crudest thing one can do. I found a better option. I used Natural Language Toolkit ( nltk
) library 2 of python
. I used this library to tag each word of my text content. In nutshell, tagging will tell you what kind of grammatical object (adjective, adverb, noun, verb etc) is a current word in the context of the entire sentence. I used ‘Universal Part-of-Speech Tagset’ for simplicity. Further details regarding this tagging functions can be found here. The code used in this analysis can be found on GitHub.
I checked what are the most common grammatical objects used in all three text content.
Fig 1: Most common adjectives used. 3 |
As it was Open Research Fund, I was expecting to get ‘open’ and ‘new’ as one of the most used adjectives (Fig 1). I was amused to see that Origin of Species has ‘same’ and ‘other’ as a top adjective. This makes sense considering Darwin was comparing a lot of observations in his famous book.
Next obvious thing and probably most important in this context is to check most common nouns.
Fig 2: Most common nouns used. 3 |
As seen in Fig 2, ‘Data’, ‘Research’ and ‘Researchers’ were most used nouns in the ORF proposals. It looks like everyone likes to concentrate on the ‘data’. ‘health’ and ‘community’ is also on top because one of the aims of ORF is to ‘making health research more open’. Few more of most common nouns are shown in Fig 3. Our control worked as expected. To kill a Mockingbird gave ‘jem’, ‘atticus’ and ‘dill’ as most common nouns which are names of few of the main characters of the book. Furthermore, it was not surprising to see ‘species’ and ‘selection’ as most common nouns from the origin of species. I am little surprised by seeing the word ‘science’ at such a lower rank in ORF proposals (Fig 3).
Fig 3: Top 100 nouns used in ORF applications. 3 |
Fig 4: Top 100 nouns used in On the Origin of Species and To Kill a Mockingbird. 3 |
Next, I checked what are the most common verbs. Interesting to see how the ‘tense’ is changing in a different kind of literary work. As we are looking into proposals, we will get future tense of verbs. I was surprised to see the similarity between proposals and the origin of species.
Fig 5: Most common verbs. 3 |
Following are few more plots which shows some more grammatical objects and their distribution,
Fig 6: Most common adverbs. 3 |
Fig 7: Most common adpositions. 3 |
Fig 8: Most common pronoun. 3 |
In the end, I can’t say I found any similarity between these proposals. However, for sure there are some similar grammatical structures as words popping up. This is very small data-set to conclude anything. This will be probe to develop new classifier which should able to classify article or text content into its genera based on grammar structure and words used.
Code used in the above analysis can be found here.
(Header image is downloaded from Pixabay.com under CC0-license)