See following link for instructions to install environment needed to run a Python(Jupyter) notebook - http://www.open.edu/openlearn/learn-to-code-installation
Next you need to import pandas module
from pandas import *
Read in 5 gram .csv file (changed file ending from original .txt file)
5 gram file taken from http://www.ngrams.info/download_coca.asp
Use sep='\t' as file is tab separated
df = read_csv('w5c.csv', sep='\t')
Check table headings have been read ok (note I added these headings to .csv file myself)
df.columns
Check value of a row
df.irow(2)
Assign some variable names (this is not strictly necessary but saves some typing when issuing commands)
frequency=df['freq']
posfirst=df['posone']
possecond=df['postwo']
posthird=df['posthree']
posfourth=df['posfour']
posfifth=df['posfive']
onegram=df['onegram']
twogram=df['twogram']
threegram=df['threegram']
fourgram=df['fourgram']
fivegram=df['fivegram']
Counting types of pattern [at nn1 io at nn1] (see CLAWS7 Tagset - http://ucrel.lancs.ac.uk/claws7tags.html)
Above pattern is most common as given by Towards an n-grammar of English (https://www.academia.edu/18626245/Towards_an_n-grammar_of_English)
patternOne=df[(posfirst == 'at') & (possecond == 'nn1') & (posthird == 'io') & (posfourth == 'at') & (posfifth == 'nn1')]
len(patternOne)
Count types of /ppis1 vd0 xx vvi to/
Here just checking that even though [I do n't want to] is very frequent (12659 instances) the syntatic pattern this derives from is not so frequent
countppis = df[(posfirst == 'ppis1') & (possecond == 'vd0') & (posthird == 'xx') & (posfourth == 'vvi') & (posfifth == 'to')]
len(countppis)
Examples of the 10 most frequent 5-gram patternOne in decreasing frequency
patternOne.sort('freq', ascending=0).head(10)
Collapse the part of speech columns into one column and assign it to a variable
newcol = df['newcol'] = posfirst.map(str) + possecond.map(str) + posthird.map(str) + posfourth.map(str) + posfifth.map(str)
Count how many patterns there are, the total differs by 3 from the n-grammar English paper 325,549 vs 325,552
newcol.nunique()
Make a count of the top ten 5-grams (using groupby method)
grouped1=df.groupby('newcol')['freq'].count()
grouped1.sort('freq', ascending=0)
print grouped1.head(10)