See following link for instructions to install environment needed to run a Python(Jupyter) notebook - http://www.open.edu/openlearn/learn-to-code-installation

Next you need to import pandas module

In [2]:
from pandas import *

Read in 5 gram .csv file (changed file ending from original .txt file)

5 gram file taken from http://www.ngrams.info/download_coca.asp

Use sep='\t' as file is tab separated

In [3]:
df = read_csv('w5c.csv', sep='\t')

Check table headings have been read ok (note I added these headings to .csv file myself)

In [4]:
df.columns
Out[4]:
Index([u'freq', u'onegram', u'twogram', u'threegram', u'fourgram', u'fivegram',
       u'posone', u'postwo', u'posthree', u'posfour', u'posfive'],
      dtype='object')

Check value of a row

In [5]:
df.irow(2)
Out[5]:
freq               5
onegram            a
twogram         baby
threegram    aspirin
fourgram       every
fivegram         day
posone           at1
postwo           nn1
posthree          nn
posfour          at1
posfive         nnt1
Name: 2, dtype: object

Assign some variable names (this is not strictly necessary but saves some typing when issuing commands)

In [6]:
frequency=df['freq']
posfirst=df['posone']
possecond=df['postwo']
posthird=df['posthree']
posfourth=df['posfour']
posfifth=df['posfive']
onegram=df['onegram']
twogram=df['twogram']
threegram=df['threegram']
fourgram=df['fourgram']
fivegram=df['fivegram']

Counting types of pattern [at nn1 io at nn1] (see CLAWS7 Tagset - http://ucrel.lancs.ac.uk/claws7tags.html)

Above pattern is most common as given by Towards an n-grammar of English (https://www.academia.edu/18626245/Towards_an_n-grammar_of_English)

In [7]:
patternOne=df[(posfirst == 'at') & (possecond == 'nn1') & (posthird == 'io') & (posfourth == 'at') & (posfifth == 'nn1')]
In [8]:
len(patternOne)
Out[8]:
7272

Count types of /ppis1 vd0 xx vvi to/

Here just checking that even though [I do n't want to] is very frequent (12659 instances) the syntatic pattern this derives from is not so frequent

In [9]:
countppis = df[(posfirst == 'ppis1') & (possecond == 'vd0') & (posthird == 'xx') & (posfourth == 'vvi') & (posfifth == 'to')]
len(countppis)
Out[9]:
51

Examples of the 10 most frequent 5-gram patternOne in decreasing frequency

In [10]:
patternOne.sort('freq', ascending=0).head(10)
Out[10]:
freq onegram twogram threegram fourgram fivegram posone postwo posthree posfour posfive
980720 3618 the rest of the world at nn1 io at nn1
987110 1217 the side of the road at nn1 io at nn1
980317 1174 the rest of the country at nn1 io at nn1
933714 825 the fact of the matter at nn1 io at nn1
931730 764 the end of the world at nn1 io at nn1
931715 670 the end of the war at nn1 io at nn1
980705 597 the rest of the way at nn1 io at nn1
914922 547 the benefit of the doubt at nn1 io at nn1
929913 530 the edge of the bed at nn1 io at nn1
919738 526 the center of the room at nn1 io at nn1

Collapse the part of speech columns into one column and assign it to a variable

In [11]:
newcol = df['newcol'] = posfirst.map(str) + possecond.map(str) + posthird.map(str) + posfourth.map(str) + posfifth.map(str)

Count how many patterns there are, the total differs by 3 from the n-grammar English paper 325,549 vs 325,552

In [12]:
newcol.nunique()
Out[12]:
325549

Make a count of the top ten 5-grams (using groupby method)

In [13]:
grouped1=df.groupby('newcol')['freq'].count()
In [18]:
grouped1.sort('freq', ascending=0)
In [19]:
print grouped1.head(10)
newcol
atnn1ioatnn1      7272
iiatnn1ioat       5012
nn1iiatnn1io      3286
atjjnn1ioat       3104
tovviatnn1io      2379
iiatjjnn1io       2266
at1jjnn1iiat      2242
iiatnn1ioappge    2115
atnn1ioatjj       1953
iiatnn1iiat       1894
Name: freq, dtype: int64
In [ ]: