See following link for instructions to install environment needed to run a Python(Jupyter) notebook - http://www.open.edu/openlearn/learn-to-code-installation

Next you need to import pandas module

from pandas import *

Read in 5 gram .csv file (changed file ending from original .txt file)

5 gram file taken from http://www.ngrams.info/download_coca.asp

Use sep='\t' as file is tab separated

df = read_csv('w5c.csv', sep='\t')

Check table headings have been read ok (note I added these headings to .csv file myself)

df.columns

Index([u'freq', u'onegram', u'twogram', u'threegram', u'fourgram', u'fivegram',
       u'posone', u'postwo', u'posthree', u'posfour', u'posfive'],
      dtype='object')

Check value of a row

df.irow(2)

freq               5
onegram            a
twogram         baby
threegram    aspirin
fourgram       every
fivegram         day
posone           at1
postwo           nn1
posthree          nn
posfour          at1
posfive         nnt1
Name: 2, dtype: object

Assign some variable names (this is not strictly necessary but saves some typing when issuing commands)

frequency=df['freq']
posfirst=df['posone']
possecond=df['postwo']
posthird=df['posthree']
posfourth=df['posfour']
posfifth=df['posfive']
onegram=df['onegram']
twogram=df['twogram']
threegram=df['threegram']
fourgram=df['fourgram']
fivegram=df['fivegram']

Counting types of pattern [at nn1 io at nn1] (see CLAWS7 Tagset - http://ucrel.lancs.ac.uk/claws7tags.html)

Above pattern is most common as given by Towards an n-grammar of English (https://www.academia.edu/18626245/Towards_an_n-grammar_of_English)

patternOne=df[(posfirst == 'at') & (possecond == 'nn1') & (posthird == 'io') & (posfourth == 'at') & (posfifth == 'nn1')]

len(patternOne)

7272

Count types of /ppis1 vd0 xx vvi to/

Here just checking that even though [I do n't want to] is very frequent (12659 instances) the syntatic pattern this derives from is not so frequent

countppis = df[(posfirst == 'ppis1') & (possecond == 'vd0') & (posthird == 'xx') & (posfourth == 'vvi') & (posfifth == 'to')]
len(countppis)

51

Examples of the 10 most frequent 5-gram patternOne in decreasing frequency

patternOne.sort('freq', ascending=0).head(10)

Collapse the part of speech columns into one column and assign it to a variable

newcol = df['newcol'] = posfirst.map(str) + possecond.map(str) + posthird.map(str) + posfourth.map(str) + posfifth.map(str)

Count how many patterns there are, the total differs by 3 from the n-grammar English paper 325,549 vs 325,552

newcol.nunique()

325549

Make a count of the top ten 5-grams (using groupby method)

grouped1=df.groupby('newcol')['freq'].count()

grouped1.sort('freq', ascending=0)

print grouped1.head(10)

newcol
atnn1ioatnn1      7272
iiatnn1ioat       5012
nn1iiatnn1io      3286
atjjnn1ioat       3104
tovviatnn1io      2379
iiatjjnn1io       2266
at1jjnn1iiat      2242
iiatnn1ioappge    2115
atnn1ioatjj       1953
iiatnn1iiat       1894
Name: freq, dtype: int64

	freq	onegram	twogram	threegram	fourgram	fivegram	posone	postwo	posthree	posfour	posfive
980720	3618	the	rest	of	the	world	at	nn1	io	at	nn1
987110	1217	the	side	of	the	road	at	nn1	io	at	nn1
980317	1174	the	rest	of	the	country	at	nn1	io	at	nn1
933714	825	the	fact	of	the	matter	at	nn1	io	at	nn1
931730	764	the	end	of	the	world	at	nn1	io	at	nn1
931715	670	the	end	of	the	war	at	nn1	io	at	nn1
980705	597	the	rest	of	the	way	at	nn1	io	at	nn1
914922	547	the	benefit	of	the	doubt	at	nn1	io	at	nn1
929913	530	the	edge	of	the	bed	at	nn1	io	at	nn1
919738	526	the	center	of	the	room	at	nn1	io	at	nn1