Analysis & Extraction of Multiword Expressions (MWEs)
Multiword Expressions (MWEs) are phrases that can be treated as a single semantic unit. E.g. swimming pool and climate change. MWEs have application in different areas including: parsing, language generation, language modeling, terminology extraction, and topic models.
Wordview can extract different types of MWEs from a text corpus in any of the supported languages. Wordview by default extracts the following types of MWEs: Light Verb Constructions (LVCs), 2 and 3 word Noun Compounds (NCs), 2 and 3 word Adjective-Noun Compounds (ANCs), and Verb-Noun Compounds (VNCs). However, you can specify other types of MWEs you want to extract using the custom_pattern argument. For more details, see the the documentation.
# First we need to extract ngrams from the corpus
# If this was not done previously, e.g. when running other functions of Wordview,
# you can do it as follows:
from wordview.preprocessing import NgramExtractor
import pandas as pd
imdb_corpus = pd.read_csv("data/IMDB_Dataset_sample.csv")
extractor = NgramExtractor(imdb_corpus, "review")
extractor.extract_ngrams()
extractor.get_ngram_counts(ngram_count_file_path="data/ngram_counts.json")
# Now we can extract MWEs
from wordview.mwes import MWE
import json
mwe_obj = MWE(imdb_corpus, 'review',
ngram_count_file_path='data/ngram_counts.json',
language='EN',
custom_patterns="NP: {<DT>?<JJ>*<NN>}",
only_custom_patterns=False,
)
mwe_obj.extract_mwes(sort=True, top_n=10)
json.dump(mwe_obj.mwes, open('data/mwes.json', 'w'), indent=4)
The above returns the results in a dictionary, that in this example we stored in a json file called data/mwes.json. You can also return the result in a table:
mwe_obj.print_mwe_table()
Which will return a table like this:
╔═════════════════════════╦═══════════════╗
║ LVC ║ Association ║
╠═════════════════════════╬═══════════════╣
║ SHOOT the binding ║ 26.02 ║
║ achieve this elusive ║ 24.7 ║
║ manipulate the wildlife ║ 24.44 ║
║ offset the darker ║ 24.02 ║
║ remove the bindings ║ 24.02 ║
║ Wish that Anthony ║ 23.9 ║
║ Add some French ║ 23.5 ║
║ grab a beer ║ 22.82 ║
║ steal the 42 ║ 22.5 ║
║ invoke the spirit ║ 22.12 ║
╚═════════════════════════╩═══════════════╝
╔══════════════════════╦═══════════════╗
║ NC2 ║ Association ║
╠══════════════════════╬═══════════════╣
║ gordon willis ║ 20.74 ║
║ Smoking Barrels ║ 20.74 ║
║ sadahiv amrapurkar ║ 20.74 ║
║ nihilism nothingness ║ 20.74 ║
║ tomato sauce ║ 20.74 ║
║ Picket Fences ║ 20.74 ║
║ deja vu ║ 19.74 ║
║ cargo bay ║ 19.74 ║
║ zoo souvenir ║ 19.16 ║
║ cake frosting ║ 19.16 ║
╚══════════════════════╩═══════════════╝
╔══════════════════════════════╦═══════════════╗
║ ANC2 ║ Association ║
╠══════════════════════════════╬═══════════════╣
║ bite-sized chunks ║ 20.74 ║
║ lizardly snouts ║ 20.74 ║
║ behind-the-scenes featurette ║ 20.74 ║
║ hidebound conservatives ║ 20.74 ║
║ judicious pruning ║ 20.74 ║
║ substantial gauge ║ 19.74 ║
║ haggish airheads ║ 19.74 ║
║ global warming ║ 19.74 ║
║ Ukrainian flags ║ 19.16 ║
║ well-lit sights ║ 19.16 ║
╚══════════════════════════════╩═══════════════╝
╔═══════════════╦═══════════════╗
║ VPC ║ Association ║
╠═══════════════╬═══════════════╣
║ upside down ║ 12.67 ║
║ Stay away ║ 12.49 ║
║ put together. ║ 11.62 ║
║ sit through ║ 10.93 ║
║ ratchet up ║ 10.83 ║
║ shoot'em up ║ 10.83 ║
║ rip off ║ 10.72 ║
║ hunt down ║ 10.67 ║
║ screw up ║ 10.41 ║
║ scorch out ║ 10.4 ║
╚═══════════════╩═══════════════╝
╔══════════════╦═══════════════╗
║ NP ║ Association ║
╠══════════════╬═══════════════╣
║ every penny ║ 12.78 ║
║ THE END ║ 12.07 ║
║ A JOKE ║ 11.79 ║
║ A LOT ║ 11.05 ║
║ Either way ║ 11.03 ║
║ An absolute ║ 10.72 ║
║ half hour ║ 10.65 ║
║ no qualms ║ 10.47 ║
║ every cliche ║ 10.46 ║
║ another user ║ 10.37 ║
╚══════════════╩═══════════════╝
Notice how many interesting entities are captured, without the need for any labeled data and supervised model. This can speed things up and save much costs in certain applications.