Utilities
Text Cleaning
Cleaning up the text can be a tedious task, but for
most NLP applications we almost always need some degree of it.
wordview offers easy to use functionalities for filtering noise,
customized definition of noise, and cleaning up the text from it. For
instance, you can choose what pattern to accept via keep_pattern
argument, what pattern to drop via drop_patterns
argument, and what
pattern to replace via replace
argument. Or you can specify the max
length of allowed tokens to filter out very long sequences that are
often noise. See the docs to learn more about other parameters of
clean_text
. Here is a worked example:
from wordview.preprocessing import clean_text
# Let's only keep alphanumeric tokens as well as important punctuation marks:
keep_pattern='^[a-zA-Z0-9!.,?\';:$/_-]+$'
# In this corpus, one can frequently see HTML tags such as `< br / >`. So let's drop them:
drop_patterns={'< br / >'}
# By skimming throw the text one can frequently see many patterns such as !!! or ???. Let's replace them:
replace={'!!!':'!', '\?\?\?':'?'}
# Finally, let's set the maximum length of a token to 15:
maxlen=15
# Pass the set keyword arguments to the apply:
imdb_train.text = imdb_train.text.apply(clean_text, args=(), keep_pattern=keep_pattern, replace=replace, maxlen=maxlen)
Note clean_text
returns tokenized text.
Hyphenating MWEs
An important use of extracting MWEs is to treat them as a single token. Research shows that when fixed expressions are treated as a single token rather than the sum of their components, they can improve the performance of downstream applications such as classification and NER. Using the hyphenate_mwes function, you can replace the extracted expressions in the corpus with their hyphenated version (global warming –> global-warming) so that they are considered a single token by downstream applications. A worked example can be seen below:
from snlp.mwes import hyphenate_mwes
new_df = hyphenate_mwes(path_to_mwes='tmp/mwes/mwe_data.json',
mwe_types=['NC', 'JNC'],
df=imdb_train,
text_column='text')
new_df.to_csv('tmp/new_df.csv', sep='\t')