Utilities ######### Text Cleaning ------------- Cleaning up the text can be a tedious task, but for most NLP applications we almost always need some degree of it. *wordview* offers easy to use functionalities for filtering noise, customized definition of noise, and cleaning up the text from it. For instance, you can choose what pattern to accept via ``keep_pattern`` argument, what pattern to drop via ``drop_patterns`` argument, and what pattern to replace via ``replace`` argument. Or you can specify the max length of allowed tokens to filter out very long sequences that are often noise. See the docs to learn more about other parameters of ``clean_text``. Here is a worked example: .. code:: python from wordview.preprocessing import clean_text # Let's only keep alphanumeric tokens as well as important punctuation marks: keep_pattern='^[a-zA-Z0-9!.,?\';:$/_-]+$' # In this corpus, one can frequently see HTML tags such as `< br / >`. So let's drop them: drop_patterns={'< br / >'} # By skimming throw the text one can frequently see many patterns such as !!! or ???. Let's replace them: replace={'!!!':'!', '\?\?\?':'?'} # Finally, let's set the maximum length of a token to 15: maxlen=15 # Pass the set keyword arguments to the apply: imdb_train.text = imdb_train.text.apply(clean_text, args=(), keep_pattern=keep_pattern, replace=replace, maxlen=maxlen) **Note** ``clean_text`` returns tokenized text. Hyphenating MWEs ---------------- An important use of extracting MWEs is to treat them as a single token. Research shows that when fixed expressions are treated as a single token rather than the sum of their components, they can improve the performance of downstream applications such as classification and NER. Using the `hyphenate_mwes` function, you can replace the extracted expressions in the corpus with their hyphenated version (global warming --> global-warming) so that they are considered a single token by downstream applications. A worked example can be seen below: .. code:: python from snlp.mwes import hyphenate_mwes new_df = hyphenate_mwes(path_to_mwes='tmp/mwes/mwe_data.json', mwe_types=['NC', 'JNC'], df=imdb_train, text_column='text') new_df.to_csv('tmp/new_df.csv', sep='\t')