Analysis of Anomalies & Outliers
Sometimes, anomalies find their way into the data and tamper with the
quality of the downstream ML model. For instance, a classifier that is
trained to classify input documents into N known classes, does not know
what to do with an anomalous document, hence, it places it into one of
those classes that can be completely wrong. Anomaly detection, in this
example, allows us to identify and discard anomalies before running the
classifier. On the other hand, sometimes anomalies the most interesting
part of our data and those are the ones that we are looking for.
You can use wordview
to identify anomalies in your data. For instance,
you can use NormalDistAnomalies
to identify anomalies based on (the normalized)
distribution of your data. See a worked example below.
from wordview.anomaly import NormalDistAnomalies
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a score for words.
# It can be e.g. word frequency
tsp = TextStatsPlots(df=imdb_train, text_column='text')
token_score_dict = tsp.analysis.token_to_count_dict
# or it can be the inverse document frequency (IDF)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(imdb_train["text"])
idf = vectorizer.idf_
token_score_dict = dict(zip(vectorizer.get_feature_names(), idf))
# Use NormalDistAnomalies to identify anomalies.
nda = NormalDistAnomalies(items=token_score_dict)
nda.anomalous_items()