Analysis of Anomalies & Outliers ################################ Sometimes, anomalies find their way into the data and tamper with the quality of the downstream ML model. For instance, a classifier that is trained to classify input documents into N known classes, does not know what to do with an anomalous document, hence, it places it into one of those classes that can be completely wrong. Anomaly detection, in this example, allows us to identify and discard anomalies before running the classifier. On the other hand, sometimes anomalies the most interesting part of our data and those are the ones that we are looking for. You can use ``wordview`` to identify anomalies in your data. For instance, you can use ``NormalDistAnomalies`` to identify anomalies based on (the normalized) distribution of your data. See a worked example below. .. code:: python from wordview.anomaly import NormalDistAnomalies from sklearn.feature_extraction.text import TfidfVectorizer # Create a score for words. # It can be e.g. word frequency tsp = TextStatsPlots(df=imdb_train, text_column='text') token_score_dict = tsp.analysis.token_to_count_dict # or it can be the inverse document frequency (IDF) vectorizer = TfidfVectorizer(min_df=1) X = vectorizer.fit_transform(imdb_train["text"]) idf = vectorizer.idf_ token_score_dict = dict(zip(vectorizer.get_feature_names(), idf)) # Use NormalDistAnomalies to identify anomalies. nda = NormalDistAnomalies(items=token_score_dict) nda.anomalous_items()