Label Analysis

Wordview provides several statistics for labels in labeled datasets. To use this feature, you can use LabelStatsPlots.

Document-level Labels

For document-level labels Wordview supports up to 4 independent categorical or numerical labels. These labels should be provided in a one-label-per-document fashion. See the following example where such labels are provided in a pandas.dataframe.

from wordview.text_analysis import LabelStatsPlots

# In addition to the original label which is located in the column `label`, of the dataframe,
# for illustration purpose, let's create two random labels:
imdb_train['numerical_label'] = np.random.randint(1, 500, imdb_train.shape[0])
imdb_train['new_label'] = random.choices(['a', 'b', 'c', 'd'], [0.2, 0.5, 0.8, 0.9], k=imdb_train.shape[0])
imdb_train['numerical_labe2'] = np.random.randint(1, 500, imdb_train.shape[0])

lsp = LabelStatsPlots(df=imdb_train, label_columns=[('label', 'categorical'),
                                                    ('label2', 'categorical'),
                                                    ('numerical_label', 'numerical'),
                                                    ('numerical_label2', 'numerical')
                                                   ])

lsp.show_label_plots()

labels

Sequence-level Labels

One of the most widely used types of labels in NLP are sequence-level labels (e.g. Named Entity tags such as PER, LOC). Wordview offers functions to see the overall and document-level distribution of sequence labels.