API
API reference for public classes and methods.
Text Analysis
- class wordview.text_analysis.TextStatsPlots(df: DataFrame, text_column: str, distributions: set = {'doc_len', 'sentence_len', 'word_frequency_zipf'}, pos_tags: set = {'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'})
Represents Text Statistics and Plots.
- __init__(df: DataFrame, text_column: str, distributions: set = {'doc_len', 'sentence_len', 'word_frequency_zipf'}, pos_tags: set = {'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'}) None
Initialize a new TextStatsPlots object with the given arguments.
- Parameters:
df – DataFrame with a text_column that contains the text corpus.
text_column – Specifies the column of DataFrame where text data resides.
distributions –
set of distribution types to generate and plot. Available distributions are:
doc_len: Document lengths
sentence_len: Sentence lengths
word_frequency_zipf: Zipfian word frequency distribution.
Default =
{'doc_len', 'word_frequency_zipf'}
pos_tags –
A set of target POS tags for downstream analysis.
Default =
{'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'}
- Returns:
None
- chat(api_key: str = '')
Chat with OpenAI’s latest model about the results of Wordview’s text analysis. Access the chat UI in your localhost under http://127.0.0.1:5000/
- Parameters:
api_key – OpenAI API key.
- Returns:
None
- return_stats() dict[str, Any]
Returns dataset statistics, including: Language/s Number of unique words Number of all words Number of documents Median document length Number of nouns Number of adjectives Number of verbs.
- show_bar_plots(pos: str, layout_settings: dict[str, Any] = {}, plot_settings: dict[str, str] = {}) None
Shows POS bar plots.
- Parameters:
pos – Type of POS. Can be any of the Penn POS tags
(https – //www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
layout_settings –
To customize the plot layout. For example: layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,
’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’
}
example (plot_settings = To customize the plot colors and other attributes. For) –
- {‘color’: ‘darkgreen’,
’max_words’: 200}
- show_distplot(distribution: str, layout_settings: dict[str, str] = {}, plot_settings: dict[str, str] = {}) None
Shows distribution plots for distribution.
- Parameters:
distribution –
The distribution for which the plot is to be shown. Available distributions are:
doc_len: document lengths
word_frequency_zipf: Zipfian word frequency distribution.
layout_settings –
To customize the plot layout. For example:
layout_settings = {'plot_bgcolor':'rgba(245, 245, 245, 1)', 'paper_bgcolor': 'rgba(255, 255, 255, 1)', 'hovermode': 'y'}
For a full list of possible options, see: https://plotly.com/python/reference/layout/
plot_settings –
A dictionary of form:
{"<plot_setting>": "<value>"}
for each one of the supported plots, in order to customize the plot colors and other attributes. For example, for word_frequency_zipf and doc_len plots, you can, respectively pass:plot_settings = {'theoritical_zipf_colorscale': 'Reds', 'emperical_zipf_colorscale': 'Greens', 'mode': 'markers'} plot_settings = {'color': 'blue', 'showlegend': False}
You can pass all the attributes for different available distribution plots at once, but not all of them are supported across all plots. The supported attributes will be extracted and used for each distribution type.
- Returns:
None
- show_insights()
Prints insights about the dataset.
- show_stats() None
Print dataset statistics, including: Language/s Number of unique words Number of all words Number of documents Median document length Number of nouns Number of adjectives Number of verbs.
- show_word_clouds(pos: str, layout_settings: dict[str, Any] = {}, plot_settings: dict[str, str] = {}) None
Shows POS word clouds.
- Parameters:
pos – Type of POS. Can be any of the Penn POS tags
(https – //www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
layout_settings –
To customize the plot layout. For example: layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,
’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’
}
example (plot_settings = To customize the plot colors and other attributes. For) –
- {‘color’: ‘darkgreen’,
’max_words’: 200}
- Returns:
None
Label Analysis
- class wordview.text_analysis.LabelStatsPlots(df: DataFrame, label_columns: list[Tuple])
Represents Label Statistics and Plots.
- __init__(df: DataFrame, label_columns: list[Tuple]) None
Initialize a new LabelStatsPlots object with the given arguments.
- Parameters:
df – DataFrame with one or more label column/s.
label_columns – list of tuples (column_name, label_type) that specify a label column and its type (categorical or numerical).
- Returns:
None
- show_label_plots(layout_settings: dict[str, Any] = {}) None
Renders label plots for columns specified in self.label_columns.
- Parameters:
layout_settings – To customize the plot layout.
example (For) –
- layout_settings ={‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,
’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’, ‘coloraxis’: {‘colorscale’: ‘peach’}, ‘coloraxis_showscale’:True
}
scales (See here for a list of named color) –
https – //plotly.com/python/builtin-colorscales/
- Returns:
None
Extraction & Analysis of Multiword Expressions
- class wordview.mwes.MWE(df: DataFrame, text_column: str, ngram_count_source=None, ngram_count_file_path=None, language: str = 'EN', custom_patterns: str | None = None, only_custom_patterns: bool = False, mwe_frequency_threshold: int = 10, association_threshold: float = 1.0)
Extract MWEs of typeS: LVC, VPC, Noun Compounds, Adjective Compounds, and custom patterns from a text corpus.
- __init__(df: DataFrame, text_column: str, ngram_count_source=None, ngram_count_file_path=None, language: str = 'EN', custom_patterns: str | None = None, only_custom_patterns: bool = False, mwe_frequency_threshold: int = 10, association_threshold: float = 1.0) None
Initializes a new instance of MWE class.
- Parameters:
df – A pandas DataFrame containing the text corpus.
text_column – The name of the column containing the text.
ngram_count_source – A dictionary containing ngram counts.
ngram_count_file_path – A path to a json file containing ngram counts.
language – The language of the corpus. Currently only ‘EN’ and ‘DE’ are supported. Defaults = ‘EN’.
custom_pattern – A string pattern to match against the tokens. The pattern must be a string of the following form. Examples of user-defined patterns: NP: {<DT>?<JJ>*<NN>} # Noun phrase You can use multiple and/or nested patterns, separated by a newline character e.g.: custom_pattern = ‘’’ VP: {<MD>?<VB.*><NP|PP|CLAUSE>+$} # Verb phrase PROPN: {<NNP>+} # Proper noun ADJP: {<RB|RBR|RBS>*<JJ>} # Adjective phrase ADVP: {<RB.*>+<VB.*><RB.*>*} # Adverb phrase’’’
only_custom_pattern – If True, only the custom pattern will be used to extract MWEs, otherwise, the default patterns will be used as well.
mwe_frequency_threshold – The minimum frequency of an MWE to be considered for extraction. Defaults to 10.
association_threshold – A threshold value for the association measure. Only MWEs with an association measure above this threshold will be returned.
Returns – None
- chat(api_key: str = '')
Chat with OpenAI’s latest model about MWEs . Access the chat UI in your localhost under http://127.0.0.1:5001/
- Parameters:
api_key – OpenAI API key.
- Returns:
None
- extract_mwes(sort: bool = True, top_n: int | None = None) dict[str, dict[str, float]]
Extract MWEs from the text corpus and add them to self.mwes.
- Parameters:
sort – If True, the MWEs will be sorted in descending order of association measure.
top_n – If provided, only the top n MWEs will be returned.
- Returns:
None.
- print_mwe_table()
Prints a table of MWEs and their association measures.
- Parameters:
None –
- Returns:
None
Bias Analysis
- class wordview.bias_analysis.bias.BiasDetector(df, text_column)
Bias Detector class for detecting different bias categories in text.
- __init__(df, text_column)
Initializes a BiasDetector object.
- Parameters:
df – A pandas dataframe containing text data.
text_column – The name of the column containing text data.
- Returns:
None
- detect_bias(language: str = 'en') dict[str, dict[str, float]]
Detects bias in the text data.
- Parameters:
language – The language of the text data. Defaults to English.
- Returns:
A dictionary of bias categories and subcategories and their associated bias scores.
- print_bias_table()
Prints a table of the bias scores for each category.
- Parameters:
None –
- Returns:
None
- show_bias_plot(colorscale: str | list[list] | None = None, layout_settings: dict | None = None, font_settings: dict = {'bias_subcategory_font': {'size': 16}, 'category_titles': {'size': 18}, 'colorbar_tick_font': {'size': 16}, 'colorbar_title_font': {'size': 18}})
Displays a plotly heatmap of the bias scores for each category.
- Parameters:
colorscale –
The colorscale to use for the heatmap. If not provided, the default colorscale is used. You can define a custom colorscale by providing a list of lists of the form:
- cyan_scopecolorscale = [
[0.0, “#E0FFFF”], # Lightest Cyan [0.25, “#B3E4E4”], # Lighter Cyan [0.5, “#66C2C2”], # Neutral Cyan [0.75, “#339999”], # Darker Cyan [1.0, “#006666”], # Darkest Cyan
]
Or you can use one of the built-in colorscales by providing a string. Example of available colorscales are:’aggrnyl’, ‘agsunset’, ‘algae’, ‘amp’, ‘armyrose’, ‘balance’, ‘blackbody’, ‘bluered’, ‘blues’, ‘blugrn’, ‘bluyl’, ‘brbg’,’brwnyl’, ‘bugn’, ‘bupu’, ‘burg’, ‘burgyl’, ‘cividis’, ‘curl’
You can reverse a colorscale by appending an ‘_r’ to it, e.g. ‘algae_r’ # See here for a full list: # https://plotly.com/python/builtin-colorscales/
layout_settings –
A dictionary of layout settings to apply to the plot. If not provided, the default layout settings are used. .. rubric:: Example
layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’, ‘paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’ }
For a full list of possible options, see: https://plotly.com/python/reference/layout/
font_settings –
A dictionary of font sizes for color bar, sub-categories, and subtitles. Defaults = {
”colorbar_tick_font”: {“size”: 16}, “colorbar_title_font”: {“size”: 18}, “bias_subcategory_font”: {“size”: 16}, “category_titles”: {“size”: 18},
}
- Returns:
None
Cluster Analysis
- class wordview.clustering.cluster.Cluster(documents: List[str], vector_model: str = 'tfidf')
Cluster text documents using various clustering algorithms and based on different vectorization technologies.
- __init__(documents: List[str], vector_model: str = 'tfidf') None
- Parameters:
documents (List) – List of documents.
vector_model (str) – Vectorization technology. Default = tfidf.
- cluster(clustering_algorithm: str = 'kmeans', n_clusters: int = 3, distance_threshold: Any | None = None) None
Cluster documents using the algorithm specified by clustering_algorithm.
- Parameters:
clustering_algorithm (str) – Clustering algorithm. Default = kmeans. Supported algorithms: [Kmeans, AgglomerativeClustering].
n_clusters (str) – Number of clusters. Default = 3.
distance_threshold (float) – Distance threshold for AgglomerativeClustering. Default = None.
Returns – None
Anomaly Analysis
- class wordview.anomaly.NormalDistAnomalies(items: Dict, val_name: str = 'representative_value', gaussianization_strategy: str = 'brute')
- __init__(items: Dict, val_name: str = 'representative_value', gaussianization_strategy: str = 'brute')
Identify anomalies on a normal distribution.
- Parameters:
items – A dictionary of items and their representative value, such as word_count, idf, etc.
val_name – Name of the value in the above dictionary. i.e. word_count, idf, etc. Defaults to representative_value.
gaussianization_strategy – Strategy for gaussianization. Can be any of lambert, brute, or boxcox. Defaults = brute.
- Returns:
None
- anomalous_items(manual: bool = False, z: int = 3, prob: float = 0.001) Set[str]
Identify anomalous items in self.items.
- Parameters:
manual – Whether or not select redundant words using manual thresholds. When set to True, manual_thresholds should be specified.
z – Items with a z-score above this value are considered anomalous. Used only when manual is False. Default = 3.
prob – Probability threshold below which items are considered anomalous.
- Returns:
An alphabetically sorted set of anomalous items.