API

API reference for public classes and methods.

Text Analysis

class wordview.text_analysis.TextStatsPlots(df: DataFrame, text_column: str, distributions: set = {'doc_len', 'sentence_len', 'word_frequency_zipf'}, pos_tags: set = {'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'})

Represents Text Statistics and Plots.

__init__(df: DataFrame, text_column: str, distributions: set = {'doc_len', 'sentence_len', 'word_frequency_zipf'}, pos_tags: set = {'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'}) None

Initialize a new TextStatsPlots object with the given arguments.

Parameters:
  • df – DataFrame with a text_column that contains the text corpus.

  • text_column – Specifies the column of DataFrame where text data resides.

  • distributions

    set of distribution types to generate and plot. Available distributions are:

    doc_len: Document lengths

    sentence_len: Sentence lengths

    word_frequency_zipf: Zipfian word frequency distribution.

    Default = {'doc_len', 'word_frequency_zipf'}

  • pos_tags

    A set of target POS tags for downstream analysis.

    Default = {'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'}

Returns:

None

chat(api_key: str = '')

Chat with OpenAI’s latest model about the results of Wordview’s text analysis. Access the chat UI in your localhost under http://127.0.0.1:5000/

Parameters:

api_key – OpenAI API key.

Returns:

None

return_stats() dict[str, Any]

Returns dataset statistics, including: Language/s Number of unique words Number of all words Number of documents Median document length Number of nouns Number of adjectives Number of verbs.

show_bar_plots(pos: str, layout_settings: dict[str, Any] = {}, plot_settings: dict[str, str] = {}) None

Shows POS bar plots.

Parameters:
  • pos – Type of POS. Can be any of the Penn POS tags

  • (https – //www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

  • layout_settings

    To customize the plot layout. For example: layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,

    ’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’

    }

  • example (plot_settings = To customize the plot colors and other attributes. For) –

    {‘color’: ‘darkgreen’,

    ’max_words’: 200}

show_distplot(distribution: str, layout_settings: dict[str, str] = {}, plot_settings: dict[str, str] = {}) None

Shows distribution plots for distribution.

Parameters:
  • distribution

    The distribution for which the plot is to be shown. Available distributions are:

    doc_len: document lengths

    word_frequency_zipf: Zipfian word frequency distribution.

  • layout_settings

    To customize the plot layout. For example:

    layout_settings = {'plot_bgcolor':'rgba(245, 245, 245, 1)',
                       'paper_bgcolor': 'rgba(255, 255, 255, 1)',
                       'hovermode': 'y'}
    

    For a full list of possible options, see: https://plotly.com/python/reference/layout/

  • plot_settings

    A dictionary of form: {"<plot_setting>": "<value>"} for each one of the supported plots, in order to customize the plot colors and other attributes. For example, for word_frequency_zipf and doc_len plots, you can, respectively pass:

    plot_settings = {'theoritical_zipf_colorscale': 'Reds',
                     'emperical_zipf_colorscale': 'Greens',
                     'mode': 'markers'}
    
    plot_settings = {'color': 'blue',
                     'showlegend': False}
    

    You can pass all the attributes for different available distribution plots at once, but not all of them are supported across all plots. The supported attributes will be extracted and used for each distribution type.

Returns:

None

show_insights()

Prints insights about the dataset.

show_stats() None

Print dataset statistics, including: Language/s Number of unique words Number of all words Number of documents Median document length Number of nouns Number of adjectives Number of verbs.

show_word_clouds(pos: str, layout_settings: dict[str, Any] = {}, plot_settings: dict[str, str] = {}) None

Shows POS word clouds.

Parameters:
  • pos – Type of POS. Can be any of the Penn POS tags

  • (https – //www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

  • layout_settings

    To customize the plot layout. For example: layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,

    ’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’

    }

  • example (plot_settings = To customize the plot colors and other attributes. For) –

    {‘color’: ‘darkgreen’,

    ’max_words’: 200}

Returns:

None

Label Analysis

class wordview.text_analysis.LabelStatsPlots(df: DataFrame, label_columns: list[Tuple])

Represents Label Statistics and Plots.

__init__(df: DataFrame, label_columns: list[Tuple]) None

Initialize a new LabelStatsPlots object with the given arguments.

Parameters:
  • df – DataFrame with one or more label column/s.

  • label_columns – list of tuples (column_name, label_type) that specify a label column and its type (categorical or numerical).

Returns:

None

show_label_plots(layout_settings: dict[str, Any] = {}) None

Renders label plots for columns specified in self.label_columns.

Parameters:
  • layout_settings – To customize the plot layout.

  • example (For) –

    layout_settings ={‘plot_bgcolor’:’rgba(245, 245, 245, 1)’,

    ’paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’, ‘coloraxis’: {‘colorscale’: ‘peach’}, ‘coloraxis_showscale’:True

    }

  • scales (See here for a list of named color) –

  • https – //plotly.com/python/builtin-colorscales/

Returns:

None

Extraction & Analysis of Multiword Expressions

class wordview.mwes.MWE(df: DataFrame, text_column: str, ngram_count_source=None, ngram_count_file_path=None, language: str = 'EN', custom_patterns: str | None = None, only_custom_patterns: bool = False, mwe_frequency_threshold: int = 10, association_threshold: float = 1.0)

Extract MWEs of typeS: LVC, VPC, Noun Compounds, Adjective Compounds, and custom patterns from a text corpus.

__init__(df: DataFrame, text_column: str, ngram_count_source=None, ngram_count_file_path=None, language: str = 'EN', custom_patterns: str | None = None, only_custom_patterns: bool = False, mwe_frequency_threshold: int = 10, association_threshold: float = 1.0) None

Initializes a new instance of MWE class.

Parameters:
  • df – A pandas DataFrame containing the text corpus.

  • text_column – The name of the column containing the text.

  • ngram_count_source – A dictionary containing ngram counts.

  • ngram_count_file_path – A path to a json file containing ngram counts.

  • language – The language of the corpus. Currently only ‘EN’ and ‘DE’ are supported. Defaults = ‘EN’.

  • custom_pattern – A string pattern to match against the tokens. The pattern must be a string of the following form. Examples of user-defined patterns: NP: {<DT>?<JJ>*<NN>} # Noun phrase You can use multiple and/or nested patterns, separated by a newline character e.g.: custom_pattern = ‘’’ VP: {<MD>?<VB.*><NP|PP|CLAUSE>+$} # Verb phrase PROPN: {<NNP>+} # Proper noun ADJP: {<RB|RBR|RBS>*<JJ>} # Adjective phrase ADVP: {<RB.*>+<VB.*><RB.*>*} # Adverb phrase’’’

  • only_custom_pattern – If True, only the custom pattern will be used to extract MWEs, otherwise, the default patterns will be used as well.

  • mwe_frequency_threshold – The minimum frequency of an MWE to be considered for extraction. Defaults to 10.

  • association_threshold – A threshold value for the association measure. Only MWEs with an association measure above this threshold will be returned.

  • Returns – None

chat(api_key: str = '')

Chat with OpenAI’s latest model about MWEs . Access the chat UI in your localhost under http://127.0.0.1:5001/

Parameters:

api_key – OpenAI API key.

Returns:

None

extract_mwes(sort: bool = True, top_n: int | None = None) dict[str, dict[str, float]]

Extract MWEs from the text corpus and add them to self.mwes.

Parameters:
  • sort – If True, the MWEs will be sorted in descending order of association measure.

  • top_n – If provided, only the top n MWEs will be returned.

Returns:

None.

print_mwe_table()

Prints a table of MWEs and their association measures.

Parameters:

None

Returns:

None

Bias Analysis

class wordview.bias_analysis.bias.BiasDetector(df, text_column)

Bias Detector class for detecting different bias categories in text.

__init__(df, text_column)

Initializes a BiasDetector object.

Parameters:
  • df – A pandas dataframe containing text data.

  • text_column – The name of the column containing text data.

Returns:

None

detect_bias(language: str = 'en') dict[str, dict[str, float]]

Detects bias in the text data.

Parameters:

language – The language of the text data. Defaults to English.

Returns:

A dictionary of bias categories and subcategories and their associated bias scores.

print_bias_table()

Prints a table of the bias scores for each category.

Parameters:

None

Returns:

None

show_bias_plot(colorscale: str | list[list] | None = None, layout_settings: dict | None = None, font_settings: dict = {'bias_subcategory_font': {'size': 16}, 'category_titles': {'size': 18}, 'colorbar_tick_font': {'size': 16}, 'colorbar_title_font': {'size': 18}})

Displays a plotly heatmap of the bias scores for each category.

Parameters:
  • colorscale

    The colorscale to use for the heatmap. If not provided, the default colorscale is used. You can define a custom colorscale by providing a list of lists of the form:

    cyan_scopecolorscale = [

    [0.0, “#E0FFFF”], # Lightest Cyan [0.25, “#B3E4E4”], # Lighter Cyan [0.5, “#66C2C2”], # Neutral Cyan [0.75, “#339999”], # Darker Cyan [1.0, “#006666”], # Darkest Cyan

    ]

    Or you can use one of the built-in colorscales by providing a string. Example of available colorscales are:’aggrnyl’, ‘agsunset’, ‘algae’, ‘amp’, ‘armyrose’, ‘balance’, ‘blackbody’, ‘bluered’, ‘blues’, ‘blugrn’, ‘bluyl’, ‘brbg’,’brwnyl’, ‘bugn’, ‘bupu’, ‘burg’, ‘burgyl’, ‘cividis’, ‘curl’

    You can reverse a colorscale by appending an ‘_r’ to it, e.g. ‘algae_r’ # See here for a full list: # https://plotly.com/python/builtin-colorscales/

  • layout_settings

    A dictionary of layout settings to apply to the plot. If not provided, the default layout settings are used. .. rubric:: Example

    layout_settings = {‘plot_bgcolor’:’rgba(245, 245, 245, 1)’, ‘paper_bgcolor’: ‘rgba(255, 255, 255, 1)’, ‘hovermode’: ‘y’ }

    For a full list of possible options, see: https://plotly.com/python/reference/layout/

  • font_settings

    A dictionary of font sizes for color bar, sub-categories, and subtitles. Defaults = {

    ”colorbar_tick_font”: {“size”: 16}, “colorbar_title_font”: {“size”: 18}, “bias_subcategory_font”: {“size”: 16}, “category_titles”: {“size”: 18},

    }

Returns:

None

Cluster Analysis

class wordview.clustering.cluster.Cluster(documents: List[str], vector_model: str = 'tfidf')

Cluster text documents using various clustering algorithms and based on different vectorization technologies.

__init__(documents: List[str], vector_model: str = 'tfidf') None
Parameters:
  • documents (List) – List of documents.

  • vector_model (str) – Vectorization technology. Default = tfidf.

cluster(clustering_algorithm: str = 'kmeans', n_clusters: int = 3, distance_threshold: Any | None = None) None

Cluster documents using the algorithm specified by clustering_algorithm.

Parameters:
  • clustering_algorithm (str) – Clustering algorithm. Default = kmeans. Supported algorithms: [Kmeans, AgglomerativeClustering].

  • n_clusters (str) – Number of clusters. Default = 3.

  • distance_threshold (float) – Distance threshold for AgglomerativeClustering. Default = None.

  • Returns – None

Anomaly Analysis

class wordview.anomaly.NormalDistAnomalies(items: Dict, val_name: str = 'representative_value', gaussianization_strategy: str = 'brute')
__init__(items: Dict, val_name: str = 'representative_value', gaussianization_strategy: str = 'brute')

Identify anomalies on a normal distribution.

Parameters:
  • items – A dictionary of items and their representative value, such as word_count, idf, etc.

  • val_name – Name of the value in the above dictionary. i.e. word_count, idf, etc. Defaults to representative_value.

  • gaussianization_strategy – Strategy for gaussianization. Can be any of lambert, brute, or boxcox. Defaults = brute.

Returns:

None

anomalous_items(manual: bool = False, z: int = 3, prob: float = 0.001) Set[str]

Identify anomalous items in self.items.

Parameters:
  • manual – Whether or not select redundant words using manual thresholds. When set to True, manual_thresholds should be specified.

  • z – Items with a z-score above this value are considered anomalous. Used only when manual is False. Default = 3.

  • prob – Probability threshold below which items are considered anomalous.

Returns:

An alphabetically sorted set of anomalous items.