Helpers

class helpers.PreProcessing(text)[source]

Bases: object

Class which is equipped with all sorts of Preprocessing & Cleaning techniques

lemmatize()[source]

Function to extract root words - Lemmatizing

Parameters

None

Returns

Processed text

Return type

str

remove_characters()[source]

Function to remove special characters

Parameters

None

Returns

Processed text

Return type

str

remove_letters(size)[source]

Function to remove words from a text with less than n letters

Parameters

None

Returns

Processed text

Return type

str

remove_numbers()[source]

Function to remove numbers in a text

Parameters

None

Returns

Processed text

Return type

str

remove_punctuation()[source]

Function to remove punctuations from the text

Parameters

None

Returns

Processed text

Return type

str

remove_stopwords()[source]

Function to remove stopwords from the text

Parameters

None

Returns

Processed text

Return type

str

remove_urls()[source]

Remove URLs from text

Parameters

None

Returns

Text after removing URLS

Return type

str

stemming()[source]

Function to extract root words - Stemming

Parameters

None

Returns

Processed text

Return type

str

text_lowercase()[source]

Function to convert all alphabets in a text to lowercase

Parameters

None

Returns

Processed text

Return type

str

tokenize()[source]

Function to tokenize phrases and tokens

Parameters

None

Returns

Processed text

Return type

str

helpers.counter_cosine_similarity(user_id, counterA, counterB)[source]

Calculate the counter cosine similarity for each user_id

Parameters
  • user_id (str) – User ID

  • counterA (List) – Keyword list 1

  • counterB (List) – Keyword list 2

Returns

Dictionary of {User ID: Counter_cosine value}

Return type

Dictionary

helpers.create_tokens(df, func, column_name, n_cores)[source]

Function to create tokens for a given column name

Parameters
  • df (Pandas.DataFrame) – DataFrame

  • func – Function to be applied

  • column_name (str) – The name of the column

  • n_cores (int) – No of CPU cores to be used

Returns

List containing the results of the function applied on each element of the column

Return type

List

helpers.extract_json(base_url, end_url, page)[source]

Extract data from webpage by appending Base_url+page+End_url

Parameters
  • base_url (str) – Base URL

  • end_url (str) – End URL

  • page (str) – The page number

Returns

Response from the webpage

Return type

Dict

helpers.get_datetime(date_str, year=True)[source]

Converts string Datetime to Datetime Object

Parameters
  • date_str (str) – Datetime in String

  • year (bool) – If True, extracts returns the year

Returns

Datetime object

Return type

Datetime

helpers.get_formatted_date(data, format_='%m%d%Y')[source]

Function to format date in a required format

Parameters
  • data (List) – List of dates as string

  • format (str) – Format in which date should be returned

Returns

Formatted date

Return type

Datetime

helpers.get_keys(text, ngram=1, ntop=10, generator='Spacy')[source]

Function to extract keywords from a text using a chosen generator

Parameters
  • text (str) – The text from which keywords are to be extracted

  • ngram (int) – No of words used for Ngram

  • ntop (int) – The no of top keywords to be extracted

  • generator (str) – The algorithm to be used for keyword extraction

Returns

List of Keywords

Return type

List

helpers.get_request(url, headers, data='')[source]

Retrieves a webpage with the desired header and payload data and returns the text data

Parameters
  • url (str) – URL from which data needs to be extracted

  • headers (List) – headers for the url

  • data (Dict) – Payload data

Returns

Response from the webpage in text

Return type

Str

helpers.merge_databases(dset1, dset2, on, how='inner')[source]

Function to merge two_datasets with key as ‘on’

Parameters
  • dset1 (Pandas.DataFrame) – Dataset 1

  • dset2 (Pandas.DataFrame) – Dataset 2

  • on (str) – Column on which datasets are to be merged

  • how (str) – Type of join

Returns

Merged datafram

Return type

Pandas.DataFrame

helpers.parallelize(n_cores, func, arg1)[source]

Function to Parallelize the task on multiple CPU thread

Parameters
  • n_cores (int) – No of cores of CPU to be used

  • func (Function()) – The function which needs to be parallelized

  • arg1 (str) – List[list of elements, len(list of elements)]

Returns

List containing the results of the function applied on each element in arg1[0]

Return type

List

helpers.save_pandas_to_csv(df, output_path, index)[source]

Saves the dataset to CSV file

Parameters
  • output_path (str) – Path where the file needs to be saved

  • index (bool) – Whether index should be included while saving

Returns

None

helpers.tokenize(phrase, k=3)[source]

Function which preprocess and tokenize a given phrase

Parameters
  • phrase (str) – Phrase to be tokenized

  • k (str) – Minimum length for a words for it to be retained in the text

Returns

Processed tokens

Return type

List