Helpers

class helpers.PreProcessing(text)[source]

Bases: object

Class which is equipped with all sorts of Preprocessing & Cleaning techniques

lemmatize()[source]

Function to extract root words - Lemmatizing

Parameters: None –
Returns: Processed text
Return type: str

remove_characters()[source]

Function to remove special characters

Parameters: None –
Returns: Processed text
Return type: str

remove_letters(size)[source]

Function to remove words from a text with less than n letters

Parameters: None –
Returns: Processed text
Return type: str

remove_numbers()[source]

Function to remove numbers in a text

Parameters: None –
Returns: Processed text
Return type: str

remove_punctuation()[source]

Function to remove punctuations from the text

Parameters: None –
Returns: Processed text
Return type: str

remove_stopwords()[source]

Function to remove stopwords from the text

Parameters: None –
Returns: Processed text
Return type: str

remove_urls()[source]

Remove URLs from text

Parameters: None –
Returns: Text after removing URLS
Return type: str

stemming()[source]

Function to extract root words - Stemming

Parameters: None –
Returns: Processed text
Return type: str

text_lowercase()[source]

Function to convert all alphabets in a text to lowercase

Parameters: None –
Returns: Processed text
Return type: str

tokenize()[source]

Function to tokenize phrases and tokens

Parameters: None –
Returns: Processed text
Return type: str

helpers.counter_cosine_similarity(user_id, counterA, counterB)[source]

Calculate the counter cosine similarity for each user_id

Parameters

user_id (str) – User ID
counterA (List) – Keyword list 1
counterB (List) – Keyword list 2

Returns

Dictionary of {User ID: Counter_cosine value}

Return type

Dictionary

helpers.create_tokens(df, func, column_name, n_cores)[source]

Function to create tokens for a given column name

Parameters

df (Pandas.DataFrame) – DataFrame
func – Function to be applied
column_name (str) – The name of the column
n_cores (int) – No of CPU cores to be used

Returns

List containing the results of the function applied on each element of the column

Return type

List

helpers.extract_json(base_url, end_url, page)[source]

Extract data from webpage by appending Base_url+page+End_url

Parameters

base_url (str) – Base URL
end_url (str) – End URL
page (str) – The page number

Returns

Response from the webpage

Return type

Dict

helpers.get_datetime(date_str, year=True)[source]

Converts string Datetime to Datetime Object

Parameters

date_str (str) – Datetime in String
year (bool) – If True, extracts returns the year

Returns

Datetime object

Return type

Datetime

helpers.get_formatted_date(data, format_='%m%d%Y')[source]

Function to format date in a required format

Parameters

data (List) – List of dates as string
format (str) – Format in which date should be returned

Returns

Formatted date

Return type

Datetime

helpers.get_keys(text, ngram=1, ntop=10, generator='Spacy')[source]

Function to extract keywords from a text using a chosen generator

Parameters

text (str) – The text from which keywords are to be extracted
ngram (int) – No of words used for Ngram
ntop (int) – The no of top keywords to be extracted
generator (str) – The algorithm to be used for keyword extraction

Returns

List of Keywords

Return type

List

helpers.get_request(url, headers, data='')[source]

Retrieves a webpage with the desired header and payload data and returns the text data

Parameters

url (str) – URL from which data needs to be extracted
headers (List) – headers for the url
data (Dict) – Payload data

Returns

Response from the webpage in text

Return type

Str

helpers.merge_databases(dset1, dset2, on, how='inner')[source]

Function to merge two_datasets with key as ‘on’

Parameters

dset1 (Pandas.DataFrame) – Dataset 1
dset2 (Pandas.DataFrame) – Dataset 2
on (str) – Column on which datasets are to be merged
how (str) – Type of join

Returns

Merged datafram

Return type

Pandas.DataFrame

helpers.parallelize(n_cores, func, arg1)[source]

Function to Parallelize the task on multiple CPU thread

Parameters

n_cores (int) – No of cores of CPU to be used
func (Function()) – The function which needs to be parallelized
arg1 (str) – List[list of elements, len(list of elements)]

Returns

List containing the results of the function applied on each element in arg1[0]

Return type

List

helpers.save_pandas_to_csv(df, output_path, index)[source]

Saves the dataset to CSV file

Parameters

output_path (str) – Path where the file needs to be saved
index (bool) – Whether index should be included while saving

Returns

None

helpers.tokenize(phrase, k=3)[source]

Function which preprocess and tokenize a given phrase

Parameters

phrase (str) – Phrase to be tokenized
k (str) – Minimum length for a words for it to be retained in the text

Returns

Processed tokens

Return type

List