Helpers
- class helpers.PreProcessing(text)[source]
Bases:
objectClass which is equipped with all sorts of Preprocessing & Cleaning techniques
- lemmatize()[source]
Function to extract root words - Lemmatizing
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_characters()[source]
Function to remove special characters
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_letters(size)[source]
Function to remove words from a text with less than n letters
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_numbers()[source]
Function to remove numbers in a text
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_punctuation()[source]
Function to remove punctuations from the text
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_stopwords()[source]
Function to remove stopwords from the text
- Parameters
None –
- Returns
Processed text
- Return type
str
- remove_urls()[source]
Remove URLs from text
- Parameters
None –
- Returns
Text after removing URLS
- Return type
str
- stemming()[source]
Function to extract root words - Stemming
- Parameters
None –
- Returns
Processed text
- Return type
str
- helpers.counter_cosine_similarity(user_id, counterA, counterB)[source]
Calculate the counter cosine similarity for each user_id
- Parameters
user_id (str) – User ID
counterA (List) – Keyword list 1
counterB (List) – Keyword list 2
- Returns
Dictionary of {User ID: Counter_cosine value}
- Return type
Dictionary
- helpers.create_tokens(df, func, column_name, n_cores)[source]
Function to create tokens for a given column name
- Parameters
df (Pandas.DataFrame) – DataFrame
func – Function to be applied
column_name (str) – The name of the column
n_cores (int) – No of CPU cores to be used
- Returns
List containing the results of the function applied on each element of the column
- Return type
List
- helpers.extract_json(base_url, end_url, page)[source]
Extract data from webpage by appending Base_url+page+End_url
- Parameters
base_url (str) – Base URL
end_url (str) – End URL
page (str) – The page number
- Returns
Response from the webpage
- Return type
Dict
- helpers.get_datetime(date_str, year=True)[source]
Converts string Datetime to Datetime Object
- Parameters
date_str (str) – Datetime in String
year (bool) – If True, extracts returns the year
- Returns
Datetime object
- Return type
Datetime
- helpers.get_formatted_date(data, format_='%m%d%Y')[source]
Function to format date in a required format
- Parameters
data (List) – List of dates as string
format (str) – Format in which date should be returned
- Returns
Formatted date
- Return type
Datetime
- helpers.get_keys(text, ngram=1, ntop=10, generator='Spacy')[source]
Function to extract keywords from a text using a chosen generator
- Parameters
text (str) – The text from which keywords are to be extracted
ngram (int) – No of words used for Ngram
ntop (int) – The no of top keywords to be extracted
generator (str) – The algorithm to be used for keyword extraction
- Returns
List of Keywords
- Return type
List
- helpers.get_request(url, headers, data='')[source]
Retrieves a webpage with the desired header and payload data and returns the text data
- Parameters
url (str) – URL from which data needs to be extracted
headers (List) – headers for the url
data (Dict) – Payload data
- Returns
Response from the webpage in text
- Return type
Str
- helpers.merge_databases(dset1, dset2, on, how='inner')[source]
Function to merge two_datasets with key as ‘on’
- Parameters
dset1 (Pandas.DataFrame) – Dataset 1
dset2 (Pandas.DataFrame) – Dataset 2
on (str) – Column on which datasets are to be merged
how (str) – Type of join
- Returns
Merged datafram
- Return type
Pandas.DataFrame
- helpers.parallelize(n_cores, func, arg1)[source]
Function to Parallelize the task on multiple CPU thread
- Parameters
n_cores (int) – No of cores of CPU to be used
func (Function()) – The function which needs to be parallelized
arg1 (str) – List[list of elements, len(list of elements)]
- Returns
List containing the results of the function applied on each element in arg1[0]
- Return type
List