RDash - Quickstart Guide

RDash is a recommendation system that captures the opportunities for pursuing external research funds through grants, contracts, and subcontracts based on the scholar’s research profile. RDash-Grants entails analyzing a massive set of solicitations and funding opportunities and selecting the most appropriate one or group of relevant grants by considering the scholar’s preferences and research profile.

RDash consists of two main components :

A webapp (frontend)
Backend

This page will provide details on the backend component of RDash which uses Natural Language Processing for recommendation.

Setup

Before using the code you should first clone the repository (currently available only to Taugroup members) and install all the required libraries. This can be done through the below snippet from your command line.

git clone https://github.com/taugroup/RDASH.git
pip install -r requirements.txt

Usage

End-to-end recommendation system can be broken down to 7 steps. Each of the steps and their corresponding code are given below.

Step 1 : Create a list of Scholars (with demographic details and list of publications)

python user_profile_creation.py --univ_name='TAMU'

Step 2 : Create publication database - extract the information from all publications of each user

python extract_publications.py --n_cores=20

Step 3 : Create Analytical database - with representative keywords for each user

python create_analytical_data.py --n_cores=20

Step 4 : Compile list of Grants

python extract_proposals.py

Step 5 : Extract grant details

python main_extractor.py --n_cores=20 --a 'National Science Foundation' 'National Institutes of Health'

Step 6 : Recommend scholars for a Proposal / grant

python recommend_scholars.py --top_k=20 --proposal_id='PD-18-1263' --n_cores=20 --agency='NSF'

Step 7 : Extract proposals to a json for searching

python extract_proposals_titles_db.py

Features

The tool extracts and creates user/scholar profile using the TAMU scholars library using APIs
Matches and recommends user profile to research proposals
Identify similar research profiles for each scholar
Advance Oppurtunities for Intelligent Research
Recommend latest relevant articles/publications for literature searcha and advancement

Workflow

Modules

Below is the documentation for various python modules used in this project.

Automatic_keyword_generator

class automatic_keyword_generator.Keyword_generator(text)[source]

Bases: object

Class containing various algorithms to generate keywords. Algorithms include Yake, Gensim, Rake, Bert, Spacy.

BERT(n_gram=1, top_n=5)[source]

Function containing BERT algorithm to extract keywords

Parameters

n_gram (int) – No of continuous sequence of words to be used
top_n (int) – Ordered on relevancy, the number of top keywords to be returned

Returns

List of extracted keywords

Return type

List

Rake()[source]

Function containing RAKE algorithm to extract keywords

Parameters: None –
Returns: List of extracted keywords
Return type: List

Spacy()[source]

Function containing Spacy algorithm to extract keywords

Parameters: None –
Returns: List of extracted keywords
Return type: List

Yake(max_ngram_size, numOfKeywords, language='en', deduplication_threshold=0.9)[source]

Function containing YAKE algorithm to extract keywords

Parameters

max_ngram_size (int) – Int based on word grams
numOfKeywords (int) – Ordered on relevancy, the number of top keywords to be returned
language (int) – Language of the text (default = en)
deduplication_threshold (float) – Duplication of words in keywords

Returns

List of extracted keywords

Return type

List

gensim()[source]

Function containing Gensim Algorithm to extract keywords

Parameters: None –
Returns: List of extracted keywords :rtype: List

automatic_keyword_generator.countVectorizer(n_gram, text)[source]

Function to get feature names (words) from the input text

Parameters

text (List of strings) – the text to be used to extract keywords from
n_gram (tuple) – tuple containing minimum and maximum values of n_gram

Returns

feature (read words) learned from the text

Return type

List of words

Main_extractor

class main_extractor.AgencyDataExtractor(n_cores, agencies, params)[source]

Bases: object

Class which can extract data from required agencey webpages. Currently added agencie - NIH, NSF

extract_agency_proposals()[source]

Parent function which calls child functions to retrieve data for each agency. Each child function will save the data to specific files separately. :param None:

Returns: None

Extract_proposals

class extract_proposals.GrantsDataExtractor(xml_url, csv_url, agencies, params)[source]

Bases: object

Class which will extract data from the Grants.Gov website.

As per design, we will first download the list of all Open proposals from the Grants.gov. Later for each proposal, further data is extracted from the dedicated webpage (for example from NSF website).

ExtractCSVData()[source]

Function to extract data from the downloaded CSV file Once the data is extracted it will be saved as a dataframe - self.metadata

Parameters: None –
Returns: None

ExtractXMLData()[source]

Function to extract data from the XML file. Once the data is extracted it will be saved as a dataframe - self.opps_df

Parameters: None –
Returns: None

ProcessXMLData()[source]

Function to process extracted the XML data. Reformat columns - CloseDate, PostDate. LastUpdateDate. Identify Open Proposals.

:param None :

Returns: None

SaveXMLData()[source]

Function to save all the XML Data to CSV files. Specifically, Open Proposals agency wise will be saved in seprate files.

:param None :

Returns: None

Create_analytical_data

class create_analytical_data.Analytical_Data_Creator(n_cores, univ_name, params)[source]

Bases: object

Class which will create a user (read scholar) database with details from his profile page and relevant publications.

Parameters

user_organisation (str) – Text from which kerwords are to be extracted
i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_publication_data()[source]

Main function to process and compile the final Scholars’ dataset

Parameters: None –
Returns: None

create_user_token_data()[source]

Function to create tokens from Profile page (Organization, Overview and Keyword sections) for all users

Parameters: None –
Returns: None

create_analytical_data.get_author_pubinfo(scholar_df, i, top_n=5, top_title=True)[source]

Function to extract information of top N publications of the author

Parameters

scholar_df (Pandas.DataFrame) – DataFram containing user’s all information
i (str) – User ID (dummy_variable)
top_n (int) – Based on relevancy, the number of top Titles will be used
top_title (bool) – If True, only top N pulications will be extracted. Else all publicatio data will be used.

Returns

Tuple of Dictionaries. Each distionary contain User_id as key and keyworks from Publication title / User keywords as values

Return type

Tuple

create_analytical_data.user_keywords(user_key, i)[source]

Function to calculate tokens from user’s keywords

Parameters

user_key (str) – Text from which kerwords are to be extracted
i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_analytical_data.user_o_keywords(user_overview, i)[source]

Function to calculate tokens from user’s Overview

Parameters

user_overview (str) – Text from which kerwords are to be extracted
i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_analytical_data.user_org_keywords(user_organisation, i)[source]

Function to calculate tokens from user’s organization

Parameters

user_organisation (str) – Text from which kerwords are to be extracted
i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

Extract_publications

class extract_publications.Extract_Publications(n_cores, univ_name, params)[source]

Bases: object

Class which will extract all the publication details of all the scholars of a given university

create_univ_publication_data()[source]

Main function which will create the publication data for all the users of a university

Parameters: None –
Returns: None

create_user_publication_data(user_id, pub_ids)[source]

Creates the publication data for a single user by scraping university webpage. The function takes in a publication ID and get complete detail of the data from University webpage.

Parameters

user_id (str) – User_id for each user whose publications are to be extracted
pub_ids (List) – List of all publication IDs for the user

Returns

Dataframe for each user where each row is a publication of a user. Total no of rows = n_publications

Return type

class Pandas.DataFrame

get_publication_ids(user_id, str_)[source]

Returns a dictionary where each user_id is a key and a list of his/her publications as values

Parameters: user_id (str) – User_id for each user whose publications are to be extracted

param str_: WIP :type str_: str

Returns: Dictonary of {User IDs : List of publications}
Return type: class `Dictionary `

save_user_publications(univ_name='TAMU')[source]

Function to save the publication details of each user

Parameters: univ_name (`str1) – University name of the user - which determines the file names for saving.

User_profile_creation

class user_profile_creation.extract_user_profiles(univ_name, output_path)[source]

Bases: object

Class which can extract profiles of all users from a university

extract_info(url, user_id)[source]

Function to extract a particular user’s information from general university URL

Parameters

url (str) – The URL from which response is to be retrieved
user_id (str) – ID of the particular scholar

Returns

None

extract_profiles()[source]

Function to compile Scholar data of a particular university. The function will first identify the total number of scholars in a university and then get basic summary available for each scholar.

Parameters: None –
Returns: None

get_awards()[source]

Function to Research areas of the Scholar from University Page

Parameters: None –
Returns: Research areas of the Scholar, Length of research_areas
Return type: Tuple (List, Int)

get_department()[source]

Function to extract Department of the Scholar from University Page

Parameters: None –
Returns: Department of the Scholar
Return type: str

get_department_info()[source]

Function to extract Department info (including course area) of the Scholar from University Page

Parameters: None –
Returns: Department info of the Scholar
Return type: str

get_email()[source]

Function to extract email of the Scholar from University Page

Parameters: None –
Returns: Email of the Scholar
Return type: str

get_keywords()[source]

Function to extract keywords of the Scholar from University Page

Parameters: None –
Returns: Keywords of the Scholar
Return type: str

get_name()[source]

Function to extract name of the Scholar from University Page

Parameters: None –
Returns: Name of the Scholar
Return type: str

get_netid()[source]

Function to extract NetID (University Unique Identifier) of the Scholar from University Page

Parameters: None –
Returns: NetID of the Scholar
Return type: str

get_npublications()[source]

Function to get the no of publciations of the Scholar from University Page

Parameters: None –
Returns: Publications of the Scholar
Return type: str

get_organizations()[source]

Function to extract Organizations of the Scholar from University Page

Parameters: None –
Returns: Organizations of the Scholar
Return type: str

get_overview()[source]

Function to extract Overview of the Scholar from University Page

Parameters: None –
Returns: Overview of the Scholar
Return type: str

get_profile(url, user_id)[source]

Function to extract all details of a scholar from University Page

Parameters

url (str) – The base university URL from which Scholars’ data can be extracted by appending their user_ids
user_id (str) – The university provided User ID of the scholar

Returns

Scholar Data in the form of Pandas.DataFrame

Return type

Pandas.DataFrame

get_publications()[source]

Function to extract publications of the Scholar from University Page

Parameters: None –
Returns: Publications of the Scholar
Return type: str

get_research()[source]

Function to Research areas of the Scholar from University Page

Parameters: None –
Returns: Research areas of the Scholar, Length of research_areas
Return type: Tuple (List, Int)

get_title()[source]

Function to extract Prefered title of the Scholar from University Page

Parameters: None –
Returns: Preferred title of the Scholar
Return type: str

user_profile_creation.get_userid(user_dict, key)[source]

Function to get User IDs from JSON

Parameters: user_dict (JSON) – The URL from which response is to be retrieved
Returns: User ID
Return type: str

Helpers

class helpers.PreProcessing(text)[source]

Bases: object

Class which is equipped with all sorts of Preprocessing & Cleaning techniques

lemmatize()[source]

Function to extract root words - Lemmatizing

Parameters: None –
Returns: Processed text
Return type: str

remove_characters()[source]

Function to remove special characters

Parameters: None –
Returns: Processed text
Return type: str

remove_letters(size)[source]

Function to remove words from a text with less than n letters

Parameters: None –
Returns: Processed text
Return type: str

remove_numbers()[source]

Function to remove numbers in a text

Parameters: None –
Returns: Processed text
Return type: str

remove_punctuation()[source]

Function to remove punctuations from the text

Parameters: None –
Returns: Processed text
Return type: str

remove_stopwords()[source]

Function to remove stopwords from the text

Parameters: None –
Returns: Processed text
Return type: str

remove_urls()[source]

Remove URLs from text

Parameters: None –
Returns: Text after removing URLS
Return type: str

stemming()[source]

Function to extract root words - Stemming

Parameters: None –
Returns: Processed text
Return type: str

text_lowercase()[source]

Function to convert all alphabets in a text to lowercase

Parameters: None –
Returns: Processed text
Return type: str

tokenize()[source]

Function to tokenize phrases and tokens

Parameters: None –
Returns: Processed text
Return type: str

helpers.counter_cosine_similarity(user_id, counterA, counterB)[source]

Calculate the counter cosine similarity for each user_id

Parameters

user_id (str) – User ID
counterA (List) – Keyword list 1
counterB (List) – Keyword list 2

Returns

Dictionary of {User ID: Counter_cosine value}

Return type

Dictionary

helpers.create_tokens(df, func, column_name, n_cores)[source]

Function to create tokens for a given column name

Parameters

df (Pandas.DataFrame) – DataFrame
func – Function to be applied
column_name (str) – The name of the column
n_cores (int) – No of CPU cores to be used

Returns

List containing the results of the function applied on each element of the column

Return type

List

helpers.extract_json(base_url, end_url, page)[source]

Extract data from webpage by appending Base_url+page+End_url

Parameters

base_url (str) – Base URL
end_url (str) – End URL
page (str) – The page number

Returns

Response from the webpage

Return type

Dict

helpers.get_datetime(date_str, year=True)[source]

Converts string Datetime to Datetime Object

Parameters

date_str (str) – Datetime in String
year (bool) – If True, extracts returns the year

Returns

Datetime object

Return type

Datetime

helpers.get_formatted_date(data, format_='%m%d%Y')[source]

Function to format date in a required format

Parameters

data (List) – List of dates as string
format (str) – Format in which date should be returned

Returns

Formatted date

Return type

Datetime

helpers.get_keys(text, ngram=1, ntop=10, generator='Spacy')[source]

Function to extract keywords from a text using a chosen generator

Parameters

text (str) – The text from which keywords are to be extracted
ngram (int) – No of words used for Ngram
ntop (int) – The no of top keywords to be extracted
generator (str) – The algorithm to be used for keyword extraction

Returns

List of Keywords

Return type

List

helpers.get_request(url, headers, data='')[source]

Retrieves a webpage with the desired header and payload data and returns the text data

Parameters

url (str) – URL from which data needs to be extracted
headers (List) – headers for the url
data (Dict) – Payload data

Returns

Response from the webpage in text

Return type

Str

helpers.merge_databases(dset1, dset2, on, how='inner')[source]

Function to merge two_datasets with key as ‘on’

Parameters

dset1 (Pandas.DataFrame) – Dataset 1
dset2 (Pandas.DataFrame) – Dataset 2
on (str) – Column on which datasets are to be merged
how (str) – Type of join

Returns

Merged datafram

Return type

Pandas.DataFrame

helpers.parallelize(n_cores, func, arg1)[source]

Function to Parallelize the task on multiple CPU thread

Parameters

n_cores (int) – No of cores of CPU to be used
func (Function()) – The function which needs to be parallelized
arg1 (str) – List[list of elements, len(list of elements)]

Returns

List containing the results of the function applied on each element in arg1[0]

Return type

List

helpers.save_pandas_to_csv(df, output_path, index)[source]

Saves the dataset to CSV file

Parameters

output_path (str) – Path where the file needs to be saved
index (bool) – Whether index should be included while saving

Returns

None

helpers.tokenize(phrase, k=3)[source]

Function which preprocess and tokenize a given phrase

Parameters

phrase (str) – Phrase to be tokenized
k (str) – Minimum length for a words for it to be retained in the text

Returns

Processed tokens

Return type

List

RDash - Quickstart Guide

Setup

Usage

Features

Workflow

Modules

Automatic_keyword_generator

Main_extractor

Extract_proposals

Create_analytical_data

Extract_publications

User_profile_creation

Helpers

Indices and tables