RDash - Quickstart Guide

RDash is a recommendation system that captures the opportunities for pursuing external research funds through grants, contracts, and subcontracts based on the scholar’s research profile. RDash-Grants entails analyzing a massive set of solicitations and funding opportunities and selecting the most appropriate one or group of relevant grants by considering the scholar’s preferences and research profile.

RDash consists of two main components :
  • A webapp (frontend)

  • Backend

This page will provide details on the backend component of RDash which uses Natural Language Processing for recommendation.

Setup

Before using the code you should first clone the repository (currently available only to Taugroup members) and install all the required libraries. This can be done through the below snippet from your command line.

git clone https://github.com/taugroup/RDASH.git
pip install -r requirements.txt

Usage

End-to-end recommendation system can be broken down to 7 steps. Each of the steps and their corresponding code are given below.

Step 1 : Create a list of Scholars (with demographic details and list of publications)

python user_profile_creation.py --univ_name='TAMU'

Step 2 : Create publication database - extract the information from all publications of each user

python extract_publications.py --n_cores=20

Step 3 : Create Analytical database - with representative keywords for each user

python create_analytical_data.py --n_cores=20

Step 4 : Compile list of Grants

python extract_proposals.py

Step 5 : Extract grant details

python main_extractor.py --n_cores=20 --a 'National Science Foundation' 'National Institutes of Health'

Step 6 : Recommend scholars for a Proposal / grant

python recommend_scholars.py --top_k=20 --proposal_id='PD-18-1263' --n_cores=20 --agency='NSF'

Step 7 : Extract proposals to a json for searching

python extract_proposals_titles_db.py

Features

  • The tool extracts and creates user/scholar profile using the TAMU scholars library using APIs

  • Matches and recommends user profile to research proposals

  • Identify similar research profiles for each scholar

  • Advance Oppurtunities for Intelligent Research

  • Recommend latest relevant articles/publications for literature searcha and advancement

Workflow

_images/workflow.png

Modules

Below is the documentation for various python modules used in this project.

Automatic_keyword_generator

class automatic_keyword_generator.Keyword_generator(text)[source]

Bases: object

Class containing various algorithms to generate keywords. Algorithms include Yake, Gensim, Rake, Bert, Spacy.

BERT(n_gram=1, top_n=5)[source]

Function containing BERT algorithm to extract keywords

Parameters
  • n_gram (int) – No of continuous sequence of words to be used

  • top_n (int) – Ordered on relevancy, the number of top keywords to be returned

Returns

List of extracted keywords

Return type

List

Rake()[source]

Function containing RAKE algorithm to extract keywords

Parameters

None

Returns

List of extracted keywords

Return type

List

Spacy()[source]

Function containing Spacy algorithm to extract keywords

Parameters

None

Returns

List of extracted keywords

Return type

List

Yake(max_ngram_size, numOfKeywords, language='en', deduplication_threshold=0.9)[source]

Function containing YAKE algorithm to extract keywords

Parameters
  • max_ngram_size (int) – Int based on word grams

  • numOfKeywords (int) – Ordered on relevancy, the number of top keywords to be returned

  • language (int) – Language of the text (default = en)

  • deduplication_threshold (float) – Duplication of words in keywords

Returns

List of extracted keywords

Return type

List

gensim()[source]

Function containing Gensim Algorithm to extract keywords

Parameters

None

Returns

List of extracted keywords :rtype: List

automatic_keyword_generator.countVectorizer(n_gram, text)[source]

Function to get feature names (words) from the input text

Parameters
  • text (List of strings) – the text to be used to extract keywords from

  • n_gram (tuple) – tuple containing minimum and maximum values of n_gram

Returns

feature (read words) learned from the text

Return type

List of words

Main_extractor

class main_extractor.AgencyDataExtractor(n_cores, agencies, params)[source]

Bases: object

Class which can extract data from required agencey webpages. Currently added agencie - NIH, NSF

extract_agency_proposals()[source]

Parent function which calls child functions to retrieve data for each agency. Each child function will save the data to specific files separately. :param None:

Returns

None

Extract_proposals

class extract_proposals.GrantsDataExtractor(xml_url, csv_url, agencies, params)[source]

Bases: object

Class which will extract data from the Grants.Gov website.

As per design, we will first download the list of all Open proposals from the Grants.gov. Later for each proposal, further data is extracted from the dedicated webpage (for example from NSF website).

ExtractCSVData()[source]

Function to extract data from the downloaded CSV file Once the data is extracted it will be saved as a dataframe - self.metadata

Parameters

None

Returns

None

ExtractXMLData()[source]

Function to extract data from the XML file. Once the data is extracted it will be saved as a dataframe - self.opps_df

Parameters

None

Returns

None

ProcessXMLData()[source]

Function to process extracted the XML data. Reformat columns - CloseDate, PostDate. LastUpdateDate. Identify Open Proposals.

:param None :

Returns

None

SaveXMLData()[source]

Function to save all the XML Data to CSV files. Specifically, Open Proposals agency wise will be saved in seprate files.

:param None :

Returns

None

Create_analytical_data

class create_analytical_data.Analytical_Data_Creator(n_cores, univ_name, params)[source]

Bases: object

Class which will create a user (read scholar) database with details from his profile page and relevant publications.

Parameters
  • user_organisation (str) – Text from which kerwords are to be extracted

  • i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_publication_data()[source]

Main function to process and compile the final Scholars’ dataset

Parameters

None

Returns

None

create_user_token_data()[source]

Function to create tokens from Profile page (Organization, Overview and Keyword sections) for all users

Parameters

None

Returns

None

create_analytical_data.get_author_pubinfo(scholar_df, i, top_n=5, top_title=True)[source]

Function to extract information of top N publications of the author

Parameters
  • scholar_df (Pandas.DataFrame) – DataFram containing user’s all information

  • i (str) – User ID (dummy_variable)

  • top_n (int) – Based on relevancy, the number of top Titles will be used

  • top_title (bool) – If True, only top N pulications will be extracted. Else all publicatio data will be used.

Returns

Tuple of Dictionaries. Each distionary contain User_id as key and keyworks from Publication title / User keywords as values

Return type

Tuple

create_analytical_data.user_keywords(user_key, i)[source]

Function to calculate tokens from user’s keywords

Parameters
  • user_key (str) – Text from which kerwords are to be extracted

  • i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_analytical_data.user_o_keywords(user_overview, i)[source]

Function to calculate tokens from user’s Overview

Parameters
  • user_overview (str) – Text from which kerwords are to be extracted

  • i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

create_analytical_data.user_org_keywords(user_organisation, i)[source]

Function to calculate tokens from user’s organization

Parameters
  • user_organisation (str) – Text from which kerwords are to be extracted

  • i (str) – User ID (dummy_variable)

Returns

Space separated set of keywords

Return type

str

Extract_publications

class extract_publications.Extract_Publications(n_cores, univ_name, params)[source]

Bases: object

Class which will extract all the publication details of all the scholars of a given university

create_univ_publication_data()[source]

Main function which will create the publication data for all the users of a university

Parameters

None

Returns

None

create_user_publication_data(user_id, pub_ids)[source]

Creates the publication data for a single user by scraping university webpage. The function takes in a publication ID and get complete detail of the data from University webpage.

Parameters
  • user_id (str) – User_id for each user whose publications are to be extracted

  • pub_ids (List) – List of all publication IDs for the user

Returns

Dataframe for each user where each row is a publication of a user. Total no of rows = n_publications

Return type

class Pandas.DataFrame

get_publication_ids(user_id, str_)[source]

Returns a dictionary where each user_id is a key and a list of his/her publications as values

Parameters

user_id (str) – User_id for each user whose publications are to be extracted

param str_: WIP :type str_: str

Returns

Dictonary of {User IDs : List of publications}

Return type

class `Dictionary `

save_user_publications(univ_name='TAMU')[source]

Function to save the publication details of each user

Parameters

univ_name (`str1) – University name of the user - which determines the file names for saving.

User_profile_creation

class user_profile_creation.extract_user_profiles(univ_name, output_path)[source]

Bases: object

Class which can extract profiles of all users from a university

extract_info(url, user_id)[source]

Function to extract a particular user’s information from general university URL

Parameters
  • url (str) – The URL from which response is to be retrieved

  • user_id (str) – ID of the particular scholar

Returns

None

extract_profiles()[source]

Function to compile Scholar data of a particular university. The function will first identify the total number of scholars in a university and then get basic summary available for each scholar.

Parameters

None

Returns

None

get_awards()[source]

Function to Research areas of the Scholar from University Page

Parameters

None

Returns

Research areas of the Scholar, Length of research_areas

Return type

Tuple (List, Int)

get_department()[source]

Function to extract Department of the Scholar from University Page

Parameters

None

Returns

Department of the Scholar

Return type

str

get_department_info()[source]

Function to extract Department info (including course area) of the Scholar from University Page

Parameters

None

Returns

Department info of the Scholar

Return type

str

get_email()[source]

Function to extract email of the Scholar from University Page

Parameters

None

Returns

Email of the Scholar

Return type

str

get_keywords()[source]

Function to extract keywords of the Scholar from University Page

Parameters

None

Returns

Keywords of the Scholar

Return type

str

get_name()[source]

Function to extract name of the Scholar from University Page

Parameters

None

Returns

Name of the Scholar

Return type

str

get_netid()[source]

Function to extract NetID (University Unique Identifier) of the Scholar from University Page

Parameters

None

Returns

NetID of the Scholar

Return type

str

get_npublications()[source]

Function to get the no of publciations of the Scholar from University Page

Parameters

None

Returns

Publications of the Scholar

Return type

str

get_organizations()[source]

Function to extract Organizations of the Scholar from University Page

Parameters

None

Returns

Organizations of the Scholar

Return type

str

get_overview()[source]

Function to extract Overview of the Scholar from University Page

Parameters

None

Returns

Overview of the Scholar

Return type

str

get_profile(url, user_id)[source]

Function to extract all details of a scholar from University Page

Parameters
  • url (str) – The base university URL from which Scholars’ data can be extracted by appending their user_ids

  • user_id (str) – The university provided User ID of the scholar

Returns

Scholar Data in the form of Pandas.DataFrame

Return type

Pandas.DataFrame

get_publications()[source]

Function to extract publications of the Scholar from University Page

Parameters

None

Returns

Publications of the Scholar

Return type

str

get_research()[source]

Function to Research areas of the Scholar from University Page

Parameters

None

Returns

Research areas of the Scholar, Length of research_areas

Return type

Tuple (List, Int)

get_title()[source]

Function to extract Prefered title of the Scholar from University Page

Parameters

None

Returns

Preferred title of the Scholar

Return type

str

user_profile_creation.get_userid(user_dict, key)[source]

Function to get User IDs from JSON

Parameters

user_dict (JSON) – The URL from which response is to be retrieved

Returns

User ID

Return type

str

Helpers

class helpers.PreProcessing(text)[source]

Bases: object

Class which is equipped with all sorts of Preprocessing & Cleaning techniques

lemmatize()[source]

Function to extract root words - Lemmatizing

Parameters

None

Returns

Processed text

Return type

str

remove_characters()[source]

Function to remove special characters

Parameters

None

Returns

Processed text

Return type

str

remove_letters(size)[source]

Function to remove words from a text with less than n letters

Parameters

None

Returns

Processed text

Return type

str

remove_numbers()[source]

Function to remove numbers in a text

Parameters

None

Returns

Processed text

Return type

str

remove_punctuation()[source]

Function to remove punctuations from the text

Parameters

None

Returns

Processed text

Return type

str

remove_stopwords()[source]

Function to remove stopwords from the text

Parameters

None

Returns

Processed text

Return type

str

remove_urls()[source]

Remove URLs from text

Parameters

None

Returns

Text after removing URLS

Return type

str

stemming()[source]

Function to extract root words - Stemming

Parameters

None

Returns

Processed text

Return type

str

text_lowercase()[source]

Function to convert all alphabets in a text to lowercase

Parameters

None

Returns

Processed text

Return type

str

tokenize()[source]

Function to tokenize phrases and tokens

Parameters

None

Returns

Processed text

Return type

str

helpers.counter_cosine_similarity(user_id, counterA, counterB)[source]

Calculate the counter cosine similarity for each user_id

Parameters
  • user_id (str) – User ID

  • counterA (List) – Keyword list 1

  • counterB (List) – Keyword list 2

Returns

Dictionary of {User ID: Counter_cosine value}

Return type

Dictionary

helpers.create_tokens(df, func, column_name, n_cores)[source]

Function to create tokens for a given column name

Parameters
  • df (Pandas.DataFrame) – DataFrame

  • func – Function to be applied

  • column_name (str) – The name of the column

  • n_cores (int) – No of CPU cores to be used

Returns

List containing the results of the function applied on each element of the column

Return type

List

helpers.extract_json(base_url, end_url, page)[source]

Extract data from webpage by appending Base_url+page+End_url

Parameters
  • base_url (str) – Base URL

  • end_url (str) – End URL

  • page (str) – The page number

Returns

Response from the webpage

Return type

Dict

helpers.get_datetime(date_str, year=True)[source]

Converts string Datetime to Datetime Object

Parameters
  • date_str (str) – Datetime in String

  • year (bool) – If True, extracts returns the year

Returns

Datetime object

Return type

Datetime

helpers.get_formatted_date(data, format_='%m%d%Y')[source]

Function to format date in a required format

Parameters
  • data (List) – List of dates as string

  • format (str) – Format in which date should be returned

Returns

Formatted date

Return type

Datetime

helpers.get_keys(text, ngram=1, ntop=10, generator='Spacy')[source]

Function to extract keywords from a text using a chosen generator

Parameters
  • text (str) – The text from which keywords are to be extracted

  • ngram (int) – No of words used for Ngram

  • ntop (int) – The no of top keywords to be extracted

  • generator (str) – The algorithm to be used for keyword extraction

Returns

List of Keywords

Return type

List

helpers.get_request(url, headers, data='')[source]

Retrieves a webpage with the desired header and payload data and returns the text data

Parameters
  • url (str) – URL from which data needs to be extracted

  • headers (List) – headers for the url

  • data (Dict) – Payload data

Returns

Response from the webpage in text

Return type

Str

helpers.merge_databases(dset1, dset2, on, how='inner')[source]

Function to merge two_datasets with key as ‘on’

Parameters
  • dset1 (Pandas.DataFrame) – Dataset 1

  • dset2 (Pandas.DataFrame) – Dataset 2

  • on (str) – Column on which datasets are to be merged

  • how (str) – Type of join

Returns

Merged datafram

Return type

Pandas.DataFrame

helpers.parallelize(n_cores, func, arg1)[source]

Function to Parallelize the task on multiple CPU thread

Parameters
  • n_cores (int) – No of cores of CPU to be used

  • func (Function()) – The function which needs to be parallelized

  • arg1 (str) – List[list of elements, len(list of elements)]

Returns

List containing the results of the function applied on each element in arg1[0]

Return type

List

helpers.save_pandas_to_csv(df, output_path, index)[source]

Saves the dataset to CSV file

Parameters
  • output_path (str) – Path where the file needs to be saved

  • index (bool) – Whether index should be included while saving

Returns

None

helpers.tokenize(phrase, k=3)[source]

Function which preprocess and tokenize a given phrase

Parameters
  • phrase (str) – Phrase to be tokenized

  • k (str) – Minimum length for a words for it to be retained in the text

Returns

Processed tokens

Return type

List

Indices and tables