“Federal RePORTER, a funded research and development grant database, contained a vast amount of information on federally funded R&D and was utilized by researchers and policymakers to uncover insights” (Linehan et al., 2022) Federal RePORTER data contains over 1 million federally funded R&D grants from science and technology federal agencies from 2000 (Linehan et al., 2022). Project information included title, funding department, funding agency, principal investigator (PI), organization, start date, and total cost. Federal RePORTER data were submitted by the individual agencies including HHS, NSF, USDA, NASA, DOD, VA, ED, and EPA.
Federal RePORTER was retired in 2022, although archived data through fiscal year 2020 are available at https://federalreporter.nih.gov/.
Our raw dataset contained 1,262,655 projects from FYs 2008-2020. In order to prepare this data for analysis we
Removed projects that were missing an abstract (42,536 projects),
Deduplicated projects by title, abstract, and FY, and then removed the repeated project entries (71,902 projects), and
Removed abstracts that were short phrases such as “Abstract not provided”.
Doc2Vec utilizes the raw abstract text. It takes items such as common words (i.e. the, and, it) and non-alphabetic characters into account to give documents as a whole meaning. The other factorization method, TF-IDF, requires text processing to be effective.
We processed the abstract text of each project by
Removing phrases such as “description (provided by applicant)”,
Performing standard Natural Language Processing (NLP) methods, specifically tokenization, lemmatization, stop word removal, and addition of bi-grams and tri-grams, and
Discarding length one tokens and numeric tokens except those of length four (years).
Linehan K, Oh E, Thurston J, Siwe GL, Garrett M, Keller S, Shipp S, Kindlon A, Jankowski, J. Technical Report - Detecting Federally Funded Research and Development Trends Using Machine Learning and Information Retrieval Methods, Technical Report. BI-2022-1531. Proceedings of the Biocomplexity Institute. University of Virginia; 2022 May. DOI: https://doi.org/10.18130/4c3g-k017.