Below are figures showing the top 10, 20, and 30 topics respectively that appear when non-negative matrix factorization (NMF) was run on a corpus of documents identified as relating to big data. For these figures, the corpus selected was from the documents identified using TF-IDF and KNN. This model had far less false positives than Doc2Vec and KNN, making it more accurate.


1. NMF with 10 topics

The figure shows the 10 main broad research areas using big-data and funded by a federal agency. The research areas cover clinical trial, clinical investigation, health community, brain and neural cognitive, food safety, data analysis, conference, cancer, child and cell.

Increasing research areas: clinical trials, clinical investigations, and health communities.

Declining research areas:cells and statistical analysis


2. NMF with 20 topics

We increased the number of topics to have a more granular understanding of the research areas using big data. The result is similar to the 10 topics with some additional research areas emerging on imaging, vaccine, social behavior.

Increasing research areas: clinical trials, patient care and data analysis.

Declining research areas: proteins, cells, and lungs.


3. NMF with 30 topics

Finally, we expanded the number of topics to 30 in order to have a more large view of topics funded by a federal agency that use big data. In addition to the research area described with 10 and 20 topics, other research topics that support the use of big data cover drug development, risk intervention, HIV, conferences and meeting, learning language and speech.

Increasing research areas: clinical trials and patient care.

Declining research areas: proteins, crops, and cells.