Text Classifcation

KNN on TF-IDF

257,123 abstracts were identified as big data when using TF-IDF to vectorize the abstracts and KNN as a classifier.

Above can be seen a sample of the abstracts that were identified as big data using TF-IDF. Information such as their department, agency, project dates, and costs were recorded. This information will later be used to model further trends in the abstracts.

The breakdown of department of the abstracts that were identified as big data using TF-IDF is shown above. The vast majority of the abstracts were from HSS which is representative of the proportion of HSS documents in the corpus.

KNN on Doc2Vec

484,362 abstracts were identified as relating to big data by using Doc2Vec to classify the documents and KNN to classify them.

Above can be seen a sample of the abstracts that were identified as big data using Doc2Vec. Information such as their department, agency, project dates, and costs were recorded. This information will later be used to model further trends in the abstracts.