Document Classifier

About:

This app will classify computer science papers from their abstract and predict the papers topic using machine learning model hosted on Google Cloud Run.

How The Web App Works:

This web app is built using FastAPI and Bootstrap. Using Bootstrap allows us to have a beautiful responsive website without having to write HTML or JavaScript. I deployed the app using Docker and Google Cloud Run to build out a serverless web application. The advantage of using a serverless framework for me is cost effectiveness: I don't pay much at all unless people use my web app a ton and I don't expect people to visit this app very often. However, due to the serverless framework I will have issues with latency, which I can live with. When the user enters the text and hits the "Submit" button the Python script then passes the text to another Rest APi that serves a model that predicts the topic of the paper. To learn more about how I deployed this app checkout this blogpost.

How The Model Works:

The model for this app is a text classification algorithm using Scikit-learn library. It was trained on an imbalanced dataset (corpus) that I created from summaries of papers published on arxiv.org. The topic of each paper was already labeled as the category therefore alleviating the need for me to label the dataset. The data was stored a MongoDB database and used to train a Support Vector Machine that uses weighting to alleviate the imbalance in the number of paper topics. You can read more about this part of the model development in my first blog post. To improve the model performance we used stop word removal and stemming through the Natural Language Toolkit (NLTK). Finally, we persist the entire pipeline using Joblib. You can read more about this aspect of the model training in the second blog post.