ML Malware Prediction in the Cloud

Abstract:

This work is centered around a binary classification problem over millions of observations, each pertaining to a distinct Windows device. By classifying correctly which device has the highest chance of acquiring malware in the coming time period, we can get an idea of the most influential factors towards said infection. This work also consists of an exploration of Google Cloud Platform (GCP) and Microsoft Azure as on-demand distributed computing ecosystems. There is also some discussion around the class imbalance problem in classification. My results are pretty impressive for such rudimentary implementations (1st place in Kaggle got .71 AUROC).

Download here