Creating a Model to Detect Malware using Supervised Learning Algorithms - Case Study - IT Assignment Help

Download Solution Order New Solution


SCENARIO
The industry grant from TOBORRM requires that they provide a clear case for whether machine learning algorithms could solve the problem of classifying malicious software. Your task is to build on your previous work and run the data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy.

TASK
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.

Part 1 – General data preparation and cleaning.
Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.
Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:
For How.Many.Times.File.Seen, set all values = 65535 to NA;
Convert Threads.Started to a factor whose categories are given by
1 = 1 thread started
2 = 2 threads started
3 = 3 threads started
4 = 4 threads started
5 = 5 or more threads started

Hint: Replace all values greater than 5 with 5, then use the factor(.) function.
Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)
Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.
Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time.

Part 2 – Compare the performances of different machine learning algorithms
Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels.
library(tidyverse)set.seed(Enter your student ID)models.list1 <- c("Logistic Ridge Regression", "Logistic LASSO Regression", "Logistic Elastic-Net Regression")models.list2 <- c("Classification Tree", "Bagging Tree", "Random Forest")myModels <- c("Binary Logistic Regression", sample(models.list1,size=1), sample(models.list2,size=1))myModels %>% data.frame
For each of your ML modelling approaches, you will need to:
Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.
Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):
Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches, even if you’re using the same search strategy as the workshop notes.

 

This IT Assignment has been solved by our IT experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.