Data Preparation, Exploring and Modelling - IT Assignment Help

Download Solution Order New Solution
Assignment Task

 

Part A –Data Preparation, exploring and modelling

Data Description:
The four CSV files are described in the following table:
File Name Ordered New Column Names
Covid19.csv This is the master file that include information about the countries, continents and the daily new cases and daily new deaths in each country.
Tests.csv This file lists information about the daily COVID-19 tests for each country.
Countries.csv This file provides information about the countries
Recovered.csv This file presents Information about the daily recovered cases in each country.

Task 1: Data Preparation and Wrangling
Load and read the data from the CSV files and store them into dataframes named appropriately.
Tidy up the dataframe driven from the file “Recovered.csv” to be compatible with the dataframe driven from the file “Covid19.csv”, i.e. every observation should have a record of recovered patients in one country in a single day.
Change the column names in the dataframes were loaded from the following files accordingly.
Ordered New Column Names
File Name Covid19.csv Code, Country, Continent, Date, NewCases, NewDeathsTests.csv Code, Date, NewTestsCountries.csv Code, Country, Population, GDP, GDPCapitaRecovered.csv Country, Date, Recovered
Ensure that all dates variables are of date data type and with the same format across the dataframes.
Considering the master dataframe is the one loaded from file “Covid19.csv”, add new 5 variables to it from other files (Recovered.csv, Tests.csv, Countries.csv). The 5 new added variables should be named (“Recovered”, “NewTests”, “Population”, “GDP”, “GDPCapita”) accordingly.
[Hint: you can use the merge function to facilitate the alignment of the data in the different dataframes.]
Check for Nas in all dataframes and change them to Zero.
Using existing “Date” variable; add month and week variables to the master dataframe.

Task 2: Exploratory Data Analysis
Add four new variables to the master dataframe (“CumCases”, “CumDeaths”, “CumRecovered”, “CumTests”) These variables should reflect the cumulative relevant data up to the date of the observation, i.e CumCases for country “X” at Date “Y” should reflect the total number of cases in country “X” since the beginning of recording data till the date “Y”.

Find the day with the highest reported death toll across the world. Print the date and the death toll of that day.
Build a graph to show how the cumulative data of (Infected Cases, Deaths, Recovered, Tests) change over the time for the whole world collectively.
[Hint: Use geom_line, use log for Y axis for better presentation, Use different colour to distinguish between new cases, deaths, and recovered]
Extract the last day (05/05/2020) data and save it in a separate dataframe called
“lastDay_data”.
[Hint: use filter function with Date = "2020-05-05"]
Based on the last day data, extract the whole records of the top 10 countries worldwide that have current active cases, total confirmed cases, and fatality rate in separate dataframes (i.e. top10activeW, top10casesW, top10fatalityW, top10testsMW).
[Hint: you can use head(arranged_data, n=10) to get the top 10 records]
Based on the last day data, print the up to date confirmed, death, recovered cases as well as the tests for every continent.
Build a graph to show the total number of cases over the time for the top 10 countries that have been obtained in question 7 (Use log for Y axis for better presentation).
[Hint: first you need to get the data of the top-10 countries and then plot their lines]
Build a graph for the top 10 countries with current highest active cases which was obtained previously in question 7. The graph should have one subgraph (i.e. using facet function) for each of these countries, every subgraph should show how the new cases, new deaths, and new recovered cases were changing over time (Use log for Y axis for better presentation, Use different colour to distinguish between new cases, deaths, and recovered).
[hint: geom_line function with date on x_axis and each of the values of the variables in y_axis]
Build a graph for the top 10 countries with current highest total tests per one million of the population which was obtained previously in question 7. This graph should present total number of infected cases, total tests so far, and the total tests per million of the population for each country.
[hint: you can use bar chart to achieve this task]
Build a graph to present the statistics of all continents which was obtained previously in question 8 (Use log for Y axis for better presentation, Use Continent in the legend, make sure x-axis labels does not overlap).

Task 3: Data-Driven Modelling
Based on the data of the last day, that you have extracted in the previous task, create a separate dataframe named "cor_data" with the data of these variables (CumCases, CumTests, Population, GDP, GDPCapita).
[Hint: you can use select function on the lastday_data dataframe]
Compute the correlation matrix between the variables of the “cor_data” and visualise this correlation matrix.
visualise the distribution of the cumulative cases in the cor_data with and without changing the scale of the x axis to log transformation.
[Hint: you can use the geom_histrogram function]
Print the outlier values of the cumulative cases in “cor_data”.
Divide the cor_data into training and testing, where training data represent 65% of the number of rows.
Train a linear regression model to predict cumulative cases from the GDP of the countries. Then, evaluate this model on the test data and print the root mean square error value.
Train another linear regression model to predict cumulative cases from all the other variables. Then, evaluate this model on the test data and print the root mean square error value.

 

This IT Assignment has been solved by our IT experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.