Enjoy Upto 30% off on all Your Assignments ORDER NOW

+61480015851

+61480015851 info@myassignmentservices.com

COS60008-Introduction to Data Science Report - IT Computer Science Assignment Help

Download Solution

Order New Solution

Assignment Task

Task

Introduction
This is an individual assignment and worth 15% of your final grade. It intends to evaluate your understanding and practical skills to deal with the first few steps in a typical data science process. In this assignment, you are provided three data files, i.e., “data1.csv”, “data2.csv” and “data3.csv”, which form the dataset created from a higher education institution related to students enrolled in different undergraduate degrees1.
The files “data1.csv” and “data2.csv” contain the same set of students but distinct sets of attributes for describing the student, where each student has its unique ID. The file “data3.csv” contains a different set of students with each student described by all attributes from both “data1.csv” and “data2.csv”.
You are asked to carry out data acquisition, preparation and exploration based on the three data sources according to the given instructions. For example, you need to develop and implement appropriate steps to load and merge the data from the three data files, perform data cleaning, make explorative data analysis, and report your findings. A discussion forum and further announcements for the assignment will be available in Canvas. You are responsible for checking Canvas on a regular basis to stay informed with regards to any updates about the assignment.

Task 1 – Data Acquisition and Preparation
At first, you need to acquire three data files “data1.csv”, “data2.csv”, and “data3.csv”, which are included in a single .zip file named “assignment1_data.zip”, under the menu “Assignments ? Assignment 1” in Canvas and put them into your working folder in the Jupyter Notebook. These data files are adapted from the “Student Drop out and Academic Success” data set in the UCI repository2, which contain many records of students with each record corresponding to a specification of the student in terms of its various attributes. The files “data1.csv” and “data2.csv” contain the same set of students but two distinct sets of attributes for describing a student. In contrast, the file “data3.csv” contains a different set of students, where each record of the student consists of all attributes from both “data1.csv” and “data2.csv”.

The set of 38 possible attributes for a student record and their corresponding value ranges are given below:

The set of 38 possible attributes for a student record and their corresponding value ranges

corresponding value ranges are given below:

As a data scientist, you will be asked to analyse the data from the three data files. However, before doing that you know that you need to carry out some data preparation operations, e.g., merging and cleaning the data. In this task, you are asked to utilise the Python package “Pandas” to do the following:

1.1. Loading the data from the three data files into three Pandas DataFrames and checking whether theloaded data are equivalent to the data contained in the raw data files.

1.2. Merging the obtained three DataFrames into a single one that should contain all students from the three DataFrames, where each student has a unique ID and is described by the 38 attributes listed above.

1.3. Cleaning the data by using the knowledge you have learned.

You need to deal with the issues existing in the data, e.g., missing values, duplicates, impossible values and extra whitespaces. However, you must NOT modify any parts of data that do not suffer issues. Failing to do so will lead to mark reduction.
When dealing with missing values (if any), you can remove an entire row or column ONLY IF more than 50% of its elements are missing. Otherwise, you must find other appropriate cleaning methods to handle missing values.
You must be able to explain how you detect each data issue and why you choose a specific cleaning method to deal with it.

Task 2 – Data Exploration
Now you have finished Task 1 and obtained a DataFrame composed of the cleaned data. You can start to explore your data by carrying out the following steps:

2.1. Choosing two columns with categorical and numerical values, respectively, and visualising each of them in an appropriate way. Note that you need to explore and identify potentially important columns (and can justify your choice) instead of making random choice.

2.2. Choosing three pairs of columns and exploring the relationship between the two columnsinvolved in each pair via appropriate descriptive statistics and visualisation tools. Your choice of the column pairs should intend to address some “plausible hypotheses” on the data.

2.3. Building a scatter matrix for all numerical columns.
Note: Graphs must contain appropriate titles, axis labels, etc. to make themselves self-explained. They should be clear enough for readers to read. You can research on appropriate categorical text label, as the data set does not have the text description of the numerical code.

Task 3 – Report

In this task, you are asked to write a report to elaborate your analyses and findings in Tasks 2 and 3. You should:

3.1. Create a sub-heading tilted “Task 1: Data Acquisition & Preparation” in your report under which you should:

Briefly describe how you addressed this task.
Describe how you merged the data from the three data files
Describe each of the data issues you detected in data cleaning, explain how you detected it, and justify why you chose a specific data cleaning method to deal with it.
Discuss any problems you encountered when undertaking this task and how you solved them.

3.2. Create a sub-heading named “Task 2: Data Exploration” in your report under which you needto:

Create a sub-section with an appropriate title for each of the three sub-tasks in Task 2.
In the sub-section for sub-task 2.1, for each selected column, include the graph(s) created forthat column, and provide a brief explanation on why you chose that column and a specific visualisation method to explore it.
In the sub-section for sub-task 2.2, briefly explain why you chose each of the three pairs of columns (e.g., stating the hypotheses that you intended to address), include the descriptive statistics and graph(s) for each of the three selected pairs, followed by a brief discussion onany interesting findings about the presence or lack of relationship between the two involved columns.
In the sub-section for sub-task 2.3, include the plot of the scatter matrix, and report your findings from the plot.

This COS60008-IT Computer Science Assignment has been solved by our IT Computer Science Expert at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing Style. Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered.

You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turn tin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Name

Email *

Country

Phone No.*

Subject

Deadline (AEST) Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)

Time

Upload your assignment

Kindly mention your assignment details

Captcha

Verify Captcha *

I accept the T&C and all policies of the website and agree to receive offers and updates.

COS60008-Introduction to Data Science Report - IT Computer Science Assignment Help

Assignment Task

Get It Done! Today

Subjects

Contact Us

COS60008-Introduction to Data Science Report - IT Computer Science Assignment Help

Assignment Task

Get It Done! Today

Download Sample

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.