Visualization and Data Processing - IT Assignment Help

Download Solution Order New Solution
Assignment Task
 
 


Assessment 2
The following exercises are designed to assess your understanding of concepts, implementation and interpretation of topics in Visualization and Data Processing. Some questions may require you to search and use R functions that we have not used so far. In all following questions submit codes and output.
Note: The questions in this assessment may have multiple correct solutions. Hence submission of R code is essential. Almost no statistical background is presumed knowledge for this assessment. All methods required for solution are available on the content pages of Weeks 2-5 of this subject. Some of them have been covered in detail during collaborate sessions.
Answer sheet and script (code).
All outputs must be followed by a short statement, for full marks. You could use on of the following alternatives in responding to the questions.
Please use a new word document to provide your responses sequentially using Section and Question numbers. You could paste your R code followed by output and discussion in the word document.
In a word document provide your output and discussion. Submit the –annotated (commented)- R code separately. Codes without annotation won’t receive full marks.
If you know Rmarkdown, you could use that to create an integrated report. However, Rmarkdown is NOT a requirement for this subject.
A. Visualization: Section Marks 10
Load the R data LifeCycleSavings, from the datasets package, in your R session.
Using the values of the variable pop15 and ddpi to create new categorical variables named Pop15Cat and ddpiCat, respectively, into three categories, ‘High’ (top 20%), “Medium”( middle 60%) and “low”(bottom 20%). Show your codes and count of cases (countries) within each new categorical variable. Marks(4)
Use a visualization tool to display the relationship between sr and Pop15Cat, stratified by ddpiCat, on a single plot. You must use ggplot2 for visualization. Yourplots must have proper labelling and legends wherever applicable. Marks(5)
B. Data Processing : Section Marks 15
Create an ordered categorical variable, “Pain” with the categories “Low” (coded as) 0, “Moderate” (coded as 1) and “Severe” coded as 2, such that “No” is ordered higher than “Yes”.Marks (2)
Five observations in the pain variable are missing. Additionally, it is given that the variables have been collected for Australia males in the age group 70-80. How would you deal with them? Your answer should not be more than two sentences. Marks (4)
Load the R data iris flowers in your session and,
Show the dimension and the variable names of this dataset using a single R function.
Drop the variable “Species”- using a dplyr() function - and store the new dataset in a new data frame called iris1.
Change the first four column names of the new data frame to – sep_len, sep_wid, pet_len, and pet_wid.
Show your codes for all the steps . Print the first six rows of the data frame iris1. Marks (4)
Proximity analysis on iris flowers.
Use an R function to compute the appropriate dissimilarity coefficient for the data frame iris1. Justify your choice.
Submit the code for the function with the dissimilarity coefficient that you chose.
How many distinct dissimilarities can be computed for 52nd flower in your dataset, using the measure you have proposed in B4.1).
Identify the flowers that are- a.) most similar and b.) least similar with flower 52. Show your codes and output for identification, along with the statement. Marks (5)
C. Data Processing: Section Marks 20
Clinician scientists at Royal Melbourne hospital are investigating the relationship between fecal calprotectin (FC) as a non-invasive diagnostic alternative to Inflammatory Bowel Disease (IBD) and Acute Sever Ulcerative Colitis (ASUC). It would also contribute to standard of care. A subset of the dataset is provided as bowel.csv. The data has several missing values, causes of missing-ness are often unknown.
In the following show questions show your R working and output.
In the bowel.csv dataset,
Count the number of missing observations on the variable Hb and in the overall dataset. Marks (2)
Perform a univariate imputation on the variable Hb. Your solution should include Marks (3)
code,
result, and
Justification for the choice of imputation value you used.
Select two variables that may have association with Hb.
Justify your choice using dplyr()based tool(s). Show your working in R. Marks(4)
Use your investigation in a.) to replace all missing values in the variable FC. Show your code. Marks (6)
Present a comparison - with discussion - between the two types of imputation in the context of variable Hb. You may use tools from topics in Weeks 2 and 3, for comparison. Marks (5).
D. Text Analytics: Section Marks 15
Mystery2.RData is a saved R worksession. The response to the questions below must include comments, wherever applicable. This question tests both implementation and conceptual understanding of data analytic tools that you have learned in this subject.
1. What type of R object is Mystery2? How many components are there in Mystery 2 .Show your code and results in support. Mark (2)
2. Using your learnings from this subject to clean the object Mystery2.
a. You must use at least 5 cleaning steps.
b. Show your code and the last six rows and first five columns (only) of the tabular data that you created. Marks (4)
3. Create a subset of the tabular data you constructed in QD2, retaining only those columns that have occurred at least 50 times in this dataset. Use a visualization tool to show the frequency distribution (count) of these columns . Hint: Select an appropriate visualization tool from your learnings of Week 3. Marks (4).
4. Using tools learned in this subject quantify and depict any similarities between the rows of your cleaned data.
Is there any obvious structure in the similarity matrix as depicted by the plot? If so what does this structure mean in terms of the original data? One or two sentences at the most.
Please use the original tabular data that you created in Q.D2. You have to use an appropriate (relevant) weighting metric described on your content page. Hint: You have to use (at least) one quantitative measure and (at least) one visualization tool to justify your answer. For visualization of the similarity matrix you may use R functions such as levelplot() or image()or any other suitable plotting function. You would have to research the implementation of these functions. Show your code, results and comment on your findings. Marks (5)



This IT Assignment has been solved by our IT Experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
    
Be it a used or new solution, the quality of the work submitted by our assignment Experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.