DATA2x01 - Data Science, Big Data and Data Variety Assignment

Download Solution Order New Solution

Assignment Task

Introduction

Australia is formally defined by more than ”Statistical Area Level 2” (SA2) distinct geographical regions, designed to represent communities of between 3000-25000 people ”that interact together socially and economically”. In this assignment, we’ll focus on the 350+ SA2s within the Greater Sydney area, and you will be tasked with spatially integrating several datasets of various formats to calculate a score for how ”well-resourced” each region is.

Task 1

Import all datasets (clean if required) into your PostgreSQL server, using a well-defined data schema. These sources include:

  • SA2 Regions: Statistical Area Level 2 (SA2) digital boundaries (feel free to filter this down to the ”Greater Sydney” GCC).
  • Businesses: Number of businesses by industry and SA2 region, reported by turnover size
  • Stops: Locations of all public transport stops (train and bus) in General Transit Feed Specification (GTFS)
  • Polls: Locations (and other premises details) of polling places for the 2019 Federal
  • Schools: Geographical regions in which students must live to attend primary, secondary and future Government
  • Population: Estimates of the number of people living in each SA2 by age range (for ”per capita” calculations).
  • Income: Total earnings statistics by SA2 (for later correlation analysis).

Task 2

Compute a score for how ”well-resourced” each individual neighbourhood is according to the following formula, where S is the sigmoid function, z is the normalised z-score, and ’young people’ are defined as anyone aged 0-19. Feel free to only calculate scores for SA2 regions with a population of at least 100, and you are welcome to extend the scoring function however you deem necessary, so long as rational explanation is provided (e.g. other mathematical standardisation techniques, mitigating the impact of outliers, calculating some metrics per-capita or per-sqkm, etc).

Task 3

Extend the score by sourcing one additional dataset for each group member , and then incorporating all new datasets into your scoring function. For full marks, at least one dataset should be of spatial data, and at least one should be of a type not used so far in this assignment (e.g. JSON, XML, or collated via web scraping). As an example of subject matter for your additional datasets, they could focus on positive aspects for a region such as public facilities or other census statistics, or negative impacts such as crime rates or car accidents.

For either version of your scoring function (or both!), the following subtasks should also be achieved:

  • Visualise your score in an engaging way, and summarise key results in a table (ideally including a useful map-overlay visualisation, or an interactive graph).
  • Include in-depth analysis into your Note interesting findings, discuss their limitations, and summarise key conclusions.
  • Determine if there is any correlation between your score and the median income of each region (note the provided income data will not match all our SA2 regions, given the data corresponds to the old 2016 SA2 boundaries, not 2020, but work with those that do match).
  • Ensure at least one useful index (ideally spatial) has been used for your

Task 4: Advanced Class Only

There are two additional components for DATA2901 students.

Create a new version of your score using ranks (r) rather than z-scores (z). As a theoretical example, rather than considering a particular SA2 to have 42 public transport stops, you would use the fact that this would rank it 14th of the This will require a new standardisation technique other than the simple sigmoid z-score summation of before, so additionally consider how to convert these values into a comparable, interpretable score. Compare this new score to your previous one from Task 2 - discuss their differences, and conclude which (if any) is more reliable.

Scoreadv = f (rretail, rhealth, rstops, rpolls, rschools)

2. Use a supervised or unsupervised machine learning technique to add further depth to your results. This task is intentionally broad to allow creative applications, but some examples could include:

  • A regression model to evaluate which features are statistically significant in predicting the median income of a
  • A decision tree classifier to predict the broader SA3 region of a particular SA2 area, given some of its
  • An unsupervised clustering algorithm to find similarities between SA2s that might otherwise not be considered

This DATA2x01 - Data Science has been solved by our PhD Experts at My Uni Paper.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.