CIS9440 - Data Requirements, Storage, and Modeling for Homework Assignment

Download Solution Order New Solution

Assignment Task

Part 1 - Data Requirements

You are required to choose your own data for this homework.

  • Your data should not come from Kaggle, or any main data source included in term project.
  • Your data should not be the same as your team member in your term project.
  • Your data must contain at least 10 columns and more than 7500 rows.
  • Your data should not contain aggregate data. However, if there is no other way, you must consult with the instructor first.
  • Your data should not be stock market data or bitcoin market data.
  • You will be using the same data for all subsequent homework. So, it is important that you take your time to choose the right data. If you want to change data at a later stage, you are free to do so. However, you will have to start the whole homework 1 and it will not be graded.

Data Sourcing

Your first step with the project is to get familiar with the data. You need to understand how it is structured and most importantly find the data dictionary associated with it. If it is not there, then you will have to build a data dictionary. The latter should contain the name of the field, the description, the datatype and any constraints associated with the field.

You will need to source those data using one or more of the following methods:

  • Web Scrapping
  • Web API
  • Connection to Database
  • Connection to Data Store (Cloud Storage)

Deliverables

  1. Link of all data sources

  2. Explanation of the data (where does it come from)

  3. Link that shows the data dictionary (excel, google sheets)

  4. Github/AzureDevops/Jira account created

  5. Scripts that gather these data

  6. Git Repository Created

  7. You script should be stored in a git repository that is accessible to all members of your team and the professor.

Storage

Your next step is to choose the appropriate data store for your data. Remember in the previous step, you had to source the data using a script or a specific tool. The data stores of choice are the following Database, Storage S3, MongoDB. Make sure the data are properly stored and not scattered. If need be, you will need also to mark the date the data was stored as well. It is recommend you watch the async videos.

Deliverables 

  • Storage of choice
  • Data Stored in an orderly fashion in the storage.
  • Scripts updated from the first deliverables. You will need to update those scripts to store to data into their specific storage.
  • Git repository updated.

Modeling

Once you have done the storage, you will need to start the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools.

Deliverables 

  • Data Model documented showing the fact table and the dimension table.
  • Scripts that create the Data Warehouse
  • Scripts from previous steps updated.
  • Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DataGrip, DbSchema, SqlDBM)
  • Git Repository Updated

Part 2 - Homework Steps

You are required to choose the previous data from homework 1. If you want to change your data, you are free to do so. However, you will have to start the whole homework 1 and it will not be graded. You are free to use any cloud provider. You are required to check the feedback from the professor.

Transformation

Once you have stored the data, the next steps would be to transform the data. Data should be transformed according to specific business rules. While transformation the data, you should consider the following.

  • Unified date format YYYY-MM-DD
  • Splitting the date into multiple unit (Year, Quarter, Month, Day, Hour, etc…)
  • Removing NULL values if necessary
  • Removing Duplicates rows if necessary
  • Verify Data against data reference (currency, state, zipcode, county, NAICS, GICS, etc)
  • Use the correct data type each new fact generated.
  • Adding one or many columns
  • Summing two or more columns
  • Create a Data Mapping that will serve be incorporated into your data dictionary tools.
    • It should contain the name of the fields, their data type, their description, the source column and the destination column.

This is only a limited version of what you can do. There is more to that.  Remember also to update your data dictionary.

You have the following options:

  1. Use a transformation tool ETL tools to do the transformation.
  2. Create scripts that does this transformation.
  3. Git repository Updated.

Deliverables 

  • Scripts from previous steps updated.
  • Data Mapping Created. Data dictionary updated.
  • Data transformation project created on the cloud in case you are using Option One
  • ETL fully created to push the data to DataWarehouse
  • Git Repository Updated

Modeling

Once you have done the transformation, you will need to update the modeling of the data warehouse. Remember the Data Warehouse contains already two main aspects. A fact table and a dimension table. The fact table must have a surrogate key as well as each dimension table. Modeling can be done using any tools. Your data warehouse should be in Redshift.

Deliverables

  • Data Model documented showing the fact table and the dimension table.
  • Scripts that create the Data Warehouse
  • Scripts from previous steps updated.
  • Data Warehouse Created in AWS with Redshift
  • Data Inserted into the Data Warehouse
  • Data Warehouse accessible to everyone in the team and can’t be accessed through a client (DBeaver, DataGrip)
  • Git Repository Updated

Serving Data

You will be using an online visualization tool to show the data that you have transform. You should apply all the visualization practices you have seen in all sessions.  The following must be part of the Visual:

  • A Filtering tool by date or by dimension: When you filter by date, all charts should change based on the filter
  • A Pie Chart
  • A Column Chart
  • A Line Chart
  • A Heat Map

As part of the service DATA as well, you will need to create a api that will generate a csv file that contains a summary of the data. This is optional.

Deliverables

  • Git Repository Updated
  • A link using AWS Quick Sight that connects to the Data Warehouse and shows the data.
  • An API using python script that will generate data in a csv format which can be used by Data Analyst.
  • A PowerBI or a Tableau Link

This CIS9440 - IT Computer Science has been solved by our PhD Experts at My Uni Paper.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.