CSE5BDC-Big Data Tools & Big Data Processing - IT Computer Science Assignment Help

Download Solution Order New Solution
Assignment Task


Task 

Objectives
1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
2. Solve challenging big data processing tasks by finding highly efficient solutions.
3. Experience processing three different types of real data
a. Standard multi-attribute data (Bank data)
b. Time series data (Twitter feed data)
c. Bag of words data.
4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).


Task 1: Analysing Bank Data 
We will be doing some analytics on real data from a Portuguese banking institution1. The data is stored in a semicolon (“;”) delimited format.

The data is supplied with the assignment at the following locations:

The data is supplied with the assignment at the following locations


Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):

 small example of the bank data that we will use to illustrate the subtasks below


c) [Spark RDD] Group balance into the following three categories:
a. Low: -infinity to 500
b. Medium: 501 to 1500 =>
c. High: 1501 to +infinity


Report the number of people in each of the above categories. Write the results to “Task_1c-out” in text file format. For the small example data set you should get the following results (output order is not important in this question):

Write the results to “Task_1c-out” in text file format. For the small example data set you should get the following results


Task 2: Analysing Twitter Time Series Data 
In this task we will be doing some analytics on real Twitter data2. The data is stored in a tab (“\t”) delimited format.

The data is supplied with the assignment at the following locations:

The data is supplied with the assignment at the following locations

 

a) [Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be:

month: 200907, count: 1000, hashtagName: abc

 

b) [Do twice, once using Hive and once using Spark RDD] Find the hash tag name that was tweeted the most in the entire data set across all months. Report the total number of tweets for that hash tag name. You can either print the result to the terminal or output the result to a text file. So, for the above small example data set the output would be:
abc 1023

c) [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once. Print the result to the terminal output using println. For the above small example data set:

Print the result to the terminal output using println. For the above small example data set:


Task 3: Indexing Bag of Words data 
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.

The data is supplied with the assignment at the following locations3:

The data is supplied with the assignment at the following locations 3

Complete the following subtasks using Spark:

Complete the following subtasks using Spark


c) [spark SQL] Load the previously created dataframe stored in parquet format from subtask b). For each document ID in the docIds list (which is provided as a function argument for you), use println to display the following: the document ID, the word with the most occurrences in that document (you can break ties arbitrarily), and the number of occurrences of that word in the document. Skip any document IDs that aren’t found in the dataset. Use an optimisation to prevent loading the parquet file into memory multiple times. If docIds contains “2” and “3”, then the output for the example dataset would be:

[2, motorbike, 702] [3, boat, 2000]

 

For this subtask specify the document ids as arguments to the script. For example:

$ bash build_and_run.sh 2 3

d) [spark SQL] Load the previously created dataframe stored in parquet format from subtask b). For each word in the queryWords list (which is provided as a function argument for you), use println to display the docId with the most occurrences of that word (you can break ties arbitrarily). Use an optimisation based on how the data is partitioned.
 

If queryWords contains “car” and “truck”, then the output for the example dataset would be:

[car,2]
[truck,3]

 

For this subtask you can specify the query words as arguments to the script. For example:

$ bash build_and_run.sh computer environment power


 

Task 4: Creating co-occurring words from Bag of Words data
Using the same data as that for task 3 perform the following subtasks:
a) [spark SQL] Remove all rows which reference infrequently occurring words from docwords. Store the resulting dataframe in Parquet format at “../frequent_docwords.parquet” and in CSV format at “Task_4a-out”. An infrequently occurring word is any word that appears less than 1000 times in the entire corpus of 
documents. For the small example input file the expected output is:

 For the small example input file the expected output is


b) [spark SQL] Load up the Parquet file from “../frequent_docwords.parquet” which you created in the previous subtask. Find all pairs of frequent words that occur in the same document and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run.
Note there should not be any replicated entries like

  •  (truck, boat) (truck, boat)

 

Note you should not have the same pair occurring twice in opposite order. Only one of the following should occur:

  •  (truck, boat) (boat, truck)

 

Save the results in CSV format at “Task_4b-out”. For the example above, the output should be as follows (it is OK if the text file output format differs from that below but the data contents should be the same):
boat, motorbike, 2
motorbike, plane, 2
boat, plane, 1

 

For example the following format and content for the text file will also be acceptable (note the order is slightly different, that is also OK since we break frequency ties arbitrarily):
(2,(plane, motorbike))
(2,(motorbike, boat))
(1,(plane, boat))

 

Bonus Marks:
1. Using spark perform the following task using the data set of task 2.
[Spark RDD or Spark SQL] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name. Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, the 1st month count, and the 2nd month count using println.

 

For the small example data set of task 2 the output would be:

Hash tag name: mycoolwife
count of month 200812: 100
count of month 200901: 201

 

This CSE5BDC-IT Computer Science Assignment has been solved by our IT Computer Science Expert at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing Style. Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. 

You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turn tin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.