Highlights
Objectives
1. Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
2. Solve challenging big data processing tasks by finding highly efficient solutions.
3. Experience processing three different types of real data
a. Standard multi-attribute data (Bank data)
b. Time series data (Twitter feed data)
c. Bag of words data.
4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).
ChatGPT and similar AI tools
A key purpose of this assessment task is to test your own ability to complete the assigned tasks. Therefore, the use of ChatGPT, AI tools or chatbots with similar functionality is prohibited for this assessment task. Students who are found to be in breach of this rule will be subject to normal academic misconduct measures. Additionally, students may be engaged to provide an oral validation of their understanding of their submitted work (e.g. coding).
Expected quality of solutions
In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks. b) This entire assignment can be done using the docker containers supplied in the labs and the supplied data sets without running out of memory. It is time to show your skills! c) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems. d) For Hive queries. We prefer answers that use less tables.
The questions in the assignment will be labelled using the following:
Task
1. Analysing Bank Data
We will be doing some analytics on real data from a Portuguese banking institution1 . The data is stored in a semicolon (“;”) delimited format.
2. Analysing Twitter Time Series Data
In this task we will be doing some analytics on real Twitter data2 . The data is stored in a tab (“\t”) delimited format.
a) [Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be:
b) [Do twice, once using Hive and once using Spark RDD] Find the hash tag name that was tweeted the most in the entire data set across all months. Report the total number of tweets for that hash tag name. You can either print the result to the terminal or output the result to a text file. So, for the above small example data set the output would be:
c) [Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once.
3. Indexing Bag of Words data
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.
4. Creating co-occurring words from Bag of Words data
a) [spark SQL] Remove all rows which reference infrequently occurring words from docwords. Store the resulting dataframe in Parquet format at frequent_docwords.parquet and in CSV format at “Task 4a-out. An infrequently occurring word is any word that appears less than 1000 times in the entire corpus of documents.
This CSE5BDC - IT and Computer Science has been solved by our PHD Experts at My Uni Paper.
© Copyright 2026 My Uni Papers – Student Hustle Made Hassle Free. All rights reserved.