Kafka Producer - IT Assignment Help

Download Solution Order New Solution
Assignment Task

 

This assignment consists of three parts:
Task 1: Kafka Producer - Producing the streaming data, where you can use csv modules
to read and publish the data to the Kafka stream.
Task 2: Kafka Consumer - Consuming the streaming data using Kafka consumer, use the csv module in any python libraries (e.g. Pandas) to process the ingested data from Kafka.
Task 3: Streaming application - Using Spark Structured Streaming together with Spark ML/SQL to process data streams. In task 3, pandas can only be used for plotting steps. The excessive usage of Pandas for data processing is not recommended.

You need to simulate the streaming data production using Kafka, then show some basic streaming classification to display the accumulated average of accuracy
(accumMeanAccuracy)andtotalnumberofflightrecordsforeachtimestamp
(countFlightRecords) after consuming the data. Build a streaming application that integrates
the machine learning model (provided to you) that can classify the flight-delays data stream.

Getting Started
There is no template for this assignment, please organize your answer in order so that it is easy to mark.
You will be using Python 3+ and PySpark 3+, Kafka (kafka-python), and any other python libraries for this assignment such as : numpy, pandas, scipy, and matplotlib.
Consult the tutors or ask Ed Forum if you are using other packages.
Create an Assignment-2B-Task1_flight_producer.ipynb file for data production
Create an Assignment-2B-Task2_flight_consumer.ipynb file for consuming process data using Kafka
Create an Assignment-2B-Task3_streaming_application.ipynb file for consuming and processing data using Spark Structured Streaming
1. Producing the streaming data
In this section, you will need to implement an Apache Kafka producer to simulate the real-time streaming of the data.
Important:
In this task, you need to generate the event timestamp in UTC timezone for each data record in the producer, and then convert the timestamp to unix-timestamp format (keeping UTC timezone) to simulate the “ts” column. For example, if the current time is 2021-9-28 12:39:45 UTC, it should be converted to the value of 1632796806, and
stored in the “ts” column
Please do not use Spark in this task
Event Flight Producer
Write a python program that loads all the data from “flight*.csv”. Save the file as Assignment-2B-Task1_flight_producer.ipynb.
Your program should send X number and Y number of records from each producer following the sequence to the Kafka stream every 5 seconds. X represents the records to send in a particular batch, whereas Y represents the records to send in the next batch (pending records).
There are some steps need to be carried out for this task:
Generate random numbers A and B, whose values are between 70~100 (inclusive) and 5~10 (inclusive) respectively, which are regenerated for each keyFlight. The keyFlights are generated from the column ‘DAY_OF_WEEK’ in the dataset which has
7 unique keys. These values 1, 2, 3, 4, 5, 6, and, 7 represents ‘sunday’,‘monday’, ‘tuesday’, ‘wednesday’,’thursday’,’friday’,’saturday’
You will need to append event time in unix-timestamp format (as mentioned above) for each key. Assuming that there are 7 keys in the flight-dataset as mentioned above,
there will be 7 unix-timestamp for each batch.
Each batch data contains 7 group (7 sub batches) instances generated from each key.
All of them are concatenated in the form of the list of dictionaries.
a.Explanation of a group instances/records
If A1 represents a group of instances/records generated from key = ‘1’ and B1 represents a group of pending instances/records generated from key = ‘1’, thus batch 1 (X1) contains [A1; A2; A3; A4; A5; A6; A7] and batch 1 pending (Y1) contains [B1; B2; B3; B4; B5; B6; B7].
A1 and B1 have the same ‘ts’ as it is generated from the same batch at the same time. The same case is also the same for A2 and B2 and so on.
Given random numbers A and B, the number of instances in A1, A2 and
B1, B2 and so on vary.
b.Explanation of a dictionary.
Dictionary represents an instance of data which output can be seen as follow
i. Exampleofadictionarywithkey
‘DAY_OF_WEEK’:1, ‘month’:1,...} = ‘1’ {‘ts’:1632796806,..,
ii. Exampleofadictionarywithkey = ‘7’ {‘ts’:1632744322,...,
‘DAY_OF_WEEK’:7, ‘month’:3,...}
iii.Dictionary is a part of a sub batch data. A sub batch data is a part of a batch data X. This also applies for pending data B.

At time1: X1 and Y1 are generated on the producer side, but only X1 is sent.
At time2: X2 and Y2 are generated on the producer side, but only X2 and pending data from the previous batch (Y1) are sent to the consumer.
If the data in each key is exhausted, restart from the first sequence again.
Pseudocode for this task:
Take the DAY_OF_WEEK column as the key, name a variable KeyFlightswhich contains the set of keys (7 keys).Createa function getFlightRecords,whichreturnsa variable named flightRecords, which is a dictionary that contains all flight data with theirassociated keys (step 3).
Create a topic called ‘flightTopic’
Create an instance variable called ‘flightProducer’
for each keyFlight in KeyFlightsGenerate A[‘keyFlight’] andB[‘keyFlight’]andgiveboththe timestamp as formatted in 3.b.
Concatenate all A and B. It will form the data batch X and Y respectively.
f.Send X and Y to the consumer following the rule in step 4.
2 Consuming data using Kafka
In this task, we will implement multiple Apache Kafka consumers to consume the data from task 1.
Important:
In this task, use Kafka consumer to consume the data from task 1.
Do not use Spark in this task
Event Flight Consumer
Write a python program that consumes the process events using kafka consumer, visualising the countFlightRecords in real time based on the timestamp. Save the file as Assignment-2B-Task2_flight_consumer.ipynb.
Your program should get the count of countFlightRecords captured by the consumer in the last 2-minutes (use the processing time) for each keyFlight. For this, print the number of flights for keyFlight = ‘1’, keyFlight = ‘2’, and keyFlight = ‘3’ only. Then, use line charts to visualize it. Note, in this task, please use ts as timestamp generated
from producer step (disregard DAY, DAY_OF_WEEK, MONTH, and YEAR as the real timestamp)
Hint - x-axis can be used to represent the timestamp, while y-axis can be used to represent the number of countFlightRecords data can be represented in different color legends.
3. Streaming application using Spark Structured Streaming
In this task, we will implement Spark Structured Streaming to consume the data from task
1 and perform streaming classification.
Important:
You may use Spark Structured Streaming together with Spark SQL and ML.
Write a python program that achieves the following requirements. Save the file as Assignment-2B-Task3_streaming_application.ipynb.
SparkSession is created using a SparkConf object, which would use two local cores
with a proper application name, and use UTC as the timezone 3.
From the Kafka producers in Task 1, ingest the streaming data into Spark Streaming.
Then the streaming data format should be transformed into the proper formats following the file schema in the metadata.
Persist the transformed streaming data in parquet format for flight data. Flight data should be stored in “flight.parquet” in the same folder of your notebook.
Load the machine learning models given, and use the models to classify whether each flight records are delayed. This is based on the assumption that the data has been
labelled.
Using the classification results, monitor the data following the requirements below. For each key in keyFlight = ‘1’, keyFlight = ‘2’, and keyFlight = ‘3’, keep track of the accumulated accuracy for every timestamp in the 2-min window for a total of 6 minutes.
i.Your results should include, number of records flight and for each key, including their accumulated accuracy in each timestamp. ii. Visualise the data in line charts. Prepare a line chart plot to show the number of flights from the start to the most recent.
For this visualisation, You need two subplots. First subplot, the x-axis can be used to represent the timestamp, while y-axis can be used to represent the number of countFlightRecords. For the second subplot, x-axis can be used to represent the timestamp, whereas y-axis can be used to plot the accumMeanAccuracy. For each subplot, the results from all keyFlights (key = ‘1’, key = ‘2’, and key = ‘3’) should be represented in different color legends.

 

This IT Assignment has been solved by our IT experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.