Basic Programming - Real Time Stream Analysis Using Spark - Computer Science Assessment Answer

Download Solution Order New Solution
Internal Code: 1AICFF

Basic Programming - Real Time Stream Analysis Using Spark - Computer Science Assessment Answer

Assignment Task: Goal: The importance of real-time data analytics has been increasing. The goal of this programming assignment is to analyse data streams over a network in real time and derive valuable insights. Two important factors to be focused in this assignment are the accuracy of results and how fast were results obtained. Dataset: You'll be a provided a JAR executable that generates continuous UDP stream of data. The format of data would CSV lines of the format - "version", "count", "sys_uptime", "unix_secs", "unix_nsecs", "flow_sequence","engine_type", "engine_id", "sampling_interval", "srcaddr", "dstaddr", "nexthop", "input", "output", "dPkts", "doctets", "first", "last", "srcport", "dstport", "pad1", "top_flags", "prot", "tos", "src_as", "dst_as", "src_mask", "dst_mask", "pad2". Command to execute the code: java -jar data-generator.jar --destlPaddress <Destination IP address> --destPortNumber <Destination Port Number> --transmission Time <Total transmission time in minutes> --transmission Rate <Number of packets to sent per second> Example command: java -jar data-generator.jar --destlPaddress 127.0.0.1 --destPortNumber 9876 --transmission Time 10 --transmission Rate 1000 CSV Header version, count, uptime, unixTime, unixNano Time, sequence_number, engine_type, engine_id, sampling_interval, srcaddr, dstaddr, nexthop, input, output, dPkts, doctets, first, last, srcport, dstport, pad1, tcp_flags, prot, tos, src_as, dst_as, src_mask, dst_mask, pad2 Example Date: 5,29,10,1559090710882,1559090710882,59,9,8,46,69.50.4.20,205.90.187.116,56,41,78,30,86, 35,37,40,24,45,32,8,28,8,84,6,10,7 Note:
  1. The data is transmitted as UDP packets.
  2. The UDP payload does not contain a CSV header. The payload only consists of CSV data. 3. The size of each packet varies but does not exceed 1024 bytes. 4. Each UDP packet consists of 5 CSV lines. 5. The transmission Rate parameter is used to slow down the transmission rate for debugging and performance tuning purpose. The maximum transmission rate in the DC server is around 65k packets/second. Increasing the value of the transmission rate beyond 65k will not have any impact on the speed of transmission.
The maximum transmission rate varies on your personal laptop. An optimal value would be around 50k packets/second. Problem Statement: Solve the following problems using any Apache Spark libraries that would improve your design and performance. 1. Determine the moving average of "dockets" field over a time window of one minute. Along with the average list out the largest value corresponding to each one minute window. Consider "unixTime" in the data for window time operations. "unixTime" is an epoch timestamp similar to System.currentTimeMillis. 2. Pattern Matching: List out all the "srcaddr" & "dstaddr" that appears twice in a sliding window of 2 minutes. Consider "unixTime" in the data for window time operations. "unixTime" is an epoch timestamp similar to System.currentTimeMillis(). The main goal of the assignment is to get the results in real time with a minimum time lag. If source transmission ends in Nth second then your spark code must be able to process all the data in (N+k)th second. The goal here is to minimize k to as small value as possible.
This Computer Science Assessment has been solved by our Computer Science experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.