CS6713 - Scalable Algorithms for Data Analysis Assignment

Download Solution Order New Solution

Assignment Task

You are expected to implement everything from scratch and are not expected to use any predefined functions/libraries.

1. Implement the Tidemark algorithm for estimating the number of distinct elements. Test it for the stream consisting of all the numbers in the file, windows of 50000 numbers each, compare it with the ground truth and plot this information.

2. Write a code to test whether there is a number that appears at least m/10 times in the stream, where m is the length of the stream. If so, what is the frequency of that number. That is implememnt the heavy hitters algorithm where k = 10.

3. Implement Bloom filter with the following values of the sketch size 50, 70, 100, 150, 500, 1000, 2000. Please use the appropriate values of the hash function as per the sketch size and number of items in the stream. Consider the first 5% of elements as your test datasets (don’t include the test dataset while creating bloom filter), and report the confusion matrix corresponding to each datasets, on various values of the sketch size mentioned above.

4. Implement Count-min-sketch algorithm with the following values of (t, k) = {(50, 50),(25, 100),(250, 10),(500, 5)} 2 . Consider the first 5% of elements as your test datasets (consist of query items), and report the RMSE bar charts on these values of (t, k). The RMSE is defined as follows – for each query item, compute the difference of its ground truth frequency and its estimation from the sketch, square all these values, add them up, and compute the mean. Note that smaller RMSE is an indication of better performance. 1http://fimi.uantwerpen.be/data/ 2Recall that k denote the sketch dimension, and t denote the repetition 1

5. Repeat the above for the Count-Sketch algorithm. In the bar-chart, put the bar-chart results of Count-sketch and Count-min-sketch side-by-side for comparison.

6. Implement AMS-sketch for estimating the ℓ2 norm of the frequency vector using medians-of-means estimates with the following values of (t, k) = {(50, 50),(25, 100),(250, 10),(500, 5) 3 . Compute the difference of estimated quantity and the ground truth ℓ2 norm, and report it in a bar chart.

Note: Kindly submit a jupyter notebook file. Please copy the question in a cell, and in the following cell write its code. The code should be well-commented and self-explanatory. In your code, please set the path of datasets (preferably) to the desktop location.

This IT and Computer Science has been solved by our PhD Experts at My Uni Paper. Our Assignment Writing Experts are efficient in providing a fresh solution to this question. We are serving more than 10000+ Students in Australia, the UK, and the US by helping them to score HD in their academics. Our Experts are well-trained to follow all marking rubrics and referencing styles.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.