Approximating Stream Cardinalities

Goal

In this exercises we will investigate how large a set needs to be before it starts to make sense to use approximate methods to measure its size. Then we will compare the performance of different methods.

Prerequisites

In principle, the content presented during the lectures suffices to implement this task. However, it is certainly beneficial to study the corresponding text (sections 4.1, 4.2, 4.4) of the Mining of Massive Datasets book. You can also watch the related videos of the coursera course in week 3 : Mining Data Streams (12:01), Sampling a Stream (11:30) and Completed Counting Distinct Elements (25:59). Further, there are the articles about HyperLogLog and K-minimum values, which are not covered in the book nor the videos.

Basic usage of map-reduce can be found from the MapReduce Tutorial of Hadoop.

Task

The task consists of three parts. First, you have to investigate the limits for set sizes on your own system. Then you need to calculate the exact number of unique elements of a part of the stream using map-reduce. Finally, you will need to use two approximate methods for determining the cardinality.

The data used is generated (download generator). This generator is written in Java and deterministically produces the elements on its standard output (one per line). This makes it possible to make statistical predictions about the outcome and repeat the same stream multiple times, while there is no need to download a huge (theoretically infinite) amount of data. Thousand elements can be shown as follows: java -jar generator.jar 1000. If you do not specify any amount, the generator will produce an infinite stream of elements. You can ask it to generate 10000 elements and pipe them to your program, like this: $ java -jar generator.jar 10000 | python myanalyzer.py. In your program you then read the elements from the standard input.

This task is performed either individually or as a pair. You are free to work using the programming language you want, however, make sure you take a language in which you can read from standard input and in which you can handle the elements as a stream.

Part I

Investigate the limits of the set implementation of the programming language on your computer. The goal of this exercise is not to write an optimized implementation to achieve larger set sizes. Instead, the goal is to get an idea of the magnitude which a normal machine can handle.

To test this, write something like this:

make an empty set
while (true){
    start timer
    add 1000000 new elements to the set (you can just use 64 bit integers)
    end timer and print time
}

You do not have to use the generator to produce items to be added to your set. Instead, you can just use a counter which is incremented after each add operation. You can stop the process once you notice that it seems to take a long time (>5 min) or when you run out of memory (which seems more frequent) to add the batch of elements.

Most likely, the timings will not be linear in any way. If the set you used is based on hashing, the time for adding elements is basically constant independent on the set size. However, once in a while the table used to store the values needs to be re-sized, which will take longer for larger sets.

The only thing you need to report is the amount of memory you have, the language and set type you used and how many elements you could store. If you have set specific memory options, you can report these as well.

Part II

The teacher has placed a set of data on Amazon S3 which contains the first 10000-10^8 elements of the stream in bucket s3://data-for-stream-counting-5e33f2e9-bb84-4c40-bb03-5bb086c7adea/. The dataset is there in larger sizes as well, but processing these will take quite a bit of time. (You can browse S3, starting from here).

Using map-reduce, you need to calculate the number of unique elements exactly.

Computing distinct values using map reduce can be done fairly straightforward, you can implement it by following the pseudocode below.

map(key, value) -> value, null
reduce(key, values) -> 1, null

A combiner can be used to save communication, if the map function produces a lot of duplicates. It could be written as follows:

combine(key, values) -> key, null

You are advised to work with Java. Using Java, you need to export your mapreduce code to a runnable .jar file and place it in your S3 bucket. However, do not use anything from the old .mapred. API, use everything from the new .mapreduce. instead. Read the MapReduce Tutorial which was mentioned above.

In order to work on Amazon infrastructure, you first need to get your personal password from http://jyu-aws.appspot.com and log-in to AWS. There you can switch to the elastic map reduce service and read instructions on how to run map reduce jobs. All map-reduce tasks must be run on the cluster in availability zone us-east-1 (US East N.Virginia).

When starting jobs, use identifiers from which you can be identified (korppi ID) to avoid interference with others’ work. For storing any data, use your personal bucket for this course jyu-ties438-username-randomUDDI (already created).

Then you have to launch the map-reduce job as described in step 8 of EMR DeveloperGuide - launch-custom-jar.

The map-reduce job will generate a number of output files in your output bucket, each containing lines with a one. You can use these to determine the size by getting the size of each file and dividing it by two (each line is a one and a new-line, thus two bytes). You can also use counters (either internal ones or your own) to determine the size. Alternatively, you can use a second map-reduce job to determine the number of lines in the files.

First test on the small dataset and see whether you can get the correct result data10000 should give you 9925 distinct values. Then determine the sizes of the sets up till 10^8.

Part III

Implement two algorithms for cardinality estimation. The first one uses (order) statistics (sampling or K-minimum values) and the second one bit pattern observables (Flajolet-Martin (from handbook), LogLog or HyperLogLog - see also hints). For both algorithms, you need to calculate the root-mean-square error for 1,000,000 stream elements.

You tell the generator to produce 10000 elements with a seed of 123 as follows java -jar generator.jar 10000 123. To calculate the error, you need to compute the estimated cardinalities for the 10 different seeds from the table below:

    | seed |  exact cardinality |
    |     | (1,000,000) |
    | --- | ----------- |
    | 1   | 697757      |
    | 2   | 698466      |
    | 3   | 698151      |
    | 4   | 698536      |
    | 5   | 697079      |
    | 6   | 697732      |
    | 7   | 697822      |
    | 8   | 698286      |
    | 9   | 698364      |
    | 10  | 697910      |

Now, for both experiments (2 algorithms) , you can calculate the root-mean-square error as follows:

RMSE

Hints

Use small part of the data during development.
For the programming language, you might just want to choose the one you are most familiar with. Depending on the chosen language the teacher will be able to help more (or less) with language specific issues.
Randomized algorithms are difficult to debug. Make it somehow possible by fixing the seed of the random number generator. At least each run will be the same.
When using Windows, be careful with the command line pipe | operator. It seems that in some cases the shell buffers all output from the first process and only forwards it to the next after the first one ends. If the stream is very large, this will cause it to either write to the disk (slow) or fill up your memory.
If you are using a machine with more as 2 cores, you can start multiple experiments at the same time. Since we are not measuring running time, they won’t affect each other.

The easiest to develop the map reduce part is using a Maven project. The needed dependency is :

 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>2.4.0</version>
 </dependency>

When implementing k-minimum values, make sure the the k values which you keep are unique, i.e. do not contain duplicates. For keeping the k values, you could use a priority queue, but check whether it contains the value before adding.
The Flajolet-Martin algorithm from the course book is very slow because it needs to evaluate a hash function for each counter.
The LogLog and HyperLogLog algorithm were explained slightly wrong in class and the same mistake can be found in the linked articles. The number in the counters is not the number of leading zeros, but the index of the first 1, with indexes counting from 1. So the desired number is the number of zeros + 1.

Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam