Approximating Stream Cardinalities

Goal

In this exercises we will investigate how large a set needs to be before it starts to make sense to use approximate methods to measure its size. Then we will use map-reduce to compute the sizes of some large sets and compare the performance of different methods for estimating the cardinality of a stream of data.

Prerequisites

In principle, the content presented during the lectures suffices to implement this task. However, it is certainly beneficial to study the corresponding text (sections 4.1, 4.2, 4.4) of the Mining of Massive Datasets book. You can also watch the related videos of the online course. Further, there are the articles about HyperLogLog and K-minimum values, which are not covered in the book nor the videos.

Basic usage of map-reduce can be found from the MapReduce Tutorial of Hadoop.

Task

The task consists of three parts. First, you have to investigate the limits for set sizes on your own system. Then, you will need to use two approximate methods for determining the cardinality of a stream of data. Finally, you need to calculate the exact number of unique elements of a part of the stream using map-reduce.

This task is performed either individually or as a pair. You are free to work using the programming language you want, however, make sure you take a language in which you can read from standard input and in which you can handle the elements as a stream.

Part I

Investigate the limits of the set implementation of the programming language on your computer. The goal of this exercise is not to write an optimized implementation to achieve larger set sizes. Instead, the goal is to get an idea of the magnitude which a normal machine can handle.

To test this, write something like this:

make an empty set
while (true){
    start timer
    add 1000000 new elements to the set (you can just use 64 bit integers)
    end timer and print time
}

You do not have to use the generator to produce items to be added to your set. Instead, you can just use a counter which is incremented after each add operation. You can stop the process once you notice that it seems to take a long time (>1 min)to add the batch of elements or when you run out of memory (which seems more frequent). On Linux you can limit the amount of memory the process can use. After executing ulimit -Hv 1000000 processes started in that terminal can not use more than 1GB of memory. (This is not strictly necessary but prevents the machine from swapping.) Note: in the past students have reported hanging systems when not limiting the processes memory, mainly when using Python.

Most likely, the timings will not be linear. If the set you used is based on hashing, the time for adding elements is basically constant independent on the set size. However, once in a while the table used to store the values needs to be re-sized, which will take longer for larger sets.

You need to report

the amount of memory you have set,
the language and set type you used,
a chart of the timings, and
how many elements you could store.

Also specify how you set memory options.

Part II

Implement two algorithms for cardinality estimation. The first one uses (order) statistics (sampling or K-minimum values) and the second one bit pattern observables (Flajolet-Martin (from handbook), LogLog or HyperLogLog - see also hints). For both algorithms, you need to calculate the root-mean-square error for 1,000,000 stream elements.

The data used is generated (download generator). This generator is written in Java and deterministically produces the elements on its standard output (one per line). This generator makes it possible to know the outcome and repeat the same stream multiple times, while there is no need to download a huge (theoretically infinite) amount of data.

By default ( java -jar generator.jar ) the generator will produce an infinite stream of elements.
Thousand elements can be shown as follows: java -jar generator.jar 1000.
Thousand elements randomly generated with a seed value of 4561 can be generated with java -jar generator.jar 1000 4561.
You can ask it to generate 10000 elements and pipe them to your program, like this: $ java -jar generator.jar 10000 | python myanalyzer.py. In your program you then read the elements from the standard input.

For both algorithms you need to estimate the cardinality for streams with seeds 1 till 10 and estimate the error. Concrete, for both experiments (2 algorithms) , you need calculate the root-mean-square error as follows:

RMSE

where the exact cardinalities can be found from the following table. If you notice that your algorithm can handle 1M easily, feel free to use a higher cardinality. The generation of 100M elements takes about 30 seconds.

    | seed |  exact cardinality | exact cardinality | exact cardinality | exact cardinality |
    |      |  (1,000,000)       | (10,000,000)      | (100,000,000)     | (1,000,000,000)   | 
    |------|--------------------|-------------------|-------------------|-------------------|
    |  1   | 697757             | 3753742           | 14899128          | 46929246          |
    |  2   | 698466             | 3755459           | 14899436          | 46932806          |
    |  3   | 698151             | 3753884           | 14905011          | 46926010          |
    |  4   | 698536             | 3755369           | 14901524          | 46927853          |
    |  5   | 697079             | 3752845           | 14907010          | 46923292          |
    |  6   | 697732             | 3753219           | 14901538          | 46930349          |
    |  7   | 697822             | 3754504           | 14899040          | 46920659          |
    |  8   | 698286             | 3755127           | 14903161          | 46924710          |
    |  9   | 698364             | 3753823           | 14900900          | 46925823          |
    |  10  | 697910             | 3754738           | 14901838          | 46929227          |

Part III

In the past we had map-reduce infrastructure available for this exercise. This is unfortuately not the case this year. Hence, we will work with a smaller dataset (max 10M elements) and you will have to do this exercise on your own machine or on a machine in the lab. If you have the resources and time left, you can also work with a larger dataset. In Eclipse, it is possible to run a map-reduce process directly from the IDE with the depedency desribed in the hints section below (other IDEs untested, but should work). It is also possible to run hadoop jobs on a local cluster. You can get this running by following the instructions from the Hadoop documentation. Note, you only need to work in Local (Standalone) Mode, and not in Pseudo-Distributed Mode or Fully-Distributed Mode. There is no need to setup HDFS.

Your task is to count the exact number of distinct items in the datasets of sizes 10000 till 10^7 using map-reduce for one of the seeds above. You will need to create the input files for this task using the generator.

Computing distinct values using map reduce can be done fairly straightforward, you can implement it by following the pseudocode below.

map(key, value) -> value, null
reduce(key, values) -> 1, null

A combiner should be used to save communication, since the map function produces a lot of duplicates. It could be written as follows:

combine(key, values) -> key, null

Read the MapReduce Tutorial which was mentioned above.

The map-reduce job will generate a number of output files in your output location, each containing lines with a one. You can use these to determine the size by getting the size of each file and dividing it by two (each line is a one and a new-line, thus two bytes). You can also use counters (either internal ones or your own) to determine the size. Alternatively, you can use a second map-reduce job to determine the number of lines in the files.

First test on the small dataset and see whether you can get the correct result; seed 1 should give you 9896 distinct values (there was a mistake here, 9925 was not correct). Then determine the sizes of the sets up till 10^8 (you are allowed to try larger sizes if you have time).

Hints

Use a small part of the data during development.
For the programming language, you might just want to choose the one you are most familiar with. Depending on the chosen language the teacher will be able to help more (or less) with language specific issues. For the map reduce program, you are advised to use Java (or another JVM based language).
Randomized algorithms are difficult to debug. Make it somehow possible by fixing the seed of the random number generator. At least each run will be the same.
When using Windows, be careful with the command line pipe | operator. It seems that in some cases the shell buffers all output from the first process and only forwards it to the next after the first one ends. If the stream is very large, this will cause it to either write to the disk (slow) or fill up your memory.
If you are using a machine with more than 2 cores, you can start multiple experiments at the same time. Since we are not measuring running time, they won’t affect each other.

The easiest to develop the map reduce part is using a Maven project. The needed dependency is :

 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>2.9.0</version>
 </dependency>

When implementing k-minimum values, make sure the the k values which you keep are unique, i.e. do not contain duplicates. For keeping the k values, you could use a priority queue, but check whether it contains the new value before adding.
The Flajolet-Martin algorithm from the course book is very slow because it needs to evaluate a hash function for each counter.
The LogLog and HyperLogLog algorithms in the articles have a small mistake in the linked articles (as was mentioned during the interactive demonstration of the hyperloglog algorithm). The number in the counters is not the number of trailing/leading zeros, but the index of the first 1, with indexes counting from 1 (or in other words, the number of trailing/leading zeroes + 1). So the desired number is the number of zeros + 1.
Do not use anything from the old .mapred. API, use everything from the new .mapreduce. instead.

Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam