Approximating Stream Cardinalities

Goal

In this exercises we will investigate how large a set needs to be before it starts to make sense to use approximate methods to measure its size. Then we will use map-reduce to compute the sizes of some large sets and compare the performance of different methods for estimating the cardinality of a stream of data.

Prerequisites

In principle, the content presented during the lectures suffices to implement this task. However, it is certainly beneficial to study the corresponding text (sections 4.1, 4.2, 4.4) of the Mining of Massive Datasets book. You can also watch the related videos of the coursera course in week 3 : Mining Data Streams (12:01), Sampling a Stream (11:30) and Completed Counting Distinct Elements (25:59). Further, there are the articles about HyperLogLog and K-minimum values, which are not covered in the book nor the videos.

Basic usage of map-reduce can be found from the MapReduce Tutorial of Hadoop.

Task

The task consists of three parts. First, you have to investigate the limits for set sizes on your own system. Then you need to calculate the exact number of unique elements of a part of the stream using map-reduce. Finally, you will need to use two approximate methods for determining the cardinality of a stream of data.

This task is performed either individually or as a pair. You are free to work using the programming language you want, however, make sure you take a language in which you can read from standard input and in which you can handle the elements as a stream.

Part I

Investigate the limits of the set implementation of the programming language on your computer. The goal of this exercise is not to write an optimized implementation to achieve larger set sizes. Instead, the goal is to get an idea of the magnitude which a normal machine can handle.

To test this, write something like this:

make an empty set
while (true){
    start timer
    add 1000000 new elements to the set (you can just use 64 bit integers)
    end timer and print time
}

You do not have to use the generator to produce items to be added to your set. Instead, you can just use a counter which is incremented after each add operation. You can stop the process once you notice that it seems to take a long time (>1 min)to add the batch of elements or when you run out of memory (which seems more frequent). On Linux you can limit the amount of memory the process can use. After executing ulimit -Hv 5000000 processes started in that terminal can not use more than 5GB of memory. (This is not strictly necessary but prevents the machine from swapping.)

Most likely, the timings will not be linear. If the set you used is based on hashing, the time for adding elements is basically constant independent on the set size. However, once in a while the table used to store the values needs to be re-sized, which will take longer for larger sets.

You need to report

the amount of memory you have,
the language and set type you used
a chart of the timings,
and how many elements you could store.

If you have set specific memory options, you should report these as well.

Part II

In order to work on Amazon infrastructure, you first need to get your personal password from http://jyu-aws.appspot.com and log-in to AWS.

The teacher has placed a set of data on Amazon S3 which contains the first 10000-10^10 elements of a stream in bucket s3://data-for-stream-counting-5e33f2e9-bb84-4c40-bb03-5bb086c7adea/ on amazon web services (change: 10^10 has now been removed since it takes to long to process, causing all other students to wait.). (You can browse S3, starting from here).

You task is to count the exact number of distinct items in the datasets of sizes 10000 till 10^8 using map-reduce. The dataset is there in larger sizes as well, but processing these will take quite a bit of time. To get started, you should switch to the Elastic Map Reduce (EMR) service in AWS. All map-reduce tasks must be run on the cluster Count Distinct Cluster2 which is in availability zone us-east-1 (US East N.Virginia). When starting jobs, use identifiers from which you can be identified (korppi ID) to avoid interference with others’ work. For storing any data, use your personal bucket for this course jyu-ties438-username-randomUDDI (already created).

Computing distinct values using map reduce can be done fairly straightforward, you can implement it by following the pseudocode below.

map(key, value) -> value, null
reduce(key, values) -> 1, null

A combiner should be used to save communication, since the map function produces a lot of duplicates. It could be written as follows:

combine(key, values) -> key, null

If you are working with Java you need to export your mapreduce code to a runnable .jar file and place it in your S3 bucket. (Note: do not use anything from the old .mapred. API, use everything from the new .mapreduce. instead.) Read the MapReduce Tutorial which was mentioned above. Then you have to launch the map-reduce job as described in Submit a Custom JAR Step Using the Console of EMR DeveloperGuide - launch-custom-jar. The map-reduce job will generate a number of output files in your output bucket, each containing lines with a one. You can use these to determine the size by getting the size of each file and dividing it by two (each line is a one and a new-line, thus two bytes). You can also use counters (either internal ones or your own) to determine the size. Alternatively, you can use a second map-reduce job to determine the number of lines in the files.

First test on the small dataset and see whether you can get the correct result data10000 should give you 9925 distinct values. Then determine the sizes of the sets up till 10^8 (you are allowed to try larger sizes if you have time).

Part III

Implement two algorithms for cardinality estimation. The first one uses (order) statistics (sampling or K-minimum values) and the second one bit pattern observables (Flajolet-Martin (from handbook), LogLog or HyperLogLog - see also hints). For both algorithms, you need to calculate the root-mean-square error for 1,000,000 stream elements.

The data used is generated (download generator). This generator is written in Java and deterministically produces the elements on its standard output (one per line). This generator makes it possible to know the outcome and repeat the same stream multiple times, while there is no need to download a huge (theoretically infinite) amount of data.

By default ( java -jar generator2015.jar ) the generator will produce an infinite stream of elements.
Thousand elements can be shown as follows: java -jar generator2015.jar 1000.
Thousand elements randomly generated with a seed value of 4561 can be generated with java -jar generator2015.jar 1000 4561.
You can ask it to generate 10000 elements and pipe them to your program, like this: $ java -jar generator.jar 10000 | python myanalyzer.py. In your program you then read the elements from the standard input.

For both algorithms you need to estimate the cardinality for streams with seeds 1 till 10 and estimate the error. Concrete, for both experiments (2 algorithms) , you need calculate the root-mean-square error as follows:

RMSE

where the exact cardinalities can be found from the following table:

    | seed |  exact cardinality | exact cardinality | exact cardinality | exact cardinality |
    |     | (1,000,000) | (10,000,000) | (100,000,000) | (1,000,000,000) |
    | --- | ----------- | ------------ | ------------- | --------------- |
    | 1   | 697757      | 3753742      | 14899128      | 46929246        |
    | 2   | 698466      | 3755459      | 14899436      | 46932806        |
    | 3   | 698151      | 3753884      | 14905011      | 46926010        |
    | 4   | 698536      | 3755369      | 14901524      | 46927853        |
    | 5   | 697079      | 3752845      | 14907010      | 46923292        |
    | 6   | 697732      | 3753219      | 14901538      | 46930349        |
    | 7   | 697822      | 3754504      | 14899040      | 46920659        |
    | 8   | 698286      | 3755127      | 14903161      | 46924710        |
    | 9   | 698364      | 3753823      | 14900900      | 46925823        |
    | 10  | 697910      | 3754738      | 14901838      | 46929227        |

Hints

Use small part of the data during development.
For the programming language, you might just want to choose the one you are most familiar with. Depending on the chosen language the teacher will be able to help more (or less) with language specific issues.
Randomized algorithms are difficult to debug. Make it somehow possible by fixing the seed of the random number generator. At least each run will be the same.
When using Windows, be careful with the command line pipe | operator. It seems that in some cases the shell buffers all output from the first process and only forwards it to the next after the first one ends. If the stream is very large, this will cause it to either write to the disk (slow) or fill up your memory.
If you are using a machine with more than 2 cores, you can start multiple experiments at the same time. Since we are not measuring running time, they won’t affect each other.

The easiest to develop the map reduce part is using a Maven project. The needed dependency is :

 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>2.6.0</version> <!-- amazon does not support newer/older versions -->
     <scope>provided</scope>
 </dependency>

When implementing k-minimum values, make sure the the k values which you keep are unique, i.e. do not contain duplicates. For keeping the k values, you could use a priority queue, but check whether it contains the value before adding.
The Flajolet-Martin algorithm from the course book is very slow because it needs to evaluate a hash function for each counter.
The LogLog and HyperLogLog algorithms in the articles have a small mistake in the linked articles (as was mentioned during the interactive demonstration of the hyperloglog algorithm). The number in the counters is not the number of leading zeros, but the index of the first 1, with indexes counting from 1. So the desired number is the number of zeros + 1.

Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam