Approximating Jaccard Distance Between Documents
Goal
In this exercises we will implement an algorithm which speeds up the measurement of the Jaccard similarity between documents.
Prerequisites
In principle, the content presented during the lectures suffices to implement this task. However, it is certainly beneficial to study the corresponding text in chapter 3 of the Mining of Massive Datasets book. You can also watch the videos of the coursera course related to LSH (in week 2).
Task
The task consists of four parts. First, you need to devise a way to convert the documents to sets. Then you need to be able to calculate the exact Jaccard distance between two of these sets. Next, you will use locality-sensitive hashing to calculate an approximation of the distance. Finally, you need to evaluate the quality of the approximation.
For this task, we will use a cleaned subset of the English Wikipedia.
The dataset is downloaded to oksa3.it.jyu.fi:/home/wikidata/wikipedia_utf8_filtered_20pageviews.csv
and can also be downloaded from https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/.
On each line of the file, you will find the an id and the text of the article.
Each article has to be taken as a document for this exercise.
The second file at your disposal can be found from oksa3.it.jyu.fi:/home/wikidata/words.txt
. This file contains all words (punctuation removed, lowercased) occurring in the document with their absolute frequencies.
This task is performed either individually or as a pair.
You are free to work using the programming language you want.
You can use oksa3.it.jyu.fi
to do this task (log in using ssh
). Alternatively, you can do this task on your own computer.
Part I
Read the dataset one line at a time and apply the transformation to a set on each document. This should be done as follows:
- Until all lines are read (If your implementation is not fast enough, you can limit yourself to 50,000 articles.)
- Read a line (there is one document on each line), split of the article part
- Remove punctuation from the article, convert everything to lower case and split the line into words.
- Remove the 200 most frequent words (see
words.txt
) and words which appear less than 5 times (probably errors). - Apply one of the following (you can apply multiple if you want and have time left):
- Apply stemming on each word, add all words to a set.
- Check Porter stemming implementations for different programming languages.
- Add the terms to a set in such a way that duplicates are preserved.
- If the words are
[A, A, B]
, create the set{A1,A2,B1}
- Optional: you can also use techniques like TF.IDF to determine the frequency in the final set.
- If the words are
- Add all 2-grams to a set.
- If the words are
[A, B, C, D, E, F, A, B, C]
, then the set will be{AB, BC, CD, DE, EF, FA}
- Note: using this method will cause most document pairs to be at distance 1.
- If the words are
- Add all 5 shingles to a set.
- If the line is
abcd efg
, then the set will be{"abcd ", "bcd e", "cd ef", "d efg" }
- If the line is
- Apply stemming on each word, add all words to a set.
- Write the content of the obtained set back to the output file.
Part II
Implement a method or function which calculates the Jaccard distance between two sets of Strings. To do this, you first compute the size of the union and intersection, then you apply the formula for Jaccard distance.
Now, you should be able to calculate the distance between two of the sets created in the previous parts.
Part III
First, you need to implement a min-hash algorithm, and then use it to approximate the distance between two documents. You could implement it using the somewhat optimized algorithm from the course book (section 3.3.5). Perhaps easier is to use this implementation or port it to whatever language you are using.
Second, to calculate the approximate distance, we start from what was shown in class.
And hence
When applying min-hash n
times (using different permutations), we can approximate the distance by
Note that n
is a parameter to your approximation. A (column) vector containing the n
min-hash evaluations (the index of the first ‘1’ in the matrix) for a document is its signature vector.
Now you should be able to compute the approximate Jaccard distance between a pair of sets.
Part IV
To evaluate the quality of the approximation, we will observe how much it deviates from the real value. Concretely, we will calculate the average absolute error for 100 random distance calculations.
Work as follows:
- Select 100 pairs of documents uniformly at random.
- Load the corresponding sets (200) and compute signature vectors of length 80.
- For each distance calculation(100):
- Calculate the real distance between D1 and D2
- For
n
in [5, 10, 15, 20, 25, 30, 35. 40, …, 80]:- Calculate the approximate distance (using only the first
n
hash function evaluations).
- Calculate the approximate distance (using only the first
Finally, for each n
, calculate the average error as follows:
Create a chart in which you plot the error in function of n
. (You can do this using a spreadsheet program or directly from your program.)
Returning the task
Create a repository in yousource in which you place
- All code you created
- The figure created in part IV
- A readme file with
- your name(s)
- instructions on how to compile/run your application (not needed for Java/Python)
- you answer to the question: “Which behavior did you observed from the created figure?”
- If you used the min-hash implementation provided: “Why is it okay to just hash the words instead of doing a proper permutation?”
Note
Normally, you would pre-calculate the signatures for all documents and not keep the documents themselves in memory. After this has been done you can very fast answer queries for the distance between two documents. In general, it is a bad idea, or even impossible, to pre-calculate all the distance between documents, since these distances need O(N^2) storage space. For instance, for 2 million documents, and 4 bytes per distance, this would be 7451 Gb. I we instead use signatures of length 100, using 4 bytes for the min-hash outcome, we need 0.745 Gb of space.
Hints
- Use small part of the data during development. If you notice that your implementation is slow, make sure you only compute the minhashes for documents used in the evaluation.
- For the programming language, you might just want to choose the one you are most familiar with. Depending on the chosen language the teacher will be able to help more (or less) with language specific issues.
- Randomized algorithms are difficult to debug. Make it somehow possible by fixing the seed of the random number generator. At least each run will be the same.
- People using Java should check the Guava Libraries, in particular classes for splitting strings and working with sets.
- The operation for permuting lists is most often called shuffle. For instance, in Java Collections.shuffle(List<?> list, Random rnd) or in Python random.shuffle(x[, random]). Not that both of these permute the list in-place (no copy is made).
- Most of the distances will be 1 or very close to 1. Therefore, it might be that your approximation always returns a distance of 1 for a low
n
.