TIES438- Big Data Engineering

December 2017 II

Lectures

(each lecture is two blocks of two lecturing hours, i.e., each lecture is 4 contact hours)

Lecture 1 : Introduction lecture A simple example of an approximate algorithm for Hamming distance
- Topics: Distance metrics, min-hashing, random hyperplane hashing, combing rows and bands
- Sources:
  - Mining massive datasets (mmds, see literature below), chapter 1, chapter 3
  - https://people.csail.mit.edu/indyk/p117-andoni.pdf (except for Euclidean distance)
Lecture 2:
- Topics: Locality-sensitive hashing, LSH forest, hashing overview, birthday paradox, preprocessing text for distance computation
- Sources:
  - Mining massive datasets (mmds, see literature below), chapter 1, chapter 3
  - https://people.csail.mit.edu/indyk/p117-andoni.pdf (except for Euclidean distance)
  - The information on hash functions comes from various sources, the most important information is in the sources above. A hash function recommended to use for general (non-secure) hashing is murmur3F (see source ), which is also available in, for example, the Guava library for Java.
- Topics: Data streams, stream sampling (fixed size, fraction), windows (sliding, tumbling), bloom filters
- Sources:
  - mmds chapter 4
  - The wikipedia article on bloom filters is well written
Lecture 3:
- Topics: Dacaying windows, Counting distinct values: sampling, K-min values, Flajolet Martin 1983, LogLog, HyperLogLog
- Sources:
  - mmds chapter 4
  - https://research.neustar.biz/2012/07/09/sketch-of-the-day-k-minimum-values/
  - https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Lecture 4:
- Topics: Curse of dimensionality, clustering, K-means, hierarchical clustering, density based clustering (DBSCAN), CURE, Twister Tries
- Sources:
  - mmds chapter 7
  - http://mathworld.wolfram.com/HypercubeLinePicking.html
  - The basic principles are well explained in the wikipedia article on DBSCAN
  - http://users.jyu.fi/~miselico/papers/twister_tries.pdf
Lecture 5:
- Big data ar rest: storage and processing
- Storage slides
- Processing slides
- Also useful in this context are the sections 2.1, 2.2, and 2.4 of the mmds book.
- More details on the foundations of Spark can be found fromt the usenix paper.
Lecture 6:
- Topics: graphs, PageRank, Personalized PageRank (PPR), Bookmark Coloring Algorithm (BCA), Graph Embeddings (TransE, global embeddings)
- Sources:
  - graphs, PageRank : mmds chapter 5 (section 5.1) An explanation similar to the one in the lecture can be found in the online course.
  - PPR is called topic-sensitive pagerank in mmds section 5.3 (not covered 5.3.4)
  - BCA (this paper also has an explanation of PPR) : Berkhin, Pavel. “Bookmark-coloring algorithm for personalized pagerank computing.” Internet Mathematics 3.1 (2006): 41-62.
  - TransE: we only covered the basic idea behind translational embeddings and the optimization function. Paper : Bordes, Antoine, et al. “Translating embeddings for modeling multi-relational data.” Advances in neural information processing systems. 2013.
  - Global embeddings: We only covered the overall idea from the slides. Paper with all details: Cochez, M., Ristoski, P., Ponzetto, S. P., & Paulheim, H. (2017). Global RDF Vector Space Embeddings. In C. d’Amato, et. al (Eds.), ISWC 2017 - The Semantic Web : 16th International Semantic Web Conference, Proceedings, Part I (pp. 190-207). Lecture Notes in Computer Science, 10587. Cham: Springer.[link]

Exercises (there will be 3 exercises altogether.)

(See also course completion modes below)

Exercise 1 : Locality-sensitive hashing for nearest neighbor finding - classifiers. Finding Similar Items - Fighting Spam [Due 15 January]
Exercise 2: Estimating the number of distinct items in a stream. Approximating Stream Cardinalities [Due 15 January]
Exercise 3: Analyzing Graphs using Spark [Due 15 January]

Individual task

The individual task is the summarization of a research paper related to the Big Data Engineering field. Instructions [Due 15 January]

During the course multiple facets related to the Big Data phenomenon will be studied. First, students will get introduced to large data sets and streaming data. Then, example storage solutions and processing algorithms will be studied. Finally, we will look into hardware considerations and apply the theory on real world datasets related to news, wikipedia, brain analysis, biology, chemistry, etc.

Students who wish to work on a problem specific to their own research should discuss this with the teacher at the beginning of the course.

Learning outcomes:

After completion of this course the students will understand the concepts related to, and the intrinsic characteristics of big amounts of data. The student will then be able to evaluate algorithms and technology to deal with problems in which big amounts of data are involved.

Prerequisites:

The student should know how to program (at least programming 2) and be familiar with algorithms, data structures and computational complexity. Further, the student should have notion of sets, probability theory, linear algebra, and statistics.

Modes of study:

Students should attend the lectures and read the assigned materials. Further, the implementation of algorithms is intended to assist the students in their understanding of the course content.

Completion mode:

The course is completed by either:

implementing the assigned tasks (individually or as a pair) and individually write a summary of one research paper. OR
passing the exam at the end of the intensive course or at a later point in the next semester.

Literature:

Mining massive data sets - Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman free download from http://www.mmds.org/
The online course on the same book (see https://www.mmds.org ) overlaps partially with the course.
Optional: Motwani, and Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University Press, 1995. ISBN: 0521474655. (available in JYU trough EBSCOhost https://jyu.finna.fi/Record/jykdok.1485577 )
Optional: Data Clustering: Algorithms and Applications - Charu C. Aggarwal, Chandan K. Reddy 2013 by Chapman and Hall/CRC ISBN 9781466558212

Master thesis topics

See the master thesis supervision page

Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam