# TIES438- Big Data Engineering

**Autumn 2015 II ** Check here for the 2014 edition

## Lectures

- Lecture 0 : Introduction lecture A simple example of an approximate algorithm for Hamming distance
- Lecture 1: Distance metrics, document similarity, approximate Jaccard distance using min-hash.
- Lecture 2: Managing with Big Data at rest (part I) Alexander Semenov slides
- Lecture 3: Efficient min-hash, finding near neighbors using LSH, angular distance (random hyperplane hashing) and K-NN classifiers.
- Lecture 4: Managing with Big Data at rest (part II) Alexander Semenov slides
- Lecture 5: Managing with Big Data in motion – Alexander Semenov (slides together with lecture 4 slides)
- Lecture 6: Guest lecture by IBM: Big Data Analytics slides
- Lecture 7: Overview of hashing. Data streams, counting distinct items.
- Lecture 8: Counting distinct items (continued), bloom filters.
- Lecture 9: Decaying windows, graphs: pagerank
- Lecture 10: Q&A about exercises
- Lecture 11: High performance computing: Parallel computing, MPI slides
- Lecture 12: HPC in biochemistryslides, MPI – collective operations (slides together with lecture 11), and code examples for MPI (yousource, github)
- Lecture 13: Clustering (hierarchical, point assignment)

## Exercises (there will be 3 exercises altogether.)

- Exercise 1 : Locality-sensitive hashing for approximating distance. Approximating Jaccard Distance Between Documents [Due 24 November 1 December]
- Exercise 2 : Locality-sensitive hashing for nearest neighbor finding - classifiers. Finding Similar Items - Fighting Spam [Due 1 December]
- Exercise 3 : Estimating the number of distinct items in a stream. Approximating Stream Cardinalities [Due 15 December]

## Individual task

- The individual task is the summarization of a research paper related to the Big Data Engineering field. Instructions [Due 31 December (can be sent earlier)]

## Contents:

During the course multiple facets related to the Big Data phenomenon will be studied. First, students will get introduced to large data sets and streaming data. Then, example storage solutions and processing algorithms will be studied. Finally, we will look into hardware considerations and apply the theory on real world datasets related to news, wikipedia, brain analysis, biology, chemistry, etc.

Students who wish to work on a problem specific to their own research should discuss this with the teacher at the beginning of the course.

## Learning outcomes:

After completion of this course the students will understand the concepts related to, and the intrinsic characteristics of big amounts of data. The student will then be able to evaluate algorithms and technology to deal with problems in which big amounts of data are involved.

## Prerequisites:

The student should know how to program (at least programming 2) and be familiar with algorithms, data structures and computational complexity. Further, the student should have notion of sets, probability theory, linear algebra, and statistics.

## Modes of study:

Students should attend the lectures and read the assigned materials. Further, the implementation of algorithms is intended to assist the students in their understanding of the course content.

## Completion mode:

The course is completed by implementing the assigned tasks. A small part of the evaluation is done by quiz during the last lecture. Further, each students should write a summary of one research paper.

## Literature:

- Mining massive data sets - Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman free download from http://www.mmds.org/
- Optional: Motwani, and Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University Press, 1995. ISBN: 0521474655. (available in JYU trough EBSCOhost https://jyu.finna.fi/Record/jykdok.1485577 )
- The online course https://www.coursera.org/course/mmds overlaps partially with the course.

## Master thesis topics

See the master thesis supervision page

# Links

- Course information in Korppi TIES438