# TIES438- Big Data Engineering

**Autumn II 2014** Check here for the latest edition

## Lectures

- Lecture 0 : Introduction lecture short guest lecture
- Lecture 1 - Guest lecture by IBM : Big Data Analytics (the slides are only accessible for course participants in the bucket
`slides-ibm-talk`

on AWS S3) - Lecture 2 : Distance measures, Basics of Locality-sensitive hashing, and random hyperplane hashing.
- Lecture 3 : Hashing, combining multiple Locality-sensitive hash functions, and Min-hashing.
- Lecture 4 : NoSQL Databases - by Alexander Semenov
- Lecture 5 : Graphs - PageRank
- Lecture 6 : Parallel computing, Map-reduce - by Alexander Semenov
- Lecture 7 : Streams (data in motion): cardinality estimation
- Lecture 8 : Streams: Bloom filters, decaying windows
- Lecture 9 : High performance computing : Introduction and Message passing interface (MPI)
- Lecture 10 : An example of the use of HPC in biochemistry and code examples for MPI (yousource, github)
- Lecture 11 : The curse of dimensionality and clustering: hierarchical clustering and point assignment in large and high dimensional datasets.

## Exercises

- Exercise 1 : Random hyperplane hashing for fast approximate image retrieval. Finding Similar Items - Images
- Exercise 2 : Locality-sensitive hashing for approximating distance. Approximating Jaccard Distance Between Documents
- exercise 3 : Estimating the number of distinct items in a stream. Approximating Stream Cardinalities
- Exercise 4 : Basic usage of MPI and analysis of simulation data Analysis of photo-isomerization experiments

## Individual task

- The individual task is the summarization of a research paper related to the Big Data Engineering field. Instructions

## Contents:

During the course multiple facets related to the Big Data phenomenon will be studied. First, students will get introduced to large data sets and streaming data. Then, example storage solutions and processing algorithms will be studied. Finally, we will look into hardware considerations and apply the theory on real world datasets related to news, wikipedia, brain analysis, biology, chemistry, etc.

Students who wish to work on a problem specific to their own research should discuss this with the teacher at the beginning of the course.

## Learning outcomes:

After completion of this course the students will understand the concepts related to, and the intrinsic characteristics of big amounts of data. The student will then be able to evaluate algorithms and technology to deal with problems in which big amounts of data are involved.

## Prerequisites:

The student should know how to program (at least programming 2) and be familiar with algorithms, data structures and computational complexity. Further, the student should have notion of sets, probability theory, linear algebra, and statistics.

## Modes of study:

Students should attend the lectures and read the assigned materials. Further, the implementation of algorithms is intended to assist the students in their understanding of the course content.

## Completion mode:

The course is completed by implementing the assigned tasks. A small part of the evaluation is done by quizzes during the lectures. Further, each students should write a summary of one research paper.

## Literature:

- Mining massive data sets - Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman free download from http://www.mmds.org/
- The online course https://www.coursera.org/course/mmds overlaps partially with the course.

# Links

- Course information in Korppi TIES438