# Content

**Algorithms**- Storage
- Hardware

Autumn 2015

Michael (Michael Cochez)

- Main teacher
- PhD student at MIT department
- Related work
- Evolving Knowledge Ecosystems for Big Data Understanding
- Locality-Sensitive Hashing for Massive String-Based Ontology Matching
- Scalable Hierarchical Clustering

- Ongoing:
- Prototype based ontologies
- Context dependent distance metrics
- Scalable graph summarization

Olga (Olga Kushanova)

- Assistant
- Master student at MIT department (WISE)
- Thesis: Co-evolution of data and query clusters

- What is Big Data? Why study this?
- Course content
- Practicalities
- First example

(on blackboard)

- Big Data Problems
- Problems
- Lots of data
- A belief that if you have enough data, then there will be something in there which is valuable
- Generously told by vendors of all sorts of tools.

- Volume: amount and dimensionality of data
- Velocity: speed of data accumulation (cf. datastreams and concept drift)
- Variety: heterogeneity of data and formats (cf. audio&video)
- Veracity: quality of data can vary significantly (cf. sampling)

- To understand what might be possible
- and what not

- To be able to handle larger datasets
- To be able to create your own tools

- To know what the fuzz is about

source: http://www.socmedsean.com/comic-the-critical-element-of-a-successful-big-data-strategy/

- Learn algorithms and generalizable skills, and not …
- Specific APIs
- Specific programming languages
- Certain platforms
- …
- Side targets
- Brush up some mathematics
- Learn the Git distributed version control system
- Use English as a communication language

**Algorithms**- Storage
- Hardware

- Hashing
- Nearest neighbor search
- Locality-sensitive hashing
- Map-reduce
- partially joint with TJTSM61 (Business Analytics and Big Data Management)
- Data streams
- sampling, filtering, dvc, counting ones, decaying windows
- Bloom filters
- Clustering
- hierarchical and point assignment
- Graphs
- page rank
- graph clustering?
- formal concept analysis?
- Recommendation and Frequent item sets
- Set containment
- Holographic graph neuron?
- Material
- Mining massive data sets - Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman
- Optional: Motwani, and Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University Press, 1995. ISBN: 0521474655. (available in JYU trough EBSCOhost https://jyu.finna.fi/Record/jykdok.1485577 )
- Scientific and other articles
- The online course https://www.coursera.org/course/mmds overlaps partially with the course.

- NoSQL (Partially joint with TJTSM61)
- Key-value stores
- Graphs
- Document store
- Column

- Memory characteristics
- hierarchy - transfer speeds

- High Performance Computing
- Short intro

- We only study a small part of the data science process
- We focus more on how we analyze as on what we are exactly analyzing for.
- We do not always care whether what we are doing makes sense.
- We will not interpret the final result.

source: http://en.wikipedia.org/wiki/Data_analysis

- Signal processing (See CIBA120)
- Dimensionality reduction (See summer school - except for LSH)
- Explicit NLP (some techniques are used in NLP)
- Explicit BI or analytics (See TJTSM61)
- A broad overview of data mining (See TIES445)

- oksa3.it.jyu.fi
- 16 HT cores Xeon E5-2670 @ 2.60GHz
- 128 GB RAM

- Center for Science (CSC) - high performance computing
- Sisu (~40000 cores - ~51 in the world )
- Taito (~10000 cores)
- Amazon Web Services
- Map reduce

- Weekly cycle
- Two lectures
- One demo session

- Hands-on experience
- All content and tools are freely available

- Individual work
- Pair work
- Small test at the end of lectures

- Grading (Pass/fail, 70% needed)
- 40% individual work
- 50% group work
- 10% quiz

- Write a summary of a research article

- Individual or as a pair
- Usually a programming task

- Today
- Slow start with approximate algorithms?

- Tomorrow
- Distance measures, Hashing, Basics of Locality-sensitive hashing.

- On Wednesday
- First assignment?

- Next week Thursday
- Guest lecture (IBM) in Auditorium 3

- programming (at least programming 2)
- basic algorithms and data structures
- computational complexity.
- mathematics
- basics of sets, probability theory, linear algebra, and statistics.