# Content

**Algorithms and Processing**- Storage
- Hardware

December 2017

Michael (Michael Cochez)

- Postdoc researcher
- Postdoc researcher
- Scientific advisor

- Related work (PhD)
- Evolving Knowledge Ecosystems for Big Data Understanding
- Locality-Sensitive Hashing
- Scalable Hierarchical Clustering

- Ongoing:
- Knowledge Graph Embedding
- Sampling Data Streams
- Frequent Pattern Mining
- Approximate Nearest Neighbor Search (LSH)
- Evolutionary Computing
- Deep Learning (cancer prediction)
- Privacy preserving KG integration gain
- Parking availability estimation

- What is Big Data? Why study this?
- Course content
- Practicalities

What do you think?

Key/buzzwords

Technologies

Algorithms

- Big Data Problems
- Problems
- Lots of data

- A belief that if you have enough data, then there will be something in there which is valuable
- Generously told by vendors of all sorts of tools.

- Volume: amount and dimensionality of data
- Velocity: speed of data accumulation (cf. datastreams and concept drift)
- Variety: heterogeneity of data and formats (cf. audio&video)
- V…
- Veracity: quality of data can vary significantly (cf. sampling)

- To understand what might be possible
- and what not

- To be able to handle larger datasets
- To be able to create your own tools

- To know what the fuzz is about

source: http://www.socmedsean.com/comic-the-critical-element-of-a-successful-big-data-strategy/

- Learn algorithms and generalizable skills, and not …
- Specific APIs
- Specific programming languages
- Certain tools
- …

- Side targets
- Brush up some mathematics
- Learn the Git distributed version control system
- Use English as a communication language

- programming (at least programming 2)
- algorithms and data structures
- computational complexity.
- mathematics
- basics of sets, probability theory, linear algebra, and statistics.

**Algorithms and Processing**- Storage
- Hardware

- Hashing
- Nearest neighbor search
- Locality-sensitive hashing
- LSH forest
- Map-reduce
- Practically if resources allow
- Data streams
- sampling, filtering, dvc, counting ones, decaying windows
- Bloom filters
- Clustering
- hierarchical and point assignment

- Graphs
- (personalized) Page Rank
- Embedding
- graph clustering?
- formal concept analysis?
- Recommendation and Frequent pattern mining
- Set containment

- NoSQL
- Key-value stores
- Graphs
- Document store
- Column

- Memory characteristics
- hierarchy - transfer speeds

- High Performance Computing
- Short intro
- If resources obtained!!!

Mainly blackboard teaching

Some parts from the course website (under construction)

- Material
- Mining massive data sets - Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman
- Scientific and other articles
- Optional: Motwani, and Raghavan. Randomized Algorithms. Cambridge, UK: Cambridge University Press, 1995. ISBN: 0521474655. (available in JYU trough EBSCOhost https://jyu.finna.fi/Record/jykdok.1485577 )
- Optional: Data Clustering: Algorithms and Applications - Charu C. Aggarwal, Chandan K. Reddy 2013 by Chapman and Hall/CRC ISBN 9781466558212
- The online course https://www.coursera.org/course/mmds overlaps partially with the course.

- We only study a small part of the data science process
- We focus more on how we analyze as on what we are exactly analyzing for.
- We do not always care whether what we are doing makes sense.
- We will not interpret the final result.

source: http://en.wikipedia.org/wiki/Data_analysis

- Signal processing (See CIBA120)
- Dimensionality reduction (except for LSH see also TIES445)
- Explicit NLP (some techniques are used in NLP)
- Explicit BI or analytics (See TJTSM61)
- A broad overview of data mining (See TIES445)
- Deep Learning for Cognitive Computing (TIES 4910)
- Watson Technologies (ITKA 352)

- Unclear for now :-(

- Very intensive
- Two lectures / day
- One demo session / day

- Hands-on experience
- All content and tools are freely available

- Individual work
- Pair work

Two options

- If passing in December required:
- Exam

- Otherwise:
- Exam
- Making all course assignments (pair) + article summary (individual)

- Write a summary of a research article

- Individual or as a pair
- Mostly programming tasks

- Today
- Start with an approximate algorithm
- Locality sensitive hashing