Analyzing Graphs using Spark
Goal
The goal of this exercise is to get familiar with some basics of a graph library which can be used to analyze large graphs. Spark GraphX will be used to analyze a rather small, but representative, data set in such fashion that the devised algorithm can also be used to analyze larger graphs.
Prerequisites
Besides the content presented during the lectures you have to read trough the documentation of the Spark GraphX library. As a programming language, you can either use Scala, Java, or Python. Eac of these are well supported in the Spark framework. However, most examples in documentation will be in Scala.
Task
The task consists of four parts. First, you need setup the environment and get familiar with the Scala framework. Then, you need to load the graph data and familiarize with the Spark GraphX library to prepare for part three in which you have to compute a PageRank on the graph. In the last part you have to attempt to find communities in the graph and evaluate your algorithm. This task is performed individually or as a pair.
Part I - Setting up the spark environment
To work on this task, you have to set-up a local spark environment.
If you are using Linux, this is not very complicated as there is a pre-compiled package which also includes hadoop.
If you have not set-up Spark before, choose release 2.2.1
and package type Pre-built for Apache Hadoop 2.7 and later
from https://spark.apache.org/downloads.html.
In case you are using Windows, you are advised to use a virtual machine with a Linux operating system. If you insist using Windows should be possible and the following Stack Overflow answer might get you to a working system: https://stackoverflow.com/a/38735202.
Then, if you have not used Spark before, follow this Quick Start Guide.
Part II - Getting the dataset and loading it into Spark
The dataset we will be using can be downloaded from http://snap.stanford.edu/data/email-Eu-core.html . You need both files to complete the task. Extract both files before using them.
To get started, you can load the email exchanges data into spark using:
val graph = GraphLoader.edgeListFile(sc, "/PATH/TO/FILE/email-Eu-core.txt")
This gives you the loaded graph. You are encouraged to experiment with this in the scala shell. To do these experiments, read trough the Spark GraphX guide: https://spark.apache.org/docs/latest/graphx-programming-guide.html . For the following parts, you will also need parts of that guide.
Part III - Finding the most important person
Assuming that the most important person is the one receiving messages from many other important different people, find this person by performing a page rank computation and then finding the ID of the person with the highest page rank.
Part IV - Attempting to Group People
This part is open-ended. The second file of the dataset contains the departments in which the people sending the messages work. The goal is to find some way to group the people by these departments using only information from the message graph. Note that even the best algorithms do not perform very well on this task. So, the goal is not to get a very good outcome, but rather get to know some more possibilities of the GraphX library.
You have to evaluate the approach you created using Weighted precision and Weighted recall, about which you can read from the Spark documentation on multiclass classification.
Returning the task###
- Everything is returned in a git repository
- Yousource : https://yousource.it.jyu.fi/
- Make a repository and add the teacher as a collaborator.
- You can also keep using the repository you used for the first task.
- If unfamiliar with git
- Read The Pro Git book on-line and instructions on yousource
- Ask in group
- The deadline for this task is January 15.
Hints
- Use the Scala shell for experimenting.