Analysis of photoisomerization experiments
Goal
The goal of this task is to write a program for a supercomputer which uses the Message Passing Interface (MPI) to communicate between the processes.
Notes
Some small things might still change in the task description.
The analysis proposed is fairly bad. It makes several assumptions which are not strongly supported with argumentation. Making a deeper analysis of the problem would require much more insight into the problem and knowledge of nonlinear dimensionality reduction techniques. Further, the dataset is actually reasonably small and hence the proposed analysis could be done on a normal computer (we are using the supercomputer to try the MPI infrastructure).
Prerequisites
In principle, the content presented during the lectures suffices to implement this task. This includes the code examples for MPI (yousource, github). More information about MPI can be found from the tutorial on mpitutorial.com. Specific parts useful for this task are broadcast and collective communication and scatter and gather. Detailed information can be found from the MPI Forum website.
Details about the hardware used in this task can be found from the CSC webpage about Taito. Some general commands to compile and run code can be found from the abovementioned MPI examples.
Task
The goal of this task is to solve a chemistry problem. It is known that certain molecules change their structure when they interact with photons (light). Further, certain molecule shapes are more beneficial as others and some shape changes can be used to cause other desired effects. Hence, chemists would like to find out what kind of molecules they need in order to obtain certain shape changes.
In this exercise you are given data (location and speeds of atoms) of a C1−N2=C3−C4=C5−C6 molecule, which is sensitive to photon interaction. Further, you are given the result of the interaction with the photon as 1 (wanted change observed), 1 (no change observed), 0 (a change observed, but it is not relevant). The goal is now to see whether we can find features in the data which predict the outcome. Once these features are found, a chemist could prepare the molecules such that they have the right starting shape and hence only the desired change would happen.
For this task, you are provided with an account on the Taito supercomputer and the datasets which have been placed in your home drive. There are two datasets (diabatic and ffsh) which are obtained using different simulation techniques.
As mentioned in the notes, the analysis performed is fairly simple and most likely you will not obtain a good result. You are allowed to experiment with more advanced methods as well but, since the main part of the exercise is to learn how to use the message passing interface (MPI), you have to use multiple communicating processes on the supercomputer.
This task is performed either individually or as a pair. You are free to work using the programming language you want, however, getting the MPI to work in a language other than C, C++ or Fortran might be highly nontrivial.
Part I
Read the data from the file, a simple C parser is provided here. You can also write your own parser, for each molecule which was tested, data is formatted as follows:
Outcome: 1
16
1SB2 NZ 1 0.039 0.071 0.019 0.3810 0.1582 0.3186
1SB2 HZ 2 0.024 0.139 0.057 0.0575 1.9353 0.1414
****
0.55500 0.17760 0.56290
The fields have the following meaning:
Outcome: the result of interaction with the photon
number of atoms (always 16 for this data)
(atom identifier  3 fields) (position of the atom 3 dimensions  x y z) (velocity of the atom in 3 dimensions  vx vy vz)
(atom identifier  3 fields) (position of the atom 3 dimensions  x y z) (velocity of the atom in 3 dimensions  vx vy vz)
 repeated 16 times, once for each atom.
0.55500 0.17760 0.56290 > coordinates of the bounding box, can be ignored.
For what follows, only the 0th process will read the dataset and send information needed to other processes.
Part II
Now, you are ready to do a simple analysis of the data. We start by observing that for each experiment we have 96 individual features, coming from 16 atoms with each a x,y, and z coordinate and speed in three dimensions (16*6=96). Now, we will work with one process for each dimensions, i.e. 96 processes, each running on its own core. See notes on how to start this job.
You start out by spreading the data over the processes, using MPI_SCATTER. Each process will receive the data for one of the dimensions (for example the 0th process receives the x coordinate of all first atoms, the first process receives the y coordinate of all first atoms, etc.). In order to send the data, it has to be layedout in memory such that you can send it. So first, you need to transform the data you read such that you can send it using MPI_SCATTER. (Alternatively, you could try to use datatypes, but it will take you quite a while to get started with them)
Next, each process needs the outcome of the experiments (the 0, 1, and 1 values). Use MPI_BroadCast to send this data to all processes.
Now, each process has its own small dataset and can perform work based on that. The data in a process looks like this:
(value  the value of the feature, outcome  one of 0, 1, 1)
(value, outcome)
(value, outcome)
...
Now, each process is going to calculate 3 averages and three corrected sample standard deviations, one for each possible outcome. So, there will be an average and standard deviation of the value for outcome zero, an average and std for outcome one and an average and std for outcome 1. Then, for each pair (0,1), (1, 1) and (0, 1) we will check whether the current feature is significantly different. This could be done by calculating the overlap between the Gaussian curves, but we will use a simpler (and less precise) measure based on Zscores. What we will compute is what the Z score of the mean of the first set is if it would be a sample of the second set and vice verse.
For example, we look at the pair (0, 1). Then we have values avg_0 , std_0 , avg_1 , std_1. What we now calculate is first the Zscore of avg_0 if it would have been sampled from the distribution with avg_1 and std_1. The Zscore is the number of standard deviations a sample is away from the mean, so concrete
Zscore = sample  mean/std = avg_0  avg_1/std_1
We also need to calculate the score in the other direction
Zscore = sample  mean/std = avg_1  avg_0/std_0
The higher this Zscores, the more different the sets 0 and 1 are.
You will obtain 6 outcomes in each process. The final step is to aggregate all these results back to the 0th process. This can be done using MPI_Gather.
Finally, the 0th process writes this outcome back to the disk.
Hints

You can experiment with a smaller number of dimensions and hence processes first. This will reduce your waiting time in the queue.

You batch script should contain the following parameters to start the job on 96 cores, the p parallel indicates that you want to place your job in the parallel queue.
#SBATCH n 96
#SBATCH p parallel