Analysis of photo-isomerization experiments

Goal

The goal of this task is to write a program for a supercomputer which uses the Message Passing Interface (MPI) to communicate between the processes.

Notes

Some small things might still change in the task description.

The analysis proposed is fairly bad. It makes several assumptions which are not strongly supported with argumentation. Making a deeper analysis of the problem would require much more insight into the problem and knowledge of non-linear dimensionality reduction techniques. Further, the dataset is actually reasonably small and hence the proposed analysis could be done on a normal computer (we are using the supercomputer to try the MPI infrastructure).

Prerequisites

In principle, the content presented during the lectures suffices to implement this task. This includes the code examples for MPI (yousource, github). More information about MPI can be found from the tutorial on mpitutorial.com. Specific parts useful for this task are broadcast and collective communication and scatter and gather. Detailed information can be found from the MPI Forum website.

Details about the hardware used in this task can be found from the CSC webpage about Taito. Some general commands to compile and run code can be found from the abovementioned MPI examples.

Task

The goal of this task is to solve a chemistry problem. It is known that certain molecules change their structure when they interact with photons (light). Further, certain molecule shapes are more beneficial as others and some shape changes can be used to cause other desired effects. Hence, chemists would like to find out what kind of molecules they need in order to obtain certain shape changes.

In this exercise you are given data (location and speeds of atoms) of a C1−N2=C3−C4=C5−C6 molecule, which is sensitive to photon interaction. Further, you are given the result of the interaction with the photon as 1 (wanted change observed), -1 (no change observed), 0 (a change observed, but it is not relevant). The goal is now to see whether we can find features in the data which predict the outcome. Once these features are found, a chemist could prepare the molecules such that they have the right starting shape and hence only the desired change would happen.

For this task, you are provided with an account on the Taito supercomputer and the datasets which have been placed in your home drive. There are two datasets (diabatic and ffsh) which are obtained using different simulation techniques.

As mentioned in the notes, the analysis performed is fairly simple and most likely you will not obtain a good result. You are allowed to experiment with more advanced methods as well but, since the main part of the exercise is to learn how to use the message passing interface (MPI), you have to use multiple communicating processes on the supercomputer.

This task is performed either individually or as a pair. You are free to work using the programming language you want, however, getting the MPI to work in a language other than C, C++ or Fortran might be highly non-trivial.

Part I

Read the data from the file, a simple C parser is provided here. You can also write your own parser, for each molecule which was tested, data is formatted as follows:

Outcome: 1
   16	
    1SB2     NZ    1   0.039  -0.071  -0.019  0.3810  0.1582 -0.3186	
    1SB2     HZ    2   0.024  -0.139   0.057 -0.0575 -1.9353 -0.1414	
    ****
    
   0.55500   0.17760   0.56290	

The fields have the following meaning:

Outcome: the result of interaction with the photon
   number of atoms (always 16 for this data)	
    (atom identifier - 3 fields)  (position of the atom 3 dimensions - x  y  z) (velocity of the atom in 3 dimensions - vx  vy vz)	
    (atom identifier - 3 fields)  (position of the atom 3 dimensions - x  y  z) (velocity of the atom in 3 dimensions - vx  vy vz)	
    ---- repeated 16 times, once for each atom.
	
   0.55500   0.17760   0.56290	-> coordinates of the bounding box, can be ignored.

For what follows, only the 0th process will read the dataset and send information needed to other processes.

Part II

Now, you are ready to do a simple analysis of the data. We start by observing that for each experiment we have 96 individual features, coming from 16 atoms with each a x,y, and z coordinate and speed in three dimensions (16*6=96). Now, we will work with one process for each dimensions, i.e. 96 processes, each running on its own core. See notes on how to start this job.

You start out by spreading the data over the processes, using MPI_SCATTER. Each process will receive the data for one of the dimensions (for example the 0th process receives the x coordinate of all first atoms, the first process receives the y coordinate of all first atoms, etc.). In order to send the data, it has to be layed-out in memory such that you can send it. So first, you need to transform the data you read such that you can send it using MPI_SCATTER. (Alternatively, you could try to use datatypes, but it will take you quite a while to get started with them)

Next, each process needs the outcome of the experiments (the 0, 1, and -1 values). Use MPI_BroadCast to send this data to all processes.

Now, each process has its own small dataset and can perform work based on that. The data in a process looks like this:

(value - the value of the feature, outcome - one of 0, 1, -1)
(value, outcome)
(value, outcome)
...

Now, each process is going to calculate 3 averages and three corrected sample standard deviations, one for each possible outcome. So, there will be an average and standard deviation of the value for outcome zero, an average and std for outcome one and an average and std for outcome -1. Then, for each pair (0,1), (1, -1) and (0, -1) we will check whether the current feature is significantly different. This could be done by calculating the overlap between the Gaussian curves, but we will use a simpler (and less precise) measure based on Z-scores. What we will compute is what the Z score of the mean of the first set is if it would be a sample of the second set and vice verse.

For example, we look at the pair (0, 1). Then we have values avg_0 , std_0 , avg_1 , std_1. What we now calculate is first the Z-score of avg_0 if it would have been sampled from the distribution with avg_1 and std_1. The Z-score is the number of standard deviations a sample is away from the mean, so concrete

Z-score = |sample - mean|/std = |avg_0 - avg_1|/std_1

We also need to calculate the score in the other direction

Z-score = |sample - mean|/std = |avg_1 - avg_0|/std_0

The higher this Z-scores, the more different the sets 0 and 1 are.

You will obtain 6 outcomes in each process. The final step is to aggregate all these results back to the 0th process. This can be done using MPI_Gather.

Finally, the 0th process writes this outcome back to the disk.

Hints

You can experiment with a smaller number of dimensions and hence processes first. This will reduce your waiting time in the queue.
You batch script should contain the following parameters to start the job on 96 cores, the -p parallel indicates that you want to place your job in the parallel queue.

#SBATCH -n 96

#SBATCH -p parallel

Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam