Michael Cochez

Assistant Professor at Vrije Universiteit Amsterdam

Cloud computing - System administration - NoSQL servers

Note

The server to be used in this exercise should be accessible now. Inform the teacher if you cannot log in or something else does not work as expected.

Goal

The goal of this exercise is getting acquainted with NoSQL databases and implementation of simple web application using python. A side-goal is training some basic system administrator skills which you will likely need when developing this type of systems yourself. You will also be using virtual machines in this exercise.

Prerequisites

Learn how to create a simple HTTP server using python with the bottle framework. (You can use other frameworks if you really want.)

Learn what a NoSQL database is. A general description can be found from Wikipedia. In this exercise we will be using the Riak NoSQL database. Read the basic information from Riakdoc - concepts. If you have not been using Linux lately, you can have a look at this cheat sheet. Read trough the Taste of Riak: Python documentation. You will need this later when accessing Riak from your python code.

Task

This assignment contains a technical part and several reflective questions (see below).

Do NOT copy paste commands without having a clue what they are doing. Some commands need to be adapted before use.

In this exercise you will set-up a NoSQL database and then write a small python server which uses the database as a back-end. As an example of a NoSQL database we will use the Riak database. For the server we will use a HTTP server implemented using the python programming language. The database and the server run on separate virtual machines.

All work in this task needs to be done on oksa3.it.jyu.fi. This machine is reachable over ssh from inside the university network (including KOAS and Kortepohja student village). Students outside the network can reach the machine using VPN or by first connecting to charra.it.jyu.fi using ssh and then connect to oksa3. In the Linux terminal a typical ssh command would look like: ssh username@host, for example ssh miselico@oksa3.it.jyu.fi. On oksa3, all student in this course have an account and X-forwarding is possible. Windows users can use putty to connect to the server, but will not be able to use X-forwarding and likely not get a smooth experience.

Note that the home directory on oksa3 is not related to your normal university home directory. All files are local to oksa3.

From now on it is assumed that you are logged in to the server.

Setting up the virtual machines

Before creating the virtual machines, each student needs to have a public/private key pair; the same type as we used for authentication to yousource with git. This key will be used to log-in to the virtual machines. You can create this key pair as follows:

ssh-keygen -t rsa

This creates a private and a public key in the ~/.ssh/ directory.

Next, a script has been prepared to create the machines. This script is put in everyones home directory and you should read it to see what it does. This script has to be executed only once for each group!

Read the script using:

nano createVMs

Execute the script using :

./createVMs

The script takes your group number as a parameter. Please double check before providing it. You might need to enter your password while this script is running since creating VMs requires root (sudo) access. Take note of the information which the script prints, some of it you will need later. Again, this script can only be run once per group!!

Note that now only the student who created the VMs can access them. This is because of the fact that only that user has the right key to log into the virtual machines. Next, you need to add the public keys ( .ssh/id_*.pub ) of the other students in your group to both virtual machines so that they also get access. Adding someone’s public key to the a linux machine is done by appending the key to the file located at ~/.ssh/authorized_keys, more concrete /root/.ssh/authorized_keys.

A final thing you should realize is that the virtual machines can communicate with each other and you can connect to the virtual machines from the host machine, i.e. oksa3. It is, however, not possible (except for people familiar with tunneling) to connect to the created virtual machines from your own work station. Instead of using your browser for testing something, you can use either wget or curl from the command line to fetch content. Interactive browsing is possible with links or lynx. In order to us these tools on the VMs, you will need to install them yourself. See also the tip section below.

Setting up the database

Remark: if you want to start working on the HTTP server before you finish the creation of the database, see the hints section.

A virtual machine was created which will be used as the database machine. Use ssh to connect to that machine. The command will look something like ssh root@192.168.122.1xx. The correct address was shown when you created the machines. Now follow the following steps:

  1. Install Erlang from source use Installing on GNU/Linux
  2. Try whether erlang is correctly installed by firing up the erlang emulator with the command erl
  3. Install Riak from source See installing on Debian and Ubuntu from source
    It seems like one dependency is missing from the documentation. You also have to install apt-get install libpam0g-dev
  4. This part has changed, sorry for that. It seems hard to get multiple nodes running on the same host, so instead one node is sufficient. After installing riak, you need to edit the file rel/riak/etc/riak.conf. In that file you have to set listener.http.internal to 192.168.122.1XX:8098 and listener.protobuf.internal to 192.168.122.1XX:8087 where XX is your group number. Then you should be able to start riak by executing rel/riak/bin/riak start.
  5. Make sure you test the database by putting an image into it and fetching it back. In the terminal you don’t have a graphical web browser! To do this you will need to install the curl tool. You can install curl by executing apt-get install curl. Look at the Test the cluster section of Five Minute Install on how you can do a quick test to see whether it works. Change the IP address and port numbers as needed.

Now, you have a NoSQL database set-up and you are ready to continue making the server.

Making the HTTP server

Now, we will make an HTTP server which is providing a minimal phone book service. The HTTP server must be running on the second VM which was created. To login to the machine, you can use the following command: ssh root@192.168.122.2XX.

First, you should try whether you can connect from the web server machine to the database machine by executing similar commands as the ones from step 5 in the previous section.

To make the HTTP server, which must be written in Python, you are advised to use frameworks like, for instance, bottle.

At http://192.168.122.2XX/add/ you must serve an HTML form in which the user can enter a name and phone number. Sending a form back to the server is a POST action. The data entered in the form must be stored in the Riak database cluster, old values should be overwritten. A previously entered phone number is retrieved by visiting http://192.168.122.2XX/search/<name>. The database must be queried for retrieval of the phone number.

Remember to change the hostname from localhost to 192.168.122.2XX in the code for starting the server. Otherwise, the server will not be accessible from outside the virtual machine.

In your python code, you will use a library provided by the riak developers to connect to the database. You will need to install a few dependencies before installing the Riak Python libraries. All of then can be installed with apt-get (See Tips section).

  • python-pip
  • protobuf-compiler
  • python-protobuf
  • build-essential
  • python-dev
  • libffi-dev

Check from this page how to install the Riak libraries and connect to your Riak server from Python code. You just installed pip, you can now use it for installation of the Python packages. While reading the tutorial, keep in mind that your database is not on localhost and does not use default ports.

    RiakClient(protocol='http', host='192.168.122.1xx', http_port=8098)

Trying to optimize things

Several things can be done in order to improve the way your application works. Some of the techniques in this section do not really make sense for this exercise. Again, the goal is to get familiar with things in order to use them when needed.

Measuring impact

To test the impact of these changes, you should measure the throughput of your application using JMeter on the server. First, download the latest stable version of the software to your account on the server, for example using wget http://the.url Then extract the archive using tar --extract --verbose --file archive.tar.gz or unzip archive.zip.

Now, in order to execute JMeter, you have to realize that on the server you cannot use the graphical user interface. Instead, you will make the test plan (.jmx file) on your own machine, using the GUI and execute it on the server (on the host machine, not inside a virtual machine). Check from the documentation how to execute the plan.

To make the testing more realistic, you should simulate a slow network link using netem on the virtual machines. You can make the testing more interesting by specifying values in a CSV file which you use as an input to JMeter, i.e. not making the exact same request many times in a row.

Using protocol buffers instead of HTTP

As you might have read, using HTTP for the client is 2-3 times slower as the protocol buffer API. Currently, your client most likely uses HTTP. Changing it to use protocol buffers is straightforward (because we pre-installed the needed dependencies) and only requires you to change the code for creating the client from riak.RiakClient(host="192.168.122.1XX", port=8098)to something like riak.RiakClient(host="192.168.122.1XX", port=8087, protocol='pbc')

Caching contacts using memcached

Some contacts will be more popular as others. Imagine that in the scope of this exercise, it makes sense to cache frequently used ones for later use. memcached is one caching solution which is also used in production environments. Install memcached and make use of the caching facilities from your Python web application. (You can install and use the packaged memcached version by running sudo apt-get install memcached). The memcache deamon runs as a separate process from your python server.

Then you use a client library to connect to it, note that some memcached clients, like e.g. pylibmc, will require you to install libmemcached-dev. The first option from the list (pylibmc) can be installed using pip install pylibmc.

Using a faster server framework

Currently, you are most likely using a debug / development setting of the python web framework (i.e. bottle). Use a production ready server, like cherrypy, instead. Also, turn of debug mode.

Cancelled: Load balancing your database

This optimization has been cancelled since it seems pretty hard to get multiple riak nodes running on one host.

You have set up three nodes in the database. Your current solution is however only querying one of the nodes all the time. Try to come up with and implement a simple mechanism which makes your HTTP server implementation alternate between the nodes, you can also use what is provided in the riak libraries.

Note that this should not make a difference really since all database nodes are on the same virtual machine.

Reflective questions

Provide an answer to the following questions in a readme file in your repo

  1. How would you implement a page which shows a list of contacts. See also List keys.
  2. Why would you ever use a set-up with virtual machines in a real (production) environment? Or would you not?
  3. Which of the optimizations made sense, which ones not?
  4. What should be improved in this exercise if it is given to students in the future?

Returning the task###

You “return” the task by appending the text from toBeAppendedTo_authorized_keys to the ~/.ssh/authorized_keys file on both virtual machines. This way the teacher is able to log into the machines.

Also place all your self-written files (web server) in this week’s repository.

Hints

  1. riak.config can be found in riakX/etc/riak.config

  2. You can install needed tools on the VMs as you see fit using apt-get install package_name. Where package_name is the name of the package to be installed. Keep in mind that the disk space for the virtual machines is fairly limited (free space can be checked with df -h).

  3. You can install the nano text editor to edit files in the terminal. Execute apt-get install nano.

  4. If you want to start working on the HTTP server before you finish the creation of the database, you can use the database server running at host 192.168.122.149. The http port for the server is 8098 and the protocol buffer port 8087. Keep in mind that also others might be using the server, so use buckets and keys which contain some identifier related to your group.

  5. If you want to create servers to test something else, you can use group numbers between 20 and 40. Run ls /srv/kvm/ on oksa3 to check which ones are in use already. Keep in mind that these test servers will be removed without warning at the end of the course.