Wednesday, February 01, 2012

Why Build an Appliance?

Clustrix sells appliances. We marry our software with industry standard hardware to make plug and play devices that drop in and work on the network. This is same model as my former company Isilon (now EMC), NetApp, and pretty much all successful storage vendors.  Why do this instead of selling software by itself?

Qualifying SSDs is particularly tricky because of the huge variation in quality.
The number one reason for the appliance model is quality control. By reducing the supported hardware set, we drastically increase the QA time we get on the intended hardware. QA time looking for hardware interactions really matter in a product that stores customers’ data. The bar is simply set much higher there.  As an example, we do extensive testing of durability of data on our specific hardware. Depending on the hard drive controller, controller firmware, drives, and drive firmware, we have gotten very different results on exactly when a specific piece of data makes it to stable storage. Many drives and even controllers lie about when the data is safe. Some drives implement tagged queuing or FUA poorly. I’ve even seen drives return from a sync command without actually having synced the data. The only way to ensure data integrity on a storage system is to properly characterize the hardware with an extensive test regime and control every piece of that storage system. With that control, we can form relationships with the vendors to fix bugs that we expose in the hardware and firmware. At Clustrix, we have a variety of different pieces of test software to exercise the disk subsystem and verify the data is safe every time. With that test software, we have rejected many pieces of hardware and countless versions of firmware that didn’t make the cut. The same rigorous process goes into qualifying networking, Infiniband, NVRAM, processing, and memory components. This kind of focused testing and qualification is not possible on a software-only product.

The second critical benefit of the appliance model is much tighter integration between hardware and software. At Clustrix, we are able to monitor the hardware in the box and present that data seamlessly in our “system” database. For example, “SELECT * from system.memory” will give you all the details for the memory installed on the system and tell you if there are any correctable or uncorrectable ECC errors. We have logic to send alerts on correctable errors and safely shut down the node on an uncorrectable error. On Clustrix, this hardware-specific data sits right along side of database-specific data like queries per second and disk full percentage which allows exceptionally easy integration with tools like Nagios and Cacti. This sort of tight integration is only possible on an appliance.

Finally, being an appliance makes the Clustrix database much easier to install, manage, and use. Creating a high performance, bullet-proof database is no longer a science project. You no longer have to put together the pieces, get all the right versions of firmware, get the right versions of the kernel and libraries, the right version of the database software and make sure all the parts are tuned to work together. The Clustrix appliance is an integrated and tuned database right out of the box.

Tuesday, January 31, 2012

Scripting SSH with Python

The other day Aaron Passey, our CTO, pointed out to me that I've written the same bit of code in different languages at each of the last 3 places I've worked. I figured that since this seems to be something commonly useful, yet not obviously available, I'd write something about it.

The scenario is that we have some code which wants to do remote ssh calls. Some variant of this code exists within the Isilon cluster management code, and we use it at Clustrix within our clx command line tool, as well as the database update scripts. We're already running on a Unix box (or variant (osx, linux, etc)) and we have access to an ssh client, so this it totally do-able from code, but not immediately obvious how.

I'll present this in Python, but the same applies to C, or any other language. I'm aware of the Paramiko library for Python which is supposed to have support for this. That may do everything this can do - I don't know. This is 100 lines of code, pretty easy to follow, and maybe instructive. This is all about utilizing ssh and our friendly posix primitives.  After the walkthrough, I've included the entire source at the bottom.

The obvious thing that we'd like to do is fork off an ssh process and read the stdin. The easy way to do this in python is with os.popen2(). This will give us back the stdin and stdout:

(sin, sout) = os.popen2(cmd)

This, however, will not work. ssh wants a psuedo tty (a pty). If it's not running in one, it just exits. This is where the helpful python pty class comes in:

(pid, f) = pty.fork()

Now we've got ssh running in the right environment.  The two args we got back are important: pid is the process id of our forked ssh process, and f is a unix fileno (not to be confused with a python file handle) which is the combined stdin and stdout of the process. It's important to remember that f isn't something reference counted. We're going to need to explicitly close it.

Now that we have the basic mechanism in place, let's build us a little ssh class:

class SSH:
    def __init__(self, ip, passwd, user, port):
        self.ip = ip
        self.passwd = passwd
        self.user = user
        self.port = port


This is structured so you can create one SSH object per target box and reuse it to do different commands. We'll also include the ability to push and pull files. Our first method will be the command handler. It will take only one argument (other than self), which will be the command to run:

    def run_cmd(self, c):
        (pid, f) = pty.fork()
        if pid == 0:
            os.execlp("ssh", "ssh", '-p %d' % self.port,
                      self.user + '@' + self.ip, c)
        else:
            return (pid, f)


As you can see, the pty fork works just like the os fork in that if the pid is 0 it means we're the child, and if non 0 it means we're the parent.

Since this is a raw unix fileno, the file closed condition is a little weird. Reading it will block until something is available and then return results. But when the descriptor closes it throws an os error. I'd rather be able to handle the reads just in a loop (or maybe it's because I'm an old C programmer) so I'm going to wrap the read:

    def _read(self, f):
        x = ''
        try:
            x = os.read(f, 1024)
        except Exception, e:
            # this always fails with io error
            pass
        return x


Once we've got this thing forked, and can read from it, we need to get our results back out. There's on additional thing we need to be prepared for: ssh might want to ask us some questions. We've all seen this before:

harmony:~$ ssh paulmini
The authenticity of host 'paulmini (10.1.2.125)' can't be established.
RSA key fingerprint is 7e:91:5d:5d:06:fe:3f:24:94:84:

c0:75:96:8c:d1:f1.
Are you sure you want to continue connecting (yes/no)?


We just want to say "yes" and move on. ssh might also ask us for a password if we don't have host keys enabled and we'll need to be prepared to handle that. We'll handle all of these actions in an ssh_results method:

    def ssh_results(self, pid, f):

First, let's initialize our output buffer:

        output = ""

Now we'll read our first chunk and see if ssh is asking anything of us. If they want to know if we really want to continue connecting because the target isn't in ssh/known_hosts, we'll say yes. If they ask us for the password we'll provide it.

        got = self._read(f)

        # check for authenticity of host request
        m = re.search("authenticity of host", got)
        if m:
            os.write(f, 'yes\n')
            # Read until we get ack
            while True:
                got = self._read(f)
                m = re.search("Permanently added", got)
                if m:
                    break

            got = self._read(f)
        # check for passwd request
        m = re.search("assword:", got)
        if m:
            # send passwd
            os.write(f, self.passwd + '\n')
            # read two lines
            tmp = self._read(f)
            tmp += self._read(f)
            m = re.search("Permission denied", tmp)
            if m:
                raise Exception("Invalid passwd")
            # passwd was accepted
            got = tmp


Preliminaries done, we can now entire our results loop:

        while got and len(got) > 0:
            output += got
            got = self._read(f)


We've know we've now ready everything because our _read returned empty. This also means that the ssh process has ended. There's a defunct zombie process sitting there and we're going to have to clean that up. We could handle the SIGCHLD and waitpid it, but in this case since we know the pid and we know the process is done, it's much simpler. Also, remember that since f isn't a reference counted Python object we're going to need to manually clean that up:

        os.waitpid(pid, 0)
        os.close(f)
        return output


And that's basically it. With some niceties for pushing and pulling files, and some error handling, the completed code looks like this:

#
# Remote ssh cmds
#

import pty, re, os, sys, stat, getpass

class SSHError(Exception):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return repr(self.value)

class SSH:

    def __init__(self, ip, passwd, user, port):
        self.ip = ip
        self.passwd = passwd
        self.user = user
        self.port = port

    def run_cmd(self, c):
        (pid, f) = pty.fork()
        if pid == 0:
            os.execlp("ssh", "ssh", '-p %d' % self.port,
                      self.user + '@' + self.ip, c)
        else:
            return (pid, f)
    def push_file(self, src, dst):
        (pid, f) = pty.fork()
        if pid == 0:
            os.execlp("scp", "scp", '-P %d' % self.port,
                      src, self.user + '@' + self.ip + ':' + dst)
        else:
            return (pid, f)

    def push_dir(self, src, dst):
        (pid, f) = pty.fork()
        if pid == 0:
            os.execlp("scp", "scp", '-P %d' % self.port, "-r", src,
                      self.user + '@' + self.ip + ':' + dst)
        else:
            return (pid, f)

    def _read(self, f):
        x = ''
        try:
            x = os.read(f, 1024)
        except Exception, e:
            # this always fails with io error
            pass
        return x

    def ssh_results(self, pid, f):
        output = ""
        got = self._read(f)
        # check for authenticity of host request
        m = re.search("authenticity of host", got)
        if m:
            os.write(f, 'yes\n')
            # Read until we get ack
            while True:
                got = self._read(f)
                m = re.search("Permanently added", got)
                if m:
                    break

            got = self._read(f)
        # check for passwd request
        m = re.search("assword:", got)
        if m:
            # send passwd
            os.write(f, self.passwd + '\n')
            # read two lines
            tmp = self._read(f)
            tmp += self._read(f)
            m = re.search("Permission denied", tmp)
            if m:
                raise Exception("Invalid passwd")
            # passwd was accepted
            got = tmp
        while got and len(got) > 0:
            output += got
            got = self._read(f)
        os.waitpid(pid, 0)
        os.close(f)
        return output

    def cmd(self, c):
        (pid, f) = self.run_cmd(c)
        return self.ssh_results(pid, f)

    def push(self, src, dst):
        s = os.stat(src)
        if stat.S_ISDIR(s[stat.ST_MODE]):
            (pid, f) = self.push_dir(src, dst)
        else:
            (pid, f) = self.push_file(src, dst)
        return self.ssh_results(pid, f)

def ssh_cmd(ip, passwd, cmd, user=getpass.getuser(), port=22):
    s = SSH(ip, passwd, user, port)
    return s.cmd(cmd)

def ssh_push(ip, passwd, src, dst, user=getpass.getuser(), port=22):

    s = SSH(ip, passwd, user, port)
    return s.push(src, dst)



Friday, January 27, 2012

Running a devnode cluster across multiple boxes

In my last post, we used devnodectl to fire up a simple 3-node cluster, with all nodes running on the same system.  This is interesting insofar as it demonstrates the functionality of the cluster, but we're certainly quite resource limited trying to emulate three nodes on one box.  In this post, I'll show how to get devnode running on different physical servers, so you can begin to see the potential of horizontal scaling available with Clustrix.
Along the way I'll also cover:
  • Manual control of devnode (vs. devnodectl)
  • How, where, and what to look for in the logs
Requirements
  • Three Red Hat/CentOS 6 or equivalent clients
  • All clients must be on the same subnet
  • The clients should NOT be running mysqld (so port 3306 will be free to devnode)
Firing up the devnode instances
First install the DevKit RPMs, as described in my last blog post, on each of your clients, and ensure you have a writeable working directory available on each (I'll use /data/clustrix on each).  If you followed along on the last exercise, please stop those nodes (devnodectl stop), and I also recommend cleaning out the prior state with rm -rf /data/clustrix/*.  The flags we'll specify below will overwrite old stuff as needed, but you'd still have the old nodes' 2 and 3 state on your first client, which might become confusing.  
  
Now we're going to start up devnode directly, and also change the flags around quite a bit.  I recommend opening three terminal windows, one for each node; connect (ssh, presumably) to each of your clients with these nodes, then run the following, one in each window:

client1$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 1 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile - 

client2$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 2 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile - -noautostart

client3$ /opt/clustrix/bin/devnode -clusterpath /data/clustrix -setpnid 3 -eth eth0 -clean -nclean 4 -vdev-size 2048 -logfile - -noautostart

The -clusterpath, -nclean, and -vdev-size flags we talked about last time (recap: where does the simulated node store it's data, and how many/big disks should we have -- note that I'm allocating 8GB per node here).  Let's pick apart the other flags here:
  • -setpnid sets a Physical Node ID -- normally this would come from a node's MAC address
  • -eth tells devnode to use the ethernet interface for inter-node communication; when we created our three node cluster on the same physical machine with devnodectl, it specified -unix to use UNIX sockets instead.  On real nodes, we'd be using InfiniBand for this purpose.
  • -clean tells the cluster to wipe all prior state 
  • -noautostart for the second and third nodes avoids a devnode restart step, as will be explained further below
  • -logfile - means to log to stdout instead of to a log file
This last option is how I prefer to run, because staring at logfiles is how I live.  For our purposes today I think it will be most instructive for you as well.  It should be noted that normally these logs go to /data/clustrix/p1/devnode.log (substitute p2, p3, etc. for other nodes).  

If a bunch of FATAL errors scroll by, the most likely culprit is a port conflict:
2012-01-25 11:22:38 ERROR cp/cp_sock.c:74 cp_bind(): stream_listen(IPv4(0.0.0.0:2048)): Address already in use
2012-01-25 11:22:38 FATAL core/segv.c:93 main_segv_handler(): Program received a fatal signal on core 0  fiber 0
2012-01-25 11:22:38 FATAL core/segv.c:95 main_segv_handler(): C stack trace:
0x000000000052d976 bind_done() <no lines read>
0x00000000008c7c91 scheduler_run_one_item() <no lines read>
0x00000000008c8893 scheduler_main_loop() <no lines read>

The above indicates that port 2048 (the control port, which will cover later) is already in use.  You'd see this if you tried to run the above commands on the same box, instead of 3 different boxes.  If you're running mysqld on one of your boxes, it will fail thusly:
n1 2012-01-25 11:37:47 ERROR mysql/server/mysql_proto.c:1171 listen_on_port(): stream_listen(IPv4(0.0.0.0:3306)): Address already in use
n1 2012-01-25 11:37:47 FATAL dbcore/dbstate.c:104 dbconf_done(): Error handling dbconf chain: Address already in use: dbconf/INIT_MYSQL_PROTO failed (unable to create TCP socket)

You can work around these by specifying the -anyport flag, in which case you'll need to look back through the logs to find which port it's chosen:
2012-01-25 11:41:17 INFO dbcore/driver.ct:90 driver_publish_address(): pnid p3 control port 33274 sw f71a74d89c512ac
n1 2012-01-25 11:41:17 INFO dbcore/driver.ct:90 driver_publish_address(): pnid p3 mysql port 36290 sw f71a74d89c512ac

Adding nodes 2 and 3 to your cluster
Normally fresh Clustrix nodes start up in a "cluster-of-one"; you can connect to any node and then pull in the others to form a larger cluster.  This mechanism involves a process restart (on real nodes, an initd-like process called nanny takes care of this); to avoid this, we used the -noautostart flag when starting nodes 2 and 3, so they don't start up as "cluster-of-one", and can't be accessed via mysql until they are added to a cluster.
So, connect to your first client (client1 above, the one where you did not have the -noautostart flag):
[nparrish@hefty mainline1]$ mysql -h beta001 -u root
mysql> use system;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from available_node_details;
+------+------------+----------------------------------------------------+
| pnid | name       | value                                              |
+------+------------+----------------------------------------------------+
| p3   | cluster    | nparrish                                           | 
| p3   | sw_version | 1112854534402937516                                | 
| p3   | version    | 5.0.45-clustrix-v3.2-7371-0f71a74d89c512ac-release | 
| p3   | iface_ip   | 10.2.12.188                                        | 
| p3   | iface_mac  | 00:25:90:34:69:04                                  | 
| p3   | hostname   | loeb.colo.sproutsys.com                            | 
| p3   | started    | 2012-01-25 20:44:33.037318                         | 
| p2   | cluster    | nparrish                                           | 
| p2   | sw_version | 1112854534402937516                                | 
| p2   | version    | 5.0.45-clustrix-v3.2-7371-0f71a74d89c512ac-release | 
| p2   | iface_ip   | 10.2.12.194                                        | 
| p2   | iface_mac  | 00:25:90:34:70:0a                                  | 
| p2   | hostname   | sainz.colo.sproutsys.com                           | 
| p2   | started    | 2012-01-24 23:52:17.085502                         | 
+------+------------+----------------------------------------------------+
14 rows in set (0.00 sec)

So we're looking at system.available_node_details, which shows the nodes that can be seen on the network (here eth0, on real nodes via InfiniBand), who are not already part of another cluster.  It's sort of a key/value pair table, extended to include the pnid (Physical Node ID -- recall we set this with -setpnid).  We're really most interested in the hostname -- I can see my two other clients, great.  

Now we add these nodes into the cluster using the ALTER CLUSTER query:

mysql> alter cluster add p2;
Query OK, 0 rows affected (0.00 sec)


mysql> alter cluster add p3;
Query OK, 0 rows affected (0.02 sec)


mysql> select * from nodeinfo\G
*************************** 1. row ***************************
        nodeid: 1
       started: 2012-01-25 20:50:50
       ntptime: 2012-01-25 20:54:13
   node uptime: 2012-01-20 23:15:52
      hostname: beta001.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.13.11
iface_mac_addr: 00:30:48:c3:e7:5c
          pnid: p1
         cores: 12288
*************************** 2. row ***************************
        nodeid: 3
       started: 2012-01-25 20:50:57
       ntptime: 2012-01-25 20:54:13
   node uptime: 2011-04-12 01:05:08
      hostname: loeb.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.12.188
iface_mac_addr: 00:25:90:34:69:04
          pnid: p3
         cores: 805313044
*************************** 3. row ***************************
        nodeid: 2
       started: 2012-01-25 20:50:53
       ntptime: 2012-01-25 20:54:13
   node uptime: 2011-10-25 17:31:38
      hostname: sainz.colo.sproutsys.com
    iface_name: eth0
      iface_ip: 10.2.12.194
iface_mac_addr: 00:25:90:34:70:0a
          pnid: p2
         cores: 3145744
3 rows in set (0.01 sec)

And you're ready to rock and roll.  (Yes, the cores value is a little funny -- on Clustrix nodes this would tell you how many CPU cores available on each node.)

Accessing Your Cluster
With each node using the standard MySQL port, it's a little less fussy to connect to the cluster, as you no longer need to find and specify a different port for each node.  As before, connect to any node and you see the same database instance.  

Normally a Clustrix cluster is configured with a Virtual IP (VIP), which is a distinct IP address which load balances connections across all the nodes.  This is partially implemented within the base OS of our appliance nodes, so unfortunately we cannot provide this functionality with the DevKit.  You can, however, implement simple load-balancing with a tool like HAProxy (we use this internally as a simple solution to cut over between clusters located on different subnets, where DSR is not possible).  I'll see about writing this up as another blog post.  

Please bear in mind that while we've now got our devnode processes distributed across multiple servers, they are now communicating over ethernet instead of via UNIX sockets (memory).  Ethernet is a far cry from the InfiniBand which connects real Clustrix nodes.  Beyond IB being a higher throughput, lower latency interconnect, our software is tuned for these characteristics, so we're not going to see world-beating performance with our simulated cluster running over ethernet.  For that, I'll refer you to our prior blog post on Percona's TPCC test!

Recap
So we've shown how to get devnode running on different nodes, communicating with each other via ethernet.  While a little more "real" than having them all running on a single box, the performance characteristics are going to be orders of magnitude off from the capabilities of proper Clustrix nodes.  This does provide a simulacra of the platform to develop against, and we'd welcome the opportunity to move you from there to deploying on real hardware.  As always, your feedback or questions are welcome, in the comments or support forum.  

Monday, January 23, 2012

Getting started with Clustrix DevKit

We recently released our Clustrix Developers Kit to enable folks to experiment with the Clustrix solution right away, without having to take delivery of actual hardware or coordinate a hosted evaluation.  The kit includes the devnode binary, which is the database code minus the few hardware-specific bits to run on our appliance hardware, and a devnodectl utility to simplify wrangling multiple instances of devnode together into a functional cluster.  With this kit, it is possible to create a fully functional cluster, load data onto it, and run your application against it.  Obviously there are limitations: CPU, memory, and disk I/O capacity are going to be divided between multiple simulated nodes, and while we can simulate failure scenarios (as we will do in a future post), actual fault tolerance requires Clustrix nodes.

In this post, I'll walk you through:
  1. Installing the RPMs that comprise the DevKit
  2. Using devnodectl to start up a cluster
  3. Accessing the database instance with the mysql client
  4. Dumping and importing data from a MySQL instance
  5. Starting MySQL replication as a slave
Along the way I want to highlight the following features of Clustrix:
  • Drop-in compatibility with MySQL, including replication
  • Single-instance database (no sharding!)
Let's get started!

Installation
First download from the main page at www.clustrix.com, the "Try it now" button.  You'll need to download and install the common and devnode packages as follows:

[root@beta001 v3.2]# rpm -ivh clustrix-common-v3.2-493.x86_64.rpm clustrix-devnode-v3.2-7371.x86_64.rpm
Preparing...                ########################################### [100%]
   1:clustrix-common        ########################################### [ 50%]
   2:clustrix-devnode       ########################################### [100%]

If you run into any errors installing the RPMs, it's likely to be due to missing dependencies, which you may be able to fill in using yum.   Otherwise, you'll probably need to find a client with a more recent Redhat/CentOS install (for this exercise, I'm using CentOS 6).  
What does the RPM actually install?  Everything installs under /opt/clustrix; support libraries in 
lib/, and the two executables we care about in bin/: devnode and devnodectl.  

You'll also need a data directory for the cluster to store files which will serve as virtual disks.  It is highly recommended that you store these on local disk rather than over NFS.  Bear in mind that with proper Clustrix nodes, this storage is provided with SSDs, thus the throughput capacity of your local storage will be a constraining factor in performance of a devnode cluster (a subject I plan to revisit in a future blog post).  Here I'm going to use /data/clustrix, which is the default for the devnodectl utility (if you use a different dir, you'll need to use the -d/--data-dir argument each time you run devnodectl):


[root@beta001 v3.2]# mkdir /data/clustrix
[root@beta001 v3.2]# chown nparrish /data/clustrix

If you are running as a non-root user (which is advised), you'll also want to chown the directory to your username, as shown.

Initializing the cluster
devnodectl takes care of starting up devnode processes, specifying flags so that they automatically join together into a cluster.  In later blog posts we'll take a more manual approach, in order to demonstrate things like adding nodes to an existing cluster, running devnode instances on multiple clients, and fault tolerance features.  But for now, firing up the cluster is as simple as:

[nparrish@beta001 ~]$ /opt/clustrix/bin/devnodectl --init --nodes 3 start

Let's look at the output this returns:

[exec] /opt/clustrix/bin/devnode -clusterpath /data/clustrix -cluster nparrish -setpnid 1 -anyport -unix -glue -noautostart -nclean 4 -vdev-size 256
[exec] /opt/clustrix/bin/devnode -clusterpath /data/clustrix -cluster nparrish -setpnid 2 -anyport -unix -noautostart -nclean 4 -vdev-size 256
[exec] /opt/clustrix/bin/devnode -clusterpath /data/clustrix -cluster nparrish -setpnid 3 -anyport -unix -noautostart -nclean 4 -vdev-size 256

This shows us the actual devnode commands being run.  I'll draw your attention to a few of the options specified:
-clusterpath /data/clustrix: All cluster state, including the files used for virtual disks, will be in this dir.
-nclean 4 -vdev-size 256: This gives us 4 disks of 256MB each, for each of the three nodes.

Waiting for node 1 to enter quorum.
done.


Node 1 [RUNNING]: /data/clustrix/p1
mysql: 3306 control: 52924 healthmon:3581
mysql socket: /data/clustrix/p1/mysql.sock

Node 2 [RUNNING]: /data/clustrix/p2
mysql: 57059 control: 57079 healthmon:52498
mysql socket: /data/clustrix/p2/mysql.sock

Node 3 [RUNNING]: /data/clustrix/p3
mysql: 58157 control: 2048 healthmon:52002
mysql socket: /data/clustrix/p3/mysql.sock

Example command to access your cluster:
mysql -u root -S /data/clustrix/p3/mysql.sock

This tells us how to access each of the three nodes that have been started.  The -anyport command means that each node gets a random MySQL port (after trying for the default 3306, which you can see node 1 gets -- if your client is already running a mysqld instance, all nodes would get some different port number), and this output tells you what port to specify for your clients to connect.  You can also connect via socket, as shown in the example.  The control and healthmon ports will be covered in a future blog post.  

Connecting to the cluster
[nparrish@beta001 ~]$ mysql -h 127.0.0.1 -P 58157 -u root

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3074
Server version: 5.0.45 

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select @@version;

+----------------------------------------------------+
| @@version                                          |
+----------------------------------------------------+
| 5.0.45-clustrix-v3.2-7371-0f71a74d89c512ac-release |
+----------------------------------------------------+
1 row in set (0.00 sec)

A few things to note here:
  • This is standard, off-the-shelf mysql command line client (hence the Oracle copyright)
  • I connected using the IP for localhost; mysql client tries to be smart and use the default mysql socket (/var/lib/mysql/mysql.sock) if you specify -h localhost
  • Clustrix reports server version 5.0.45 for compatibility only; from the full version string in @@version,  the relevant Clustrix part starts at v3.2)
  • We are connecting with built-in root user, which initially has no password
Let's create a table and insert a little bit of data:

mysql> use test;
Database changed
mysql> create table foo (id int key, v varchar(100));
Query OK, 0 rows affected (0.09 sec)

mysql> insert into foo values (1, 'bar'), (2, 'baz');
Query OK, 2 rows affected (0.00 sec)

mysql> select * from foo;
+----+------+
| id | v    |
+----+------+
|  1 | bar  |
|  2 | baz  |
+----+------+
2 rows in set (0.03 sec)

Nothing fancy, but let's now connect to a different node and see how things look:

mysql> exit
Bye
[nparrish@beta001 ~]$ mysql -h 127.0.0.1 -P 57059 -u root test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
[...]
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| foo            |
+----------------+
1 row in set (0.00 sec)

mysql> select * from foo;
+----+------+
| id | v    |
+----+------+
|  2 | baz  |
|  1 | bar  |
+----+------+
2 rows in set (0.01 sec)

Note that here I've connected to a different node, but I have the exact same view of my data.  You might also note that the rows were returned in a different order.  On Clustrix, rows are distributed across multiple nodes, and so if no ORDER BY clause is specified, may be returned in different order.  This differs from MySQL, which implicitly orders by primary key (since it must only read from one place).

mysqldump and clustrix_import
Let's import data from an existing MySQL instance (here running on a different client, called hefty).  We start by dumping with mysqldump:

[nparrish@hefty tmp]$ mysqldump -h 127.0.0.1 --single-transaction --master-data=2 sbtest > /tmp/sbtest.sql
  • --single-transaction ensures that all tables in the database are consistent with respect to eachother
  • --master-data=2 gives us the position in the replication binlog that corresponds to that single transaction, so we can start up a replication slave
  • Here we are just dumping a specific database, sbtest.  We'll also need to create this database on our cluster, as the resulting dump file will not have DROP/CREATE DATABASE statements for it. 
We can then import this using mysql client:
[nparrish@hefty tmp]$ mysql -h beta001 -P 3306 sbtest < /tmp/sbtest.sql

A few things to note here:
  • As noted, we need to indicate which database to import into, so we must first CREATE DATABASE sbtest; then specify sbtest argument to indicate the target database.
  • If we were importing into actual nodes, we'd use a tool called clustrix_import, which does the inserts in parallel; the tool has not been (de)tuned for use with devnode, so using these together will result in an out-of-memory condition.
Recall that these nodes were created with 4 256MB disks each, so by default the cluster will accomodate only a small (~1GB) dump file.  If you want larger disks, use the --drive-size option when running devnodectl --init (see devnodectl -h output for details).

MySQL Replication
We'll dedicate a whole future blog post to replication, but for just a taste let's start up a slave to catch up from the point at which we took the above dump.  First we need to find the log file and position, which is stored as a comment in our dump file:
[nparrish@hefty tmp]$ head -24 /tmp/sbtest.sql 
-- MySQL dump 10.11
--
[....]
-- Position to start replication or point-in-time recovery from
--
-- CHANGE MASTER TO MASTER_LOG_FILE='hefty-bin.000001', MASTER_LOG_POS=610868;

So we just need to add the hostname and user account to use in order to point our cluster at this MySQL instance (note that we can run this on any of our nodes):
mysql> CHANGE MASTER TO MASTER_LOG_FILE='hefty-bin.000001', MASTER_LOG_POS=610868, MASTER_HOST='hefty', MASTER_USER='root';
Query OK, 0 rows affected (0.04 sec)

mysql> start slave;
Query OK, 0 rows affected (0.02 sec)

mysql> show slave status\G
*************************** 1. row ***************************
            Slave_Name: default
          Slave_Status: Running
           Master_Host: hefty
           Master_Port: 3306
           Master_User: root
       Master_Log_File: hefty-bin
         Slave_Enabled: Enabled
          Log_File_Seq: 2
          Log_File_Pos: 267
            Last_Error: no error
     Connection_Status: Connected
  Relay_Log_Bytes_Read: 0
Relay_Log_Current_Size: 0
 Seconds_Behind_Master: 0
1 row in set (0.01 sec)

And we're off and replicating any write events on the MySQL instance.  

Stopping devnodes
To shut down your cluster, run devnodectl stop:
[nparrish@beta001 clustrix]$ /opt/clustrix/bin/devnodectl stop
[node /data/clustrix/p1] kill -9 21089
[node /data/clustrix/p2] kill -9 20078
[node /data/clustrix/p3] kill -9 20079

As noted, the processes are simply stopped with kill -9.  You can restart them with devnodectl start, this time leaving out the --init and --nodes options; note that these processes will probably get different mysql ports than for the prior run.  We'll spend more time with restarting devnode processes when we cover fault tolerance in a future post.  

Recap
So we've covered some of the initial steps you'll be taking with the Clustrix developers kit:
  1. Installing the RPMs
  2. Starting up a simple three node cluster with devnodectl
  3. Connecting to the cluster with mysql 
  4. Using mysqldump to get data from an existing MySQL instance, then import with mysql onto your cluster
  5. Using MySQL replication to set up your cluster as a slave to that MySQL instance
If you're familiar with MySQL, most of this should look pretty familiar, and that's quite the point -- the drop-in compatibility you get with Clustrix means there's no application rewrite involved.  Out of the box you get a cluster which hosts a single-instance database, no federation or sharding mess.  

What's next
Stay tuned for more.  Future topics will include a closer look at how data is stored on Clustrix, deeper coverage of replication capabilities, running devnode on multiple hosts, fault tolerance, examining query performance, and more.  
Please come talk to us on the support forums at https://groups.google.com/a/clustrix.com/group/support-public/topics if you run into any problems or have more technical questions.   

Friday, November 04, 2011

Distributed Database Architectures: Distributed Storage / Centralized Compute

In my previous post I wrote about shared disk architectures and the problems they introduce. It's common to see comparisons between shared disk and shared nothing architectures, but that distinction is too coarse to capture the differences between various shared nothing approaches.

Instead, I'm going to characterize the various "shared-nothing" style systems by their query evaluation architectures. Most systems fall into one of the following buckets:
  • Centralized compute
  • Limited distributed compute
  • Fully distributed compute

Centralized Compute: MySQL Cluster

MySQL Cluster consists of two basic roles used for servicing user queries: a compute role and a storage/data role. The compute node is the front end which takes in the query, plans it, and executes it. The compute node will communicate with the storage nodes remotely to fetch any data relevant to the query.


In the distributed storage model, data is no longer shared between the nodes at the page level. Instead, the storage nodes expose a higher level API which allows the compute node to fetch row ranges based on the available access paths (i.e. indexes).

In such a system, storage level locks associated with the data are now managed exclusively by the storage node itself. A compute node does not cache any data; instead, it always asks the set of storage nodes responsible for the data. The system solved the cache coherence overhead problem.
However, it still suffers from extensive data movement and centralized query evaluation. 

  • Cache coherence overhead
  • Extensive data movement
  • Centralized query evaluation

MySQL Query Evaluation in Action

Consider the following example:

SELECT count(*) FROM mytable WHERE acol = 1 and bcol = 2

Assumptions:
  • an index over acol
  • 10% of the rows in the table match acol = 1
  • 3% of the rows match acol = 1 and bcol = 2
  • total table size 1 Billion rows


In the diagram above, the arrows represent the flow of data through the system. As you can see, most of the query evaluation in the example is done by a single compute node.  The system generated a data movement of 100 million rows, and only a single node performed additional filtering and aggregate count.

It's an improvement over a shared disk system, but it still has some serious limitations. Such a system could be well suited for simple key access (i.e. query touches a few specific rows), but any more complexity will generally result in poor performance

As with the shared disk system, adding more nodes will not help improve single query execution, and queries which operate over large volumes of data have the potential to saturate the message bus between the nodes.