Useful MySQL 5.6 features you get for free in PXC 5.6

December 2, 2013, 7:57 am

≫ Next: The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB

≪ Previous: Q&A: Percona XtraDB Cluster as a MySQL HA solution for OpenStack

I get a lot of questions about Percona XtraDB Cluster 5.6 (PXC 5.6), specifically about whether such and such MySQL 5.6 Community Edition feature is in PXC 5.6. The short answer is: yes, all features in community MySQL 5.6 are in Percona Server 5.6 and, in turn, are in PXC 5.6. Whether or not the new feature is useful in 5.6 really depends on how useful it is in general with Galera.

I thought it would be useful to highlight a few features and try to show them working:

Innodb Fulltext Indexes

Yes, FTS works in Innodb in 5.6, so why wouldn’t it work in PXC 5.6? To test this I used the Sakila database , which contains a single table with FULLTEXT. In the sakila-schema.sql file, it is still designated a MyISAM table:

CREATE TABLE film_text (
  film_id SMALLINT NOT NULL,
  title VARCHAR(255) NOT NULL,
  description TEXT,
  PRIMARY KEY  (film_id),
  FULLTEXT KEY idx_title_description (title,description)
)ENGINE=MyISAM DEFAULT CHARSET=utf8;

I edited that file to change MyISAM to Innodb, loaded the schema and data into my 3 node cluster:

[root@node1 sakila-db]# mysql < sakila-schema.sql
[root@node1 sakila-db]# mysql < sakila-data.sql

and it works seamlessly:

node1 mysql> select title, description, match( title, description) against ('action saga' in natural language mode) as score from sakila.film_text order by score desc limit 5;
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| title | description | score |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| FACTORY DRAGON | A Action-Packed Saga of a Teacher And a Frisbee who must Escape a Lumberjack in The Sahara Desert | 3.0801234245300293 |
| HIGHBALL POTTER | A Action-Packed Saga of a Husband And a Dog who must Redeem a Database Administrator in The Sahara Desert | 3.0801234245300293 |
| MATRIX SNOWMAN | A Action-Packed Saga of a Womanizer And a Woman who must Overcome a Student in California | 3.0801234245300293 |
| REEF SALUTE | A Action-Packed Saga of a Teacher And a Lumberjack who must Battle a Dentist in A Baloon | 3.0801234245300293 |
| SHANE DARKNESS | A Action-Packed Saga of a Moose And a Lumberjack who must Find a Woman in Berlin | 3.0801234245300293 |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
5 rows in set (0.00 sec)

Sure enough, I can run this query on any node and it works fine:

node3 mysql> select title, description, match( title, description) against ('action saga' in natural language mode) as score from sakila.film_text order by score desc limit 5;
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| title           | description                                                                                               | score              |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| FACTORY DRAGON  | A Action-Packed Saga of a Teacher And a Frisbee who must Escape a Lumberjack in The Sahara Desert         | 3.0801234245300293 |
| HIGHBALL POTTER | A Action-Packed Saga of a Husband And a Dog who must Redeem a Database Administrator in The Sahara Desert | 3.0801234245300293 |
| MATRIX SNOWMAN  | A Action-Packed Saga of a Womanizer And a Woman who must Overcome a Student in California                 | 3.0801234245300293 |
| REEF SALUTE     | A Action-Packed Saga of a Teacher And a Lumberjack who must Battle a Dentist in A Baloon                  | 3.0801234245300293 |
| SHANE DARKNESS  | A Action-Packed Saga of a Moose And a Lumberjack who must Find a Woman in Berlin                          | 3.0801234245300293 |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
5 rows in set (0.05 sec)
node3 mysql> show create table sakila.film_textG
*************************** 1. row ***************************
       Table: film_text
Create Table: CREATE TABLE `film_text` (
  `film_id` smallint(6) NOT NULL,
  `title` varchar(255) NOT NULL,
  `description` text,
  PRIMARY KEY (`film_id`),
  FULLTEXT KEY `idx_title_description` (`title`,`description`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

There might be a few caveats and differences from how FTS works in Innodb vs MyISAM, but it is there.

Minimal replication images

Galera relies heavily on RBR events, but until 5.6 those were entire row copies, even if you only changed a single column in the table. In 5.6 you can change this to send only the updated data using the variable binlog_row_image=minimal.

Using a simple sysbench update test for 1 minute, I can determine the baseline size of the replicated data:

node3 mysql> show global status like 'wsrep_received%';
+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_received       | 703       |
| wsrep_received_bytes | 151875070 |
+----------------------+-----------+
2 rows in set (0.04 sec)
... test runs for 1 minute...
node3 mysql> show global status like 'wsrep_received%';
+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_received       | 38909     |
| wsrep_received_bytes | 167749809 |
+----------------------+-----------+
2 rows in set (0.17 sec)

This results in 62.3 MB of data replicated in this test.

If I set binlog_row_image=minimal on all nodes and do a rolling restart, I can see how this changes:

node3 mysql> show global status like 'wsrep_received%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| wsrep_received       | 3     |
| wsrep_received_bytes | 236   |
+----------------------+-------+
2 rows in set (0.07 sec)
... test runs for 1 minute...
node3 mysql> show global status like 'wsrep_received%';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| wsrep_received       | 34005    |
| wsrep_received_bytes | 14122952 |
+----------------------+----------+
2 rows in set (0.13 sec)

This yields a mere 13.4MB, that’s 80% smaller, quite a savings! This benefit, of course, fully depends on the types of workloads you are doing.

Durable Memcache Cluster

It turns out this feature does not work properly with Galera, see below for an explanation:

5.6 introduces an Memcached interface for Innodb. This means any standard memcache client can talk to our PXC nodes with the memcache protocol and the data is:

~~Replicated to all nodes~~
~~Durable across the cluster~~
~~Highly available~~
Easy to hash memcache clients across all servers for better cache coherency

To set this up, we need to simply load the innodb_memcache schema from the example and restart the daemon to get a listening memcached port:

[root@node1 ~]# mysql < /usr/share/mysql/innodb_memcached_config.sql
[root@node1 ~]# service mysql restart
Shutting down MySQL (Percona XtraDB Cluster)...... SUCCESS!
Starting MySQL (Percona XtraDB Cluster)...... SUCCESS!
[root@node1 ~]# lsof +p`pidof mysqld` | grep LISTEN
mysqld  31961 mysql   11u  IPv4             140592       0t0      TCP *:tram (LISTEN)
mysqld  31961 mysql   55u  IPv4             140639       0t0      TCP *:memcache (LISTEN)
mysqld  31961 mysql   56u  IPv6             140640       0t0      TCP *:memcache (LISTEN)
mysqld  31961 mysql   59u  IPv6             140647       0t0      TCP *:mysql (LISTEN)

This all appears to work and I can fetch the sample AA row from all the nodes with the memcached interface:

node1 mysql> select * from demo_test;
+-----+--------------+------+------+------+
| c1  | c2           | c3   | c4   | c5   |
+-----+--------------+------+------+------+
| AA  | HELLO, HELLO |    8 |    0 |    0 |
+-----+--------------+------+------+------+
[root@node3 ~]# telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
get AA
VALUE AA 8 12
HELLO, HELLO
END

However, if I try to update a row, it does not seem to replicate (even if I set innodb_api_enable_binlog):

[root@node3 ~]# telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set DD 0 0 0
STORED
^]
telnet> quit
Connection closed.
node3 mysql> select * from demo_test;
+----+--------------+------+------+------+
| c1 | c2           | c3   | c4   | c5   |
+----+--------------+------+------+------+
| AA | HELLO, HELLO |    8 |    0 |    0 |
| DD |              |    0 |    1 |    0 |
+----+--------------+------+------+------+
2 rows in set (0.00 sec)
node1 mysql> select * from demo_test;
+-----+--------------+------+------+------+
| c1  | c2           | c3   | c4   | c5   |
+-----+--------------+------+------+------+
| AA  | HELLO, HELLO |    8 |    0 |    0 |
+-----+--------------+------+------+------+

So unfortunately the memcached plugin must use some backdoor to Innodb that Galera is unaware of. I’ve filed a bug on the issue, but it’s not clear if there will be an easy solution or if a whole lot of code will be necessary to make this work properly.

In the short-term, however, you can at least read data from all nodes with the memcached plugin as long as data is only written using the standard SQL interface.

Async replication GTID Integration

Async GTIDs were introduced in 5.6 in order to make CHANGE MASTER easier. You have always been able to use async replication from any cluster node, but now with this new GTID support, it is much easier to failover to another node in the cluster as a new master.

If we take one node out of our cluster to be a slave and enable GTID binary logging on the other two by adding these settings:

server-id = ##
log-bin = cluster_log
log-slave-updates
gtid_mode = ON
enforce-gtid-consistency

If I generate some writes on the cluster, I can see GTIDs are working:

node1 mysql> show master statusG
*************************** 1. row ***************************
File: cluster_log.000001
Position: 573556
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1505
1 row in set (0.00 sec)
node2 mysql> show master statusG
*************************** 1. row ***************************
File: cluster_log.000001
Position: 560011
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1505
1 row in set (0.00 sec)

Notice that we’re at GTID 1505 on both nodes, even though the binary log position happens to be different.

I set up my slave to replicate from node1 (.70.2):

node3 mysql> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.70.2
                  Master_User: slave
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
           Retrieved_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1506
            Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1506
                Auto_Position: 1
1 row in set (0.00 sec)

And it’s all caught up. If put some load on the cluster, I can easily change to node2 as my master without needing to stop writes:

node3 mysql> stop slave;
Query OK, 0 rows affected (0.09 sec)
node3 mysql> change master to master_host='192.168.70.3', master_auto_position=1;
Query OK, 0 rows affected (0.02 sec)
node3 mysql> start slave;
Query OK, 0 rows affected (0.02 sec)
node3 mysql> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.70.3
                  Master_User: slave
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
            Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-3712
                Auto_Position: 1

So this seems to work pretty well. It does turns out there is a bit of a bug, but it’s actually with Xtrabackup — currently the binary logs are not copied in Xtrabackup SST and this can cause GTID inconsistencies within nodes in the cluster. I would expect this to get fixed relatively quickly.

Conclusion

MySQL 5.6 introduces a lot of new interesting features that are even more compelling in the PXC/Galera world. If you want to experiment for yourself, I pushed the Vagrant environment I used to Github at: https://github.com/jayjanssen/pxc_56_features

The post Useful MySQL 5.6 features you get for free in PXC 5.6 appeared first on MySQL Performance Blog.

↧

The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB

January 10, 2014, 12:00 am

≫ Next: ClusterControl for Percona XtraDB Cluster Improves Management and Monitoring

≪ Previous: Useful MySQL 5.6 features you get for free in PXC 5.6

Most technologies achieving high-availability for MySQL need a load-balancer to spread the client connections to a valid database host, even the Tungsten special connector can be seen as a sophisticated load-balancer. People often use hardware load balancer or software solution like haproxy. In both cases, in order to avoid having a single point of failure, multiple load balancers must be used. Load balancers have two drawbacks: they increase network latency and/or they add a validation check load on the database servers. The increased network latency is obvious in the case of standalone load balancers where you must first connect to the load balancer which then completes the request by connecting to one of the database servers. Some workloads like reporting/adhoc queries are not affected by a small increase of latency but other workloads like oltp processing and real-time logging are. Each load balancers must also check regularly if the database servers are in a sane state, so adding more load balancers increases the idle chatting over the network. In order to reduce these impacts, a very different type of load balancer is needed, let me introduce the Iptables ClusterIP target.

Normally, as stated by the RFC 1812 Requirements for IP Version 4 Routers an IP address must be unique on a network and each host must respond only for IPs it own. In order to achieve a load balancing behavior, the Iptables ClusterIP target doesn’t strictly respect the RFC. The principle is simple, each computer in the cluster share an IP address and MAC address with the other members but it answers requests only for a given subset, based on the modulo of a network value which is sourceIP-sourcePort by default. The behavior is controlled by an iptables rule and by the content of the kernel file /proc/net/ipt_CLUSTERIP/VIP_ADDRESS. The kernel /proc file just informs the kernel to which portion of the traffic it should answer. I don’t want to go too deep in the details here since all those things are handled by the Pacemaker resource agent, IPaddr2.

The IPaddr2 Pacemaker resource agent is commonly used for VIP but what is less know is its behavior when defined as part of a clone set. When part of clone set, the resource agent defines a VIP which uses the Iptables ClusterIP target, the iptables rules and the handling of the proc file are all done automatically. That seems very nice in theory but until recently, I never succeeded in having a suitable distribution behavior. When starting the clone set on, let’s say, three nodes, it distributes correctly, one instance on each but if 2 nodes fail and then recover, the clone instances all go to the 3rd node and stay there even after the first two nodes recover. That bugged me for quite a while but I finally modified the resource agent and found a way to have it work correctly. It also now set correctly the MAC address if none is provided to the MAC multicast address domain which starts by “01:00:5E”. The new agent, IPaddr3, is available here. Now, let’s show what we can achieve with it.

We’ll start from the setup described in my previous post and we’ll modify it. First, download and install the IPaddr3 agent.

root@pacemaker-1:~# wget -O /usr/lib/ocf/resource.d/percona/IPaddr3 https://github.com/percona/percona-pacemaker-agents/raw/master/agents/IPaddr3
root@pacemaker-1:~# chmod u+x /usr/lib/ocf/resource.d/percona/IPaddr3

Repeat these steps on all 3 nodes. Then, we’ll modify the pacemaker configuration like this (I’ll explain below):

node pacemaker-1
        attributes standby="off"
node pacemaker-2
        attributes standby="off"
node pacemaker-3
        attributes standby="off"
primitive p_cluster_vip ocf:percona:IPaddr3
        params ip="172.30.212.100" nic="eth1"
        meta resource-stickiness="0"
        op monitor interval="10s"
primitive p_mysql_monit ocf:percona:mysql_monitor
        params reader_attribute="readable_monit" writer_attribute="writable_monit" user="repl_user" password="WhatAPassword" pid="/var/lib/mysql/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" max_slave_lag="5" cluster_type="pxc"
        op monitor interval="1s" timeout="30s" OCF_CHECK_LEVEL="1"
clone cl_cluster_vip p_cluster_vip
        meta clone-max="3" clone-node-max="3" globally-unique="true"
clone cl_mysql_monitor p_mysql_monit
        meta clone-max="3" clone-node-max="1"
location loc-distrib-cluster-vip cl_cluster_vip
        rule $id="loc-distrib-cluster-vip-rule" -1: p_cluster_vip_clone_count gt 1
location loc-enable-cluster-vip cl_cluster_vip
        rule $id="loc-enable-cluster-vip-rule" 2: writable_monit eq 1
location loc-no-cluster-vip cl_cluster_vip
        rule $id="loc-no-cluster-vip-rule" -inf: writable_monit eq 0
property $id="cib-bootstrap-options"
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"
        cluster-infrastructure="openais"
        expected-quorum-votes="3"
        stonith-enabled="false"
        no-quorum-policy="ignore"
        last-lrm-refresh="1384275025"
        maintenance-mode="off"

First, the VIP primitive is modified to use the new agent, IPaddr3, and we set resource-stickiness=”0″. Next, we define the cl_cluster_vip clone set using: clone-max=”3″ to have three instances, clone-node-max=”3″ to allow up to three instances on the same node and globally-unique=”true” to tell Pacemaker it has to allocate an instance on a node even if there’s already one. Finally, there’re three location rules needed to get the behavior we want, one using the p_cluster_vip_clone_count attribute and the other two around the writable_monit attribute. Enabling all that gives:

root@pacemaker-1:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:51:38 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-1
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-2
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1

and the network configuration is:

root@pacemaker-1:~# iptables -L INPUT -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
CLUSTERIP  all  --  0.0.0.0/0            172.30.212.100       CLUSTERIP hashmode=sourceip-sourceport clustermac=01:00:5E:91:18:86 total_nodes=3 local_node=1 hash_init=0
root@pacemaker-1:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
2
root@pacemaker-2:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
3
root@pacemaker-3:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
1

In order to test the access, you need to query the VIP from a fourth node:

root@pacemaker-4:~# while [ 1 ]; do mysql -h 172.30.212.100 -u repl_user -pWhatAPassword -BN -e "select variable_value from information_schema.global_variables where variable_name like 'hostname';"; sleep 1; done
pacemaker-1
pacemaker-1
pacemaker-2
pacemaker-2
pacemaker-2
pacemaker-3
pacemaker-2
^C

So, all good… Let’s now desync the pacemaker-1 and pacemaker-2.

root@pacemaker-1:~# mysql -e 'set global wsrep_desync=1;'
root@pacemaker-1:~#
root@pacemaker-2:~# mysql -e 'set global wsrep_desync=1;'
root@pacemaker-2:~#
root@pacemaker-3:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:53:51 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-3
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 0
    + writable_monit                    : 0
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 0
    + writable_monit                    : 0
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 3
    + readable_monit                    : 1
    + writable_monit                    : 1
root@pacemaker-3:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
1,2,3
root@pacemaker-4:~# while [ 1 ]; do mysql -h 172.30.212.100 -u repl_user -pWhatAPassword -BN -e "select variable_value from information_schema.global_variables where variable_name like 'hostname';"; sleep 1; done
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3

Now, if pacemaker-1 and pacemaker-2 are back in sync, we have the desired distribution:

root@pacemaker-1:~# mysql -e 'set global wsrep_desync=0;'
root@pacemaker-1:~#
root@pacemaker-2:~# mysql -e 'set global wsrep_desync=0;'
root@pacemaker-2:~#
root@pacemaker-3:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:58:40 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-1
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-2
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1

All the clone instances redistributed on all nodes as we wanted.

As a conclusion, Pacemaker with a clone set of IPaddr3 is a very interesting kind of load balancer, especially if you already have pacemaker deployed. It introduces almost no latency, it doesn’t need any other hardware, doesn’t increase the database validation load and is as highly-available as your database is. The only drawback I can see is in a case where the inbound traffic is very important. In that case, all nodes are receiving all the traffic and are equally saturated. With databases and web type traffics, the inbound traffic is usually small. This solution also doesn’t redistribute the connections based on the server load like a load balancer can do but that would be fairly easy to implement with something like a server_load attribute and an agent similar to mysql_monitor but that will check the server load instead of the database status. In such a case, I suggest using much more than 1 VIP clone instance per node to have a better granularity in load distribution. Finally, the ClusterIP target, although still fully supported, has been deprecated in favor of the Cluster-match target. It is basically the same principle and I plan to adapt the IPaddr3 agent to Cluster-match in a near future.

The post The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB appeared first on MySQL Performance Blog.

↧

ClusterControl for Percona XtraDB Cluster Improves Management and Monitoring

February 4, 2014, 6:03 am

≫ Next: Doing a rolling upgrade of Percona XtraDB Cluster from 5.5 to 5.6

≪ Previous: The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB

ClusterControl for Percona XtraDB Cluster is now available in three different versions thanks to our partnership with Severalnines. ClusterControl will make it simpler to manage and monitor Percona XtraDB Cluster, MySQL Cluster, MySQL Replication, or MySQL Galera.

I am very excited about our GA release of Percona XtraDB Cluster 5.6 last week. As Vadim described in his blog post announcing the release, we have brought together the benefits of Percona Server 5.6, Percona XtraBackup and Galera 3 to create a drop-in compatible, open source, state-of-the-art high availability MySQL clustering solution. We could not have done it without the close cooperation and help of our partners at Codership. To learn more, join us on February 5th for a webinar entitled “What’s New in Percona XtraDB Cluster 5.6” presented by Jay Janssen.

While these new features in Percona XtraDB Cluster are highly valuable, we regularly receive feedback from users that they would like more tools to make it easier to manage and monitor their MySQL clusters. In response to that feedback, I’m pleased to announce our partnership with Severalnines to make ClusterControl more available to Percona XtraDB Cluster users and, in fact, all of our Support customers who use MySQL clustering solutions.

Percona XtraDB Cluster users now have three choices:

Download the Community Edition of ClusterControl for free
Purchase the Enterprise Edition of ClusterControl from Severalnines
Become a Percona Support customer for Percona XtraDB Cluster, MySQL Cluster, or MySQL Galera and receive Percona ClusterControl by Severalnines as part of your subscription

On the last option, Percona ClusterControl is a privately branded version of ClusterControl supplied to us by Severalnines. With a Percona Support contract covering Percona XtraDB Cluster, MySQL Cluster, or MySQL Galera, we provide support for your deployment and use of Percona ClusterControl. Percona has worked directly with Severalnines to ensure full compatibility of ClusterControl Community Edition with Percona XtraDB Cluster. The Percona Support team will provide assistance with the configuration and administration of Percona ClusterControl. We will help you get setup and then we will help you make the most of your new management and monitoring tools for the high availability solution of your choice.

We will also provide support for Severalnines ClusterControl Enterprise Edition when it is purchased from Severalnines or Percona and used with a cluster deployment covered by a Percona Support subscription that includes coverage of Percona XtraDB Cluster, MySQL Galera, or MySQL Cluster. ClusterControl Enterprise Edition is the de-facto management solution for Galera-based clusters and it provides the highest levels of monitoring and management capabilities for Percona XtraDB Cluster users.

Percona ClusterControl will provide our Support customers broad flexibility to deploy, monitor, and manage Percona XtraDB Clusters, supporting on-premise and Amazon Web Services deployments, as well as deployments spread across multiple cloud vendors. Percona ClusterControl can be used to automate the deployment of Percona XtraDB Cluster in cross-region, multi-datacenter setups. Percona ClusterControl offers a full range of management functionality for clusters, including:

Node and cluster recovery
Configuration management
Performance health checks
Online backups
Upgrades
Scale out
Cloning of clusters
Deploying HAProxy load balancers

Percona ClusterControl can be used with multiple clustering solutions including MySQL Cluster, MySQL Replication, MySQL Galera, MariaDB Galera Cluster, and MongoDB. Percona Support subscribers that elect coverage for these clustering solutions receive Percona ClusterControl with their subscriptions.

Just as we made Percona XtraDB Cluster a much better solution by partnering with Codership, we now expect it will be a much easier to manage and monitor solution thanks to our partnership with Severalnines. Whether you need Percona Support or not, I hope you will take advantage of these powerful management and monitoring solutions which are now available for Percona XtraDB Cluster. To learn more about using ClusterControl to manage Percona XtraDB Cluster, please join us on February 19th for a webinar entitled “Performance Monitoring and Troubleshooting of Percona XtraDB Cluster” presented by Peter Boros.

The post ClusterControl for Percona XtraDB Cluster Improves Management and Monitoring appeared first on MySQL Performance Blog.

↧

Doing a rolling upgrade of Percona XtraDB Cluster from 5.5 to 5.6

May 8, 2014, 3:00 am

≫ Next: MySQL High Availability With Percona XtraDB Cluster (Percona MySQL Training)

≪ Previous: ClusterControl for Percona XtraDB Cluster Improves Management and Monitoring

Overview

Percona XtraDB Cluster 5.6 has been GA for several months now and people are thinking more and more about moving from 5.5 to 5.6. Most people don’t want to upgrade all at once, but would prefer a rolling upgrade to avoid downtime and ensure 5.6 is behaving in a stable fashion before putting all of production on it. The official guide to a rolling upgrade can be found in the PXC 5.6 manual. This blog post will attempt to summarize the basic process.

However, there are a few caveats to trying to do a rolling 5.6 upgrade from 5.5:

If you mix Galera 2 and Galera 3 nodes, you must set wsrep_provider_options=”socket.checksum=1″ on the Galera 3 nodes for backwards compatibility between Galera versions.
You must set some 5.6 settings for 5.5 compatibility with respect to replication:
1. You can’t enable GTID async replication on the 5.6 nodes
2. You should use –log-bin-use-v1-row-events on the 5.6 nodes
3. You should set binlog_checksum=NONE on the 5.6 nodes
You must not SST a 5.5 donor to a 5.6 joiner as the SST script does not handle mysql_upgrade
You should set the 5.6 nodes read_only and not write to them!

The basic upgrade flow

The basic upgrade flow is:

For some node(s): upgrade to 5.6, do all the above stuff, put them into a read_only pool
Repeat step 1 as desired
Once your 5.6 pool is sufficiently large, cut over writes to the 5.6 pool (turn off read_only, etc.) and upgrade the rest.

This is, in essence, exactly like upgrading a 5.5 master/slave cluster to 5.6 — you upgrade the slaves first, promote a slave and upgrade the master; we just have more masters to think about.

Once your upgrade is fully to 5.6, then you can go back through and remove all the 5.5 backwards compatibility

Why can’t I write to the 5.6 nodes?

The heaviest caveat is probably the fact that in a mixed 5.5 / 5.6 cluster, you are not supposed to write to the 5.6 nodes. Why is that? Well, the reason goes back to MySQL itself. PXC/Galera uses standard RBR binlog events from MySQL for replication. Replication between major MySQL versions is only ever officially supported:

across 1 major version (i.e., 5.5 to 5.6, though multiple version hops do often work)
from a lower version master to a higher version slave (i.e., 5.5 is your master and 5.6 is your slave, but not the other way around)

This compatibility requirement (which has existed for a very long time in MySQL) works great when you have a single Master replication topology, but true multi-master (multi-writer) has obviously never been considered.

Some alternatives

Does writing to the 5.6 nodes REALLY break things?

The restriction on 5.6 masters of 5.5 slaves is probably too strict in many cases. Technically only older to newer replication is ever truly supported, but in practice you may be able to run a mixed cluster with writes to all nodes as long as you are careful. This means (at least) that any modifications to column type formats in the newer version NOT be upgraded while the old version remains active in the cluster. There might be other issues, I’m not sure, I cannot say I’ve tested every possible circumstance.

So, can I truly say I recommend this? I cannot say that officially, but you may find it works fine. As long as you acknowledge that something unforeseen may break your cluster and your migration plan, it may be reasonable. If you decide to explore this option, please test this thoroughly and be willing to accept the consequences of it not working before trying it in production!

Using Async replication to upgrade

Another alternative is rather than trying to mix the clusters and keeping 5.6 nodes read_only, why not just setup the 5.6 cluster as an async slave of your 5.5 cluster and migrate your application to the new cluster when you are ready? This is practically the same as maintaining a split 5.5/5.6 read_write/read_only cluster without so much risk and a smaller list of don’ts. Cutover in this case would be effectively like promoting a 5.6 slave to master, except you would promote the 5.6 cluster.

One caveat with this approach might be dealing with replication throughput: async may not be able to keep up replicating your 5.5 cluster writes to a separate 5.6 cluster. Definitely check out wsrep_preordered to speed things up, it may help. But realize some busy workloads just may not ever be able to use async into another cluster.

Just take the outage

A final alternative for this post is the idea of simply upgrading the entire cluster to 5.6 all at once during a maintenance window. I grant that this defeats the point of a rolling upgrade, but it may offer a lot of simplicity in the longer run.

Conclusion

A rolling PXC / Galera upgrade across major MySQL versions is limited by the fact that there is no official support or reason for Oracle to support newer master to older slave. In practice, it may work much of the time, but these situations should be considered carefully and the risk weighed against all other options.

The post Doing a rolling upgrade of Percona XtraDB Cluster from 5.5 to 5.6 appeared first on MySQL Performance Blog.

↧

MySQL High Availability With Percona XtraDB Cluster (Percona MySQL Training)

May 21, 2014, 3:00 am

≫ Next: Semi-Sync replication performance in MySQL 5.7.4 DMR

≪ Previous: Doing a rolling upgrade of Percona XtraDB Cluster from 5.5 to 5.6

I’ve had the opportunity to train lots of people on Percona XtraDB Cluster (PXC) over the last few years the product has existed. This has taken the form of phone calls, emails, blog posts, webinars, and consulting engagements. This doesn’t count all the time I’ve spent grilling Codership on how things actually work. But it has culminated in the PXC tutorial I have given annually at Percona Live MySQL Conference and Expo in Santa Clara, California for the last two years. Baron even attended this year and had this say:

“Jay Janssen’s tutorial on Percona XtraDB Cluster was impressive. I can’t imagine how much time he must have spent preparing for that.”

(Baron: Thanks for this compliment, BTW) The answer is: I haven’t kept track, but I’ve delivered it 3 times at conferences and it gets better each time. I started working on it during the summer of 2012 for Percona Live NYC that year and have put in countless hours since making it better. This year, despite my voice problems, I was really happy with how it came out.

We’ve had lots of requests for repeats of this tutorial and with more detail. As such, I’ll be doing a 2-day training about MySQL HA with Percona XtraDB Cluster in sunny San Jose, California on July 16th -17th 2014. This Percona MySQL Training session is appropriately titled, “MySQL High Availability With Percona XtraDB Cluster” and you can get more information and also register here.

This course will include an overview of how Percona XtraDB Cluster fits into the MySQL High Availability world, followed by a deep-dive into PXC. All the content of the tutorial will be covered, but it will be spread out with more details on all the components and important bits of using PXC and Galera.

If you’re unfamiliar with my tutorial — it is intensively hands-on. This will not be 2 days of lecture: You’ll be setting up and experimenting with a real cluster on the fly during the class. Expect plenty of command-line work and to leave with a much better sense of how to truly operate and use a Percona XtraDB Cluster.

Once again, you can check out the details on our website and register NOW to attend right here.

The post MySQL High Availability With Percona XtraDB Cluster (Percona MySQL Training) appeared first on MySQL Performance Blog.

↧

Semi-Sync replication performance in MySQL 5.7.4 DMR

May 27, 2014, 6:24 am

≫ Next: Galera data on Percona Cloud Tools (and other MySQL monitoring tools)

≪ Previous: MySQL High Availability With Percona XtraDB Cluster (Percona MySQL Training)

I was interested to hear about semi-sync replication improvements in MySQL’s 5.7.4 DMR release and decided to check it out. I previously blogged about poor semi-sync performance and was pretty disappointed from semi-sync’s performance across WAN distances back then, particularly with many client threads.

The Test

The basic environment of these tests was:

AWS EC2 m3.medium instances
Master in us-east-1, slave in us-west-1 (~78ms ping RTT)
CentOS 6.5
innodb_flush_log_at_trx_commit=1
sync_binlog=1
Semi-sync replication plugin installed and enabled.
GTID’s enabled (except on 5.5)
sysbench 0.5 update_index.lua test, 60 seconds, 250k table size.
MySQL 5.7 was tested with both AFTER_SYNC and AFTER_COMMIT settings for rpl_semi_sync_master_wait_point
I tested Percona XtraDB Cluster 5.6 / Galera 3.5 as well by means of comparison

Without further ado, here’s the TpmC results I got for a single client thread:

These graphs are interactive, so mouse-over for more details. I’m using log scales to better highlight the differences.

The blue bars represent transactions per second (more is better). The red bars represent average latency per transaction per client (less is better). Remember these transactions are synchronously being copied across the US before the client can execute another.

The first test is our control: Async allows ~273 TPS on a single thread. Once we introduce synchronicity, we clearly see the bulk of the time is that round trip. Note that MySQL 5.5 and 5.6 are a bit higher than MySQL 5.7 and Percona XtraDB Cluster, the latter of which show pretty similar results.

Adding parallelism

This gets more interesting to see if we redo the same tests, but with 32 test threads:

In the MySQL 5.5 and 5.6 tests, we can clearly see nasty serialization. Both really don’t allow more performance than single threaded sysbench. I was happy to see, however, that this seems to be dramatically improved in MySQL 5.7, nice job Oracle!

AFTER_SYNC and AFTER_COMMIT vary, but AFTER_SYNC is the default and is preferred over AFTER_COMMIT. The reasoning here is AFTER_SYNC forces the semi-sync wait BEFORE the transaction is committed on the master. The client still must wait for the semi-sync in AFTER_COMMIT, but other transactions may see its data on the master BEFORE we confirm the semi-sync slave has received it. This is potentially bad because if the master crashed at that instant, clients may have read data from the master that did not make it to a failover slave. This is a type of ‘phantom read’ and Yoshinori explains it in more detail here.

What about Percona XtraDB Cluster?

I also want to discuss the Percona XtraDB Cluster results, Galera here is somewhat slower than MySQL 5.7 semi-sync. There may be some enhancements to Galera that can be made (competition is a good thing), but there are still some significant differences here:

Galera allows for writing on any and all nodes, semi-sync does not
Galera introduces the certification process to check for conflicts, Semi-sync does not
Galera is not 2-phase commit and transactions are not committed synchronously anywhere except the node originating the transaction. So, it is similar to Semi-sync in this way.
I ran the Galera tests with no log-bin (Galera does not require it)
I ran the Galera tests with innodb_flush_log_at_trx_commit=1
I set the fc_limit on the second node really high to eliminate Flow control as a bottleneck. In a live cluster, it would typically be needed.
Galera provides parallel slave threads for faster apply, but it doesn’t matter here because I set the fc_limit so high

TL;DR

Semi-sync in MySQL 5.7 looks like a great improvement. Any form of synchronicity is always going to be expensive, particularly over 10s and 100s of milliseconds of latency. With MySQL 5.7, I’d be much more apt to recommend semi-sync as an option than in previous releases. Thanks to Oracle for investing here.

The post Semi-Sync replication performance in MySQL 5.7.4 DMR appeared first on MySQL Performance Blog.

↧

Galera data on Percona Cloud Tools (and other MySQL monitoring tools)

August 29, 2014, 2:58 pm

≫ Next: Q&A: Percona XtraDB Cluster as a MySQL HA solution for OpenStack

≪ Previous: Semi-Sync replication performance in MySQL 5.7.4 DMR

I was talking with a Percona Support customer earlier this week who was looking for Galera data on Percona Cloud Tools. (Percona Cloud Tools, now in free beta, is a hosted service providing access to query performance insights for all MySQL uses.)

The customer mentioned they were already keeping track of some Galera stats on Cacti, and given they were inclined to use Percona Cloud Tools more and more, they wanted to know if it was already supporting Percona XtraDB Cluster. My answer was: “No, not yet: you can install agents in each node (the regular way in the first node, then manually on the other nodes… and when prompted say “No” to create MySQL user and provide the one you’re using already) and monitor them as autonomous MySQL servers – but the concept of cluster and specially the “Galera bits” has yet to be implemented there.”

Except I was wrong.

By “concept of cluster” I mean supporting the notion of group instances, which should allow a single cluster-wide view for metrics and query reports, such as the slow queries (which are recorded locally on the node where the query was run and thus observed in a detached way). This still needs to be implemented indeed, but it’s on the roadmap.

The “Galera bits” I mentioned above are the various “wsrep_” status variables. In fact, those refer to the Write Set REPlication patches (based in the wsrep API), a set of hooks applied to the InnoDB/XtraDB storage engine and other components of MySQL that modifies the way replication works (to put it in a very simplified way), which in turn are used by the Galera library to provide a “generic Synchronous Multi-Master replication plugin for transactional applications.” A patched version of Percona Server together with the Galera libray compose the base of Percona XtraDB Cluster.

As I found out only now, Percona Cloud Tools does collect data from the various wsrep_ variables and it is available for use – it’s just not shown by default. A user only needs to add a dashboard/chart manually on PCT to see these metrics:

Click on the picture to enlarge it

Now, I need to call that customer …

Monitoring the cluster

Since I’m touching this topic I thought it would be useful to include some additional information on monitoring a Galera (Percona XtraDB Cluster in particular) cluster, even though most of what I mention below has already been published in different posts here on the MySQL Performance Blog. There’s a succint documentation page bearing the same title of this section that indicates the main wsrep_ variables you should monitor to check the health status of the cluster and how well it’s coping with load along the time (performance). Remember you can get a grasp of the full set of variables at any time by issuing the following command from one (or each one) of the nodes:

mysql> SHOW GLOBAL STATUS LIKE "wsrep_%";

And for a broader and real time view of the wsrep_ status variables you can use Jay Janssen’s myq_gadgets toolkit, which he modified a couple of years ago to account for Galera.

There’s also a specific Galera-template available in our Percona Monitoring Plugins (PMP) package that you can use in your Cacti server. That would cover the “how well the cluster copes with load along the time,” providing historical graphing. And while there isn’t a Galera specific plugin for Nagios in PMP, Jay explains in another post how you can customize pmp-check-mysql-status to “check any status variable you like,” describing his approach to keep a cluster’s “health status” in check by setting alerts on specific stats, on a per node basis.

VividCortex recently added a set of templates for Galera in their product and you can also rely on Severalnines’ ClusterControl monitoring features to get that “global view” of your cluster that Percona Cloud Tools doesn’t yet provide. Even though ClusterControl is a complete solution for cluster deployment and management, focusing on the automation of the whole process, the monitoring part alone is particularly interesting as it encompasses cluster-wide information in a customized way, including the “Galera bits”. You may want to give it a try as the monitoring features are available in the Community version of the product (and if you’re a Percona customer with a Support contract covering Percona XtraDB Cluster, then you’re entitled to get support for it from us).

One thing I should note that differentiate the monitoring solutions from above is that while you can install Cacti, Nagios and ClusterControl as servers in your own infrastructure both Percona Cloud Tools and VividCortex are hosted, cloud-based services. Having said that, neither one nor the other upload sensitive data to the cloud and both provide options for query obfuscation.

Summary

Contrary to what I believed, Percona Cloud Tools does provide support for “Galera bits” (the wsrep_ status variables), even though it has yet to implement support for the notion of group instances, which will allow for cluster-wide view for metrics and query reports. You can also rely on the Galera template for Cacti provided by Percona Monitoring Plugins for historical graphing and some clever use of Nagios’ pmp-check-mysql-status for customized cluster alerts. VividCortex and ClusterControl also provide monitoring for Galera.

Percona Cloud Tools, now in free beta, is a hosted service providing access to query performance insights for all MySQL uses. After a brief setup, unlock new information about your database and how to improve your applications. Sign up to request access to the beta today.

The post Galera data on Percona Cloud Tools (and other MySQL monitoring tools) appeared first on MySQL Performance Blog.

↧

Q&A: Percona XtraDB Cluster as a MySQL HA solution for OpenStack

November 14, 2014, 6:00 am

≫ Next: MaxScale: A new tool to solve your MySQL scalability problems

≪ Previous: Galera data on Percona Cloud Tools (and other MySQL monitoring tools)

Thanks to all who attended my Nov. 12 webinar titled, “Percona XtraDB Cluster as a MySQL HA Solution for OpenStack.” I had several questions which I covered at the end and a few that I didn’t. You can view the entire webinar and download the slides here.

Q: Is the read,write speed reduced in Galera compared to normal MySQL?

For reads, it’s the same (unless you use the sync_wait feature, used to be called causal reads).

For writes, the cost of replication (~1 RTT to the worst node), plus the cost of certification will be added to each transaction commit. This will have a non-trivial impact on your write performance, for sure. However, I believe most OpenStack meta store use cases should not suffer overly by this performance penalty.

Q: Does state transfers affect a continuing transaction within the nodes?

Mostly, no. The joining node will queue ongoing cluster replication while receiving its state transfer and use that to do its final ‘catchup’. The node donating the state transfer may get locked briefly during a full SST, which could temporarily block writes on that node. If you use the built-in clustercheck (or check the same things it checks), you can avoid this by diverting traffic away from a donor.

Q: Perhaps not the correct webinar for this question, but I was also expecting to hear something about using PXC in combination with OpenStack Trove. If you’ve got time, could you tell something about that?

Trove has not supported the concept of more than a single SQL target. My understanding is a recent improvement here for MongoDB setups may pave the way for more robust Trove instances backed by Percona XtraDB Cluster.

Q: For Loadbalancing using the Java Mysql driver, would you suggest HA proxy or the loadbalancing connection in the java driver. Also how does things work in case of persistent connections and connection pools like dbcp?

Each node in a PXC cluster is just a mysql server that can handle normal mysql connections. Obviously if a node fails, it’s easy to detect that you should probably not keep using that node, but in Percona XtraDB Cluster you need to watch out for things like cluster partitions where nodes lose quorum and stop taking queries (but still allow connections) I believe it’s possible to configure advanced connection pooling setups to properly health check the nodes (unless the necessary features are not in the pool implementation), but I don’t have a reference architecture to point you to.

Q: Are there any manual solutions to avoid deadlocks within a high write context to force a query to execute on all nodes?

Yes, write to a single node, at least for the dataset that has high write volume.

Remember what I said in the webinar: you can only update a given row once per RTT, so there’s an upper cap of throughput on the smallest locking granularity in InnoDB (i.e., a single row).

This manifests itself in two possible ways:

1) In a multi-node writing cluster by triggering deadlocks on conflicts. Only approximately 1 transaction modifying a given row per RTT would NOT receive a deadlock

2) In a single-node writing cluster by experiencing longer lock waits. Transaction times (and subsequently lock times) are extended by the replication and certification time, so other competing transactions will lock wait until the blocking transaction commit. There are no replication conflict deadlocks in this case, but the net effect is exactly the same: only 1 transaction per row per RTT.

Galera offers us high data redundancy at the cost of some throughput. If that doesn’t work for your workload, then asynchronous replication (or perhaps semi-sync in 5.7) will work better for you.

Note that there is wsrep_retry_autocommit. However, this only works for autocommited transactions. If your write volume was so high that you need to increase this a lot to get the conflict rate down, you are likely sacrificing a lot of CPU power (rollbacks are expensive in Innodb!) if a single transaction needs multiple retries to commit. This still doesn’t get around the law: 1 trx per row per RTT at best.

That was all of the questions. Be sure to check out our next OpenStack webinar on December 10 by my colleague Peter Boros. It’s titled “MySQL and OpenStack Deep Dive” and you can register here (and like all of our webinars, it’s free). I also invite you to visit our OpenStack Live 2015 page and learn more about that conference in Santa Clara, California this coming April 13-14. The Call for Papers ends Nov. 16 and speakers get a full-access conference pass. I hope to see you there!

The post Q&A: Percona XtraDB Cluster as a MySQL HA solution for OpenStack appeared first on MySQL Performance Blog.

↧

MaxScale: A new tool to solve your MySQL scalability problems

June 8, 2015, 6:00 am

≫ Next: Playing with Percona XtraDB Cluster in Docker

≪ Previous: Q&A: Percona XtraDB Cluster as a MySQL HA solution for OpenStack

Ever since MySQL replication has existed, people have dreamed of a good solution to automatically split read from write operations, sending the writes to the MySQL master and load balancing the reads over a set of MySQL slaves. While if at first it seems easy to solve, the reality is far more complex.

First, the tool needs to make sure it parses and analyses correctly all the forms of SQL MySQL supports in order to sort writes from reads, something that is not as easy as it seems. Second, it needs to take into account if a session is in a transaction or not.

While in a transaction, the default transaction isolation level in InnoDB, Repeatable-read, and the MVCC framework insure that you’ll get a consistent view for the duration of the transaction. That means all statements executed inside a transaction must run on the master but, when the transaction commits or rollbacks, the following select statements on the session can be again load balanced to the slaves, if the session is in autocommit mode of course.

Then, what do you do with sessions that set variables? Do you restrict those sessions to the master or you replay them to the slave? If you replay the set variable commands, you need to associate the client connection to a set of MySQL backend connections, made of at least a master and a slave. What about temporary objects like with “create temporary table…”? How do you deal when a slave lags behind or what if worse, replication is broken? Those are just a few of the challenges you face when you want to build a tool to perform read/write splitting.

Over the last few years, a few products have tried to tackle the read/write split challenge. The MySQL_proxy was the first attempt I am aware of at solving this problem but it ended up with many limitations. ScaleARC does a much better job and is very usable but it stills has some limitations. The latest contender is MaxScale from MariaDB and this post is a road story of my first implementation of MaxScale for a customer.

Let me first introduce what is MaxScale exactly. MaxScale is an open source project, developed by MariaDB, that aims to be a modular proxy for MySQL. Most of the functionality in MaxScale is implemented as modules, which includes for example, modules for the MySQL protocol, client side and server side.

Other families of available modules are routers, monitors and filters. Routers are used to determine where to send a query, Read/Write splitting is accomplished by the readwritesplit router. The readwritesplit router uses an embedded MySQL server to parse the queries… quite clever and hard to beat in term of query parsing.

There are other routers available, the readconnrouter is basically a round-robin load balancer with optional weights, the schemarouter is a way to shard your data by schema and the binlog router is useful to manage a large number of slaves (have a look at Booking.com’s Jean-François Gagné’s talk at PLMCE15 to see how it can be used).

Monitors are modules that maintain information about the backend MySQL servers. There are monitors for a replicating setup, for Galera and for NDB cluster. Finally, the filters are modules that can be inserted in the software stack to manipulate the queries and the resultsets. All those modules have well defined APIs and thus, writing a custom module is rather easy, even for a non-developer like me, basic C skills are needed though. All event handling in MaxScale uses epoll and it supports multiple threads.

Over the last few months I worked with a customer having a challenging problem. On a PXC cluster, they have more than 30k queries/s and because of their write pattern and to avoid certification issues, they want to have the possibility to write to a single node and to load balance the reads. The application is not able to do the Read/Write splitting so, without a tool to do the splitting, only one node can be used for all the traffic. Of course, to make things easy, they use a lot of Java code that set tons of sessions variables. Furthermore, for ISO 27001 compliance, they want to be able to log all the queries for security analysis (and also for performance analysis, why not?). So, high query rate, Read/Write splitting and full query logging, like I said a challenging problem.

We experimented with a few solutions. One was a hardware load balancer that failed miserably – the implementation was just too simple, using only regular expressions. Another solution we tried was ScaleArc but it needed many rules to whitelist the set session variables and to repeat them to multiple servers. ScaleArc could have done the job but all the rules increases the CPU load and the cost is per CPU. The queries could have been sent to rsyslog and aggregated for analysis.

Finally, the HA implementation is rather minimalist and we had some issues with it. Then, we tried MaxScale. At the time, it was not GA and was (is still) young. Nevertheless, I wrote a query logging filter module to send all the queries to a Kafka cluster and we gave it a try. Kafka is extremely well suited to record a large flow of queries like that. In fact, at 30k qps, the 3 Kafka nodes are barely moving with cpu under 5% of one core. Although we encountered some issues, remember MaxScale is very young, it appeared to be the solution with the best potential and so we moved forward.

The folks at MariaDB behind MaxScale have been very responsive to the problems we encountered and we finally got to a very usable point and the test in the pilot environment was successful. The solution is now been deployed in the staging environment and if all goes well, it will be in production soon. The following figure is simplified view of the internals of MaxScale as configured for the customer:

The blocks in the figure are nearly all defined in the configuration file. We define a TCP listener using the MySQL protocol (client side) which is linked with a router, either the readwritesplit router or the readconn router.

The first step when routing a query is to assign the backends. This is where the read/write splitting decision is made. Also, as part of the steps required to route a query, 2 filters are called, regexp (optional) and Genlog. The regexp filter may be used to hot patch a query and the Genlog filter is the logging filter I wrote for them. The Genlog filter will send a json string containing about what can be found in the MySQL general query log plus the execution time.

Authentication attempts are also logged but the process is not illustrated in the figure. A key point to note, the authentication information is cached by MaxScale and is refreshed upon authentication failure, the refresh process is throttled to avoid overloading the backend servers. The servers are continuously monitored, the interval is adjustable, and the server status are used when the decision to assign a backend for a query is done.

In term of HA, I wrote a simple Pacemaker resource agent for MaxScale that does a few fancy things like load balancing with IPTables (I’ll talk about that in future post). With Pacemaker, we have a full fledge HA solution with quorum and fencing on which we can rely.

Performance wise, it is very good – a single core in a virtual environment was able to read/write split and log to Kafka about 10k queries per second. Although MaxScale supports multiple threads, we are still using a single thread per process, simply because it yields a slightly higher throughput and the custom Pacemaker agent deals with the use of a clone set of MaxScale instances. Remember we started early using MaxScale and the beta versions were not dealing gracefully with threads so we built around multiple single threaded instances.

So, since a conclusion is needed, MaxScale has proven to be a very useful and flexible tool that allows to elaborate solutions to problems that were very hard to tackle before. In particular, if you need to perform read/write splitting, then, try MaxScale, it is best solution for that purpose I have found so far. Keep in touch, I’ll surely write other posts about MaxScale in the near future.

The post MaxScale: A new tool to solve your MySQL scalability problems appeared first on MySQL Performance Blog.

↧

Playing with Percona XtraDB Cluster in Docker

June 30, 2015, 12:00 am

≫ Next: Better high availability: MySQL and Percona XtraDB Cluster with good application design

≪ Previous: MaxScale: A new tool to solve your MySQL scalability problems

Like any good, thus lazy, engineer I don’t like to start things manually. Creating directories, configuration files, specify paths, ports via command line is too boring. I wrote already how I survive in case when I need to start MySQL server (here). There is also the MySQL Sandbox which can be used for the same purpose.

But what to do if you want to start Percona XtraDB Cluster this way? Fortunately we, at Percona, have engineers who created automation solution for starting PXC. This solution uses Docker. To explore it you need:

Clone the pxc-docker repository:
git clone https://github.com/percona/pxc-docker
Install Docker Compose as described here
cd pxc-docker/docker-bld
Follow instructions from the README file:
a) ./docker-gen.sh 5.6 (docker-gen.sh takes a PXC branch as argument, 5.6 is default, and it looks for it on github.com/percona/percona-xtradb-cluster)
b) Optional: docker-compose build (if you see it is not updating with changes).
c) docker-compose scale bootstrap=1 members=2 for a 3 node cluster

Check which ports assigned to containers:

$docker port dockerbld_bootstrap_1 3306
0.0.0.0:32768
$docker port dockerbld_members_1 4567
0.0.0.0:32772
$docker port dockerbld_members_2 4568
0.0.0.0:32776

Now you can connect to MySQL clients as usual:

$mysql -h 0.0.0.0 -P 32768 -uroot
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 10
Server version: 5.6.21-70.1 MySQL Community Server (GPL), wsrep_25.8.rXXXX
Copyright (c) 2009-2015 Percona LLC and/or its affiliates
Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql>

To change MySQL options either pass it as a mount at runtime with something like volume: /tmp/my.cnf:/etc/my.cnf in docker-compose.yml or connect to container’s bash (docker exec -i -t container_name /bin/bash), then change my.cnf and run docker restart container_name

Notes.

If you don’t want to build use ready-to-use images
If you don’t want to run Docker Compose as root user add yourself to docker group

The post Playing with Percona XtraDB Cluster in Docker appeared first on MySQL Performance Blog.

↧

Better high availability: MySQL and Percona XtraDB Cluster with good application design

December 21, 2015, 1:19 pm

≫ Next: Galera Error Failed to Report Last Committed (Interrupted System Call)

≪ Previous: Playing with Percona XtraDB Cluster in Docker

High Availability

Have you ever wondered if your application should be able to work in read-only mode? How important is that question?

MySQL seems to be the most popular database solution for web-based products. Most typical Internet application workloads consist of many reads, with usually few writes. There are exceptions of course – MMO games for instance – but often the number of reads is much bigger then writes. So when your database infrastructure looses its ability to accept writes, either because traditional MySQL replication topology lost its master or Galera cluster lost its quorum, why would you want to the application to declare total downtime? During this scenario, imagine all the users who are just browsing the application (not contributing content): they don’t care if the database cannot accept new data. People also prefer to have access to an application, even if it’s functionality is notably reduced, rather then see the 500 error pages. In some disaster scenarios it is a seriously time-consuming task to perform PITR or recover some valuable data: it is better to at least have the possibility of user read access to a recent backup.

My advice: design your application with the possible read-only partial outage in mind, and test how the application works in that mode during it’s development life cycle. I think it will pay off greatly and increase the perception of availability of your product. As an example, check out some of the big open source projects’ implementation of this concept, like MediaWiki or Drupal (and also some commercial products).

PXC

Having said that, I want to highlight a pretty new (and IMHO) important improvement in this regard, introduced in the Galera replication since PXC version 5.6.24. It was already mentioned by my colleague Stéphane in his blog post earlier this year.

Focus on data consistency

As you probably know, one of Galera’s key advantages is their great data consistency care and data-centric approach. No matter where you write in the cluster, all nodes must have the same data. This is important when you realize what happens when a data inconsistency is detected between the nodes. Inconsistent nodes, which cannot apply a writeset due to missing rows or duplicate unique key values for instance, will have to abort and perform an emergency shutdown. This happens in order to remove contaminated members from the cluster, and not spread the data “illness” further. If it does happen that the majority of nodes perform an emergency abort, the remaining minority may loose the cluster quorum and will stop serving further client’s requests. So the price for data consistency protection is availability.

Anti-split-brain

Sometimes a node or node cluster members loose connectivity to others, in a way that >50% of nodes can no longer communicate. Connectivity is lost all of a sudden, without a proper “goodbye” message from the “dissappeared” nodes. These nodes don’t know what the reason was for the lost connection – were the peers killed? or may be networks were split? In that situation, nodes declare a non-Primary cluster state and go into SQL-disabled mode. This is because a member of a cluster without a quorum (majority), hence not acting as Primary Component, is not trusted as it may have inconsistent or old data. Because of this state, it won’t allow the clients to access it.

This is for two reasons. First and unquestionably, we don’t want to allow writes when there is a risk of network split, where the other part of the cluster still forms the Primary Component and keeps operating. We may also want to disallow reads of a stall data, however, when there is a possibility that the other part of the infrastructure already has a lot of newer information.

In the standard MySQL replication process there are no such precautions – if replication is broken in master-master topology both masters can still accept writes, and they can also read anything from the slaves regardless of how much they may be lagging or if they are connected to their masters at all. In Galera though, even if too much lag in the applying queue is detected (similar to replication lag concept), the cluster will pause writes using a Flow Control mechanism. If replication is broken as described above, it will even stop the reads.

Dirty reads from Galera cluster

This behavior may seem too strict, especially if you just migrated from MySQL replication to PXC, and you just accept that database “slave” nodes can serve the read traffic even if they are separated from the “master.” Or if your application does not rely on writes but mostly on access to existing content. In that case, you can either enable the new wsrep_dirty_reads variable dynamically (per session only if needed), or setup your cluster to run this option by default by placing wsrep_dirty_reads = ON in the my.cnf (global values are acceptable in the config file is available since PXC 5.6.26).

The Galera topology below is something we very often see at customer sites, where WAN locations are configured to communicate via VPN:

I think this failure scenario is a perfect usage case for wsrep_dirty_reads – where none of the cluster parts are able to work at full functionality alone, but could successfully keep serving read queries to the clients.

So let’s quickly see how the cluster member behaves with the wsrep_dirty_reads option disabled and enabled (for the test I blocked network communication on port 4567):

percona3 mysql> show status like 'wsrep_cluster_status';
+----------------------+-------------+
| Variable_name        | Value       |
+----------------------+-------------+
| wsrep_cluster_status | non-Primary |
+----------------------+-------------+
1 row in set (0.00 sec)
percona3 mysql> show variables like '%dirty_reads';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| wsrep_dirty_reads | OFF   |
+-------------------+-------+
1 row in set (0.01 sec)
percona3 mysql>  select * from test.g1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

And when enabled:

percona2 mysql> show status like 'wsrep_cluster_status';
+----------------------+-------------+
| Variable_name        | Value       |
+----------------------+-------------+
| wsrep_cluster_status | non-Primary |
+----------------------+-------------+
1 row in set (0.00 sec)
percona2 mysql> show variables like '%dirty_reads';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| wsrep_dirty_reads | ON    |
+-------------------+-------+
1 row in set (0.00 sec)
percona2 mysql> select * from test.g1;
+----+-------+
| id | a     |
+----+-------+
|  1 | dasda |
|  2 | dasda |
+----+-------+
2 rows in set (0.00 sec)
percona2 mysql> insert into test.g1 set a="bb";
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

MySQL

In traditional replication, you are probably using the slaves for reads anyway. So if the master crashes, and for some reason a failover toolkit like MHA or PRM is not configured or also fails, in order to keep the application working you should direct new connections meant for the master to one of the slaves. If you use a loadbalancer, maybe just have the slaves as backups for the master in the write pool. This may help to achieve a better user experience during the downtime, where everyone can at least use existing information. As noted above, however, the application must be prepared to work that way.

There are caveats to this implementation, as the “read_only” mode that is usually used on slaves is not 100% read-only. This is due to the exception for “super” users. In this case, the new super_read_only variable comes to the rescue (available in Percona Server 5.6) as well as stock MySQL 5.7. With this feature, there is no risk that after pointing database connections to one of the slaves, some special users will change the data.

If a disaster is severe enough, it may be necessary to recover data from a huge SQL dump, and it’s often hard to find enough spare servers to serve the traffic with an old binary snapshot. It’s worth noting that InnoDB has a special read-only mode, meant to be used in a read-only medium, that is lightweight compared to full InnoDB mode.

Useful links

If you are looking for more information about Galera/PXC availability problems and recovery tips, these earlier blog posts may be interesting:
Percona XtraDB Cluster (PXC): How many nodes do you need?
Percona XtraDB Cluster: Quorum and Availability of the cluster
Galera replication – how to recover a PXC cluster

↧

Galera Error Failed to Report Last Committed (Interrupted System Call)

May 27, 2016, 3:00 pm

≫ Next: Galera warning “last inactive check”

≪ Previous: Better high availability: MySQL and Percona XtraDB Cluster with good application design

MySQL-Got-an-error-reading-communication-packet-errors

In this blog, we’ll discuss the ramifications of the Galera Error Failed to Report Last Committed (Interrupted System Call).

I have recently seen this error with Percona XtraDB Cluster (or Galera):

[Warning] WSREP: Failed to report last committed 549684236, -4 (Interrupted system call)

It was posted in launchpad as a bug in 2013: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1434646

My colleague Przemek replied, and explained it as:

Reporting the last committed transaction is just a part of the certification index purge process. In case it fails for some reason (it occasionally does), the cert index purge may be a little delayed. But it does not mean the transaction was not applied successfully. This is a warning after all.

If we look up this error in the source code, we realize it is reusing Linux system errors. Specifically:

#define EINTR 4 /* Interrupted system call */

As there isn’t much documentation regarding this error, and internet searches did not bring up useful information, my colleague David Bennett and I delved into the source code (as we do on occasion).

If we look in the Galera source code gcs_sm.hpp we see:

289 * @retval -EINTR - was interrupted by another thread

We also see:

317 /* was interrupted, will be handled by someone else */

This means that the thread was interrupted, but the server will retry on another thread. As it is just a warning, it isn’t anything to be too concerned about – unless they begin to pile up (which could be a sign of concurrency issues).

The specific warning is thrown from galera_service_thd.cpp here:

58 if (gu_unlikely(ret < 0)) 59 { 60 log_warn << "Failed to report last committed " 61 << data.last_committed_ << ", " << ret 62 << " (" << strerror (-ret) << ')'; 63 // @todo: figure out what to do in this case 64 }

This warning could be handled better so as to not bloody the logs, or sound cryptic enough to concern administrators.

↧

Galera warning “last inactive check”

June 2, 2016, 11:51 am

≫ Next: Percona XtraDB Cluster on Ceph

≪ Previous: Galera Error Failed to Report Last Committed (Interrupted System Call)

In this post, we’ll discuss the Galera warning “last inactive check” and what it means.

Problem

I’ve been working with Percona XtraDB Cluster quite a bit recently, and have been investigating various warnings. I came across this one today:

[Warning] WSREP: last inactive check more than PT1.5S ago (PT1.51811S), skipping check

This warning is related to the evs.inactive_check_period option. This option controls the poll period for the group communication response time. If a node is delayed, it is added to a delay list and it can lead to the cluster evicting the node.

Possible Cause

While some troubleshooting tips seem to associate the warning with VMWare snapshots, this isn’t the case here, as we see the warning on a physical machine.

I checked for backups or desynced nodes, and this also wasn’t the case. The warning was not accompanied by any errors or other information, so there was nothing critical happening.

In the troubleshooting link above, Galera developers said:

This can be seen on bare metal as well — with poorly configured mysqld, O/S, or simply being overloaded. All it means is that this thread could not get CPU time for 7.1 seconds. You can imagine that access to resources in virtual machines is even harder (especially I/O) than on bare metal, so you will see this in virtual machines more often.

This is not a Galera specific issue (it just reports being stuck, other mysqld threads are equally stuck) so there is no configuration options for that. You simply must make sure that your system and mysqld are properly configured, that there is enough RAM (buffer pool not over provisioned), that there is swap, that there are proper I/O drivers installed on guest and so on.

Basically, Galera runs in virtual machines as well as the virtual machines approximates bare metal.

It could also be an indication of unstable network or just higher average network latency than expected by the default configuration. In addition to checking network, do check I/O, swap and memory when you do see this warning.

Our graphs and counters otherwise look healthy. If this is the case, this is most likely nothing to worry about.

It is also a good idea to ensure your nodes are desynced before backup. Look for spikes in your workload. A further option to check for is that swappiness is set to 1 on modern kernels.

If all of this looks good, ensure the servers are all talking to the same NTP server, have the same time zone and the times and dates are in sync. While this warning could be a sign of an overloaded system, if everything else looks good this warning isn’t something to worry about.

Source

The warning comes from evs_proto.cpp in the Galera code:

if (last_inactive_check_ + inactive_check_period_*3 < now) { log_warn << "last inactive check more than " << inactive_check_period_*3 << " ago (" << (now - last_inactive_check_) << "), skipping check"; last_inactive_check_ = now; return; }

Since the default for inactive_check_period is one second according to the Galera documentation, if it is now later than three seconds after the last check, it skips the rest of the above routine and adds the node to the delay list and does some other logic. The reason it does this is that it doesn’t want to rely on stale counters before making decisions. The message is really just letting you know that.

In Percona XtraDB Cluster, this setting defaults to 0.5s. This warning simply could be that your inactive_check_period is too low, and the delay is not high enough to add the node to the delay list. So you could consider increasing evs.inactive_check_period to resolve the warnings. (Apparently in Galera, it may also now be 0.5s but documentation is stale.)

Possible Solution

To find a sane value my colleague David Bennett came up with this command line, which gives you an idea of when your check warnings are happening:

$ cat mysqld.log | grep 'last inactive check more than' | perl -ne 'm/(PT(.*)S)/; print $1."n"' | sort -n | uniq -c 1 1.55228 1 1.5523 1 1.55257 1 1.55345 1 1.55363 1 1.5543 1 1.55436 1 1.55483 1 1.5552 1 1.55582
Therefore, in this case, it may be a good idea to set inactive_check_period at 1 or 1.5 to make the warnings go away.

Conclusion

Each node in the cluster keeps its own local copy of how it sees the topology of the entire cluster. check_inactive is a node event that is triggered every inactive_check_period seconds to help the node update its view of the whole cluster, and ensure it is accurate. Service messages can be broadcast to the cluster informing nodes of changes to the topology. For example, if a cluster node is going down it will broadcast a service message telling each node in the cluster to remove it. The action is queued but the actual view of the cluster is updated with check_inactive. This is why it adds nodes to its local copy of inactive, suspect and delayed nodes.

If a node thinks it might be looking at stale data, it doesn’t make these decisions and waits until the next time for a fresh queue. Unfortunately, if inactive_check_period is too low, it will keep giving you warnings.

↧

Percona XtraDB Cluster on Ceph

August 4, 2016, 3:31 pm

≫ Next: Percona XtraDB Cluster 5.7.12 RC1 is now available

≪ Previous: Galera warning “last inactive check”

This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

New node starts (joiner) and unmounts its current MySQL datadir
The joiner and asks for an SST
The donor creates a consistent snapshot of its MySQL datadir with the Galera position
The donor sends to the joiner the name of the snapshot to use
The joiner creates a clone of the snapshot name provided by the donor
The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

https://github.com/y-trudeau/ceph-related-tools/raw/master/wsrep-sst/wsrep_sst_ceph

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph
chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

cephlocalpool: The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
cephmountpoint: What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
cephmountoptions: The options used to mount the filesystem. The default value is: rw,noatime
cephkeyring: The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
cephcleanup: Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0

Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function

mysqld_main

, which sets the current working directory to the datadir (an empty directory at that point). After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 90760ba..bd9fa38 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);
       }
     }
   }
+
+  /*
+   * Forcing a new setwd in case the SST mounted the datadir
+   */
+  if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)
+    unireg_abort(1);        /* purecov: inspected */
+
   if (opt_bin_log)
   {
     /*

With this patch, I added a new

my_setwd

call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the

base_dir

Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:

wsrep_provider_options="base_dir=/var/lib/galera"

Of course, if you have other provider options, don’t forget to add them there.

Walkthrough

So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'
apt-get update
apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global]
fsid = 87671417-61e4-442b-8511-12659278700f
mon_initial_members = odroid1, odroid2
mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_journal = /var/lib/ceph/osd/journal
osd_journal_size = 128
osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s
    cluster 87671417-61e4-442b-8511-12659278700f
     health HEALTH_OK
     monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0}
            election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille
     mdsmap e204: 1/1/1 up {0=odroid3=up:active}
     osdmap e995: 4 osds: 4 up, 4 in
      pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects
            643 GB used, 6318 GB / 7334 GB avail
                1352 active+clean
  client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool. We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC
/dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1
mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid
chown mysql.mysql /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
mkdir /var/lib/galera
chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the

mysql_install_db

tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific

wsrep_sst_ceph

settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld]
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options="base_dir=/var/lib/galera"
wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48
wsrep_node_address=10.0.5.48
wsrep_sst_method=ceph
wsrep_cluster_name=ceph_cluster
[sst]
cephlocalpool=mysqlpool
cephmountoptions=rw,noatime,nodiratime,nouuid
cephkeyring=/etc/ceph/ceph.client.admin.keyring
cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC2:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC2-1463776424 -    /dev/rbd1
root@PXC3:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC3:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC3-1464118729 -    /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls
PXC
PXC2-1463776424
PXC3-1464118729
root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424
rbd image 'PXC2-1463776424':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.108b4246146651
        format: 2
        features: layering
        flags:
        parent: mysqlpool/PXC@1463776423
        overlap: 10240 MB

Discussion

Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424
warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.
NAME            PROVISIONED USED
PXC2-1463776424      10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

DISCLAIMER: The

wsrep_sst_ceph

script isn’t officially supported by Percona.

↧

Percona XtraDB Cluster 5.7.12 RC1 is now available

August 11, 2016, 4:01 pm

≫ Next: Percona XtraDB Cluster 5.7.14-26.17 GA is now available

≪ Previous: Percona XtraDB Cluster on Ceph

Percona XtraDB Cluster Reference Architecture

Percona announces the first release candidate (RC1) in the Percona XtraDB Cluster 5.7 series on August 9, 2016. Binaries are available from the downloads area or our software repositories.

Percona XtraDB Cluster 5.7.12-5rc1-26.16 is based on the following:

This release includes all changes from upstream releases and the following:

New Features

PXC Strict Mode: Use the pxc_strict_mode variable in the configuration file or the –pxc-strict-mode option during mysqld startup.
Galera instruments exposed in Performance Schema: This includes mutexes, condition variables, file instances, and threads.

Bug Fixes

Fixed error messages.
Fixed the failure of SST via mysqldump with gtid_mode=ON.
Added check for TOI that ensures node readiness to process DDL+DML before starting the execution.
Removed protection against repeated calls of wsrep->pause() on the same node to allow parallel RSU operation.
Changed wsrep_row_upd_check_foreign_constraints to ensure that fk-reference-tableis open before marking it open.
Fixed error when running SHOW STATUS during group state update.
Corrected the return code of sst_flush_tables() function to return a non-negative error code and thus pass assertion.
Fixed memory leak and stale pointer due to stats not freeing when toggling the wsrep_providervariable.
Fixed failure of ROLLBACK to register wsrep_handler
Fixed failure of symmetric encryption during SST.

Other Changes

Added support for sending the keyring when performing encrypted SST.
Changed the code of THD_PROC_INFO to reflect what the thread is currently doing.
Using XtraBackup as the SST method now requires Percona XtraBackup 2.4.4 or later.
Improved rollback process to ensure that when a transaction is rolled back, any statements open by the transaction are also rolled back.
Removed the sst_special_dirs variable.
Disabled switching of slave_preserve_commit_order to ON when running PXC in cluster mode, as it conflicts with existing multi-master commit ordering resolution algorithm in Galera.
Based the default my.cnf on Percona Server 5.7 configuration with Galera/wsrep settings from PXC 5.6.
Other low-level fixes and improvements for better stability.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

↧

Percona XtraDB Cluster 5.7.14-26.17 GA is now available

September 29, 2016, 7:05 am

≫ Next: All You Need to Know About GCache (Galera-Cache)

≪ Previous: Percona XtraDB Cluster 5.7.12 RC1 is now available

This Percona XtraDB Cluster 5.7.14-26.17 GA release is dedicated to the memory of Federico Goncalvez, our colleague with Percona’s Uruguayan team until his tragic death on September 6, 2016.

Fede, we miss you.

Percona announces the first general availability (GA) release in the Percona XtraDB Cluster 5.7 series on September 29, 2016. Binaries are available from the downloads area or our software repositories.

The full release notes are available here.

Percona XtraDB Cluster 5.7.14-26.17 GA is based on the following:

For information about the changes and new features introduced in Percona Server 5.7, see Changed in Percona Server 5.7.

Percona XtraDB Cluster 5.7.14-26.17 GA New Features

This is a list of the most important features introduced in Percona XtraDB Cluster 5.7 compared to version 5.6:

PXC Strict Mode saves your workload from experimental and unsupported features.
Support for monitoring Galera Library instruments and other wsrep instruments as part of Performance Schema.
Support for encrypted tablespaces in Multi-Master Topology, which enables Percona XtraDB Cluster to wire encrypted tablespace to new booting node.
Compatibility with ProxySQL, including a quick configuration script.
Support for monitoring Percona XtraDB Cluster nodes using Percona Monitoring and Management
More stable and robust operation with MySQL and Percona Server version 5.7.14, as well as Galera 3.17 compatibility. Includes all upstream bug fixes, improved logging and more.
Simplified packaging for Percona XtraDB Cluster to a single package that installs everything it needs, including the Galera library.
Support for latest Percona XtraBackup with enhanced security checks.

Bug Fixes

Fixed crash when a local transaction (such as EXPLAIN or SHOW) is interrupted by a replicated transaction with higher priority (like ALTER that changes table structure and can thus affect the result of the local transaction).
Fixed DONOR node getting stuck in Joined state after successful SST.
Fixed error message when altering non-existent table with pxc-strict-mode enabled.
Fixed path to the directory in percona-xtradb-cluster-shared.conf.
Fixed setting of seqno in grastate.dat to -1 on clean shutdown.
Fixed failure of asynchronous TOI actions (like DROP) for non-primary nodes.
Fixed replacing of my.cnf during upgrade from 5.6 to 5.7.

Security Fixes

CVE-2016-6662
CVE-2016-6663
CVE-2016-6664

For more information, see this blog post.

Other Improvements

Added support of defaults-group-suffix for SST scripts.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

↧

All You Need to Know About GCache (Galera-Cache)

November 16, 2016, 5:21 pm

≫ Next: Galera Cache (gcache) is finally recoverable on restart

≪ Previous: Percona XtraDB Cluster 5.7.14-26.17 GA is now available

This blog discusses some important aspects of GCache.

Why do we need GCache?

Percona XtraDB Cluster is a multi-master topology, where a transaction executed on one node is replicated on another node(s) of the cluster. This transaction is then copied over from the group channel to Galera-Cache followed by apply action.

The cache can be discarded immediately once the transaction is applied, but retaining it can help promote a node as a DONOR node serving write-sets for a newly booted node.

So in short, GCache acts as a temporary storage for replicated transactions.

How is GCache managed?

Naturally, the first choice to cache these write-sets is to use memory allocated pool, which is governed by gcache.mem_store. However, this is deprecated and buggy and shouldn’t be used.

Next on the list is on-disk files. Galera has two types of on-disk files to manage write-sets:

RingBuffer File:
- A circular file (aka RingBuffer file). As the name suggests, this file is re-usable in a circular queue fashion, and is pre-created when the server starts. The size of this file is preconfigured and can’t be changed dynamically, so selecting a proper size for this file is important.
- The user can set the size of this file using gcache.size. (There are multiple blogs about how to estimate size of the Galera Cache, which is generally linked to downtime. If properly planned, the next booting node will find all the missing write-sets in the cache, thereby avoiding need for SST.)
- Write-sets are appended to this file and, when needed, the file is re-cycled for use.
On-demand page store:
- If the transaction write-set is large enough not to fit in a RingBuffer File (actually large enough not to fit in half of the RingBuffer file) then an independent page (physical disk file) is allocated to cache the write-sets.
- Again there are two types of pages:
  - Page with standard size: As defined by gcache.page_size (default=128M).
  - Page with non-standard page size: If the transaction is large enough not to fit into a standard page, then a non-standard page is created for the transaction. Let’s say
```
gcache.page_size=1M
```
    and transaction
```
write_set = 1.5M
```
    , then a separate page (in turn on-disk file) will be created with a size of 1.5M.

How long are on demand pages retained? This is controlled using following two variables:

gcache.keep_pages_size
- ```
keep_pages_size
```
  defines total size of allocated pages to keep. For example, if
```
keep_pages_size = 10M
```
  then N pages that add up to 10M can be retained. If N pages add to more than 10M, then pages are removed from the start of the queue until the size falls below set threshold. A size of 0 means don’t retain any page.
gcache.keep_pages_count (PXC specific)
- But before pages are actually removed, a second check is done based on
```
page_count
```
  . Let’s say
```
keep_page_count = N+M
```
  , then even though N pages adds up to 10M, they will be retained as the
```
page_count
```
  threshold is not yet hit. (The exception to this is non-standard pages at the start of the queue.)

So in short, both condition must be satisfied. The recommendation is to use whichever condition is applicable in the user environment.

Where are GCache files located?

The default location is the data directory, but this can be changed by setting gcache.dir. Given the temporary nature of the file, and iterative read/write cycle, it may be wise to place these files in a faster IO disk. Also, the default name of the file is gcache.cache. This is configurable by setting gcache.name.

What if one of the node is DESYNCED and PAUSED?

If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling.

If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused.

↧

Galera Cache (gcache) is finally recoverable on restart

November 30, 2016, 2:38 pm

≫ Next: Percona Software News and Roadmap Update with CEO Peter Zaitsev: Q1 2017

≪ Previous: All You Need to Know About GCache (Galera-Cache)

This post describes how to recover Galera Cache (or gcache) on restart.

Recently Codership introduced (with Galera 3.19) a very important and long awaited feature. Now users can recover Galera cache on restart.

Need

If you gracefully shutdown cluster nodes one after another, with some lag time between nodes, then the last node to shutdown holds the latest data. Next time you restart the cluster, the last node shutdown will be the first one to boot. Any followup nodes that join the cluster after the first node will demand an SST.

Why SST, when these nodes already have data and only few write-sets are missing? The DONOR node caches missing write-sets in Galera cache, but on restart this cache is wiped clean and restarted fresh. So the DONOR node doesn’t have a Galera cache to donate missing write-sets.

This painful set up made it necessary for users to think and plan before gracefully taking down the cluster. With the introduction of this new feature, the user can retain the Galera cache.

How does this help ?

On restart, the node will revive the galera-cache. This means the node can act as a DONOR and service missing write-sets (facilitating IST, instead of using SST). This option to retain the galera-cache is controlled by an option named

gcache.recover=yes/no

. The default is NO (Galera cache is not retained). The user can set this option for all nodes, or selective nodes, based on disk usage.

gcache.recover in action

The example below demonstrates how to use this option:

Let’s say the user has a three node cluster (n1, n2, n3), with all in sync.
The user gracefully shutdown n2 and n3.
n1 is still up and running, and processes some workload, so now n1 has latest data.
n1 is eventually shutdown.
Now the user decides to restart the cluster. Obviously, the user needs to start n1 first, followed by n2/n3.
n1 boots up, forming an new cluster.
n2 boots up, joins the cluster, finds there are missing write-sets and demands IST but given that n1 doesn’t have a gcache, it falls back to SST.

n2 (JOINER node log):

2016-11-18 13:11:06 3277 [Note] WSREP: State transfer required:
 Group state: 839028c7-ad61-11e6-9055-fe766a1886c3:4680
 Local state: 839028c7-ad61-11e6-9055-fe766a1886c3:3893

n1 (DONOR node log),

gcache.recover=no

2016-11-18 13:11:06 3245 [Note] WSREP: IST request: 839028c7-ad61-11e6-9055-fe766a1886c3:3893-4680|tcp://192.168.1.3:5031
2016-11-18 13:11:06 3245 [Note] WSREP: IST first seqno 3894 not found from cache, falling back to SST

Now let’s re-execute this scenario with

gcache.recover=yes

n2 (JOINER node log):

2016-11-18 13:24:38 4603 [Note] WSREP: State transfer required:
 Group state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495
 Local state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769
....
2016-11-18 13:24:41 4603 [Note] WSREP: Receiving IST: 726 writesets, seqnos 769-1495
....
2016-11-18 13:24:49 4603 [Note] WSREP: IST received: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495

n1 (DONOR node log):

2016-11-18 13:24:38 4573 [Note] WSREP: IST request: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769-1495|tcp://192.168.1.3:5031
2016-11-18 13:24:38 4573 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can also validate this by checking the lowest write-set available in gcache on the DONOR node.

mysql> show status like 'wsrep_local_cached_downto';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| wsrep_local_cached_downto | 1 |
+---------------------------+-------+
1 row in set (0.00 sec)

So as you can see,

gcache.recover

could restore the cache on restart and help service IST over SST. This is a major resource saver for most of those graceful shutdowns.

gcache revive doesn’t work if . . .

If gcache pages are involved. Gcache pages are still removed on shutdown, and the gcache write-set until that point also gets cleared.

Again let’s see and example:

Let’s assume the same configuration and workflow as mentioned above. We will just change the workload pattern.
n1, n2, n3 are in sync and an average-size workload is executed, such that the write-set fits in the gcache. (seqno=1-x)
n2 and n3 are shutdown.
n1 continues to operate and executes some average size workload followed by a huge transaction that results in the creation of a gcache page. (1-x-a-b-c-h) [h represent transaction seqno]
Now n1 is shutdown. During shutdown, gcache pages are purged (irrespective of the
```
keep_page_sizes
```
setting).
The purge ensures that all the write-sets that has seqno smaller than
```
gcache-page-residing
```
write-set are purged, too. This effectively means (1-h) everything is removed, including (a,b,c).
On restart, even though n1 can revive the gcache it can’t revive anything, as all the write-sets are purged.
When n2 boots up, it requests IST, but n1 can’t service the missing write-set (a,b,c,h). This causes SST to take place.

Summing it up

Needless to say,

gcache.recover

is a much needed feature, given it saves SST pain. (Thanks Codership.) It would be good to see if the feature can be optimized to work with gcache pages.

And yes, Percona XtraDB Cluster inherits this feature in its upcoming release.

↧

Percona Software News and Roadmap Update with CEO Peter Zaitsev: Q1 2017

January 27, 2017, 2:06 pm

≫ Next: Installing Percona Monitoring and Management (PMM) for the First Time

≪ Previous: Galera Cache (gcache) is finally recoverable on restart

This blog post is a summary of the webinar Percona Software News and Roadmap Update – Q1 2017 given by Peter Zaitsev on January 12, 2017.

Over the last few months, I’ve had the opportunity to meet and talk with many of Percona’s customers. I love these meetings, and I always get a bunch of questions about what we’re doing, what our plans are and what releases are coming.

I’m pleased to say there is a great deal going on at Percona, and I thought giving a quick talk about our current software and services, along with our plans, would provide a simple reference for many of these questions.

A full recording of this webinar, along with the presentation slide deck, can be found here.

Percona Solutions and Services

Let me start by laying out Percona’s company purpose:

To champion unbiased open source database solutions.

What does this mean? It means that we write software to offer you better solutions, and we use the best of what software and technology exist in the open source community.

Percona stands by a set of principles that we feel define us as a company, and are a promise to our customers:

100% free and open source software
Focused on finding solution to maximize your success
Open source database strategy consulting and implementation
No vendor lock-in required

We offer trusted and unbiased expert solutions, support and resource in a broad software ecosystem, including:

MySQL
Percona Server
MariaDB
Percona XtraDB Cluster
Galera Cluster for MySQL
MariaDB Galera Cluster
MongoDB
Percona Server for MongoDB
Amazon RDS for MySQL/MariaDB/Aurora
Google CloudSQL

We also have specialization options for PaaS, IaaS, and SaaS solutions like Amazon Web Services, OpenStack, Google Cloud Platform, OpenShift, Ceph, Docker and Kubernetes.

Percona’s immediate business focus includes building long-term partnership relationships through support and managed services.

The next few sections detail our current service offerings, with some outlook on our plans.

98% Customer Satisfaction Rating

Over the last six months, Percona has consistently maintained a 98% Customer Satisfaction Rating!

Customer Success Team

Our expanded Customer Success Team is here to ensure you’re getting most out of your Percona Services Subscription.

Managed Services Best Practices

Unification based on best practices
Organization changes to offer more personal service
Increased automation

Ongoing Services

Consulting and Training. Our consulting and training services are available to assist you with whatever project or staff needs you have.

Onsite and remote
4 hours to full time (weeks or months)
Project and staff augmentation

Advanced HA Included with Enterprise and Premier Support. Starting this past Spring, we included advance high availability (HA) support as part of our Enterprise and Premier support tiers. This advanced support includes coverage for:

Percona XtraDB Cluster
MariaDB Galera Cluster
Galera Cluster for MySQL
Upcoming MySQL group replication
Upcoming MySQL Innodb Cluster

Enterprise Wide Support Agreements. Our new Enterprise Wide Support option allows you to buy per-environment support coverage that covers all of the servers in your environment, rather than on a per-server basis. This method of support can save you money, because it:

Covers both “MySQL” and “MongoDB”
Means you don’t have to count servers
Provides highly customized coverage

Simplified Support Pricing. Get easy to understand support pricing quickly.

To discuss how Percona Support can help your business, please call us at +1-888-316-9775 (USA),
+44 203 608 6727 (Europe), or have us contact you.

New Percona Online Store – Easy to Buy, Pay Monthly

Purchase some of Percona’s services online
Fast and easy credit card payments
Recurring monthly payments
https://www.percona.com/store

Percona Software

Below are the latest and upcoming features in Percona’s software. All of Percona’s software adheres to the following principles:

100% free and open source
No restricted “Enterprise” version
No “open core”
No BS-licensing (BSL)

Percona Server for MySQL 5.7

Overview

100% Compatible with MySQL 5.7 Community Edition
100% Free and Open Source
Includes Alternatives to Many MySQL Enterprise Features
Includes TokuDB Storage Engine
Focus on Performance and Operational Visibility

Latest Improvements

Improved space allocation for TokuDB: http://bit.ly/2ddVsTO
Directory per database layout for TokuDB: http://bit.ly/2iaYBZq
Column compression (5.6 only)
Security updates

Features about to be Released

Integration of TokuDB and Performance Schema
MyRocks integration in Percona Server
MySQL Group Replication
Starting to look towards MySQL 8

Percona XtraBackup 2.4

Overview

#1 open source binary hot backup solution for MySQL
Alternative to MySQL Enterprise backup
Parallel backups, incremental backups, streaming, encryption
Supports MySQL, MariaDB, Percona Server

New Features

Support SHA256 passwords and secure connection to server
Improved Security (CVE-2016-6225)
Wrong passphrase detection

Percona Toolkit

Overview

“Swiss Army Knife” of tools
Helps DBAs be more efficient
Helps DBAs make fewer mistakes
Supports MySQL, MariaDB, Percona Server, Amazon RDS MySQL

New Features

Improved fingerprinting in pt-query-digest
Pause file for pt-online-schema-change
Provide information about transparent huge pages

Coming Soon

Working towards Percona Toolkit 3.0 release
Comprehensive support for MongoDB
New tools are now implemented in Go

Percona Server for MongoDB 3.2

Overview

100% compatible with MongoDB 3.2 Community Edition
100% open source
Alternatives for many MongoDB Enterprise features
MongoRocks (RocksDB) storage engine
Percona Memory Engine

New

Percona Server for MongoDB 3.2 – GA
Support for MongoRocks storage engine
PerconaFT storage engine depreciated
Implemented Percona Memory Engine

Coming Soon

Percona Server for MongoDB 3.4
Fully compatible with MongoDB 3.4 Community Edition
Updated RocksDB storage engine
Universal hot backup for WiredTiger and MongoRocks
Profiling rate limiting (query sampling)

Percona Memory Engine for MongoDB

Benchmarks

Percona Memory Engine for MongoDB® is a 100 percent open source in-memory storage engine for Percona Server for MongoDB.

Based on the in-memory storage engine used in MongoDB Enterprise Edition, WiredTiger, Percona Memory Engine for MongoDB delivers extremely high performance and reduced costs for a variety of use cases, including application cache, sophisticated data manipulation, session management and more.

Below are some benchmarks that we ran to demonstrate Percona Memory Engine’s performance.

Percona Software News and Roadmap Update

WiredTiger vs MongoRocks – write intensive

Percona XtraDB Cluster 5.7

Overview

Based on Percona Server 5.7
Easiest way to bring HA in your MySQL environment
Designed to work well in the cloud
Multi-master replication with no conflicts
Automatic node provisioning for auto-scaling and self-healing

Goals

Brought PXC development in-house to server our customers better
Provide complete clustering solution, not set of LEGO pieces
Improve usability and ease of use
Focus on quality

Highlights

Integrated cluster-aware load balancer with ProxySQL
Instrumentation with Performance Schema
Support for data at rest encryption (InnoDB tablespace encryption)
Your data is safe by default with “strict mode” – prevents using features that do not work correctly
Integration with Percona Monitoring and Management (PMM)

New in Percona XtraDB Cluster 5.7

One option to secure all network communication: pxc-encrypt-cluster-traffic
Zero downtime maintenance with ProxySQL and Maintenance Mode

Percona Monitoring and Management

Overview

Comprehensive database-focused monitoring
100% open source, roll-your-own solution
Easy to install and use
Supports MySQL and MongoDB
Version 1.0 focuses on trending and query analyses
Management features to come

Examples of PMM Screens

What queries are causing the load?

Why are they causing this load?

How to fix them:

System information:

What happens on OS and hardware level:

As well as the database level:

New in Percona Monitoring and Management

Continuing to improve and expand dashboards with every release
Includes Grafana 4.0 (with basic Alerting)
SSL support for server-agent communications
Supports authentication for server-agent communication
Added support for Amazon RDS
Reduced space consumption by using advanced compression

Coming Soon

PMM server available as AMI and Virtual Appliance image
Better MongoDB dashboards
Continuing work on dashboards Improvement
Query analytics application refinements
Short term continuing focus on monitoring functionality

Check out the Demo

http://pmmdemo.percona.com

Percona Live Open Source Database Conference 2017 is right around the corner!

The Percona Live Open Source Database Conference is the premier event for the diverse and active open source database community, as well as businesses that develop and use open source database software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, PostgreSQL and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way

This conference has something for everyone!

April 24-27, 2017
Santa Clara, CA, USA
MySQL, MongoDB, PostgreSQL, other open source databases
https://www.percona.com/live/17/

Sponsorship opportunities are available as well. Become a Percona Live Sponsor, download the prospectus here.

↧

Installing Percona Monitoring and Management (PMM) for the First Time

February 24, 2017, 1:32 pm

≫ Next: Percona XtraDB Cluster 5.6.35-26.20 is now available

≪ Previous: Percona Software News and Roadmap Update with CEO Peter Zaitsev: Q1 2017

This post is another in the series on Percona’s MongoDB 3.4 bundle release. This post is meant to walk a prospective user through the benefits of Percona Monitoring and Management (PMM), how it’s architected and the simple install process. By the end of this post, you should have a good idea of what PMM is, where it can add value in your environment and how you can get PMM going quickly.

Percona Monitoring and Management (PMM) is Percona’s open-source tool for monitoring and alerting on database performance and the components that contribute to it. PMM monitors MySQL (Percona Server and MySQL CE), Amazon RDS/Aurora, MongoDB (Percona Server and MongoDB CE), Percona XtraDB/Galera Cluster, ProxySQL, and Linux.

What is it?

Percona Monitoring and Management is an amalgamation of exciting, best in class, open-source tools and Percona “engineering wizardry,” designed to make it easier to monitor and manage your environment. The real value to our users is the amount of time we’ve spent integrating the tools, plus the pre-built dashboards we’ve constructed that leverage the ten years of performance optimization experience. What you get is a tool that is ready to go out of the box, and installs in minutes. If you’re still not convinced, like ALL Percona software it’s completely FREE!

Sound good? I can hear you nodding your head. Let’s take a quick look at the architecture.

What’s it made of?

PMM, at a high-level, is made up of two basic components: the client and the server. The PMM Client is installed on the database servers themselves and is used to collect metrics. The client contains technology specific exporters (which collect and export data), and an “admin interface” (which makes the management of the PMM platform very simple). The PMM server is a “pre-integrated unit” (Docker, VM or AWS AMI) that contains four components that gather the metrics from the exporters on the PMM client(s). The PMM server contains Consul, Grafana, Prometheus and a Query Analytics Engine that Percona has developed. Here is a graphic from the architecture section of our documentation. In order to keep this post to a manageable length, please refer to that page if you’d like a more “in-depth” explanation.

How do I use it?

PMM is very easy to access once it has been installed (more on the install process below). You will simply open up the web browser of your choice and connect to the PMM Landing Page by typing

http://<ip_address_of _PMM_server>

. That takes you to the PMM landing page, where you can access all of PMM’s tools. If you’d like to get a look into the user experience, we’ve set up a great demo site so you can easily test it out.

Where should I use it?

There’s a good chance that you already have a monitoring/alerting platform for your production workloads. If not, you should set one up immediately and start analyzing trends in your environment. If you’re confident in your production monitoring solution, there is still a use for PMM in an often overlooked area: development and testing.

When speaking with users, we often hear that their development and test environments run their most demanding workloads. This is often due to stress testing and benchmarking. The goal of these workloads is usually to break something. This allows you to set expectations for normal, and thus abnormal, behavior in your production environment. Once you have a good idea of what’s “normal” and the critical factors involved, you can alert around those parameters to identify “abnormal” patterns before they cause user issues in production. The reason that monitoring is critical in your dev/test environment(s) is that you want to easily spot inflection points in your workload, which signal impending disaster. Dashboards are the easiest way for humans to consume and analyze this data.

Are you sold? Let’s get to the easiest part: installation.

How do you install it?

PMM is very easy to install and configure for two main reasons. The first is that the components (mentioned above) take some time to install, so we spent the time to integrate everything and ship it as a unit: one server install and a client install per host. The second is that we’re targeting customers looking to monitor MySQL and MongoDB installations for high-availability and performance. The fact that it’s a targeted solution makes pre-configuring it to monitor for best practices much easier. I believe we’ve all seen a particular solution that tries to do a little of everything, and thus actually does no particular thing well. This is the type of tool that we DO NOT want PMM to be. Now, onto the installation procedure.

There are four basic steps to get PMM monitoring your infrastructure. I do not want to recreate the Deployment Guide in order to maintain the future relevancy of this post. However, I’ll link to the relevant sections of the documentation so you can cut to the chase. Also, underneath each step, I’ll list some key takeaways that will save you time now and in the future.

Install the integrated PMM server in the flavor of your choice (Docker, VM or AWS AMI)
1. Percona recommends Docker to deploy PMM server as of v1.1
  1. As of right now, using Docker will make the PMM server upgrade experience seamless.
  2. Using the default version of Docker from your package manager may cause unexpected behavior. We recommend using the latest stable version from Docker’s repositories (instructions from Docker).
2. PMM server AMI and VM are “experimental” in PMM v1.1
3. When you open the “Metrics Monitor” for the first time, it will ask for credentials (user: admin pwd: admin).
Install the PMM client on every database instance that you want to monitor.
1. Install with your package manager for easier upgrades when a new version of PMM is released.
Connect the PMM client to the PMM Server.
1. Think of this step as sending configuration information from the client to the server. This means you are telling the client the address of the PMM server, not the other way around.
Start data collection services on the PMM client.
1. Collection services are enabled per database technology (MySQL, MongoDB, ProxySQL, etc.) on each database host.
2. Make sure to set permissions for PMM client to monitor the database: Cannot connect to MySQL: Error 1045: Access denied for user ‘jon’@’localhost’ (using password: NO)
  1. Setting proper credentials uses this syntax sudo pmm-admin add <service_type> –user xxxx –password xxxx
3. There’s good information about PMM client options in the “Managing PMM Client” section of the documentation for advanced configurations/troubleshooting.

What’s next?

That’s really up to you, and what makes sense for your needs. However, here are a few suggestions to get the most out of PMM.

Set up alerting in Grafana on the PMM server. This is still an experimental function in Grafana, but it works. I’d start with Barrett Chambers’ post on setting up email alerting, and refine it with Peter Zaitsev’s post.
Set up more hosts to test the full functionality of PMM. We have completely free, high-performance versions of MySQL, MongoDB, Percona XtraDB Cluster (PXC) and ProxySQL (for MySQL proxy/load balancing).
Start load testing the database with benchmarking tools to build your troubleshooting skills. Try to break something to learn what troubling trends look like. When you find them, set up alerts to give you enough time to fix them.

↧

Innodb Fulltext Indexes

Minimal replication images

Durable Memcache Cluster

Async replication GTID Integration

Conclusion

Overview

The basic upgrade flow

Why can’t I write to the 5.6 nodes?

Some alternatives

Does writing to the 5.6 nodes REALLY break things?

Using Async replication to upgrade

Just take the outage

Conclusion

The Test

Adding parallelism

What about Percona XtraDB Cluster?

TL;DR

Monitoring the cluster

Summary

Q: Is the read,write speed reduced in Galera compared to normal MySQL?

Q: Does state transfers affect a continuing transaction within the nodes?

Q: Perhaps not the correct webinar for this question, but I was also expecting to hear something about using PXC in combination with OpenStack Trove. If you’ve got time, could you tell something about that?

Q: For Loadbalancing using the Java Mysql driver, would you suggest HA proxy or the loadbalancing connection in the java driver. Also how does things work in case of persistent connections and connection pools like dbcp?

Q: Are there any manual solutions to avoid deadlocks within a high write context to force a query to execute on all nodes?

High Availability

PXC

Focus on data consistency

Anti-split-brain

Dirty reads from Galera cluster

MySQL

Useful links

Problem

Possible Cause

Source

Possible Solution

Conclusion

The challenges of clone-based SST

SST script

Root privileges

Files opened by MySQL before the SST

Walkthrough

1. Join the Ceph cluster

2. Create the Ceph pool

3. Create the first RBD image

4. Modify the my.cnf files

5. Start the other XtraDB Cluster nodes

Discussion

DISCLAIMER: The wsrep_sst_ceph script isn’t officially supported by Percona.

Percona XtraDB Cluster 5.7.14-26.17 GA New Features

Bug Fixes

Security Fixes

Other Improvements

Why do we need GCache?

How is GCache managed?

Where are GCache files located?

What if one of the node is DESYNCED and PAUSED?

Need

How does this help ?

gcache.recover in action

gcache revive doesn’t work if . . .

Summing it up

Percona Solutions and Services

Percona Software

Percona Live Open Source Database Conference 2017 is right around the corner!

What is it?

What’s it made of?

How do I use it?

Where should I use it?

How do you install it?

What’s next?

DISCLAIMER: The
wsrep_sst_ceph
script isn’t officially supported by Percona.