mk-table-sync.1p

Langue: en

Version: 2009-05-04 (debian - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

mk-table-sync - Synchronize MySQL tables efficiently.

SYNOPSIS

It is a good idea to back up your data and run "mk-table-sync" with ``--test'' first, to see what will happen. If you want to see which rows are different without changing the tables, use ``--print'' instead of ``--execute''.

To sync db.tbl1 from host1 to host2:

    mk-table-sync --execute u=user,p=pass,h=host1,D=db,t=tbl host2
 
 

Sync all tables in host1 to host2 and host3:

    mk-table-sync --execute host1 host2 host3
 
 

Resolve differences mk-table-checksum found on this master's slaves:

    mk-table-sync --execute --replicate test.checksum master1
 
 

Sync this slave to its replication master:

    mk-table-sync --execute --synctomaster slave1
 
 

Sync this slave to its replication master, resolving differences mk-table-checksum found:

    mk-table-sync --execute --synctomaster --replicate test.checksum slave1
 
 

Sync server2 in a master-master replication configuration, where server2's copy of db1.tbl1 is known or suspected to be incorrect:

    mk-table-sync --execute --synctomaster h=server2,D=db1,t=tbl1
 
 

Note that in the master-master configuration, the following will NOT do what you want, because it will make changes directly on server2, which will then flow through replication and change server1's data:

    # Don't do this in a master-master setup!
    mk-table-sync --execute h=server1,D=db1,t=tbl1 h=server2
 
 

DESCRIPTION

WARNING this tool is unfinished and could perform slowly. The Chunk algorithm is great when it can be used, and so is Nibble, but otherwise GroupBy is the default choice and it may not perform very well. Please run with ``--test'' before subjecting your servers to this tool, and make backups of your data!

This tool is designed to do one-way synchronization of data (two-way sync is planned for the future). It finds differences efficiently with one of several algorithms (see ``ALGORITHMS''). It makes changes on the destination table(s) so it matches the source.

It does not synchronize table structures, indexes, or any other schema changes. It synchronizes only data.

It can operate through replication by comparing a slave with its master and making changes on the master. These changes will flow through replication and correct any differences found on the slave.

It accepts a list of DSNs (see the ``--help'' output) to tell it where and how to connect.

There are many ways to invoke it. The following is the abbreviated logic:

    if DSN has a t part, sync only that table:
       if 1 DSN:
          if --synctomaster:
             The DSN is a slave.  Connect to its master and sync.
       if more than 1 DSN:
          The first DSN is the source.  Sync each DSN in turn.
    else if --replicate:
       if --synctomaster:
          The DSN is a slave.  Connect to its master, find records
          of differences, and fix.
       else:
          The DSN is the master.  Find slaves and connect to each,
          find records of differences, and fix.
    else:
       if only 1 DSN and --synctomaster:
          The DSN is a slave.  Connect to its master, find tables and
          filter with --databases etc, and sync each table to the master.
       else:
          find tables, filtering with --databases etc, and sync each
          DSN to the first.
 
 

If you're confused about how it the DSNs are interpreted, give the ``--explainhosts'' option and it will tell you.

REPLICATION SAFETY

Synchronizing a replication master and slave safely is a non-trivial problem, in general. There are all sorts of issues to think about, such as other processes changing data, trying to change data on the slave, whether the destination and source are a master-master pair, and much more.

In general, the safe way to do it is to change the data on the master, and let the changes flow through replication to the slave like any other changes. However, this works only if it's possible to REPLACE into the table on the master. REPLACE works only if there's a unique index on the table (otherwise it just acts like an ordinary INSERT).

If your table has unique keys, you should use the ``--synctomaster'' and/or ``--replicate'' options to sync a slave to its master. This will generally do the right thing. When there is no unique key on the table, there is no choice but to change the data on the slave, and mk-table-sync will detect that you're trying to do so. It will complain and die unless you specify the ``--skipslavecheck'' option.

If you're syncing a table without a primary or unique key on a master-master pair, you must change the data on the destination server. Therefore, you need to use the ``--skipbinlog'' option for safety. If you don't, the changes you make on the destination server will replicate back to the source server and change the data there!

The generally safe thing to do on a master-master pair is to use the ``--synctomaster'' option so you don't change the data on the destination server. You will also need to use the ``--skipslavecheck'' option to keep mk-table-sync from complaining that it is changing data on a slave.

ALGORITHMS

This tool has a generic data-syncing framework, within which it is possible to use any number of different algorithms to actually find differences. It chooses the best algorithm automatically. While I plan to add more algorithms in the future, the following are implemented now:
Chunk
Finds an index whose first column is numeric (including date and time types), and divides the column's range of values into chunks of approximately ``--chunksize'' rows. Syncs a chunk at a time by checksumming the entire chunk. If the chunk differs on the source and destination, checksums each chunk's rows individually to find the rows that differ.

It is efficient when the column has sufficient cardinality to make the chunks end up about the right size.

The initial per-chunk checksum is quite small and results in minimal network traffic and memory consumption. If a chunk's rows must be examined, only the primary key columns and a checksum are sent over the network, not the entire row. If a row is found to be different, the entire row will be fetched, but not before.

Nibble
Finds an index and ascends the index in fixed-size nibbles of ``--chunksize'' rows, using a non-backtracking algorithm (see mk-archiver for more on this algorithm). It is very similar to ``Chunk'', but instead of pre-calculating the boundaries of each piece of the table based on index cardinality, it uses "LIMIT" to define each nibble's upper limit, and the previous nibble's upper limit to define the lower limit.

It works in steps: one query finds the row that will define the next nibble's upper boundary, and the next query checksums the entire nibble. If the nibble differs between the source and destination, it examines the nibble row-by-row, just as ``Chunk'' does.

GroupBy
Selects the entire table grouped by all columns, with a COUNT(*) column added. Compares all columns, and if they're the same, compares the COUNT(*) column's value to determine how many rows to insert or delete into the destination. Works on tables with no primary key or unique index.
Stream
Selects the entire table in one big stream and compares all columns. Selects all columns. Much less efficient than the other algorithms, but works when there is no suitable index for them to use.
Future Plans
Possibilities for future algorithms are TempTable (what I originally called bottom-up in earlier versions of this tool), DrillDown (what I originallly called top-down), and GroupByPrefix (similar to how SqlYOG Job Agent works). Each algorithm has strengths and weaknesses. If you'd like to implement your favorite technique for finding differences between two sources of data on possibly different servers, I'm willing to help. The algorithms adhere to a simple interface that makes it pretty easy to write your own.

DOWNLOADING

You can download Maatkit from Google Code at <http://code.google.com/p/maatkit/>, or you can get any of the tools easily with a command like the following:
    wget http://www.maatkit.org/get/toolname
    or
    wget http://www.maatkit.org/trunk/toolname
 
 

Where "toolname" can be replaced with the name (or fragment of a name) of any of the Maatkit tools. Once downloaded, they're ready to run; no installation is needed. The first URL gets the latest released version of the tool, and the second gets the latest trunk code from Subversion.

OPTIONS

Specify at least one of --print, --execute, or --test.

--where and --replicate are mutually exclusive.

--algorithm
short form: -a; type: string

The algorithm to use when comparing the tables.

This is a suggestion. The tool will auto-detect the best algorithm, and if your chosen algorithm can't be used, will use the best available one instead. See ``ALGORITHMS''.

--askpass
Prompt for a password when connecting to MySQL.
--bufferresults
Fetch all rows from MySQL before comparing.

This is disabled by default. If enabled, all rows will be fetched into memory for comparing. This may result in the results ``cursor'' being held open for a shorter time on the server, but if the tables are large, it could take a long time anyway, and eat all your memory. For most non-trivial data sizes, you want to leave this disabled.

--bufferinmysql
Instruct MySQL to buffer queries in its memory.

This option adds the "SQL_BUFFER_RESULT" option to the comparison queries. This causes MySQL to execute the queries and place them in a temporary table internally before sending the results back to mk-table-sync. The advantage of this strategy is that mk-table-sync can fetch rows as desired without using a lot of memory inside the Perl process, while releasing locks on the MySQL table (to reduce contention with other queries). The disadvantage is that it uses more memory on the MySQL server instead.

You probably want to enable ``--bufferresults'' too, because buffering into a temp table and then fetching it all into Perl's memory is probably a silly thing to do. This option is most useful for the GroupBy and Stream algorithms, which may fetch a lot of data from the server.

--chunksize
type: string; default: 1000

Number of rows or data size per chunk.

The size of each chunk of rows for the ``Chunk'' and ``Nibble'' algorithms. The size can be either a number of rows, or a data size. Data sizes are specified with a suffix of k=kibibytes, M=mebibytes, G=gibibytes. Data sizes are converted to a number of rows by dividing by the average row length.

--columns
short form: -c; type: array

Compare this comma-separated list of columns.

--databases
short form: -d; type: hash

Sync only this comma-separated list of databases.

A common request is to sync tables from one database with tables from another database on the same or different server. This is not yet possible. --databases will not do it, and you can't do it with the D part of the DSN either, because in the absence of a table name it assumes the whole server should be synced and the D part controls only the connection's default database.

--engine
short form: -e; type: hash

Sync only this comma-separated list of storage engines.

--execute
short form: -x

Execute queries to make the tables have identical data.

--explainhosts
Print connection information and exit.

Print out a list of hosts to which mk-table-sync will connect, with all the various connection options, and exit.

--function
short form: -f; type; string

Which hash function you'd like to use for checksums.

The default is "CRC32". Other good choices include "MD5" and "SHA1". If you have installed the "FNV_64" user-defined function, "mk-table-sync" will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you've installed that user-defined function. Bot of these are distributed with Maatkit. See mk-table-checksum for more information and benchmarks.

--ignore-columns
type: Hash

Ignore this comma-separated list of column names in comparisons.

This option causes columns not to be compared. However, if a row is determined to differ between tables, all columns in that row will be synced, regardless. (It is not currently possible to exclude columns from the sync process itself, only from the comparison.)

--ignoredb
short form: -g; type: Hash

Ignore this comma-separated list of databases.

--ignoreengine
short form: -E; type: Hash; default: FEDERATED,MRG_MyISAM

Ignore this comma-separated list of storage engines.

--ignoretbl
short form: -n; type: Hash

Ignore this comma-separated list of tables.

Table names may be qualified with the database name.

--ignore-triggers
Write to tables on which triggers are defined.

Before syncing tables "mk-table-sync" checks all destination tables on all destination hosts (those tables which will be synced to the source host's tables). If any destination host table has triggers defined on it, "mk-table-sync" will die, unless --ignore-triggers is given.

--lock
short form: -k; type: int

Lock tables: 0=none, 1=per sync cycle, 2=per table, or 3=globally.

This uses "LOCK TABLES". This can help prevent tables being changed while you're examining them. The possible values are as follows:

   VALUE  MEANING
   =====  =======================================================
   0      Never lock tables.
   1      Lock and unlock one time per sync cycle (as implemented
          by the syncing algorithm).  This is the most granular
          level of locking available.  For example, the Chunk
          algorithm will lock each chunk of C<N> rows, and then
          unlock them if they are the same on the source and the
          destination, before moving on to the next chunk.
   2      Lock and unlock before and after each table.
   3      Lock and unlock once for every server (DSN) synced, with
          C<FLUSH TABLES WITH READ LOCK>.
 
 

A replication slave is never locked if ``--replicate'' or ``--synctomaster'' is specified, since in theory locking the table on the master should prevent any changes from taking place. (You are not changing data on your slave, right?) If ``--wait'' is given, the master (source) is locked and then the tool waits for the slave to catch up to the master before continuing.

If --transaction is specified, "LOCK TABLES" is not used. Instead, lock and unlock are implemented by beginning and committing transactions. The exception is if --lock is 3.

If --notransaction is specified, then "LOCK TABLES" is used for any value of --lock. See ``--[no]transaction''.

--lock-and-rename
Lock the source and destination table, sync, then swap names. This is useful as a less-blocking ALTER TABLE, once the tables are reasonably in sync with each other (which you may choose to accomplish via any number of means, including dump and reload or even something like mk-archiver). It requires exactly two DSNs and assumes they are on the same server, so it does no waiting for replication or the like. Tables are locked with LOCK TABLES.
--nouseindex
Do not add USE INDEX hint to SQL statements for Chunk algorithm.

By default "mk-table-sync" adds an USE INDEX hint to each SQL statement for the Chunk algorithm to coerce MySQL into using the index for the column by which a table will be chunked. This option causes "mk-table-sync" to omit the USE INDEX hint.

The Nibble algorithm always uses a FORCE/USE INDEX hint. The GroupBy and Stream algorithms never use an index hint.

--print
short form: -p

Print queries that will resolve differences.

If you don't trust "mk-table-sync", or just want to see what it will do, this is a good way to be safe. These queries are valid SQL and you can run them yourself if you want to sync the tables manually.

--replace
short form: -r

Write all "INSERT" and "UPDATE" statements as "REPLACE".

This is automatically switched on as needed when there are unique index violations.

--replicate
short form: -R; type: string

Sync tables listed as different in this table.

Specifies that "mk-table-sync" should examine the specified table to find data that differs. The table is exactly the same as the argument of the same name to mk-table-checksum. That is, it contains records of which tables (and ranges of values) differ between the master and slave.

For each table and range of values that shows differences between the master and slave, "mk-table-checksum" will sync that table, with the appropriate "WHERE" clause, to its master.

This automatically sets ``--wait'' to 60 and causes changes to be made on the master instead of the slave.

If ``--synctomaster'' is specified, the tool will assume the server you specified is the slave, and connect to the master as usual to sync.

Otherwise, it will try to use "SHOW PROCESSLIST" to find slaves of the server you specified. If it is unable to find any slaves via "SHOW PROCESSLIST", it will inspect "SHOW SLAVE HOSTS" instead. You must configure each slave's "report-host", "report-port" and other options for this to work right. After finding slaves, it will inspect the specified table on each slave to find data that needs to be synced, and sync it.

The tool examines the master's copy of the table first, assuming that the master is potentially a slave as well. Any table that shows differences there will NOT be synced on the slave(s). For example, suppose your replication is set up as A->B, B->C, B->D. Suppose you use this argument and specify server B. The tool will examine server B's copy of the table. If it looks like server B's data in table "test.tbl1" is different from server A's copy, the tool will not sync that table on servers C and D.

--setvars
type: string; default: wait_timeout=10000 /*!40001 ,QUERY_CACHE_TYPE=0 */

Set these MySQL variables.

Specify any variables you want to be set immediately after connecting to MySQL. These will be included in a "SET" command.

--skipbinlog
Do not log to the binary log ("SET SQL_LOG_BIN=0").
--skipforeignkey
short form: -K

Disable foreign key checks ("SET FOREIGN_KEY_CHECKS=0").

--skipslavecheck
Don't check whether the destination server is a slave.

If the destination server is a slave, it's generally unsafe to make changes on it. However, sometimes you have to; ``--replace'' won't work unless there's a unique index, for example, so you can't make changes on the master in that scenario. By default mk-table-sync will complain if you try to change data on a slave. This option will bypass that check. Use it at your own risk.

--skipuniquekey
Disable unique index checks ("SET UNIQUE_CHECKS=0").
--synctomaster
short form: -s

Treat the DSN as a slave and sync it to its master.

Treat the server you specified as a slave. Inspect "SHOW SLAVE STATUS", connect to the server's master, and treat the master as the source and the slave as the destination. Causes changes to be made on the master. Sets ``--wait'' to 60 by default, sets ``--lock'' to 1 by default, and disables ``--[no]transaction'' by default. See also ``--replicate'', which changes this option's behavior.

--tables
short form: -t; type: hash

Sync only this comma-separated list of tables.

Table names may be qualified with the database name.

--test
Analyze, decide the sync algorithm to use, print and exit.

Implies ``--verbose'' so you can see the results. The results are in the same output format that you'll see from actually running the tool, but there will be zeros for rows affected. This is because the tool actually executes, but stops before it compares any data and just returns zeros. The zeros do not mean there are no changes to be made.

--timeoutok
Keep going if ``--wait'' fails.

If you specify ``--wait'' and the slave doesn't catch up to the master's position before the wait times out, the default behavior is to abort. This option makes the tool keep going anyway. Warning: if you are trying to get a consistent comparision between the two servers, you probably don't want to keep going after a timeout.

--[no]transaction
Use transactions instead of "LOCK TABLES".

The granularity of beginning and committing transactions is controlled by ``--lock''. This is enabled by default, but since ``--lock'' is disabled by default, it has no effect.

Most options that enable locking also disable transactions by default, so if you want to use transactional locking (via "LOCK IN SHARE MODE" and "FOR UPDATE", you must specify --transaction explicitly.

If you don't specify --transaction explicitly, "mk-table-sync" will decide on a per-table basis whether to use transactions or table locks. It currently uses transactions on InnoDB tables, and table locks on all others.

If --notransaction is specified, then "mk-table-sync" will not use transactions at all (not even for InnoDB tables) and locking is controlled by ``--lock''.

--trim
TRIM()> "VARCHAR" columns in "BIT_XOR" and "ACCUM" modes. Helps when comparing MySQL 4.1 to >= 5.0.

This is useful when you don't care about the trailing space differences between MySQL versions which vary in their handling of trailing spaces. MySQL 5.0 and later all retain trailing spaces in "VARCHAR", while previous versions would remove them.

--[no]utf8
default: yes

Enable UTF-8 options in Perl and MySQL.

This option is deprecated. Pass the "A" option in a DSN instead. For backwards compatibility, this option adds "A=utf8" to all DSNs.

Enables character set settings in Perl and MySQL. If the value is "utf8", sets Perl's binmode on STDOUT to utf8, passes the "mysql_enable_utf8" option to DBD::mysql, and runs "SET NAMES UTF8" after connecting to MySQL. Any other value sets binmode on STDOUT without the utf8 layer, and runs "SET NAMES" after connecting to MySQL.

--verbose
short form: -v

Print results of sync operations.

See ``OUTPUT'' for more details about the output.

--wait
short form: -w; type: time

How long to wait for slaves to catch up to their master.

Make the master wait for the slave to catch up in replication before comparing the tables. The value is the number of seconds to wait before timing out (see also ``--timeoutok''). Sets ``--lock'' to 1 and ``--[no]transaction'' to 0 by default. If you see an error such as the following,

   MASTER_POS_WAIT returned -1
 
 

It means the timeout was exceeded and you need to increase it.

The default value of this option is influenced by other options. To see what value is in effect, run with --help.

--where
short form: -W; type: string

"WHERE" clause to restrict syncing to part of the table.

EXIT STATUS

Exit status is as follows:
    STATUS  MEANING
    ======  =======================================================
    0       Success.
    1       Internal error.
    2       At least one table differed on the destination.
    3       Combination of 1 and 2.
 
 

OUTPUT

If you specify the ``--verbose'' option, you'll see information about the differences between the tables. There is one row per table. Each server is printed separately. For example,
    # Syncing D=test,t=test2
    # DELETE REPLACE INSERT UPDATE ALGORITHM DATABASE.TABLE
    #      0       0      2      0 Chunk     test.test1
 
 

This table required 2 "UPDATE" statements to synchronize.

There are cases where no combination of "INSERT", "UPDATE" or "DELETE" statements can resolve differences without violating some unique key. For example, suppose there's a primary key on column a and a unique key on column b. Then there is no way to sync these two tables with straightforward UPDATE statements:

  +---+---+  +---+---+
  | a | b |  | a | b |
  +---+---+  +---+---+
  | 1 | 2 |  | 1 | 1 |
  | 2 | 1 |  | 2 | 2 |
  +---+---+  +---+---+
 
 

The tool rewrites queries to "DELETE" and "REPLACE" in this case. This is automatically handled after the first index violation, so you don't have to worry about it.

ENVIRONMENT

The environment variable MKDEBUG enables verbose debugging output in all of the Maatkit tools:
    MKDEBUG=1 mk-....
 
 

HISTORY AND ACKNOWLEDGEMENTS

My work is based in part on Giuseppe Maxia's work on distributed databases, <http://www.sysadminmag.com/articles/2004/0408/> and code derived from that article. There is more explanation, and a link to the code, at <http://www.perlmonks.org/?node_id=381053>.

Another programmer extended Maxia's work even further. Fabien Coelho changed and generalized Maxia's technique, introducing symmetry and avoiding some problems that might have caused too-frequent checksum collisions. This work grew into pg_comparator, <http://www.coelho.net/pg_comparator/>. Coelho also explained the technique further in a paper titled ``Remote Comparison of Database Tables'' (<http://cri.ensmp.fr/classement/doc/A-375.pdf>).

This existing literature mostly addressed how to find the differences between the tables, not how to resolve them once found. I needed a tool that would not only find them efficiently, but would then resolve them. I first began thinking about how to improve the technique further with my article <http://www.xaprb.com/blog/2007/03/05/an-algorithm-to-find-and-resolve-data-differences-between-mysql-tables/>, where I discussed a number of problems with the Maxia/Coelho ``bottom-up'' algorithm. After writing that article, I began to write this tool. I wanted to actually implement their algorithm with some improvements so I was sure I understood it completely. I discovered it is not what I thought it was, and is considerably more complex than it appeared to me at first. Fabien Coelho was kind enough to address some questions over email.

The first versions of this tool implemented a version of the Coelho/Maxia algorithm, which I called ``bottom-up'', and my own, which I called ``top-down.'' Those algorithms are considerably more complex than the current algorithms and I have removed them from this tool, and may add them back later. The improvements to the bottom-up algorithm are my original work, as is the top-down algorithm. The techniques to actually resolve the differences are also my own work.

Another tool that can synchronize tables is the SQLyog Job Agent from webyog. Thanks to Rohit Nadhani, SJA's author, for the conversations about the general techniques. There is a comparison of mk-table-sync and SJA at <http://www.xaprb.com/blog/2007/04/05/mysql-table-sync-vs-sqlyog-job-agent/>

Thanks to the following people and organizations for helping in many ways:

The Rimm-Kaufman Group <http://www.rimmkaufman.com/>, MySQL AB <http://www.mysql.com/>, Blue Ridge InternetWorks <http://www.briworks.com/>, Percona <http://www.percona.com/>, Fabien Coelho, Giuseppe Maxia and others at MySQL AB, Kristian Koehntopp (MySQL AB), Rohit Nadhani (WebYog), The helpful monks at Perlmonks, And others too numerous to mention.

SYSTEM REQUIREMENTS

You need Perl, DBI, DBD::mysql, and some core packages that ought to be installed in any reasonably new version of Perl.

BUGS

Please use Google Code Issues and Groups to report bugs or request support: <http://code.google.com/p/maatkit/>.

Please include the complete command-line used to reproduce the problem you are seeing, the version of all MySQL servers involved, the complete output of the tool when run with ``--version'', and if possible, debugging output produced by running with the "MKDEBUG=1" environment variable.

COPYRIGHT, LICENSE AND WARRANTY

This program is copyright 2007-2009 Baron Schwartz. Feedback and improvements are welcome.

THIS PROGRAM IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; OR the Perl Artistic License. On UNIX and similar systems, you can issue `man perlgpl' or `man perlartistic' to read these licenses.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

AUTHOR

Baron Schwartz.

VERSION

This manual page documents Ver 1.0.15 Distrib 3519 $Revision: 3515 $.