mk-parallel-dump.1p

Langue: en

Version: 2009-05-04 (debian - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

mk-parallel-dump - Dump sets of MySQL tables in parallel.

SYNOPSIS

   mk-parallel-dump
   mk-parallel-dump --tab --basedir /path/to/backups/
   mk-parallel-dump --sets order,profile,session --settable meta.backupset
 
 

Do not rely on mk-parallel-dump for your backups unless you have tested it. You have been warned.

DESCRIPTION

mk-parallel-dump connects to a MySQL server, finds database and table names, and dumps them in parallel for speed. It can be used in several pre-packaged ways, or as a generic wrapper to call some program in parallel, passing it parameters for each table. It supports backup sets and dumping only tables that have changed since the last dump.

To dump all tables to gzipped files in the current directory, each database with its own directory, with a global read lock, flushing and recording binary log positions, each table in a single file:

   mk-parallel-dump
 
 

To dump tables elsewhere:

   mk-parallel-dump --basedir /path/to/elsewhere
 
 

To dump to tab-separated files with "SELECT INTO OUTFILE", each table with separate data and SQL files:

   mk-parallel-dump --tab
 
 

To dump one or more backup sets (see ``BACKUP SETS''):

   mk-parallel-dump --sets set1,set2,set3 --settable meta.backupset
 
 

To ``write your own command line,'' use "--" to indicate where the arguments for mk-parallel-dump stop and where the arguments for "mysqldump" (or any other program) begin. The following example shows "mysqldump", and aside from simpler options to "mysqldump", is basically what happens when you specify no arguments at all:

   mk-parallel-dump -- mysqldump --skip-lock-tables '%D' '%N' \
      \| gzip --fast -c - \> '%D.%N.gz'
 
 

The "%" modifiers are macros (see ``MACROS''). The "--skip-lock-tables" argument is very important in that last example, because otherwise both mk-parallel-dump and "mysqldump" will lock tables, so "mysqldump" will hang, waiting for the locks. Notice the shell metacharacters "|" and ">" are escaped so the shell won't interpret them, and they'll get passed through to the generated command-line.

There's no reason you can't use mk-parallel-dump to do other tasks in parallel, such as "OPTIMIZE TABLE":

   mk-parallel-dump --noflushlock -- mysqlcheck --optimize '%D' '%N'
 
 

When you use built-in defaults, mk-parallel-dump will relay these arguments on to every forked copy of "mysqldump": ``--defaults-file'', ``--host'', ``--port'', ``--socket'', ``--user'', ``--password''. If you write your own command-line, you will need to specify them manually.

If you specify the ``--tab'' option, mk-parallel-dump creates separate files that hold views and triggers, so they can be restored correctly (this is not currently possible with the "mysqldump" from MySQL AB, which will restore triggers before restoring data). Otherwise it does not back up your entire database; it dumps tables and data only. It does not dump view definitions or stored routines. However, if you dump the "mysql" database, you'll be dumping the stored routines anyway.

Exit status is 0 if everything went well, 1 if any chunks failed, and any other value indicates an internal error.

mk-parallel-dump doesn't clean out any destination directories before dumping into them. You can move away the old destination, then remove it after a successful dump, with a shell script like the following:

    #!/bin/sh
    CNT=`ls | grep -c old`;
    if [ -d default ]; then mv default default.old.$CNT;
    mk-parallel-dump
    if [ $? != 0 ]
    then
       echo "There were errors, not purging old sets."
    else
       echo "No errors during dump, purging old sets."
       rm -rf default.old.*
    fi
 
 

mk-parallel-dump checks whether files have been created before dumping. If the file has been created, it skips the table or chunk that would have created the file. This makes dumps resumable. If you don't want this behavior, and instead you want a full dump, then move away the old files.

BACKUP SETS

Backup sets are groups of logically related tables you want to backup together. You specify a set by inserting the table names into a table in the MySQL server from which you're dumping, and then naming it in the ``--sets'' option. mk-parallel-dump always works a set at a time; if you don't specify a set, it auto-discovers tables, filters them with the various command-line options (``--databases'', etc) and considers them the default set.

The table that stores backup sets should have at least these columns: setname, priority, db, tbl. The following is a suggested table structure:

   CREATE TABLE backupset (
     setname  CHAR(10)  NOT NULL,
     priority INT       NOT NULL DEFAULT 0,
     db       CHAR(64)  NOT NULL,
     tbl      CHAR(64)  NOT NULL,
     ts       TIMESTAMP NOT NULL,
     PRIMARY KEY(setname, db, tbl),
     KEY(setname, priority, db, tbl)
   );
 
 

Entries are ordered by priority, db, and tbl. Priority 0 tables are dumped first, not last. If it looks like tables are dumped in the wrong order, it's probably because they're being dumped asynchronously. The output is printed when the dump finishes, not when it starts.

If you specify ``--age'', mk-parallel-dump expects the "ts" column to exist, and will update the column to the current date and time when it successfully dumps a table.

Don't use "default" as a set name. It is used when you don't specify any sets and when you want all tables not explicitly assigned to a set to be dumped (see ``--[no]defaultset'').

Set names may contain only lowercase letters, numbers, and underscores.

CHUNKS

mk-parallel-dump can break your tables into chunks when dumping, and put approximately the amount of data you specify into each chunk. This is useful to avoid enormous files for restoration, which can not only take a long time but may be a lot of extra work for transactional storage engines like InnoDB. A huge file can create a huge rollback segment in your tablespace.

To dump in chunks, specify the ``--chunksize'' option. This option is an integer with an optional suffix. Without the suffix, it's the number of rows you want in each chunk. With the suffix, it's the approximate size of the data.

mk-parallel-dump tries to use index statistics to calculate where the boundaries between chunks should be. If the values are not evenly distributed, some chunks can have a lot of rows, and others may have very few or even none. Some chunks can exceed the size you want.

When you specify the size with a suffix, the allowed suffixes are k, M and G, for kibibytes, mebibytes, and gibibytes, respectively. mk-parallel-dump doesn't know anything about data size. It asks MySQL (via "SHOW TABLE STATUS") how long an average row is in the table, and converts your option to a number of rows.

Not all tables can be broken into chunks. mk-parallel-dump looks for an index whose leading column is numeric (integers, real numbers, and date and time types). It prefers the primary key if its first column is chunk-able. Otherwise it chooses the first chunk-able column in the table.

Generating a series of "WHERE" clauses to divide a table into evenly-sized chunks is difficult. If you have any ideas on how to improve the algorithm, please write to the author (see ``BUGS'').

MACROS

mk-parallel-dump can insert "%" variables into arguments. The available macros are as follows:
   MACRO  MEANING
   =====  =================
   %S     The backup set
   %D     The database name
   %N     The table name
   %C     The chunk number
   %W     The WHERE clause
 
 

You can place a number between the "%" and the letter. The macro replacement then assumes it's a digit and pads it with leading zeroes (in practice, this is only useful for %C).

OUTPUT

Output depends on verbosity. When ``--test'' is given, output includes commands that would be executed.

When ``--verbose'' is 0, there is normally no output unless there's an error.

When ``--verbose'' is 1, there is one line of output for each backup set, showing the set, how many tables and chunks were dumped with what status, how much time elapsed, and how much time the parallel dump jobs added up to. A final line shows sums for all sets, unless there is only one set.

When ``--verbose'' is 2, there is also one line of output for each table. Each line is printed when a forked ``child'' process ends and is removed from the list of children. The output shows the backup set, database, table, seconds spent dumping, the exit status of the forked dump process, and number of current processes (including the one just reaped; so this typically shows ``how many are running in parallel''). A status of 0 indicates success:

   SET     DATABASE TABLE         TIME STATUS THREADS
   default mysql    db               0      0       4
   default mysql    columns_priv     0      0       4
   default mysql    help_category    0      0       3
 
 

SPEED OF PARALLEL DUMPS

How much faster is it to dump in parallel? That depends on your hardware and data. You may be able dump files twice as fast, or more if you have lots of disks and CPUs. Here are some user-contributed figures.

The following table is for a 3.6GHz Xeon machine with 4 processors and a RAID-10 array of 15k disks, directly attached to the server with a fibre channel. Most of the space is in one huge table that wasn't dumped in parallel:

   COMMAND                      SIZE  TIME
   --------------------------  -----  ----
   mk-parallel-dump            1.4GB   269
   mysqldump                   1.4GB   345
 
 

On the same machine, in a database with lots of roughly equal-sized tables:

   COMMAND                      SIZE  TIME
   --------------------------  -----  ----
   mk-parallel-dump            117MB     7
   mysqldump                   117MB    37
 
 

It doesn't always work that well. A dual 2.80GHz Xeon server with a RAID-5 array of three 7200RPM SATA disk drives running MySQL 5.0.38 on GNU/Linux achieved the following dump times:

   COMMAND                      SIZE  TIME
   --------------------------  -----  ----
   mk-parallel-dump            3.0GB  2596
   mysqldump | gzip --fast     3.0GB  3195
 
 

While dumping two threads in parallel, this machine was at an average of 74% CPU utilization and 12% I/O wait. This machine doesn't have enough disks and CPUs to do that many things at once, so it's not going to speed up much.

Dumping lots of tiny tables by forking of lots of "mysqldump" processes isn't usually much faster, because of the overhead of starting "mysqldump", connecting, inspecting the table, and dumping it. Note that tab-separated dumps are typically much faster and don't suffer as much from the effects of many tiny tables, because they're not done via "mysqldump".

See also <http://www.paragon-cs.com/wordpress/?p=52> for a test of parallel dumping and restoring.

DOWNLOADING

You can download Maatkit from Google Code at <http://code.google.com/p/maatkit/>, or you can get any of the tools easily with a command like the following:
    wget http://www.maatkit.org/get/toolname
    or
    wget http://www.maatkit.org/trunk/toolname
 
 

Where "toolname" can be replaced with the name (or fragment of a name) of any of the Maatkit tools. Once downloaded, they're ready to run; no installation is needed. The first URL gets the latest released version of the tool, and the second gets the latest trunk code from Subversion.

OPTIONS

``--locktables'' and ``--[no]flushlock'' are mutually exclusive.

``--sets'' and ``--setperdb'' are mutually exclusive.

``--losslessfp'' requires ``--tab''.

--age
type: time

Dump only tables modified since this long ago, or not dumped since this long ago.

Specifies how 'old' a table must be before mk-parallel-dump will consider it.

When ``--sets'' is not specified, mk-parallel-dump uses "SHOW TABLE STATUS" instead of "SHOW TABLES" to get a list of tables in each database, and compares the time to the "Update_time" column in the output. If the "Update_time" column is not "NULL" and is older than the specified interval ago, it will not be dumped. Thus, it means ``dump tables that have changed since X amount of time'' (presumably the last regular backup). This means the table will always be dumped if it uses InnoDB or another storage engine that doesn't report the "Update_time".

When ``--sets'' is specified, the ``--settable'' table determines when a table was last dumped, and the meaning of "--age" reverses; it becomes ``dump tables not dumped in X amount of time.''

--askpass
Prompt for a password when connecting to MySQL.
--basedir
type: string

Base directory for creating files.

The directory in which files will be stored. If you use pre-canned options, such as ``--tab'', mk-parallel-dump knows what the eventual filenames will be, and can place all the files in this directory. It will also create any parent directories that don't exist, if needed (see also ``--umask'').

The default is the current working directory.

If you write your own command line, mk-parallel-dump cannot know which arguments in the command line are filenames, and thus doesn't know the eventual destination of the dump files. It does not try to create parent directories in this case.

--[no]biggestfirst
default: yes

Process tables largest first.

This strategy gives better parallelization. Suppose there are 8 threads and the last table is huge. We will finish everything else and then be running single-threaded while that one finishes. If that one runs first, then we will have the max number of threads running at a time for as long as possible.

Ignored when ``--settable'' is given.

--[no]binlogpos
short form: -b; default: yes

Dump the master/slave position.

Dump binary log positions from both "SHOW MASTER STATUS" and "SHOW SLAVE STATUS", whichever can be retrieved from the server. The data is dumped to a file named 00_master_data.sql. This is done for each backup set.

The file also contains details of each table dumped, including the WHERE clauses used to dump it in chunks.

--charset
short form: -A; type: string

Default character set.

Enables character set settings in Perl and MySQL. If the value is "utf8", sets Perl's binmode on STDOUT to utf8, passes the "mysql_enable_utf8" option to DBD::mysql, and runs "SET NAMES UTF8" after connecting to MySQL. Any other value sets binmode on STDOUT without the utf8 layer, and runs "SET NAMES" after connecting to MySQL.

--chunksize
short form: -C; type: string

Number of rows or data size to dump per file.

Specifies that the table should be dumped in segments of approximately the size given. The syntax is either a plain integer, which is interpreted as a number of rows per chunk, or an integer with a suffix of G, M, or k, which is interpreted as the size of the data to be dumped in each chunk. See ``CHUNKS'' for more details.

--csv
Do ``--tab'' dump in CSV format (implies --tab).

Changes ``--tab'' options so the dump file is in comma-separated values (CSV) format. The SELECT INTO OUTFILE statement looks like the following, and can be re-loaded with the same options:

    SELECT * INTO OUTFILE %D.%N.%6C.txt
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
    LINES TERMINATED BY '\n' FROM %D.%N;
 
 
--databases
short form: -d; type: hash

Dump only this comma-separated list of databases.

--dbregex
type: string

Dump only databases whose names match this Perl regex.

--defaults-file
short form: -F; type: string

Only read mysql options from the given file. You must give an absolute pathname.

--[no]defaultset
When ``--sets'' given, dump tables not in any set.

The result will be a "default" set consisting of tables not explicitly included in any set.

--[no]flushlock
short form: -k

Use "FLUSH TABLES WITH READ LOCK".

This is enabled by default, unless you're dumping sets (see ``--sets''). This lock is taken once, at the beginning of the whole process, and is never released. If you want to lock only the tables you're dumping, use ``--locktables''.

--flushlog
Execute "FLUSH LOGS" when getting binlog positions.

This is done for each backup set. This option is NOT enabled by default because it causes the MySQL server to rotate its error log, potentially overwriting error messages.

--[no]gzip
Compress files with gzip.

This is enabled by default unless your platform is Win32. By default, this causes the standard SQL dumps to be piped to gzip's "STDIN" and the result is redirected to the destination file. If this option isn't enabled, by default "mysqldump"'s "--result-file" parameter is used to direct the dump to the destination file. When using ``--tab'', this option causes gzip to be called separately on each resulting file after it is dumped (because "SELECT INTO OUTFILE" cannot be directed to a pipe).

--host
short form: -h; type: string

Connect to host.

--ignoredb
short form: -g; type: Hash

Ignore this comma-separated list of databases.

--ignoreengine
short form: -E; type: Hash; default: FEDERATED,MRG_MyISAM

Dump no data for this comma-separated list of storage engines.

The schema file will be dumped as usual. This prevents dumping data for Federated tables and Merge tables.

--ignoretbl
short form: -n; type: Hash

Ignore this comma-separated list of table names.

Table names may be qualified with the database name.

--locktables
Use "LOCK TABLES" (disables ``--[no]flushlock'').

Disables ``--[no]flushlock'' (unless it was explicitly set) and locks tables with "LOCK TABLES READ". Enabled by default when ``--sets'' is specified. The lock is taken and released with every set of tables dumped.

--losslessfp
short form: -L

Dump float types with extra precision for lossless restore (requires ``--tab'').

Wraps these types with a call to "FORMAT()" with 17 digits of precision. According to the comments in Google's patches, this will give lossless dumping and reloading in most cases. (I shamelessly stole this technique from them. I don't know enough about floating-point math to have an opinion).

This works only with ``--tab''.

--numthread
short form: -m; type: int

Number of threads to dump concurrently.

Specifies the number of parallel processes to run. The default is 2 (this is mk-parallel-dump, after all --- 1 is not parallel). On GNU/Linux machines, the default is the number of times 'processor' appears in /proc/cpuinfo. On Windows, the default is read from the environment. In any case, the default is at least 2, even when there's only a single processor.

--password
short form: -p; type: string

Password to use when connecting.

--port
short form: -P; type: int

Port number to use for connection.

--progress
Display progress messages.

Progress is displayed each time a table or chunk of a table finishes dumping. Progress is calculated by measuring the data size of each table, and assuming all bytes are created equal. The output is the completed and total bytes, the percent completed, estimated time remaining, and estimated completion time.

--quiet
short form: -q

Quiet output; disables ``--verbose''.

--setperdb
Dump each database as a separate set.

Each set is named the same as the database. Implies ``--locktables''.

--sets
type: array

Dump this comma-separated list of backup sets.

Requires ``--settable''. See ``BACKUP SETS''. The special "default" set is reserved; don't use it as a set name.

--settable
type: string

The database.table in which backup sets are kept.

As mentioned earlier, you can manually specify the ordering of tables in a backup set. Therefore ``--[no]biggestfirst'' has no effect when this option is given.

--setvars
type: string; default: wait_timeout=10000

Set these MySQL variables.

Specify any variables you want to be set immediately after connecting to MySQL. These will be included in a "SET" command.

--socket
short form: -S; type: string

Socket file to use for connection.

--stopslave
Issue "STOP SLAVE" on server before dumping data.

This ensures that the data is not changing during the dump. Issues "START SLAVE" after the dump is complete.

If the slave is not running, throws an error and exits. This is to prevent possibly bad things from happening if the slave is not running because of a problem, or because someone intentionally stopped the slave for maintenance or some other purpose.

--tab
short form: -T

Dump tab-separated (sets ``--umask'' 0).

Dump via "SELECT INTO OUTFILE", which is similar to what "mysqldump" does with the "--tab" option, but you're not constrained to a single database at a time.

Before you use this option, make sure you know what "SELECT INTO OUTFILE" does! I recommend using it only if you're running mk-parallel-dump on the same machine as the MySQL server, but there is no protection if you don't.

The files will be gzipped after dumping if ``--[no]gzip'' is enabled. This option sets ``--umask'' to zero so auto-created directories are writable by the MySQL server.

Triggers are dumped into ".trg" files, and views are postponed until the end of the dump, then dumped all together into the "00_views.sql" file. This allows restoring data before the triggers, which is important for restoring data accurately. Views must be postponed until the end and dumped together so they can be restored correctly; interdependencies between views and tables may prevent correct restoration otherwise.

--tables
short form: -t; type: hash

Dump only this comma-separated list of table names.

Table names may be qualified with the database name.

--tblregex
type: string

Dump only tables whose names match this Perl regex.

--test
Print commands instead of executing them.
--umask
type: string

Set the program's "umask" to this octal value.

This is useful when you want created files and directories to be readable or writable by other users (for example, the MySQL server itself).

--user
short form: -u; type: string

User for login if not current user.

--verbose
short form: -v; cumulative: yes; default: 1

Be verbose; can specify multiple times.

See ``OUTPUT''.

--wait
short form: -w; type: time; default: 5m

Wait limit when the server is down.

If the MySQL server crashes during dumping, waits until the server comes back and then continues with the rest of the tables. "mk-parallel-dump" will check the server every second until this time is exhausted, at which point it will give up and exit.

This implements Peter Zaitsev's ``safe dump'' request: sometimes a dump on a server that has corrupt data will kill the server. mk-parallel-dump will wait for the server to restart, then keep going. It's hard to say which table killed the server, so no tables will be retried. Tables that were being concurrently dumped when the crash happened will not be retried. No additional locks will be taken after the server restarts; it's assumed this behavior is useful only on a server you're not trying to dump while it's in production.

ENVIRONMENT

The environment variable "MKDEBUG" enables verbose debugging output in all of the Maatkit tools:
    MKDEBUG=1 mk-....
 
 

SYSTEM REQUIREMENTS

You need Perl, DBI, DBD::mysql, and some core packages that ought to be installed in any reasonably new version of Perl.

This program works best on GNU/Linux. Filename quoting might not work well on Microsoft Windows if you have spaces or funny characters in your database or table names.

BUGS

Please use Google Code Issues and Groups to report bugs or request support: <http://code.google.com/p/maatkit/>.

Please include the complete command-line used to reproduce the problem you are seeing, the version of all MySQL servers involved, the complete output of the tool when run with ``--version'', and if possible, debugging output produced by running with the "MKDEBUG=1" environment variable.

COPYRIGHT, LICENSE AND WARRANTY

This program is copyright 2007-2009 Baron Schwartz. Feedback and improvements are welcome.

THIS PROGRAM IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; OR the Perl Artistic License. On UNIX and similar systems, you can issue `man perlgpl' or `man perlartistic' to read these licenses.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

AUTHOR

Baron Schwartz.

SEE ALSO

See also mk-parallel-restore.

VERSION

This manual page documents Ver 1.0.15 Distrib 3519 $Revision: 3502 $.