1
0
Fork 0

Doc - Fast Cluster Restore Procedure (#7756)

This commit is contained in:
sleto-it 2019-02-27 16:29:51 +01:00 committed by GitHub
parent ef03234331
commit 37c5c1239d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 395 additions and 65 deletions

View File

@ -5,6 +5,12 @@ Backup and restore can be done via the tools
[_arangodump_](../Programs/Arangodump/README.md) and
[_arangorestore_](../Programs/Arangorestore/README.md).
{% hint 'tip' %}
In order to speed up the _arangorestore_ performance in a Cluster environment,
the [Fast Cluster Restore](../Programs/Arangorestore/FastClusterRestore.md)
procedure is recommended.
{% endhint %}
Performing frequent backups is important and a recommended best practices that
can allow you to recover your data in case unexpected problems occur.
Hardware failures, system crashes, or users mistakenly deleting data can always

View File

@ -18,12 +18,12 @@ _arangodump_ will by default connect to the *_system* database using the default
endpoint. If you want to connect to a different database or a different endpoint,
or use authentication, you can use the following command-line options:
- *--server.database <string>*: name of the database to connect to
- *--server.endpoint <string>*: endpoint to connect to
- *--server.username <string>*: username
- *--server.password <string>*: password to use (omit this and you'll be prompted for the
- `--server.database <string>`: name of the database to connect to
- `--server.endpoint <string>`: endpoint to connect to
- `--server.username <string>`: username
- `--server.password <string>`: password to use (omit this and you'll be prompted for the
password)
- *--server.authentication <bool>*: whether or not to use authentication
- `--server.authentication <bool>`: whether or not to use authentication
Here's an example of dumping data from a non-standard endpoint, using a dedicated
[database name](../../Appendix/Glossary.md#database-name):
@ -39,9 +39,9 @@ By default, _arangodump_ will dump both structural information and documents fro
non-system collections. To adjust this, there are the following command-line
arguments:
- *--dump-data <bool>*: set to *true* to include documents in the dump. Set to *false*
- `--dump-data <bool>`: set to *true* to include documents in the dump. Set to *false*
to exclude documents. The default value is *true*.
- *--include-system-collections <bool>*: whether or not to include system collections
- `--include-system-collections <bool>`: whether or not to include system collections
in the dump. The default value is *false*. **Set to _true_ if you are using named
graphs that you are interested in restoring.**
@ -69,9 +69,10 @@ Cluster Backup
--------------
Starting with Version 2.1 of ArangoDB, the *arangodump* tool also
supports sharding. Simply point it to one of the coordinators and it
supports sharding and can be used to backup data from a Cluster.
Simply point it to one of the _Coordinators_ and it
will behave exactly as described above, working on sharded collections
in the cluster.
in the Cluster.
Please see the [Limitations](Limitations.md).

View File

@ -18,32 +18,32 @@ _arangorestore_ can be invoked from the command-line as follows:
arangorestore --input-directory "dump"
This will connect to an ArangoDB server and reload structural information and
This will connect to an ArangoDB server and reload structural information and
documents found in the input directory *dump*. Please note that the input directory
must have been created by running *arangodump* before.
_arangorestore_ will by default connect to the *_system* database using the default
endpoint. If you want to connect to a different database or a different endpoint,
endpoint. If you want to connect to a different database or a different endpoint,
or use authentication, you can use the following command-line options:
- *--server.database <string>*: name of the database to connect to
- *--server.endpoint <string>*: endpoint to connect to
- *--server.username <string>*: username
- *--server.password <string>*: password to use (omit this and you'll be prompted for the
- `--server.database <string>`: name of the database to connect to
- `--server.endpoint <string>`: endpoint to connect to
- `--server.username <string>`: username
- `--server.password <string>`: password to use (omit this and you'll be prompted for the
password)
- *--server.authentication <bool>*: whether or not to use authentication
Since version 2.6 _arangorestore_ provides the option *--create-database*. Setting this
- `--server.authentication <bool>`: whether or not to use authentication
Since version 2.6 _arangorestore_ provides the option *--create-database*. Setting this
option to *true* will create the target database if it does not exist. When creating the
target database, the username and passwords passed to _arangorestore_ (in options
*--server.username* and *--server.password*) will be used to create an initial user for the
target database, the username and passwords passed to _arangorestore_ (in options
*--server.username* and *--server.password*) will be used to create an initial user for the
new database.
The option `--force-same-database` allows restricting arangorestore operations to a
database with the same name as in the source dump's "dump.json" file. It can thus be used
to prevent restoring data into a "wrong" database by accident.
For example, if a dump was taken from database `a`, and the restore is attempted into
For example, if a dump was taken from database `a`, and the restore is attempted into
database `b`, then with the `--force-same-database` option set to `true`, arangorestore
will abort instantly.
@ -55,7 +55,7 @@ Here's an example of reloading data to a non-standard endpoint, using a dedicate
arangorestore --server.endpoint tcp://192.168.173.13:8531 --server.username backup --server.database mydb --input-directory "dump"
To create the target database whe restoring, use a command like this:
arangorestore --server.username backup --server.database newdb --create-database true --input-directory "dump"
_arangorestore_ will print out its progress while running, and will end with a line
@ -64,25 +64,25 @@ showing some aggregate statistics:
Processed 2 collection(s), read 2256 byte(s) from datafiles, sent 2 batch(es)
By default, _arangorestore_ will re-create all non-system collections found in the input
directory and load data into them. If the target database already contains collections
which are also present in the input directory, the existing collections in the database
By default, _arangorestore_ will re-create all non-system collections found in the input
directory and load data into them. If the target database already contains collections
which are also present in the input directory, the existing collections in the database
will be dropped and re-created with the data found in the input directory.
The following parameters are available to adjust this behavior:
- *--create-collection <bool>*: set to *true* to create collections in the target
- `--create-collection <bool>`: set to *true* to create collections in the target
database. If the target database already contains a collection with the same name,
it will be dropped first and then re-created with the properties found in the input
directory. Set to *false* to keep existing collections in the target database. If
directory. Set to *false* to keep existing collections in the target database. If
set to *false* and _arangorestore_ encounters a collection that is present in the
input directory but not in the target database, it will abort. The default value is *true*.
- *--import-data <bool>*: set to *true* to load document data into the collections in
the target database. Set to *false* to not load any document data. The default value
- `--import-data <bool>`: set to *true* to load document data into the collections in
the target database. Set to *false* to not load any document data. The default value
is *true*.
- *--include-system-collections <bool>*: whether or not to include system collections
- `--include-system-collections <bool>`: whether or not to include system collections
when re-creating collections or reloading data. The default value is *false*.
For example, to (re-)create all non-system collections and load document data into them, use:
arangorestore --create-collection true --import-data true --input-directory "dump"
@ -91,7 +91,7 @@ This will drop potentially existing collections in the target database that are
in the input directory.
To include system collections too, use *--include-system-collections true*:
arangorestore --create-collection true --import-data true --include-system-collections true --input-directory "dump"
To (re-)create all non-system collections without loading document data, use:
@ -107,20 +107,29 @@ To just load document data into all non-system collections, use:
To restrict reloading to just specific collections, there is is the *--collection* option.
It can be specified multiple times if required:
arangorestore --collection myusers --collection myvalues --input-directory "dump"
Collections will be processed by in alphabetical order by _arangorestore_, with all document
collections being processed before all [edge collection](../../Appendix/Glossary.md#edge-collection)s. This is to ensure that reloading
data into edge collections will have the document collections linked in edges (*_from* and
*_to* attributes) loaded.
collections being processed before all [edge collections](../../Appendix/Glossary.md#edge-collection).
This remains valid also when multiple threads are in use (from v3.4.0 on).
Note however that when restoring an edge collection no internal checks are made in order to validate that
the documents that the edges connect exist or not. As a consequence, when restoring individual collections
which are part of a graph you are not required to restore in a specific order.
{% hint 'warning' %}
When restoring only a subset of collections of your database, and graphs are in use, you will need
to make sure you are restoring all the needed collections (the ones that are part of the graph) as
otherwise you might have edges pointing to non existing documents.
{% endhint %}
To restrict reloading to specific views, there is the *--view* option.
Should you specify the *--collection* parameter views will not be restored _unless_ you explicitly
specify them via the *--view* option.
arangorestore --collection myusers --view myview --input-directory "dump"
In the case of an arangosearch view you must make sure that the linked collections are either
also restored or already present on the server.
@ -132,8 +141,8 @@ See [Arangodump](../Arangodump/Examples.md#encryption) for details.
Reloading Data into a different Collection
------------------------------------------
_arangorestore_ will restore document and edges data with the exact same *_key*, *_rev*, *_from*
and *_to* values as found in the input directory.
_arangorestore_ will restore document and edges data with the exact same *_key*, *_rev*, *_from*
and *_to* values as found in the input directory.
With some creativity you can also use _arangodump_ and _arangorestore_ to transfer data from one
collection into another (either on the same server or not). For example, to copy data from
@ -142,44 +151,49 @@ you can start with the following command:
arangodump --collection myvalues --server.database mydb --output-directory "dump"
This will create two files, *myvalues.structure.json* and *myvalues.data.json*, in the output
directory. To load data from the datafile into an existing collection *mycopyvalues* in database
This will create two files, *myvalues.structure.json* and *myvalues.data.json*, in the output
directory. To load data from the datafile into an existing collection *mycopyvalues* in database
*mycopy*, rename the files to *mycopyvalues.structure.json* and *mycopyvalues.data.json*.
After that, run the following command:
arangorestore --collection mycopyvalues --server.database mycopy --input-directory "dump"
Using arangorestore with sharding
---------------------------------
Restoring in a Cluster
----------------------
As of Version 2.1 the *arangorestore* tool supports sharding. Simply
point it to one of the coordinators in your cluster and it will
work as usual but on sharded collections in the cluster.
From v2.1 on, the *arangorestore* tool supports sharding and can be
used to restore data into a Cluster. Simply point it to one of the
_Coordinators_ in your Cluster and it will work as usual but on sharded
collections in the Cluster.
If *arangorestore* is asked to restore a collection, it will use the same number of
shards, replication factor and shard keys as when the collection was dumped.
The distribution of the shards to the servers will also be the same as at the time of the dump,
provided that the number of DBServers in the cluster dumped from is identical to the
number of DBServers in the to-be-restored-to cluster.
If *arangorestore* is asked to restore a collection, it will use the same
number of shards, replication factor and shard keys as when the collection
was dumped. The distribution of the shards to the servers will also be the
same as at the time of the dump, provided that the number of _DBServers_ in
the cluster dumped from is identical to the number of DBServers in the
to-be-restored-to cluster.
To modify the number of _shards_ or the _replication factor_ for all or just some collections,
*arangorestore*, starting from v3.3.22 and v3.4.2, provides the options `--number-of-shards` and `--replication-factor`.
These options can be specified multiple times as well, in order to override the settings
To modify the number of _shards_ or the _replication factor_ for all or just
some collections, *arangorestore* provides the options `--number-of-shards`
and `--replication-factor` (starting from v3.3.22 and v3.4.2). These options
can be specified multiple times as well, in order to override the settings
for dedicated collections, e.g.
unix> arangorestore --number-of-shards 2 --number-of-shards mycollection=3 --number-of-shards test=4
arangorestore --number-of-shards 2 --number-of-shards mycollection=3 --number-of-shards test=4
The above will restore all collections except "mycollection" and "test" with 2 shards. "mycollection"
will have 3 shards when restored, and "test" will have 4. It is possible to omit the default value
and only use collection-specific overrides. In this case, the number of shards for any collections not
overridden will be determined by looking into the "numberOfShards" values contained in the dump.
The above will restore all collections except "mycollection" and "test" with
2 shards. "mycollection" will have 3 shards when restored, and "test" will
have 4. It is possible to omit the default value and only use
collection-specific overrides. In this case, the number of shards for any
collections not overridden will be determined by looking into the
"numberOfShards" values contained in the dump.
The `--replication-factor` options works in the same way, e.g.
unix> arangorestore --replication-factor 2 --replication-factor mycollection=1
arangorestore --replication-factor 2 --replication-factor mycollection=1
will set the replication factor to 2 for all collections but "mycollection", which will get a
replication factor of just 1.
replication factor of just 1.
If a collection was dumped from a single instance and is then restored into
a cluster, the sharding will be done by the `_key` attribute by default. One can
@ -190,6 +204,29 @@ If you restore a collection that was dumped from a cluster into a single
ArangoDB instance, the number of shards, replication factor and shard keys will silently
be ignored.
### Factors affecting speed of arangorestore in a Cluster
The following factors affect speed of _arangorestore_ in a Cluster:
- **Replication Factor**: the higher the _replication factor_, the more
time the restore will take. To speed up the restore you can restore
using a _replication factor_ of 1 and then increase it again
after the restore. This will reduce the number of network hops needed
during the restore.
- **Restore Parallelization**: if the collections are not restored in
parallel, the restore speed is highly affected. A parallel restore can
be done from v3.4.0 by using the `--threads` option of _arangorestore_.
Before v3.4.0 it is possible to achieve parallelization by restoring
on multiple _Coordinators_ at the same time. Depending on your specific
case, parallelizing on multiple _Coordinators_ can still be useful even
when the `--threads` option is in use (from v.3.4.0).
{% hint 'tip' %}
Please refer to the [Fast Cluster Restore](FastClusterRestore.md) page
for further operative details on how to take into account, when restoring
using _arangorestore_, the two factors described above.
{% endhint %}
### Restoring collections with sharding prototypes
*arangorestore* will yield an error when trying to restore a

View File

@ -0,0 +1,279 @@
Fast Cluster Restore
====================
The _Fast Cluster Restore_ procedure documented in this page is recommended
to speed-up the performance of [_arangorestore_](../Arangorestore/README.md)
in a Cluster environment.
It is assumed that a Cluster environment is running and a _logical_ backup
with [_arangodump_](../Arangodump/README.md) has already been taken.
{% hint 'info' %}
The procedure described in this page is particularly useful for ArangoDB
version 3.3, but can be used in 3.4 and later versions as well. Note that
from v3.4, _arangorestore_ includes the option `--threads` which can be a first
good step already in achieving restore parallelization and its speed benefit.
However, the procedure below allows for even further parallelization (making
use of different _Coordinators_), and the part regarding temporarily setting
_replication factor_ to 1 is still useful in 3.4 and later versions.
{% endhint %}
The speed improvement obtained by the procedure below is achieved by:
1. Restoring into a Cluster that has _replication factor_ 1, thus reducing
number of network hops needed during the restore operation (_replication factor_
is reverted to initial value at the end of the procedure - steps #2, #3 and #6).
2. Restoring in parallel multiple collections on different _Coordinators_
(steps #4 and #5).
{% hint 'info' %}
Please refer to
[this](Examples.md#factors-affecting-speed-of-arangorestore-in-a-cluster)
section for further context on the factors affecting restore speed when restoring
using _arangorestore_ in a Cluster.
{% endhint %}
Step 1: Copy the _dump_ directory to all _Coordinators_
-------------------------------------------------------
The first step is to copy the directory that contains the _dump_ to all machines
where _Coordinators_ are running.
{% hint 'tip' %}
This step is not strictly required as the backup can be restored over the
network. However, if the restore is executed locally the restore speed is
significantly improved.
{% endhint %}
Step 2: Restore collection structures
-------------------------------------
The collection structures have to be restored from exactly one _Coordinator_ (any
_Coordinator_ can be used) with a command similar to the following one. Please add
any additional needed option for your specific use case, e.g. `--create-database`
if the database where you want to restore does not exist yet:
```
arangorestore
--server.endpoint <endpoint-of-a-coordinator>
--server.database <database-name>
--server.password <password>
--import-data false
--input-directory <dump-directory>
```
{% hint 'info' %}
If you are using v3.3.22 or higher, or v3.4.2 or higher, please also add in the
command above the option `--replication-factor 1`.
{% endhint %}
The option `--import-data false` tells _arangorestore_ to restore only the
collection structure and no data.
Step 3: Set _Replication Factor_ to 1
--------------------------------------
{% hint 'info' %}
This step is **not** needed if you are using v3.3.22 or higher or v3.4.2 or higher
and you have used in the previous step the option `--replication-factor 1`.
{% endhint %}
To speed up restore, it is possible to set the _replication factor_ to 1 before
importing any data. Run the following command from exactly one _Coordinator_ (any
_Coordinator_ can be used):
```
echo 'db._collections().filter(function(c) { return c.name()[0] !== "_"; })
.forEach(function(c) { print("collection:", c.name(), "replicationFactor:",
c.properties().replicationFactor); c.properties({ replicationFactor: 1 }); });'
| arangosh
--server.endpoint <endpoint-of-a-coordinator>
--server.database <database-name>
--server.username <user-name>
--server.password <password>
```
Step 4: Create parallel restore scripts
---------------------------------------
Now that the Cluster is prepared, the `parallelRestore` script will be used.
Please create the below `parallelRestore` script in any of your _Coordinators_.
When executed (see below for further details), this script will create other scripts
that can be then copied and executed on each _Coordinator_.
```
#!/bin/sh
#
# Version: 0.3
#
# Release Notes:
# - v0.3: fixed a bug that was happening when the collection name included an underscore
# - v0.2: compatibility with version 3.4: now each coordinator_<number-of-coordinator>.sh
# includes a single restore command (instead of one for each collection)
# which allows making using of the --threads option in v.3.4.0 and later
# - v0.1: initial version
if test -z "$ARANGOSH" ; then
export ARANGOSH=arangosh
fi
cat > /tmp/parallelRestore$$.js <<'EOF'
var fs = require("fs");
var print = require("internal").print;
var exit = require("internal").exit;
var arangorestore = "arangorestore";
var env = require("internal").env;
if (env.hasOwnProperty("ARANGORESTORE")) {
arangorestore = env["ARANGORESTORE"];
}
// Check ARGUMENTS: dumpDir coordinator1 coordinator2 ...
if (ARGUMENTS.length < 2) {
print("Need at least two arguments DUMPDIR and COORDINATOR_ENDPOINTS!");
exit(1);
}
var dumpDir = ARGUMENTS[0];
var coordinators = ARGUMENTS[1].split(",");
var otherArgs = ARGUMENTS.slice(2);
// Quickly check the dump dir:
var files = fs.list(dumpDir).filter(f => !fs.isDirectory(f));
var found = files.indexOf("ENCRYPTION");
if (found === -1) {
print("This directory does not have an ENCRYPTION entry.");
exit(2);
}
// Remove ENCRYPTION entry:
files = files.slice(0, found).concat(files.slice(found+1));
for (let i = 0; i < files.length; ++i) {
if (files[i].slice(-5) !== ".json") {
print("This directory has files which do not end in '.json'!");
exit(3);
}
}
files = files.map(function(f) {
var fullName = fs.join(dumpDir, f);
var collName = "";
if (f.slice(-10) === ".data.json") {
var pos;
if (f.slice(0, 1) === "_") { // system collection
pos = f.slice(1).indexOf("_") + 1;
collName = "_" + f.slice(1, pos);
} else {
pos = f.lastIndexOf("_")
collName = f.slice(0, pos);
}
}
return {name: fullName, collName, size: fs.size(fullName)};
});
files = files.sort(function(a, b) { return b.size - a.size; });
var dataFiles = [];
for (let i = 0; i < files.length; ++i) {
if (files[i].name.slice(-10) === ".data.json") {
dataFiles.push(i);
}
}
// Produce the scripts, one for each coordinator:
var scripts = [];
var collections = [];
for (let i = 0; i < coordinators.length; ++i) {
scripts.push([]);
collections.push([]);
}
var cnum = 0;
var temp = '';
var collections = [];
for (let i = 0; i < dataFiles.length; ++i) {
var f = files[dataFiles[i]];
if (typeof collections[cnum] == 'undefined') {
collections[cnum] = (`--collection ${f.collName}`);
} else {
collections[cnum] += (` --collection ${f.collName}`);
}
cnum += 1;
if (cnum >= coordinators.length) {
cnum = 0;
}
}
var cnum = 0;
for (let i = 0; i < coordinators.length; ++i) {
scripts[i].push(`${arangorestore} --input-directory ${dumpDir} --server.endpoint ${coordinators[i]} ` + collections[i] + ' ' + otherArgs.join(" "));
}
for (let i = 0; i < coordinators.length; ++i) {
let f = "coordinator_" + i + ".sh";
print("Writing file", f, "...");
fs.writeFileSync(f, scripts[i].join("\n"));
}
EOF
${ARANGOSH} --javascript.execute /tmp/parallelRestore$$.js -- "$@"
rm /tmp/parallelRestore$$.js
```
To run this script, all _Coordinator_ endpoints of the Cluster have to be
provided. The script accepts all options of the tool _arangorestore_.
The command below can for instance be used on a Cluster with three
_Coordinators_:
```
./parallelRestore <dump-directory>
tcp://<ip-of-coordinator1>:<port of coordinator1>,
tcp://<ip-of-coordinator2>:<port of coordinator2>,
tcp://<ip-of-coordinator3>:<port of coordinator3>
--server.username <username>
--server.password <password>
--server.database <database_name>
--create-collection false
```
**Notes:**
- The option `--create-collection false` is passed since the collection
structures were created already in the previous step.
- Starting from v3.4.0 the _arangorestore_ option *--threads N* can be
passed to the command above, where _N_ is an integer, to further parallelize
the restore (default is `--threads 2`).
The above command will create three scripts, where three corresponds to
the amount of listed _Coordinators_.
The resulting scripts are named `coordinator_<number-of-coordinator>.sh` (e.g.
`coordinator_0.sh`, `coordinator_1.sh`, `coordinator_2.sh`).
Step 5: Execute parallel restore scripts
----------------------------------------
The `coordinator_<number-of-coordinator>.sh` scripts, that were created in the
previous step, now have to be executed on each machine where a _Coordinator_
is running. This will start a parallel restore of the dump.
Step 6: Revert to the initial _Replication Factor_
--------------------------------------------------
Once the _arangorestore_ process on every _Coordinator_ is completed, the
_replication factor_ has to be set to its initial value.
Run the following command from exactly one _Coordinator_ (any _Coordinator_ can be
used). Please adjust the `replicationFactor` value to your specific case (2 in the
example below):
```
echo 'db._collections().filter(function(c) { return c.name()[0] !== "_"; })
.forEach(function(c) { print("collection:", c.name(), "replicationFactor:",
c.properties().replicationFactor); c.properties({ replicationFactor: 2 }); });'
| arangosh
--server.endpoint <endpoint-of-a-coordinator>
--server.database <database-name>
--server.username <user-name>
--server.password <password>
```

View File

@ -11,4 +11,10 @@ If you want to import data in formats like JSON or CSV, see
_Arangorestore_ can restore selected collections or all collections of a backup,
optionally including _system_ collections. One can restore the structure, i.e.
the collections with their configuration with or without data.
Views can also be dumped or restored (either all of them or selectively).
Views can also be dumped or restored (either all of them or selectively).
{% hint 'tip' %}
In order to speed up the _arangorestore_ performance in a Cluster environment,
the [Fast Cluster Restore](FastClusterRestore.md)
procedure is recommended.
{% endhint %}

View File

@ -80,6 +80,7 @@
* [Limitations](Programs/Arangodump/Limitations.md)
* [Arangorestore](Programs/Arangorestore/README.md)
* [Examples](Programs/Arangorestore/Examples.md)
* [Fast Cluster Restore](Programs/Arangorestore/FastClusterRestore.md)
* [Options](Programs/Arangorestore/Options.md)
* [Arangoimport](Programs/Arangoimport/README.md)
* [Examples JSON](Programs/Arangoimport/ExamplesJson.md)