arangodb/Documentation/Books/Users/Replication/Components.mdpp

!CHAPTER Components

The replication architecture in ArangoDB before version 2.2 consisted of two main
components, which could be used together or in isolation: the *replication logger*
and the *replication applier*. Since ArangoDB 2.2, the replication logger has no
special purpose anymore and is available for downwards-compatibility only.

The replication applier can be administered via the command line or a REST API
(see [HTTP Interface for Replication](../HttpReplications/README.md)).

As replication is configured on a per-database level and there can be multiple
databases inside one ArangoDB instance, there can be multiple replication appliers
in one ArangoDB instance.

!SUBSECTION Replication Logger

!SUBSUBSECTION Purpose

The purpose of the replication logger in ArangoDB before version 2.2 was to log all
changes that modify the state of data in a specific database on a master server.
This included document insertions, updates, and deletions as well as creating, dropping,
renaming and changing collections and indexes.

When the replication logger was used, it logged all these write operations for the
database in its own event log, which was a system collection named *_replication*.

Reading the events sequentially from the *_replication* collection provides a list of
all write operations carried out in the master database. Replication clients could then
request these events from the logger and apply them on their own.

Starting with ArangoDB 2.2, the *_replication* system collection is not used anymore.
Instead, the server will write all data-modification operations into its write-ahead log.
The write-ahead log can be queried by clients, so the need for an extra event log is
gone.

!SUBSUBSECTION Checking the state

Starting with version 2.2, ArangoDB will log all data-modification operations in its
write-ahead log automatically. There is no need to explicitly start or configure the
replication logger on the master.

To query the current state of the logger, use the *state* command:

    require("org/arangodb/replication").logger.state();

The result might look like this:

```js
{
  "state" : {
    "running" : true,
      "lastLogTick" : "133322013",
      "totalEvents" : 16,
      "time" : "2014-07-06T12:58:11Z"
  },
  "server" : {
    "version" : "2.2.0-devel",
    "serverId" : "40897075811372"
  },
  "clients" : {
  }
}
```

In old versions of ArangoDB, the *running* attribute indicated whether the logger
was currently enabled and logged any events. In ArangoDB 2.2 and higher, this attribute
will always have a value of *true*, as the replication logger cannot be turned off.

The *totalEvents* attribute indicates how many log events have been logged since the start
of the ArangoDB server. Finally, the *lastLogTick* value indicates the id of the last
operation that was written to the server's write-ahead log. It can be used to determine whether new
operations were logged, and is also used by the replication applier for incremental
fetching of data.

**Note**: The replication logger state can also be queried via the [HTTP API](../HttpReplications/README.md).

To query which data ranges are still available for replication clients to fetch,
the logger provides the *firstTick* and *tickRanges* functions:

    require("org/arangodb/replication").logger.firstTick();

This will return the minimum tick value that the server can provide to replication
clients via its replication APIS. The *tickRanges* function returns the minimum and
maximum tick values per logfile:

    require("org/arangodb/replication").logger.tickRanges();


!SUBSUBSECTION Configuration

Since ArangoDB 2.2, no special configuration is necessary for the replication logger.
All operations are written to a server's write-ahead log, and the write-ahead log is
used in replication, too.


!SUBSECTION Replication Applier

!SUBSUBSECTION Purpose

The purpose of the replication applier is to read data from a master database's event log,
and apply them locally. The applier will check the master database for new operations periodically.
It will perform an incremental synchronization, i.e. only asking the master for operations
that occurred after the last synchronization.

The replication applier does not get notified by the master database when there are "new"
operations available, but instead uses the pull principle. It might thus take some time (the
so-called *replication lag*) before an operation from the master database gets shipped to and
applied in a slave database.

The replication applier of a database is run in a separate thread. It may encounter problems
when an operation from the master cannot be applied safely, or when the connection to the master
database goes down (network outage, master database is down or unavailable etc.). In this case,
the database's replication applier thread might terminate itself. It is then up to the
administrator to fix the problem and restart the database's replication applier.

If the replication applier cannot connect to the master database, or the communication fails at
some point during the synchronization, the replication applier will try to reconnect to
the master database. It will give up reconnecting only after a configurable amount of connection
attempts.

The replication applier state is queryable at any time by using the *state* command of the
applier. This will return the state of the applier of the current database:

```js
require("org/arangodb/replication").applier.state();
```

The result might look like this:

```js
{
  "state" : {
    "running" : true,
    "lastAppliedContinuousTick" : "152786205",
    "lastProcessedContinuousTick" : "152786205",
    "lastAvailableContinuousTick" : "152786205",
    "progress" : {
      "time" : "2014-07-06T13:04:57Z",
      "message" : "fetching master log from offset 152786205",
      "failedConnects" : 0
    },
    "totalRequests" : 38,
    "totalFailedConnects" : 0,
    "totalEvents" : 1,
    "lastError" : {
      "errorNum" : 0
    },
    "time" : "2014-07-06T13:04:57Z"
  },
  "server" : {
    "version" : "2.2.0-devel",
    "serverId" : "210189384542896"
  },
  "endpoint" : "tcp://master.example.org:8529",
  "database" : "_system"
}
```

The *running* attribute indicates whether the replication applier of the current database
is currently running and polling the server at *endpoint* for new events.

The *progress.failedConnects* attribute shows how many failed connection attempts the replication
applier currently has encountered in a row. In contrast, the *totalFailedConnects* attribute
indicates how many failed connection attempts the applier has made in total. The
*totalRequests* attribute shows how many requests the applier has sent to the master database
in total. The *totalEvents* attribute shows how many log events the applier has read from the
master.

The *progress.message* sub-attribute provides a brief hint of what the applier currently does
(if it is running). The *lastError* attribute also has an optional *errorMessage* sub-attribute,
showing the latest error message. The *errorNum* sub-attribute of the *lastError* attribute can be
used by clients to programmatically check for errors. It should be *0* if there is no error, and
it should be non-zero if the applier terminated itself due to a problem.

Here is an example of the state after the replication applier terminated itself due to
(repeated) connection problems:

```js
{
  "state" : {
    "running" : false,
    "progress" : {
      "time" : "2014-07-06T13:14:37Z",
      "message" : "applier stopped",
      "failedConnects" : 6
    },
    "totalRequests" : 79,
    "totalFailedConnects" : 11,
    "totalEvents" : 0,
    "lastError" : {
      "time" : "2014-07-06T13:09:41Z",
      "errorMessage" : "could not connect to master at tcp://master.example.org:8529: Could not connect to 'tcp:/...",
      "errorNum" : 1400
    },
    ...
  }
}
```

**Note**: the state of a database's replication applier is queryable via the HTTP API, too.
Please refer to [HTTP Interface for Replication](../HttpReplications/README.md) for more details.

!SUBSUBSECTION Starting and Stopping

To start and stop the applier in the current database, the *start* and *stop* commands can
be used:

```js
require("org/arangodb/replication").applier.start(<tick>);
require("org/arangodb/replication").applier.stop();
```

**Note**: Starting a replication applier without setting up an initial configuration will
fail. The replication applier will look for its configuration in a file named
*REPLICATION-APPLIER-CONFIG* in the current database's directory. If the file is not present,
ArangoDB will use some default configuration, but it cannot guess the endpoint (the address
of the master database) the applier should connect to. Thus starting the applier without
configuration will fail.

Note that at the first time you start the applier, you should pass the value returned in the
*lastLogTick* attribute of the initial sync operation.

**Note**: Starting a database's replication applier via the *start* command will not necessarily
start the applier on the next and following ArangoDB server restarts. Additionally, stopping a
database's replication applier manually will not necessarily prevent the applier from being
started again on the next server start. All of this is configurable separately (hang on reading).

**Note**: when stopping and restarting the replication applier of database, it will resume where
it last stopped. This is sensible because replication log events should be applied incrementally.
If the replication applier of a database has never been started before, it needs some *tick* value
from the master's log from which to start fetching events.

There is one caveat to consider when stopping a replication on the slave: if there are still
ongoing replicated transactions that are neither committed or aborted, stopping the replication
applier will cause these operations to be lost for the slave. If these transactions commit on the
master later and the replication is resumed, the slave will not be able to commit these transactions,
too. Thus stopping the replication applier on the slave manually should only be done if there
is certainty that there are no ongoing transactions on the master.


!SUBSUBSECTION Configuration

To configure the replication applier of a specific database, use the *properties* command. Using
it without any arguments will return the current configuration:

```js
require("org/arangodb/replication").applier.properties();
```

The result might look like this:

```js
{
  "requestTimeout" : 300,
  "connectTimeout" : 10,
  "ignoreErrors" : 0,
  "maxConnectRetries" : 10,
  "chunkSize" : 0,
  "autoStart" : false,
  "adaptivePolling" : true,
  "includeSystem" : true,
  "verbose" : false
}
```

**Note**: There is no *endpoint* attribute configured yet. The *endpoint* attribute is required
for the replication applier to be startable. You may also want to configure a username and password
for the connection via the *username* and *password* attributes.

```js
require("org/arangodb/replication").applier.properties({
  endpoint: "tcp://master.domain.org:8529",
  username:  "root",
  password: "secret",
  verbose: false
});
```

This will re-configure the replication applier for the current database. The configuration will be
used from the next start of the replication applier. The replication applier cannot be re-configured
while it is running. It must be stopped first to be re-configured.

To make the replication applier of the current database start automatically when the ArangoDB server
starts, use the *autoStart* attribute.

Setting the *adaptivePolling* attribute to *true* will make the replication applier poll the
master database for changes with a variable frequency. The replication applier will then lower the
frequency when the master is idle, and increase it when the master can provide new events).
Otherwise the replication applier will poll the master database for changes with a constant frequency.

To set a timeout for connection and following request attempts, use the *connectTimeout* and
*requestTimeout* values. The *maxConnectRetries* attribute configures after how many failed
connection attempts in a row the replication applier will give up and turn itself off.
You may want to set this to a high value so that temporary network outages do not lead to the
replication applier stopping itself.

The *chunkSize* attribute can be used to control the approximate maximum size of a master's
response (in bytes). Setting it to a low value may make the master respond faster (less data is
assembled before the master sends the response), but may require more request-response roundtrips.
Set it to *0* to use ArangoDB's built-in default value.

The *includeSystem* attribute controls whether changes to system collections (such as *_graphs* or
*_users*) should be applied. If set to *true*, changes in these collections will be replicated,
otherwise, they will not be replicated. It is often not necessary to replicate data from system
collections, especially because it may lead to confusion on the slave because the slave needs to
have its own system collections in order to start and keep operational.

The *verbose* attribute controls the verbosity of the replication logger. Setting it to `true` will
make the replication applier write a line to the log for every operation it performs. This should
only be used for diagnosing replication problems.

The following example will set most of the discussed properties for the current database's applier:

```js
require("org/arangodb/replication").applier.properties({
  endpoint: "tcp://master.domain.org:8529",
  username:  "root",
  password: "secret",
  adaptivePolling: true,
  connectTimeout: 15,
  maxConnectRetries: 100,
  chunkSize: 262144,
  autoStart: true,
  includeSystem: true
});
```

After the applier is now fully configured, it could theoretically be started. However, we
may first need an initial synchronization of all collections and their data from the master before
we start the replication applier.

The only safe method for doing a full synchronization (or re-synchronization) is thus to

* stop the replication applier on the slave (if currently running)
* perform an initial full sync with the master database
* note the master database's *lastLogTick* value and
* start the continuous replication applier on the slave using this tick value.

The initial synchronization for the current database is executed with the *sync* command:

```js
require("org/arangodb/replication").sync({
  endpoint: "tcp://master.domain.org:8529",
  username: "root",
  password: "secret,
  includeSystem: true
});
```

The *includeSystem* option controls whether data from system collections (such as *_graphs* and
*_users*) shall be synchronized.

The initial synchronization can optionally be configured to include or exclude specific
collections using the *restrictType* and *restrictCollection* parameters.

The following command only synchronizes collection *foo* and *bar*:

```js
require("org/arangodb/replication").sync({
  endpoint: "tcp://master.domain.org:8529",
  username: "root",
  password: "secret,
  restrictType: "include",
  restrictCollections: [ "foo", "bar" ]
});
```

Using a *restrictType* of *exclude*, all collections but the specified will be synchronized.


**Warning**: *sync* will do a full synchronization of the collections in the current database with
collections present in the master database.
Any local instances of the collections and all their data are removed! Only execute this
command if you are sure you want to remove the local data!

As *sync* does a full synchronization, it might take a while to execute.
When *sync* completes successfully, it returns an array of collections it has synchronized in its
*collections* attribute. It will also return the master database's last log tick value
at the time the *sync* was started on the master. The tick value is contained in the *lastLogTick*
attribute of the *sync* command:

```js
{
  "lastLogTick" : "231848833079705",
  "collections" : [ ... ]
}
```
Now you can start the continuous synchronization for the current database on the slave
with the command

```js
require("org/arangodb/replication").applier.start("231848833079705");
```

**Note**: The tick values should be treated as strings. Using numeric data types for tick
values is unsafe because they might exceed the 32 bit value and the IEEE754 double accuracy
ranges.