mirror of https://gitee.com/bigwinds/arangodb
250 lines
9.3 KiB
Markdown
250 lines
9.3 KiB
Markdown
Master/Slave Architecture
|
|
=========================
|
|
|
|
Introduction
|
|
------------
|
|
|
|
In a _Master/Slave_ setup one or more ArangoDB _slaves_ asynchronously replicate
|
|
from a _master_.
|
|
|
|
The _master_ is the ArangoDB instance where all data-modification operations should
|
|
be directed to. The _slave_ is the ArangoDB instance that replicates the data from
|
|
the master.
|
|
|
|
Components
|
|
----------
|
|
|
|
### Replication Logger
|
|
|
|
**Purpose**
|
|
|
|
The _replication logger_ will write all data-modification operations into the
|
|
_write-ahead log_. This log may then be read by clients to replay any data
|
|
modification on a different server.
|
|
|
|
**Checking the state**
|
|
|
|
To query the current state of the _logger_, use the *state* command:
|
|
|
|
require("@arangodb/replication").logger.state();
|
|
|
|
The result might look like this:
|
|
|
|
```js
|
|
{
|
|
"state" : {
|
|
"running" : true,
|
|
"lastLogTick" : "2064735086",
|
|
"lastUncommittedLogTick" : "2064735086",
|
|
"totalEvents" : 2064735086,
|
|
"time" : "2019-03-01T11:38:39Z"
|
|
},
|
|
"server" : {
|
|
"version" : "3.4.4",
|
|
"serverId" : "135694526467864",
|
|
"engine" : "rocksdb"
|
|
},
|
|
"clients" : [
|
|
{
|
|
"serverId" : "46402312160836",
|
|
"time" : "2019-03-01T11:38:39Z",
|
|
"expires" : "2019-03-01T13:38:39Z",
|
|
"lastServedTick" : "2064459411"
|
|
},
|
|
{
|
|
"serverId" : "260321896124903",
|
|
"time" : "2019-03-01T11:29:45Z",
|
|
"expires" : "2019-03-01T13:29:45Z",
|
|
"lastServedTick" : "2002717896"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
The *running* attribute will always be true. In earlier versions of ArangoDB the
|
|
replication was optional and this could have been *false*.
|
|
|
|
The *totalEvents* attribute indicates how many log events have been logged since
|
|
the start of the ArangoDB server. The *lastLogTick* value indicates the _id_ of the
|
|
last committed operation that was written to the server's _write-ahead log_.
|
|
It can be used to determine whether new operations were logged, and is also used
|
|
by the _replication applier_ for incremental fetching of data. The *lastUncommittedLogTick*
|
|
value contains the _id_ of the last uncommitted operation that was written to the
|
|
server's WAL. For the RocksDB storage engine, *lastLogTick* and *lastUncommittedLogTick*
|
|
are identical, as the WAL only contains committed operations.
|
|
|
|
The *clients* attribute reveals which clients (slaves) have connected to the
|
|
master recently, and up to which tick value they caught up with the replication.
|
|
|
|
**Note**: The replication logger state can also be queried via the
|
|
[HTTP API](../../../../HTTP/Replications/index.html).
|
|
|
|
To query which data ranges are still available for replication clients to fetch,
|
|
the logger provides the *firstTick* and *tickRanges* functions:
|
|
|
|
require("@arangodb/replication").logger.firstTick();
|
|
|
|
This will return the minimum tick value that the server can provide to replication
|
|
clients via its replication APIs. The *tickRanges* function returns the minimum
|
|
and maximum tick values per logfile:
|
|
|
|
require("@arangodb/replication").logger.tickRanges();
|
|
|
|
|
|
### Replication Applier
|
|
|
|
**Purpose**
|
|
|
|
The purpose of the _replication applier_ is to read data from a master database's
|
|
event log, and apply them locally. The _applier_ will check the master database
|
|
for new operations periodically. It will perform an incremental synchronization,
|
|
i.e. only asking the master for operations that occurred after the last synchronization.
|
|
|
|
The _replication applier_ does not get notified by the master database when there
|
|
are "new" operations available, but instead uses the pull principle. It might thus
|
|
take some time (the so-called *replication lag*) before an operation from the master
|
|
database gets shipped to, and applied in, a slave database.
|
|
|
|
The _replication applier_ of a database is run in a separate thread. It may encounter
|
|
problems when an operation from the master cannot be applied safely, or when the
|
|
connection to the master database goes down (network outage, master database is
|
|
down or unavailable etc.). In this case, the database's _replication applier_ thread
|
|
might terminate itself. It is then up to the administrator to fix the problem and
|
|
restart the database's _replication applier_.
|
|
|
|
If the _replication applier_ cannot connect to the master database, or the
|
|
communication fails at some point during the synchronization, the _replication applier_
|
|
will try to reconnect to the master database. It will give up reconnecting only
|
|
after a configurable amount of connection attempts.
|
|
|
|
The _replication applier_ state is queryable at any time by using the *state* command
|
|
of the _applier_. This will return the state of the _applier_ of the current database:
|
|
|
|
```js
|
|
require("@arangodb/replication").applier.state();
|
|
```
|
|
|
|
The result might look like this:
|
|
|
|
```js
|
|
{
|
|
"state" : {
|
|
"started" : "2019-03-01T11:36:33Z",
|
|
"running" : true,
|
|
"phase" : "running",
|
|
"lastAppliedContinuousTick" : "2050724544",
|
|
"lastProcessedContinuousTick" : "2050724544",
|
|
"lastAvailableContinuousTick" : "2050724546",
|
|
"safeResumeTick" : "2050694546",
|
|
"ticksBehind" : 2,
|
|
"progress" : {
|
|
"time" : "2019-03-01T11:36:33Z",
|
|
"message" : "fetching master log from tick 2050694546, last scanned tick 2050664547, first regular tick 2050544543, barrier: 0, open transactions: 1, chunk size 6291456",
|
|
"failedConnects" : 0
|
|
},
|
|
"totalRequests" : 2,
|
|
"totalFailedConnects" : 0,
|
|
"totalEvents" : 50010,
|
|
"totalDocuments" : 50000,
|
|
"totalRemovals" : 0,
|
|
"totalResyncs" : 0,
|
|
"totalOperationsExcluded" : 0,
|
|
"totalApplyTime" : 1.1071290969848633,
|
|
"averageApplyTime" : 1.1071290969848633,
|
|
"totalFetchTime" : 0.2129514217376709,
|
|
"averageFetchTime" : 0.10647571086883545,
|
|
"lastError" : {
|
|
"errorNum" : 0
|
|
},
|
|
"time" : "2019-03-01T11:36:34Z"
|
|
},
|
|
"server" : {
|
|
"version" : "3.4.4",
|
|
"serverId" : "46402312160836"
|
|
},
|
|
"endpoint" : "tcp://master.example.org",
|
|
"database" : "test"
|
|
}
|
|
```
|
|
|
|
The *running* attribute indicates whether the _replication applier_ of the current
|
|
database is currently running and polling the master at *endpoint* for new events.
|
|
|
|
The *started* attribute shows at what date and time the applier was started (if at all).
|
|
|
|
The *progress.failedConnects* attribute shows how many failed connection attempts
|
|
the _replication applier_ currently has encountered in a row. In contrast, the
|
|
*totalFailedConnects* attribute indicates how many failed connection attempts the
|
|
_applier_ has made in total. The *totalRequests* attribute shows how many requests
|
|
the _applier_ has sent to the master database in total.
|
|
|
|
The *totalEvents* attribute shows how many log events the _applier_ has read from the
|
|
master. The *totalDocuments* and *totalRemovals* attributes indicate how may document
|
|
operations the slave has applied locally.
|
|
|
|
The attributes *totalApplyTime* and *totalFetchTime* show the total time the applier
|
|
spent for applying data batches locally, and the total time the applier waited on
|
|
data-fetching requests to the master, respectively.
|
|
The *averageApplyTime* and *averageFetchTime* attributes show the average times clocked
|
|
for these operations. Note that the average times will greatly be influenced by the
|
|
chunk size used in the applier configuration (bigger chunk sizes mean less requests to
|
|
the slave, but the batches will include more data and take more time to create
|
|
and apply).
|
|
|
|
The *progress.message* sub-attribute provides a brief hint of what the _applier_
|
|
currently does (if it is running). The *lastError* attribute also has an optional
|
|
*errorMessage* sub-attribute, showing the latest error message. The *errorNum*
|
|
sub-attribute of the *lastError* attribute can be used by clients to programmatically
|
|
check for errors. It should be *0* if there is no error, and it should be non-zero
|
|
if the _applier_ terminated itself due to a problem.
|
|
|
|
Below is an example of the state after the _replication applier_ terminated itself
|
|
due to (repeated) connection problems:
|
|
|
|
```js
|
|
{
|
|
"state" : {
|
|
"started" : "2019-03-01T11:51:18Z",
|
|
"running" : false,
|
|
"phase" : "inactive",
|
|
"lastAppliedContinuousTick" : "2101606350",
|
|
"lastProcessedContinuousTick" : "2101606370",
|
|
"lastAvailableContinuousTick" : "2101606370",
|
|
"safeResumeTick" : "2101606350",
|
|
"progress" : {
|
|
"time" : "2019-03-01T11:52:45Z",
|
|
"message" : "applier shut down",
|
|
"failedConnects" : 6
|
|
},
|
|
"totalRequests" : 19,
|
|
"totalFailedConnects" : 6,
|
|
"totalEvents" : 0,
|
|
"totalDocuments" : 0,
|
|
"totalRemovals" : 0,
|
|
"totalResyncs" : 0,
|
|
"totalOperationsExcluded" : 0,
|
|
"totalApplyTime" : 0,
|
|
"averageApplyTime" : 0,
|
|
"totalFetchTime" : 0.03386974334716797,
|
|
"averageFetchTime" : 0.0028224786122639975,
|
|
"lastError" : {
|
|
"errorNum" : 1400,
|
|
"time" : "2019-03-01T11:52:45Z",
|
|
"errorMessage" : "could not connect to master at tcp://127.0.0.1:8529 for URL /_api/wal/tail?chunkSize=6291456&barrier=0&from=2101606369&lastScanned=2101606370&serverId=46402312160836&includeSystem=true&includeFoxxQueues=false: Could not connect to 'http+tcp://127.0.0.1:852..."
|
|
},
|
|
"time" : "2019-03-01T11:52:56Z"
|
|
},
|
|
"server" : {
|
|
"version" : "3.4.4",
|
|
"serverId" : "46402312160836"
|
|
},
|
|
"endpoint" : "tcp://master.example.org",
|
|
"database" : "test"
|
|
}
|
|
```
|
|
|
|
**Note**: the state of a database's replication applier is queryable via the HTTP
|
|
API, too. Please refer to [HTTP Interface for Replication](../../../../HTTP/Replications/index.html)
|
|
for more details.
|
|
|