3.7 KiB

Raw Blame History

Active Failover Architecture

An Active Failover is defined as:

One ArangoDB Single-Server instance which is read / writable by clients called Leader
One or more ArangoDB Single-Server instances, which are passive and not read or writable called Followers, which asynchronously replicate data from the master
At least one Agency acting as a "witness" to determine which server becomes the leader in a failure situation

Note: even though it is technically possible to start more than one followers only one follower is currently officially supported. This limitation may be removed in future releases.

The advantage of the Active Failover compared to a traditional Master-Slave setup is that there is an active third party, the Agency which observes and supervises all involved server processes. Follower instances can rely on the Agency to determine the correct leader server.

The Active Failover setup is made resilient by the fact that all the official ArangoDB drivers can automatically determine the correct leader server and redirect requests appropriately. Furthermore Foxx Services do also automatically perform a failover: should the leader instance fail (which is also the Foxxmaster) the newly elected leader will reinstall all Foxx services and resume executing queued Foxx tasks. Database users which were created on the leader will also be valid on the newly elected leader (always depending on the condition that they were synced already).

Consider the case for two arangod instances. The two servers are connected via server wide (global) asynchronous replication. One of the servers is elected Leader, and the other one is made a Follower automatically. At startup, the two servers race for the leadership position. This happens through the agency locking mechanism (which means that the Agency needs to be available at server start). You can control which server will become Leader by starting it earlier than other server instances in the beginning.

The Follower will automatically start replication from the Leader for all available databases, using the server-level replication introduced in v. 3.3.

When the Leader goes down, this is automatically detected by the Agency instance, which is also started in this mode. This instance will make the previous follower stop its replication and make it the new Leader.

The Follower will deny all read and write requests from client applications. Only the replication itself is allowed to access the follower's data until the follower becomes a new Leader (should a failover happen).

When sending a request to read or write data on a Follower, the Follower will always respond with HTTP 503 (Service unavailable) and provide the address of the current Leader. Client applications and drivers can use this information to then make a follow-up request to the proper Leader:

HTTP/1.1 503 Service Unavailable
X-Arango-Endpoint: http://[::1]:8531
....

Client applications can also detect who the current Leader and the Followers are by calling the /_api/cluster/endpoints REST API. This API is accessible on Leader and Followers alike.

The tool ArangoDB Starter supports starting two servers with asynchronous replication and failover out of the box.

The arangojs driver for JavaScript, the Go driver, the Java driver, ArangoJS and the PHP driver support active failover in case the currently accessed server endpoint responds with HTTP 503.

3.7 KiB Raw Blame History

Active Failover Architecture

3.7 KiB

Raw Blame History