arangodb/Documentation/Books/Manual/Scalability/Architecture.md

Architecture
------------

The cluster architecture of ArangoDB is a CP master/master model with no
single point of failure. With "CP" we mean that in the presence of a
network partition, the database prefers internal consistency over
availability. With "master/master" we mean that clients can send their
requests to an arbitrary node, and experience the same view on the
database regardless. "No single point of failure" means that the cluster
can continue to serve requests, even if one machine fails completely.

In this way, ArangoDB has been designed as a distributed multi-model
database. This section gives a short outline on the cluster architecture and
how the above features and capabilities are achieved.

### Structure of an ArangoDB cluster

An ArangoDB cluster consists of a number of ArangoDB instances
which talk to each other over the network. They play different roles,
which will be explained in detail below. The current configuration
of the cluster is held in the "Agency", which is a highly-available
resilient key/value store based on an odd number of ArangoDB instances
running [Raft Consensus Protocol](https://raft.github.io/).

For the various instances in an ArangoDB cluster there are 4 distinct
roles: Agents, Coordinators, Primary and Secondary DBservers. In the
following sections we will shed light on each of them. Note that the
tasks for all roles run the same binary from the same Docker image.

#### Agents

One or multiple Agents form the Agency in an ArangoDB cluster. The
Agency is the central place to store the configuration in a cluster. It
performs leader elections and provides other synchronization services for
the whole cluster. Without the Agency none of the other components can
operate.

While generally invisible to the outside it is the heart of the
cluster. As such, fault tolerance is of course a must have for the
Agency. To achieve that the Agents are using the [Raft Consensus
Algorithm](https://raft.github.io/). The algorithm formally guarantees
conflict free configuration management within the ArangoDB cluster.

At its core the Agency manages a big configuration tree. It supports
transactional read and write operations on this tree, and other servers
can subscribe to HTTP callbacks for all changes to the tree.

#### Coordinators

Coordinators should be accessible from the outside. These are the ones
the clients talk to. They will coordinate cluster tasks like
executing queries and running Foxx services. They know where the
data is stored and will optimize where to run user supplied queries or
parts thereof. Coordinators are stateless and can thus easily be shut down
and restarted as needed.

#### Primary DBservers

Primary DBservers are the ones where the data is actually hosted. They
host shards of data and using synchronous replication a primary may
either be leader or follower for a shard.

They should not be accessed from the outside but indirectly through the
coordinators. They may also execute queries in part or as a whole when
asked by a coordinator.

#### Secondaries

Secondary DBservers are asynchronous replicas of primaries. If one is
using only synchronous replication, one does not need secondaries at all.
For each primary, there can be one or more secondaries. Since the
replication works asynchronously (eventual consistency), the replication
does not impede the performance of the primaries. On the other hand,
their replica of the data can be slightly out of date. The secondaries
are perfectly suitable for backups as they don't interfere with the
normal cluster operation.

#### Cluster ID

Every non-Agency ArangoDB instance in a cluster is assigned a unique
ID during its startup. Using its ID a node is identifiable
throughout the cluster. All cluster operations will communicate
via this ID.

### Sharding

Using the roles outlined above an ArangoDB cluster is able to distribute
data in so called shards across multiple primaries. From the outside
this process is fully transparent and as such we achieve the goals of
what other systems call "master-master replication". In an ArangoDB
cluster you talk to any coordinator and whenever you read or write data
it will automatically figure out where the data is stored (read) or to
be stored (write). The information about the shards is shared across the
coordinators using the Agency.

Also see [Sharding](../Administration/Sharding/README.md) in the
Administration chapter.

### Many sensible configurations

This architecture is very flexible and thus allows many configurations,
which are suitable for different usage scenarios:

 1. The default configuration is to run exactly one coordinator and
    one primary DBserver on each machine. This achieves the classical
    master/master setup, since there is a perfect symmetry between the
    different nodes, clients can equally well talk to any one of the
    coordinators and all expose the same view to the data store.
 2. One can deploy more coordinators than DBservers. This is a sensible
    approach if one needs a lot of CPU power for the Foxx services,
    because they run on the coordinators.
 3. One can deploy more DBservers than coordinators if more data capacity
    is needed and the query performance is the lesser bottleneck
 4. One can deploy a coordinator on each machine where an application
    server (e.g. a node.js server) runs, and the Agents and DBservers
    on a separate set of machines elsewhere. This avoids a network hop
    between the application server and the database and thus decreases
    latency. Essentially, this moves some of the database distribution
    logic to the machine where the client runs.

These for shall suffice for now. The important piece of information here
is that the coordinator layer can be scaled and deployed independently
from the DBserver layer.

### Replication

ArangoDB offers two ways of data replication within a cluster, synchronous
and asynchronous. In this section we explain some details and highlight
the advantages and disadvantages respectively.

#### Synchronous replication with automatic fail-over

Synchronous replication works on a per-shard basis. One configures for
each collection, how many copies of each shard are kept in the cluster.
At any given time, one of the copies is declared to be the "leader" and
all other replicas are "followers". Write operations for this shard
are always sent to the DBserver which happens to hold the leader copy,
which in turn replicates the changes to all followers before the operation
is considered to be done and reported back to the coordinator.
Read operations are all served by the server holding the leader copy,
this allows to provide snapshot semantics for complex transactions.

If a DBserver fails that holds a follower copy of a shard, then the leader
can no longer synchronize its changes to that follower. After a short timeout
(3 seconds), the leader gives up on the follower, declares it to be
out of sync, and continues service without the follower. When the server
with the follower copy comes back, it automatically resynchronizes its
data with the leader and synchronous replication is restored.

If a DBserver fails that holds a leader copy of a shard, then the leader
can no longer serve any requests. It will no longer send a heartbeat to
the Agency. Therefore, a supervision process running in the Raft leader
of the Agency, can take the necessary action (after 15 seconds of missing
heartbeats), namely to promote one of the servers that hold in-sync
replicas of the shard to leader for that shard. This involves a
reconfiguration in the Agency and leads to the fact that coordinators
now contact a different DBserver for requests to this shard. Service
resumes. The other surviving replicas automatically resynchronize their
data with the new leader. When the DBserver with the original leader
copy comes back, it notices that it now holds a follower replica,
resynchronizes its data with the new leader and order is restored.

All shard data synchronizations are done in an incremental way, such that
resynchronizations are quick. This technology allows to move shards
(follower and leader ones) between DBservers without service interruptions.
Therefore, an ArangoDB cluster can move all the data on a specific DBserver
to other DBservers and then shut down that server in a controlled way.
This allows to scale down an ArangoDB cluster without service interruption,
loss of fault tolerance or data loss. Furthermore, one can re-balance the
distribution of the shards, either manually or automatically.

All these operations can be triggered via a REST/JSON API or via the
graphical web UI. All fail-over operations are completely handled within
the ArangoDB cluster.

Obviously, synchronous replication involves a certain increased latency for
write operations, simply because there is one more network hop within the
cluster for every request. Therefore the user can set the replication factor
to 1, which means that only one copy of each shard is kept, thereby
switching off synchronous replication. This is a suitable setting for
less important or easily recoverable data for which low latency write
operations matter.

#### Asynchronous replication with automatic fail-over

Asynchronous replication works differently, in that it is organized
using primary and secondary DBservers. Each secondary server replicates
all the data held on a primary by polling in an asynchronous way. This
process has very little impact on the performance of the primary. The
disadvantage is that there is a delay between the confirmation of a
write operation that is sent to the client and the actual replication of
the data. If the master server fails during this delay, then committed
and confirmed data can be lost.

Nevertheless, we also offer automatic fail-over with this setup. Contrary
to the synchronous case, here the fail-over management is done from outside
the ArangoDB cluster. In a future version we might move this management
into the supervision process in the Agency, but as of now, the management
is done via the Mesos framework scheduler for ArangoDB (see below).

The granularity of the replication is a whole ArangoDB instance with
all data that resides on that instance, which means that
you need twice as many instances as without asynchronous replication.
Synchronous replication is more flexible in that respect, you can have
smaller and larger instances, and if one fails, the data can be rebalanced
across the remaining ones.

### Microservices and zero administation

The design and capabilities of ArangoDB are geared towards usage in
modern microservice architectures of applications. With the
[Foxx services](../Foxx/README.md) it is very easy to deploy a data
centric microservice within an ArangoDB cluster.

In addition, one can deploy multiple instances of ArangoDB within the
same project. One part of the project might need a scalable document
store, another might need a graph database, and yet another might need
the full power of a multi-model database actually mixing the various
data models. There are enormous efficiency benefits to be reaped by
being able to use a single technology for various roles in a project.

To simplify live of the devops in such a scenario we try as much as
possible to use a zero administration approach for ArangoDB. A running
ArangoDB cluster is resilient against failures and essentially repairs
itself in case of temporary failures. See the next section for further
capabilities in this direction.

### Apache Mesos integration

For the distributed setup, we use the Apache Mesos infrastructure by default.
ArangoDB is a fully certified package for DC/OS and can thus
be deployed essentially with a few mouse clicks or a single command, once
you have an existing DC/OS cluster. But even on a plain Apache Mesos cluster
one can deploy ArangoDB via Marathon with a single API call and some JSON
configuration.

The advantage of this approach is that we can not only implement the
initial deployment, but also the later management of automatic
replacement of failed instances and the scaling of the ArangoDB cluster
(triggered manually or even automatically). Since all manipulations are
either via the graphical web UI or via JSON/REST calls, one can even
implement auto-scaling very easily.

A DC/OS cluster is a very natural environment to deploy microservice
architectures, since it is so convenient to deploy various services,
including potentially multiple ArangoDB cluster instances within the
same DC/OS cluster. The built-in service discovery makes it extremely
simple to connect the various microservices and Mesos automatically
takes care of the distribution and deployment of the various tasks.

See the [Deployment](../Deployment/README.md) chapter and its subsections
for instructions.

It is possible to deploy an ArangoDB cluster by simply launching a bunch of
Docker containers with the right command line options to link them up,
or even on a single machine starting multiple ArangoDB processes. In that
case, synchronous replication will work within the deployed ArangoDB cluster,
and automatic fail-over in the sense that the duties of a failed server will
automatically be assigned to another, surviving one. However, since the
ArangoDB cluster cannot within itself launch additional instances, replacement
of failed nodes is not automatic and scaling up and down has to be managed
manually. This is why we do not recommend this setup for production
deployment.