1
0
Fork 0

Merge branch 'devel' of github.com:arangodb/arangodb into devel

This commit is contained in:
hkernbach 2016-06-22 00:54:43 +02:00
commit 3d7a507a7c
3 changed files with 75 additions and 25 deletions

View File

@ -18,18 +18,13 @@ An ArangoDB cluster consists of a number of ArangoDB instances
which talk to each other over the network. They play different roles,
which will be explained in detail below. The current configuration
of the cluster is held in the "Agency", which is a highly-available
resilient key/value store based on an odd number of ArangoDB instances.
!SUBSUBSECTION Cluster ID
Every non-Agency ArangoDB instance in a cluster is assigned a unique
ID during its startup. Using its ID a node is identifiable
throughout the cluster. All cluster operations will communicate
via this ID.
resilient key/value store based on an odd number of ArangoDB instances
running [Raft Consensus Protocol](https://raft.github.io/).
For the various instances in an ArangoDB cluster there are 4 distinct
roles: Agents, Coordinators, Primary and Secondary DBservers. In the
following sections we will shed light on each of them.
following sections we will shed light on each of them. Note that the
tasks for all roles run the same binary from the same Docker image.
!SUBSUBSECTION Agents
@ -70,13 +65,21 @@ asked by a coordinator.
!SUBSUBSECTION Secondaries
Secondary DBservers are asynchronous replicas of primaries. For each
primary, there can be one ore more secondaries. Since the replication
works asynchronously (eventual consistency), the replication does not
impede the performance of the primaries. On the other hand, their
replica of the data can be slightly out of date. The secondaries are
perfectly suitable for backups as they don't interfere with the normal
cluster operation.
Secondary DBservers are asynchronous replicas of primaries. If one is
using only synchronous replication, one does not need secondaries at all.
For each primary, there can be one or more secondaries. Since the
replication works asynchronously (eventual consistency), the replication
does not impede the performance of the primaries. On the other hand,
their replica of the data can be slightly out of date. The secondaries
are perfectly suitable for backups as they don't interfere with the
normal cluster operation.
!SUBSUBSECTION Cluster ID
Every non-Agency ArangoDB instance in a cluster is assigned a unique
ID during its startup. Using its ID a node is identifiable
throughout the cluster. All cluster operations will communicate
via this ID.
!SUBSECTION Sharding
@ -205,7 +208,7 @@ modern microservice architectures of applications. With the
[Foxx services](../Foxx/README.md) it is very easy to deploy a data
centric microservice within an ArangoDB cluster.
Alternatively, one can deploy multiple instances of ArangoDB within the
In addition, one can deploy multiple instances of ArangoDB within the
same project. One part of the project might need a scalable document
store, another might need a graph database, and yet another might need
the full power of a multi-model database actually mixing the various
@ -221,7 +224,7 @@ capabilities in this direction.
!SUBSECTION Apache Mesos integration
For the distributed setup, we use the Apache Mesos infrastructure by default.
ArangoDB is a fully certified package for the Mesosphere DC/OS and can thus
ArangoDB is a fully certified package for DC/OS and can thus
be deployed essentially with a few mouse clicks or a single command, once
you have an existing DC/OS cluster. But even on a plain Apache Mesos cluster
one can deploy ArangoDB via Marathon with a single API call and some JSON
@ -268,3 +271,5 @@ even further you can install a reverse proxy like haproxy or nginx in
front of the coordinators (that will also allow easy access from the
application).
Authentication in the cluster will be added soon after the initial 3.0
release.

View File

@ -17,7 +17,7 @@ primary key and all these operations scale linearly. If the sharding is
done using different shard keys, then a lookup of a single key involves
asking all shards and thus does not scale linearly.
!SUBSECTION document store
!SUBSECTION Document store
For the document store case even in the presence of secondary indexes
essentially the same arguments apply, since an index for a sharded
@ -26,10 +26,51 @@ single document operations still scale linearly with the size of the
cluster, unless a special sharding configuration makes lookups or
write operations more expensive.
!SUBSECTION complex queries and joins
For a deeper analysis of this topic see
[this blog post](https://mesosphere.com/blog/2015/11/30/arangodb-benchmark-dcos/)
in which good linear scalability of ArangoDB for single document operations
is demonstrated.
TODO
!SUBSECTION graph database
!SUBSECTION Complex queries and joins
TODO
The AQL query language allows complex queries, using multiple
collections, secondary indexes as well as joins. In particular with
the latter, scaling can be a challenge, since if the data to be
joined resides on different machines, a lot of communication
has to happen. The AQL query execution engine organises a data
pipeline across the cluster to put together the results in the
most efficient way. The query optimizer is aware of the cluster
structure and knows what data is where and how it is indexed.
Therefore, it can arrive at an informed decision about what parts
of the query ought to run where in the cluster.
Nevertheless, for certain complicated joins, there are limits as
to what can be achieved. A very important case that can be
optimized relatively easily is if one of the collections involved
in the join is small enough such that it is possible to
replicated its data on all machines. We call such a collection a
"satellite collection". Due to the replication a join involving
such a collection can be executed locally without too much
communication overhead.
!SUBSECTION Graph database
Graph databases are particularly good at queries on graphs that involve
paths in the graph of an a priori unknown length. For example, finding
the shortest path between two vertices in a graph, or finding all
paths that match a certain pattern starting at a given vertex are such
example.
However, if the vertices and edges along the occurring paths are
distributed across the cluster, then a lot of communication is
necessary between nodes, and performance suffers. To achieve good
performance at scale, it is therefore necessary, to get the
distribution of the graph data across the shards in the cluster
right. Most of the time, the application developers and users of
ArangoDB know best, how their graphs a structured. Therefore,
ArangoDB allows users to specify, according to which attributes
the graph data is sharded. A useful first step is usually to make
sure that the edges originating at a vertex reside on the same
cluster node as the vertex.

View File

@ -8,8 +8,12 @@ resilience by means of replication and automatic failover. Furthermore,
one can build systems that scale their capacity dynamically up and down
automatically according to demand.
Obviously, one can also scale ArangoDB vertically, that is, by using
ever larger servers. However, this has the disadvantage that the
One can also scale ArangoDB vertically, that is, by using
ever larger servers. There is no builtin limitation in ArangoDB,
for example, the server will automatically use more threads if
more CPUs are present.
However, scaling vertically has the disadvantage that the
costs grow faster than linear with the size of the server, and
none of the resilience and dynamical capabilities can be achieved
in this way.