Merge branch 'devel' of github.com:arangodb/arangodb into devel

2016-06-22 00:54:43 +02:00 · 2016-06-22 00:54:43 +02:00 · 3d7a507a7c
parent 95e9062c49 7e32938277
commit 3d7a507a7c
3 changed files with 75 additions and 25 deletions
--- a/Documentation/Books/Manual/Scalability/Architecture.mdpp
+++ b/Documentation/Books/Manual/Scalability/Architecture.mdpp
@ -18,18 +18,13 @@ An ArangoDB cluster consists of a number of ArangoDB instances
 which talk to each other over the network. They play different roles,
 which will be explained in detail below. The current configuration 
 of the cluster is held in the "Agency", which is a highly-available
-resilient key/value store based on an odd number of ArangoDB instances.
-
-!SUBSUBSECTION Cluster ID
-
-Every non-Agency ArangoDB instance in a cluster is assigned a unique
-ID during its startup. Using its ID a node is identifiable
-throughout the cluster. All cluster operations will communicate
-via this ID.
+resilient key/value store based on an odd number of ArangoDB instances
+running [Raft Consensus Protocol](https://raft.github.io/).

 For the various instances in an ArangoDB cluster there are 4 distinct
 roles: Agents, Coordinators, Primary and Secondary DBservers. In the
-following sections we will shed light on each of them.
+following sections we will shed light on each of them. Note that the
+tasks for all roles run the same binary from the same Docker image.

 !SUBSUBSECTION Agents

@ -70,13 +65,21 @@ asked by a coordinator.

 !SUBSUBSECTION Secondaries

-Secondary DBservers are asynchronous replicas of primaries. For each
-primary, there can be one ore more secondaries. Since the replication
-works asynchronously (eventual consistency), the replication does not
-impede the performance of the primaries. On the other hand, their
-replica of the data can be slightly out of date. The secondaries are
-perfectly suitable for backups as they don't interfere with the normal
-cluster operation.
+Secondary DBservers are asynchronous replicas of primaries. If one is
+using only synchronous replication, one does not need secondaries at all.
+For each primary, there can be one or more secondaries. Since the
+replication works asynchronously (eventual consistency), the replication
+does not impede the performance of the primaries. On the other hand,
+their replica of the data can be slightly out of date. The secondaries
+are perfectly suitable for backups as they don't interfere with the
+normal cluster operation.
+
+!SUBSUBSECTION Cluster ID
+
+Every non-Agency ArangoDB instance in a cluster is assigned a unique
+ID during its startup. Using its ID a node is identifiable
+throughout the cluster. All cluster operations will communicate
+via this ID.

 !SUBSECTION Sharding

@ -205,7 +208,7 @@ modern microservice architectures of applications. With the
 [Foxx services](../Foxx/README.md) it is very easy to deploy a data
 centric microservice within an ArangoDB cluster.

-Alternatively, one can deploy multiple instances of ArangoDB within the
+In addition, one can deploy multiple instances of ArangoDB within the
 same project. One part of the project might need a scalable document
 store, another might need a graph database, and yet another might need
 the full power of a multi-model database actually mixing the various
@ -221,7 +224,7 @@ capabilities in this direction.
 !SUBSECTION Apache Mesos integration

 For the distributed setup, we use the Apache Mesos infrastructure by default.
-ArangoDB is a fully certified package for the Mesosphere DC/OS and can thus
+ArangoDB is a fully certified package for DC/OS and can thus
 be deployed essentially with a few mouse clicks or a single command, once
 you have an existing DC/OS cluster. But even on a plain Apache Mesos cluster
 one can deploy ArangoDB via Marathon with a single API call and some JSON 
@ -268,3 +271,5 @@ even further you can install a reverse proxy like haproxy or nginx in
 front of the coordinators (that will also allow easy access from the
 application).

+Authentication in the cluster will be added soon after the initial 3.0
+release.
--- a/Documentation/Books/Manual/Scalability/DataModels.mdpp
+++ b/Documentation/Books/Manual/Scalability/DataModels.mdpp
@ -17,7 +17,7 @@ primary key and all these operations scale linearly. If the sharding is
 done using different shard keys, then a lookup of a single key involves
 asking all shards and thus does not scale linearly.

-!SUBSECTION document store
+!SUBSECTION Document store

 For the document store case even in the presence of secondary indexes
 essentially the same arguments apply, since an index for a sharded
@ -26,10 +26,51 @@ single document operations still scale linearly with the size of the
 cluster, unless a special sharding configuration makes lookups or
 write operations more expensive.

-!SUBSECTION complex queries and joins
+For a deeper analysis of this topic see 
+[this blog post](https://mesosphere.com/blog/2015/11/30/arangodb-benchmark-dcos/)
+in which good linear scalability of ArangoDB for single document operations
+is demonstrated.

-TODO

-!SUBSECTION graph database
+!SUBSECTION Complex queries and joins

-TODO
+The AQL query language allows complex queries, using multiple
+collections, secondary indexes as well as joins. In particular with
+the latter, scaling can be a challenge, since if the data to be
+joined resides on different machines, a lot of communication
+has to happen. The AQL query execution engine organises a data
+pipeline across the cluster to put together the results in the
+most efficient way. The query optimizer is aware of the cluster
+structure and knows what data is where and how it is indexed.
+Therefore, it can arrive at an informed decision about what parts
+of the query ought to run where in the cluster.
+
+Nevertheless, for certain complicated joins, there are limits as
+to what can be achieved. A very important case that can be
+optimized relatively easily is if one of the collections involved
+in the join is small enough such that it is possible to
+replicated its data on all machines. We call such a collection a
+"satellite collection". Due to the replication a join involving
+such a collection can be executed locally without too much
+communication overhead.
+
+
+!SUBSECTION Graph database
+
+Graph databases are particularly good at queries on graphs that involve
+paths in the graph of an a priori unknown length. For example, finding
+the shortest path between two vertices in a graph, or finding all
+paths that match a certain pattern starting at a given vertex are such
+example.
+
+However, if the vertices and edges along the occurring paths are
+distributed across the cluster, then a lot of communication is
+necessary between nodes, and performance suffers. To achieve good
+performance at scale, it is therefore necessary, to get the
+distribution of the graph data across the shards in the cluster
+right. Most of the time, the application developers and users of
+ArangoDB know best, how their graphs a structured. Therefore, 
+ArangoDB allows users to specify, according to which attributes
+the graph data is sharded. A useful first step is usually to make
+sure that the edges originating at a vertex reside on the same
+cluster node as the vertex.
--- a/Documentation/Books/Manual/Scalability/README.mdpp
+++ b/Documentation/Books/Manual/Scalability/README.mdpp
@ -8,8 +8,12 @@ resilience by means of replication and automatic failover. Furthermore,
 one can build systems that scale their capacity dynamically up and down 
 automatically according to demand.

-Obviously, one can also scale ArangoDB vertically, that is, by using
-ever larger servers. However, this has the disadvantage that the
+One can also scale ArangoDB vertically, that is, by using
+ever larger servers. There is no builtin limitation in ArangoDB,
+for example, the server will automatically use more threads if
+more CPUs are present. 
+
+However, scaling vertically has the disadvantage that the
 costs grow faster than linear with the size of the server, and
 none of the resilience and dynamical capabilities can be achieved 
 in this way.