Merge branch 'devel' of github.com:arangodb/arangodb into devel

2016-06-22 00:54:43 +02:00 · 2016-06-22 00:54:43 +02:00 · 3d7a507a7c
parent 95e9062c49 7e32938277
commit 3d7a507a7c
3 changed files with 75 additions and 25 deletions
--- a/Documentation/Books/Manual/Scalability/Architecture.mdpp
+++ b/Documentation/Books/Manual/Scalability/Architecture.mdpp
@ -18,18 +18,13 @@ An ArangoDB cluster consists of a number of ArangoDB instances
 which talk to each other over the network. They play different roles,
 which will be explained in detail below. The current configuration 
 of the cluster is held in the "Agency", which is a highly-available
-resilient key/value store based on an odd number of ArangoDB instances.
+resilient key/value store based on an odd number of ArangoDB instances
-
+running [Raft Consensus Protocol](https://raft.github.io/).
 !SUBSUBSECTION Cluster ID
 Every non-Agency ArangoDB instance in a cluster is assigned a unique
 ID during its startup. Using its ID a node is identifiable
 throughout the cluster. All cluster operations will communicate
 via this ID.
 For the various instances in an ArangoDB cluster there are 4 distinct
 roles: Agents, Coordinators, Primary and Secondary DBservers. In the
-following sections we will shed light on each of them.
+following sections we will shed light on each of them. Note that the
 tasks for all roles run the same binary from the same Docker image.
 !SUBSUBSECTION Agents
@ -70,13 +65,21 @@ asked by a coordinator.
 !SUBSUBSECTION Secondaries
-Secondary DBservers are asynchronous replicas of primaries. For each
+Secondary DBservers are asynchronous replicas of primaries. If one is
-primary, there can be one ore more secondaries. Since the replication
+using only synchronous replication, one does not need secondaries at all.
-works asynchronously (eventual consistency), the replication does not
+For each primary, there can be one or more secondaries. Since the
-impede the performance of the primaries. On the other hand, their
+replication works asynchronously (eventual consistency), the replication
-replica of the data can be slightly out of date. The secondaries are
+does not impede the performance of the primaries. On the other hand,
-perfectly suitable for backups as they don't interfere with the normal
+their replica of the data can be slightly out of date. The secondaries
-cluster operation.
+are perfectly suitable for backups as they don't interfere with the
 normal cluster operation.
 !SUBSUBSECTION Cluster ID
 Every non-Agency ArangoDB instance in a cluster is assigned a unique
 ID during its startup. Using its ID a node is identifiable
 throughout the cluster. All cluster operations will communicate
 via this ID.
 !SUBSECTION Sharding
@ -205,7 +208,7 @@ modern microservice architectures of applications. With the
 [Foxx services](../Foxx/README.md) it is very easy to deploy a data
 centric microservice within an ArangoDB cluster.
-Alternatively, one can deploy multiple instances of ArangoDB within the
+In addition, one can deploy multiple instances of ArangoDB within the
 same project. One part of the project might need a scalable document
 store, another might need a graph database, and yet another might need
 the full power of a multi-model database actually mixing the various
@ -221,7 +224,7 @@ capabilities in this direction.
 !SUBSECTION Apache Mesos integration
 For the distributed setup, we use the Apache Mesos infrastructure by default.
-ArangoDB is a fully certified package for the Mesosphere DC/OS and can thus
+ArangoDB is a fully certified package for DC/OS and can thus
 be deployed essentially with a few mouse clicks or a single command, once
 you have an existing DC/OS cluster. But even on a plain Apache Mesos cluster
 one can deploy ArangoDB via Marathon with a single API call and some JSON 
@ -268,3 +271,5 @@ even further you can install a reverse proxy like haproxy or nginx in
 front of the coordinators (that will also allow easy access from the
 application).
 Authentication in the cluster will be added soon after the initial 3.0
 release.
--- a/Documentation/Books/Manual/Scalability/DataModels.mdpp
+++ b/Documentation/Books/Manual/Scalability/DataModels.mdpp
@ -17,7 +17,7 @@ primary key and all these operations scale linearly. If the sharding is
 done using different shard keys, then a lookup of a single key involves
 asking all shards and thus does not scale linearly.
-!SUBSECTION document store
+!SUBSECTION Document store
 For the document store case even in the presence of secondary indexes
 essentially the same arguments apply, since an index for a sharded
@ -26,10 +26,51 @@ single document operations still scale linearly with the size of the
 cluster, unless a special sharding configuration makes lookups or
 write operations more expensive.
-!SUBSECTION complex queries and joins
+For a deeper analysis of this topic see 
 [this blog post](https://mesosphere.com/blog/2015/11/30/arangodb-benchmark-dcos/)
 in which good linear scalability of ArangoDB for single document operations
 is demonstrated.
 TODO
-!SUBSECTION graph database
+!SUBSECTION Complex queries and joins
-TODO
+The AQL query language allows complex queries, using multiple
 collections, secondary indexes as well as joins. In particular with
 the latter, scaling can be a challenge, since if the data to be
 joined resides on different machines, a lot of communication
 has to happen. The AQL query execution engine organises a data
 pipeline across the cluster to put together the results in the
 most efficient way. The query optimizer is aware of the cluster
 structure and knows what data is where and how it is indexed.
 Therefore, it can arrive at an informed decision about what parts
 of the query ought to run where in the cluster.
 Nevertheless, for certain complicated joins, there are limits as
 to what can be achieved. A very important case that can be
 optimized relatively easily is if one of the collections involved
 in the join is small enough such that it is possible to
 replicated its data on all machines. We call such a collection a
 "satellite collection". Due to the replication a join involving
 such a collection can be executed locally without too much
 communication overhead.
 !SUBSECTION Graph database
 Graph databases are particularly good at queries on graphs that involve
 paths in the graph of an a priori unknown length. For example, finding
 the shortest path between two vertices in a graph, or finding all
 paths that match a certain pattern starting at a given vertex are such
 example.
 However, if the vertices and edges along the occurring paths are
 distributed across the cluster, then a lot of communication is
 necessary between nodes, and performance suffers. To achieve good
 performance at scale, it is therefore necessary, to get the
 distribution of the graph data across the shards in the cluster
 right. Most of the time, the application developers and users of
 ArangoDB know best, how their graphs a structured. Therefore, 
 ArangoDB allows users to specify, according to which attributes
 the graph data is sharded. A useful first step is usually to make
 sure that the edges originating at a vertex reside on the same
 cluster node as the vertex.
--- a/Documentation/Books/Manual/Scalability/README.mdpp
+++ b/Documentation/Books/Manual/Scalability/README.mdpp
@ -8,8 +8,12 @@ resilience by means of replication and automatic failover. Furthermore,
 one can build systems that scale their capacity dynamically up and down 
 automatically according to demand.
-Obviously, one can also scale ArangoDB vertically, that is, by using
+One can also scale ArangoDB vertically, that is, by using
-ever larger servers. However, this has the disadvantage that the
+ever larger servers. There is no builtin limitation in ArangoDB,
 for example, the server will automatically use more threads if
 more CPUs are present. 
 However, scaling vertically has the disadvantage that the
 costs grow faster than linear with the size of the server, and
 none of the resilience and dynamical capabilities can be achieved 
 in this way.