@ -6,7 +6,9 @@ It allows you to define a graph that is spread across several edge and document
This allows you to structure your models in line with your domain and group them logically in collections giving you the power to query them in the same graph queries.
There is no need to include the referenced collections within the query, this module will handle it for you.
New to ArangoDB? Take [Free ArangoDB Graph course](https://www.arangodb.com/arangodb-graph-course) for freshers.
Distributed graph processing enables you to do online analytical processing
directly on graphs stored into arangodb. This is intended to help you gain analytical insights
on your data, without having to use external processing sytems. Examples of algorithms
directly on graphs stored into ArangoDB. This is intended to help you gain analytical insights
on your data, without having to use external processing systems. Examples of algorithms
to execute are PageRank, Vertex Centrality, Vertex Closeness, Connected Components, Community Detection.
This system is not useful for typical online queries, where you just do work on a small set of vertices.
These kind of tasks are better suited for AQL.
The processing system inside ArangoDB is based on: [Pregel: A System for Large-Scale Graph Processing](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod10/p135-malewicz.pdf) – Malewicz et al. (Google) 2010
The processing system inside ArangoDB is based on:
[Pregel: A System for Large-Scale Graph Processing](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod10/p135-malewicz.pdf) – Malewicz et al. (Google), 2010.
This concept enables us to perform distributed graph processing, without the need for distributed global locking.
Prerequisites
@ -26,7 +31,7 @@ ensure correct results. For more information see the next section.
To enable iterative graph processing for your data, you will need to ensure
that your vertex and edge collections are sharded in a specific way.
The pregel computing model requires all edges to be present on the DB Server where
The Pregel computing model requires all edges to be present on the DB Server where
the vertex document identified by the `_from` value is located.
This means the vertex collections need to be sharded by '_key' and the edge collection
will need to be sharded after an attribute which always contains the '_key' of the vertex.
@ -36,6 +41,7 @@ additionally you will need to specify the key `distributeShardsLike` and an **eq
Only if these requirements are met can ArangoDB place the edges and vertices correctly.
For example you might create your collections like this:
There are two algorithms to find connected components in a graph. To find weakly connected components (WCC)
@ -225,24 +232,25 @@ you can use the algorithm named "connectedcomponents", to find strongly connecte
named "scc". Both algorithm will assign a component ID to each vertex.
A weakly connected components means that there exist a path from every vertex pair in that component.
WCC is a very simple and fast algorithm, which will only work correctly on undirected graphs. Your results on directed graphs may vary, depending on how connected your components are.
WCC is a very simple and fast algorithm, which will only work correctly on undirected graphs.
Your results on directed graphs may vary, depending on how connected your components are.
In the case of SCC a component means every vertex is reachable from any other vertex in the same component. The algorithm is more complex than the WCC algorithm and requires more RAM, because each vertex needs to store much more state.
In the case of SCC a component means every vertex is reachable from any other vertex in the same component.
The algorithm is more complex than the WCC algorithm and requires more RAM, because each vertex needs to store much more state.
Consider using WCC if you think your data may be suitable for it.
```javascript
var pregel = require("@arangodb/pregel");
// weakly connected components
pregel.start("connectedcomponents", "graphname")
// strongly connected components
pregel.start("scc", "graphname")
var pregel = require("@arangodb/pregel");
// weakly connected components
pregel.start("connectedcomponents", "graphname")
// strongly connected components
pregel.start("scc", "graphname")
```
### Hyperlink-Induced Topic Search (HITS)
HITS is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg (The algorithm is also known as hubs and authorities).
The idea behind Hubs and Authorities comes from the typical structure of the web: Certain websites known as hubs, serve as large directories that are not actually
authoritative on the information that they hold. These hubs are used as compilations of a broad catalog of information that leads users direct to other authoritative webpages.
The algorithm assigns each vertex two scores: The authority-score and the hub-score. The authority score rates how many good hubs point to a particular
@ -251,14 +259,13 @@ vertex (or webpage), the hub score rates how good (authoritative) the vertices p
Our version of the algorithm
converges after a certain amount of time. The parameter *threshold* can be used to set a limit for the convergence (measured as maximum absolute difference of the hub and
authority scores between the current and last iteration)
When you specify the result field name, the hub score will be stored in "<resultfield>_hub" and the authority score in
"<resultfield>_auth".
When you specify the result field name, the hub score will be stored in `<result field>_hub` and the authority score in
`<result field>_auth`.
The algorithm can be executed like this:
```javascript
var pregel = require("@arangodb/pregel");
var handle = pregel.start("hits", "yourgraph", {threshold:0.00001, resultField: "score"});
var pregel = require("@arangodb/pregel");
var handle = pregel.start("hits", "yourgraph", {threshold:0.00001, resultField: "score"});
```
### Vertex Centrality
@ -269,10 +276,8 @@ There are various definitions for centrality, the simplest one being the vertex
These definitions were not designed with scalability in mind. It is probably impossible to discover an efficient algorithm which computes them in a distributed way.
Fortunately there are scalable substitutions available, which should be equally usable for most use cases.

#### Effective Closeness
A common definitions of centrality is the **closeness centrality** (or closeness).
@ -292,8 +297,8 @@ This should work well on large graphs and on smaller ones as well. The memory re
Graphs based on real world networks often have a community structure. This means it is possible to find groups of vertices such that each each vertex group is internally more densely connected than outside the group.
@ -329,8 +333,8 @@ Social networks include community groups (the origin of the term, in fact) based
#### Label Propagation
*Label Propagation* can be used to implement community detection on large graphs. The idea is that each
vertex should be in the community that most of his neighbours are in. We iteratively detemine this by first
assigning random Community ID's. Then each itertation, a vertex will send it's current community ID to all his neighbor vertices.
vertex should be in the community that most of his neighbors are in. We iteratively determine this by first
assigning random Community ID's. Then each iteration, a vertex will send it's current community ID to all his neighbor vertices.
Then each vertex adopts the community ID he received most frequently during the iteration.
The algorithm runs until it converges,
@ -339,14 +343,14 @@ The default bound is 500 iterations, which is likely too large for your applicat
Should work best on undirected graphs, results on directed graphs might vary depending on the density of your graph.
The [Speaker-listener Label Propagation](https://arxiv.org/pdf/1109.5720.pdf) (SLPA) can be used to implement community detection. It works similar to the label propagation algorithm,
The [Speaker-listener Label Propagation](https://arxiv.org/pdf/1109.5720.pdf) (SLPA) can be used to implement community detection.
It works similar to the label propagation algorithm,
but now every node additionally accumulates a memory of observed labels (instead of forgetting all but one label).
Before the algorithm run, every vertex is initialized with an unique ID (the initial community label).
@ -354,7 +358,7 @@ During the run three steps are executed for each vertex:
1. Current vertex is the listener all other vertices are speakers
2. Each speaker sends out a label from memory, we send out a random label with a probability
proportional to the number of times the vertex observed the label
proportional to the number of times the vertex observed the label
3. The listener remembers one of the labels, we always choose the most frequently observed label