mirror of https://gitee.com/bigwinds/arangodb
304 lines
15 KiB
Plaintext
304 lines
15 KiB
Plaintext
Distributed Iterative Graph Processing (Pregel)
|
||
======
|
||
|
||
Distributed graph processing enables you to do online analytical processing
|
||
directly on graphs stored into arangodb. This is intended to help you gain analytical insights
|
||
on your data, without having to use external processing sytems. Examples of algorithms
|
||
to execute are PageRank, Vertex Centrality, Vertex Closeness, Connected Components, Community Detection.
|
||
This system is not useful for typical online queries, where you just do work on a small set of vertices.
|
||
These kind of tasks are better suited for AQL.
|
||
|
||
The processing system inside ArangoDB is based on: [Pregel: A System for Large-Scale Graph Processing](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod10/p135-malewicz.pdf) – Malewicz et al. (Google) 2010
|
||
This concept enables us to perform distributed graph processing, without the need for distributed global locking.
|
||
|
||
# Prerequisites
|
||
|
||
If you are running a single ArangoDB instance in single-server mode, there are no requirements regarding the modeling
|
||
of your data. All you need is at least one vertex collection and one edge collection. Note that the performance may be
|
||
better, if the number of your shards / collections matches the number of CPU cores.
|
||
|
||
When you use ArangoDB Community edition in cluster mode, you might need to model your collections in a certain way to
|
||
ensure correct results. For more information see the next section.
|
||
|
||
## Requirements for Collections in a Cluster (Non Smart Graph)
|
||
|
||
To enable iterative graph processing for your data, you will need to ensure
|
||
that your vertex and edge collections are sharded in a specific way.
|
||
|
||
The pregel computing model requires all edges to be present on the DB Server where
|
||
the vertex document identified by the `_from` value is located.
|
||
This means the vertex collections need to be sharded by '_key' and the edge collection
|
||
will need to be sharded after an attribute which always contains the '_key' of the vertex.
|
||
|
||
Our implementation currently requires every edge collection to be sharded after a "vertex" attributes,
|
||
additionally you will need to specify the key `distributeShardsLike` and an **equal** number of shards on every collection.
|
||
Only if these requirements are met can ArangoDB place the edges and vertices correctly.
|
||
|
||
For example you might create your collections like this:
|
||
```javascript
|
||
// Create main vertex collection:
|
||
db._create("vertices", {
|
||
shardKeys:['_key'],
|
||
numberOfShards: 8
|
||
});
|
||
|
||
// Optionally create arbitrary additional vertex collections
|
||
db._create("additonal", {
|
||
distributeShardsLike:"vertices",
|
||
numberOfShards:8
|
||
});
|
||
|
||
// Create (one or more) edge-collections:
|
||
db._createEdgeCollection("edges", {
|
||
shardKeys:['vertex'],
|
||
distributeShardsLike:"vertices",
|
||
numberOfShards:8
|
||
});
|
||
```
|
||
|
||
You will need to ensure that edge documents contain the proper values in their sharding attribute.
|
||
For a vertex document with the following content ```{_key:"A", value:0}```
|
||
the corresponding edge documents would have look like this:
|
||
|
||
```
|
||
{_from:"vertices/A", _to: "vertices/B", vertex:"A"}
|
||
{_from:"vertices/A", _to: "vertices/C", vertex:"A"}
|
||
{_from:"vertices/A", _to: "vertices/D", vertex:"A"}
|
||
...
|
||
```
|
||
|
||
This will ensure that outgoing edge documents will be placed on the same DBServer as the vertex.
|
||
Without the correct placement of the edges, the pregel graph processing system will not work correctly, because
|
||
edges will not load correctly.
|
||
|
||
### Arangosh API
|
||
|
||
#### Starting an Algorithm Execution
|
||
|
||
The pregel API is accessible through the `@arangodb/pregel` package.
|
||
To start an execution you need to specify the **algorithm** name and the vertex and edge collections.
|
||
Alternatively you can specify a named graph. Additionally you can specify custom parameters which
|
||
vary for each algorithm.
|
||
The `start` method will always a unique ID which can be used to interact with the algorithm and later on.
|
||
|
||
The below version of the `start` method can be used for named graphs:
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var params = {};
|
||
var execution = pregel.start("<algorithm>", "<yourgraph>", params);
|
||
```
|
||
Params needs to be an object, the valid keys are mentioned below in the section [Algorithms]()
|
||
|
||
|
||
Alternatively you might want to specify the vertex and edge collections directly. The call-syntax of the `start``
|
||
method changes in this case. The second argument must be an object with the keys `vertexCollections`and `edgeCollections`.
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var params = {};
|
||
var execution = pregel.start("<algorithm>", {vertexCollections:["vertices"], edgeCollections:["edges"]}, {});
|
||
```
|
||
The last argument is still the parameter object. See below for a list of algorithms and parameters.
|
||
|
||
|
||
#### Status of an Algorithm Execution
|
||
|
||
The code returned by the `pregel.start(...)` method can be used to
|
||
track the status of your algorithm.
|
||
|
||
```javascript
|
||
var execution = pregel.start("sssp", "demograph", {source: "vertices/V"});
|
||
var status = pregel.status(execution);
|
||
```
|
||
|
||
The result will tell you the current status of the algorithm execution.
|
||
It will tell you the current `state` of the execution, the current global superstep, the runtime, the global aggregator values as
|
||
well as the number of send and received messages.
|
||
|
||
Valid values for the `state` field include:
|
||
- "running" algorithm is still running
|
||
- "done": The execution is done, the result might not be written back into the collection yet.
|
||
- "canceled": The execution was permanently canceled, either by the user or by an error.
|
||
- "in error": The exeuction is in an error state. This can be caused by primary DBServers beeing not reachable or beeing non responsive.
|
||
The execution might recover later, or switch to "canceled" if it was not able to recover successfuly
|
||
- "recovering": The execution is actively recovering, will switch back to "running" if the recovery was successful
|
||
|
||
The object returned by the `status` method might for example look something like this:
|
||
|
||
```javascript
|
||
{
|
||
"state" : "running",
|
||
"gss" : 12,
|
||
"totalRuntime" : 123.23,
|
||
"aggregators" : {
|
||
"converged" : false,
|
||
"max" : true,
|
||
"phase" : 2
|
||
},
|
||
"sendCount" : 3240364978,
|
||
"receivedCount" : 3240364975
|
||
}
|
||
```
|
||
|
||
|
||
#### Canceling an Execution / Discarding results
|
||
|
||
To cancel an execution which is still runnning, and discard any intermediare results you can use the `cancel` method.
|
||
This will immediatly free all memory taken up by the execution, and will make you lose all intermediary data.
|
||
|
||
You might get inconsistent results if you cancel an execution while it is already in it's `done` state. The data is written
|
||
multi-threaded into all collection shards at once, this means there are multiple transactions simultaniously. A transaction might
|
||
already be commited when you cancel the execution job, therefore you might see the result in your collection. This does not apply
|
||
if you configured the execution to not write data into the collection.
|
||
|
||
```javascript
|
||
// start a single source shortest path job
|
||
var execution = pregel.start("sssp", "demograph", {source: "vertices/V"});
|
||
pregel.cancel(execution);
|
||
```
|
||
|
||
## AQL integration
|
||
|
||
ArangoDB supports retrieving temporary pregel results through the ArangoDB query language (AQL).
|
||
When our graph processing subsystem finishes executing an algorithm, the result can either be written back into the
|
||
database or kept in memory. In both cases the result can be queried via AQL. If the data was not written to the database
|
||
store it is only held temporarily, until the user calls the `cancel` methodFor example a user might want to query
|
||
only nodes with the most rank from the result set of a PageRank execution.
|
||
|
||
FOR v IN PREGEL_RESULT(<handle>)
|
||
FILTER v.value >= 0.01
|
||
RETURN v._key
|
||
|
||
## Available Algorithms
|
||
|
||
There are a number of general parameters which apply to almost all algorithms:
|
||
* `store`: Is per default *true*, the pregel engine will write results back to the database.
|
||
if the value is *false* then you can query the results via AQLk, see AQL integration.
|
||
* `maxGSS`: Maximum number of global iterations for this algorithm
|
||
* `parallelism`: Number of parellel threads to use per worker. Does not influence the number of threads used to load
|
||
or store data from the database (this depends on the number of shards).
|
||
* `async`: Algorithms wich support async mode, will run without synchronized global iterations,
|
||
might lead to performance increases if you have load imbalances.
|
||
* `resultField`: Most algorithms will write the result into this field
|
||
|
||
#### Page Rank
|
||
|
||
PageRank is a well known algorithm to rank documents in a graph. The algorithm will run until
|
||
the execution converges. Specify a custom threshold with the parameter `threshold`, to run for a fixed
|
||
number of iterations use the `maxGSS` parameter.
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
pregel.start("pagerank", "graphname", {maxGSS: 100, threshold:0.00000001})
|
||
```
|
||
|
||
#### Single-Source Shortest Path
|
||
|
||
Calculates the distance of each vertex to a certain shortest path. The algorithm will run until it converges,
|
||
the iterations are bound by the diameter (the longest shortest path) of your graph.
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
pregel.start("sssp", "graphname", {source:"vertices/1337"})
|
||
```
|
||
|
||
#### Hyperlink-Induced Topic Search (HITS)
|
||
|
||
HITS is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg (The algorithm is also known as hubs and authorities).
|
||
|
||
|
||
The idea behind Hubs and Authorities comes from the typical structure of the web: Certain websites known as hubs, serve as large directories that are not actually
|
||
authoritative on the information that they hold. These hubs are used as compilations of a broad catalog of information that leads users direct to other authoritative webpages.
|
||
The algorithm assigns each vertex two scores: The authority-score and the hub-score. The authority score rates how many good hubs point to a particular
|
||
vertex (or webpage), the hub score rates how good (authoritative) the vertices pointed to are. For more see https://en.wikipedia.org/wiki/HITS_algorithm
|
||
|
||
Our version of the algorithm
|
||
converges after a certain amount of time. The parameter *threshold* can be used to set a limit for the convergence (measured as maximum absolute difference of the hub and
|
||
authority scores between the current and last iteration)
|
||
When you specify the result field name, the hub score will be stored in "<result field>_hub" and the authority score in
|
||
"<result field>_auth".
|
||
The algorithm can be executed like this:
|
||
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var handle = pregel.start("hits", "yourgraph", {threshold:0.00001, resultField: "score"});
|
||
```
|
||
|
||
#### Vertex Centrality
|
||
|
||
Centrality measures help identify the most important vertices in a graph. They can be used in a wide range of applications:
|
||
For example they can be used to identify *influencers* in social networks, or *middle-men* in terrorist networks.
|
||
There are various definitions for centrality, the simplest one being the vertex degree.
|
||
|
||

|
||
|
||
|
||
##### Effective Closeness
|
||
|
||
A common definitions of centrality is the **closeness centrality** (or closeness).
|
||
The closeness of a vertex in a graph is the inverse average length of the shortest path between the vertex
|
||
and all other vertices. For vertices *x*, *y* and shortest distance *d(y,x)* it is defined as
|
||
|
||

|
||
|
||
Effective Closeness approximates the closeness measure. The algorithm works by iteratively estimating the number
|
||
of shortest paths passing through each vertex. The score will approximates the the real closeness score, since
|
||
it is not possible to actually count all shortest paths due to the horrendous O(n^2 * d) memory requirements.
|
||
The algorithm is from the paper *Centralities in Large Networks: Algorithms and Observations (U Kang et.al. 2011)*
|
||
|
||
ArangoDBs implementation approximates the number of shortest path in each iteration by using a HyperLogLog counter with 64 buckets.
|
||
This should work well on large graphs and on smaller ones as well. The memory requirements should be **O(n * d)** where
|
||
*n* is the number of vertices and *d* the diameter of your graph. Each vertex will store a counter for each iteration of the
|
||
algorithm. The algorithm can be used like this
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var handle = pregel.start("effectivecloseness", "yourgraph", resultField: "closeness");
|
||
```
|
||
|
||
##### LineRank
|
||
|
||
Another common measure is the [betweenness* centrality](https://en.wikipedia.org/wiki/Betweenness_centrality):
|
||
It measures the number of times a vertex is part of shortest paths between any pairs of vertices.
|
||
For a vertex *v* betweenness is defined as
|
||
|
||

|
||
|
||
Where the σ represents the number of shortest paths between *x* and *y*, and σ(v) represents the
|
||
number of paths also passing through a vertex *v*. By intuition a vertex with higher betweeness centrality will have more information
|
||
passing through it.
|
||
|
||
Unfortunately these definitions were not designed with scalability in mind. It is probably impossible to compute them
|
||
efficiently and accurately. Fortunately there are scalable substitutions proposed by **U Kang et.al. 2011**.
|
||
|
||
|
||
**LineRank** approximates the random walk betweenness of every vertex in a graph. This is the probability that someone starting on
|
||
an arbitary vertex, will visit this node when he randomly chooses edges to visit.
|
||
The algoruthm essentially builds a line graph out of your graph (switches the vertices and edges), and then computes a score similar to PageRank.
|
||
This can be considered a scalable equivalent to vertex betweeness, which can be executed distributedly in ArangoDB.
|
||
The algorithm is from the paper *Centralities in Large Networks: Algorithms and Observations (U Kang et.al. 2011)*
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var handle = pregel.start("linerank", "yourgraph", "resultField": "rank");
|
||
```
|
||
|
||
|
||
#### Community Detection: Label Propagation
|
||
|
||
|
||
*Label Propagation* can be used to implement community detection on large graphs. The idea is that each
|
||
vertex should be in the community that most of his neighbours are in. We iteratively detemine this by first
|
||
assigning random Community ID's. Then each itertation, a vertex will send it's current community ID to all his neighbor vertices.
|
||
Then each vertex adopts the community ID he received most frequently during the iteration.
|
||
|
||
The algorithm runs until it converges,
|
||
which likely never really happens on large graphs. Therefore you need to specify a maximum iteration bound which suits you.
|
||
The default bound is 500 iterations, which is likely too large for your application.
|
||
Should work best on undirected graphs, results on directed graphs might vary depending on the density of your graph.
|
||
|
||
```javascript
|
||
var pregel = require("@arangodb/pregel");
|
||
var handle = pregel.start("labelpropagation", "yourgraph", {maxGSS:100, resultField: "community"});
|
||
```
|