1
0
Fork 0
arangodb/Documentation/Books/Users/Aql/Optimizer.mdpp

309 lines
13 KiB
Plaintext

!CHAPTER The AQL query optimizer
AQL queres are sent through an optimizer before execution. The task of the optimizer is
to create an initial execution plan for the query, look for optimization opportunities and
apply them. As a result, the optimizer might produce multiple execution plans for a
single query. It will then calculate the costs for all plans and pick the plan with the
lowest total cost. This resulting plan is considered to be the *optimal plan*, which is
then executed.
The optimizer is designed to only perform optimization if they are *safe*, in the
meaning that an optimization does not modify the result of a query.
!SUBSECTION Execution plans
The `explain` command can be used to query the optimal executed plan or even all plans
the optimizer has generated. Additionally, `explain` can reveal some more information
about the optimizer's view of the query.
Here's an example that shows the execution plan for a simple query, using the `explain`
method of `ArangoStatement`:
```
arangosh> query = "FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value";
arangosh> db._create("test");
arangosh> for (i = 0; i < 100; ++i) { db.test.save({ value: i }); }
arangosh> db.test.ensureSkiplist("value");
arangosh> stmt = db._createStatement(query);
arangosh> stmt.explain();
{
"plan" : {
...
},
"warnings" : [
...
]
}
```
The result details will be very verbose so they are not shown here in full. Instead,
let's take a closer look at the results step by step.
!SUBSUBSECTION Execution nodes
In general, an execution plan can be considered to be a pipeline as processing steps.
Each processing step is carried out by a so-called *execution node*
The `nodes` attribute of the `explain` result contains a these execution nodes in
the execution plan. The output is still very verbose, so here's a shorted form of it:
```
arangosh> stmt.explain().plan.nodes.map(function (node) { return node.type; });
[
"SingletonNode",
"IndexRangeNode",
"CalculationNode",
"FilterNode",
"CalculationNode",
"ReturnNode"
]
```
When a plan is executed, the query execution engine will start with the node at
the bottom of the list (i.e. the *ReturnNode*).
The *ReturnNode*'s purpose is to return data to the caller. It does not produce
data itself, so it will ask the node above itself, that is the *CalculationNode*.
*CalculationNode*s are responsible for evaluating arbitrary expressions. In our
example query, the *CalculationNode* will evaluate the value of `i.value`, which
is needed by the *ReturnNode*. The calculation will be applied for all data the
*CalculationNode* gets from the node above it, i.e. the *FilterNode*.
*FilterNode*s will only let certain documents pass. Normally, filters are based on
evaluationg an expression, and so it is in the example case. The filter expression
result is calculated in the *CalculationNode* above the *FilterNode*.
Finally, all of this needs to be done for documents of collection `test`. This is
where the *IndexRangeNode* comes into play. It will use an index (thus the name)
to find certain documents in the collection and ship it down the pipeline. The
*IndexRangeNode* itself has a *SingletonNode* as its input. The sole purpose of a
*SingletonNode* node is to provide a single empty document as input for other
processing steps. It is always the end of the pipeline.
Here's a summary:
* SingletonNode: produces empty document as input for other processing steps.
* IndexRangeNode: iterates over the index on attribute `value` in collection `test`
* CalculationNode: calculates condition value `i.value > 97`
* FilterNode: only lets documents pass that satisfy condition `i.value > 97`
* CalculationNode: calculates return value `i.value`
* ReturnNode: returns data to the caller
!SUBSUBSECTION Optimizer rules
Note that in the example, the optimizer has optimized the `SORT` statement away.
It can do it safely because there is a sorted index on `i.value`, which it has
picked in the *IndexRangeNode*. As the index values are iterated in sorted order
anyway, the extra `SORT` would be redundant and was removed.
Additionally, the optimizer has done more work to generate an execution plan that
avoid as much expensive operations as possible. Here is a list of optimizer rules
that were applied to the plan:
arangosh> stmt.explain().plan.rules;
[
"move-calculations-up",
"move-filters-up",
"remove-redundant-calculations",
"remove-unnecessary-calculations",
"move-calculations-up-2",
"move-filters-up-2",
"use-index-range",
"use-index-for-sort"
]
Here's what the rules mean in context of this query:
* `move-calculations-up`: moves a *CalculationNode* as far up in the processing pipeline
as possible
* `move-filters-up`: moves a *FilterNode* as far up in the processing pipeline as
possible
* `remove-redundant-calculations`: replaces references to variables with references to
other variables that contain the exact same result. In the example query, `i.value`
is calculated multiple times, but each calculation inside a loop iteration would
produce the same value. Therefore, the expression result is shared by several nodes.
* `remove-unnecessary-calculations`: removes *CalculationNode*s whose result values are
not used in the query. In the example this is due to the `remove-redundant-calculations`
rule having made some calculations unnecessary.
* `use-index-range`: use an index to iterate over a collection instead of performing a
full collection scan. In the example case this makes sense, as the index can be
used for filtering and sorting.
* `use-index-for-sort`: removes a `SORT` operation if it is already satisfied by
traversing over a sorted index
Note that some rules may appear multiple times in the list, with number suffixes.
This is due to the same rule being applied multiple times, at different positions
in the optimizer pipeline.
!SUBSUBSECTION Collections used in a query
The list of collections used in a plan (and query) is contained in the `collections`
attribute of a plan:
```
arangosh> stmt.explain().plan.collections
[
{
"name" : "test",
"type" : "read"
}
]
```
The `name` attribute contains the name of the `collection`, and `type` is the
access type, which can be either `read` or `write`.
!SUBSUBSECTION Variables used in a query
The optimizer will also return a list of variables used in a plan (and query). This
list will contain auxilliary variables created by the optimizer itself. This list
can be ignored by end users in most cases.
!SUBSUBSECTION Cost of a query
For each plan the optimizer generates, it will calculate a total cost. The plan
with the lowest total cost is considered to be the optimal plan. Costs are
estimates only, as the actual execution costs are unknown to the optimizer.
Costs are calculated based on heuristics that are hard-coded into execution nodes.
Cost values do not have any unit.
!SUBSECTION Retrieving all execution plans
To retrieve not just the optimal plan but a list of all plans the optimizer has
generated, set the option `allPlans` to `true`:
This will return a list of all plans in the `plans` attribute instead of in the
`plan` attribute:
```
arangosh> stmt.explain({ allPlans: true });
{
"plans" : [
...
],
"warnings" : [
...
]
}
```
!SUBSECTION Warnings
For some queries, the optimizer might produce warnings. These will be returned in
the `warnings` attribute of the `explain` result:
```
arangosh> stmt = db._createStatement("FOR i IN 1..10 RETURN 1 / 0")
arangosh> stmt.explain().warnings;
[
{
"code" : 1562,
"message" : "division by zero"
}
]
```
There is an upper bound on the number of variables a query might produce. If that
bound is reached, no further warnings will be returned.
!SUBSECTION List of execution nodes
The following execution node types will appear in the output of `explain`:
* *SingletonNode*: the purpose of a *SingletonNode* is to produce an empty document
that is used as input for other processing steps. Each execution plan will contain
exactly one *SingletonNode* as its top node.
* *EnumerateCollectionNode*: enumeration over documents of a collection (given in
its *collection* attribute) without using an index.
* *IndexRangeNode*: enumeration over a specific index (given in its *index* attribute)
of a collection. The index range is specified in the *ranges* attribute of the node.
* *EnumerateListNode*: enumeration over a list of (non-collection) values.
* *FilterNode*: only lets values pass that satisfy a fill condition. Will appear once
per *FILTER* statement.
* *LimitNode*: limits the number of results passed to other processing steps. Will
appear once per *LIMIT* statement.
* *CalculationNode*: evaluates an expression. The expression result may be used by
other nodes, e.g. *FilterNode*, *EnumerateListNode*, *SortNode* etc.
* *SubqueryNode*: executes a sub-query.
* *SortNode*: performs a sort of its input values.
* *AggregateNode*: aggregates its input and produces new output variables. This will
appear once per *COLLECT* statement.
* *ReturnNode*: returns data to the caller. Will appear in each read-only query at
least once. Sub-queries will also contain *ReturnNode*s.
* *InsertNode*: inserts documents into a collection (given in its *collection*
attribute). Will appear exactly once in a query that contains an *INSERT* statement.
* *RemoveNode*: removes documents from a collection (given in its *collection*
attribute). Will appear exactly once in a query that contains a *REMOVE* statement.
* *ReplacesNode*: replaces documents in a collection (given in its *collection*
attribute). Will appear exactly once in a query that contains a *REPLACE* statement.
* *UpdateNode*: updates documents in a collection (given in its *collection*
attribute). Will appear exactly once in a query that contains an *UPDATE* statement.
* *NoResultsNode*: will be inserted if *FILTER* statements turn out to be never
satisfyable. The *NoResultsNode* will pass an empty result set into the processing
pipeline.
For queries in the cluster, the following nodes may appear in execution plans:
* *ScatterNode*: used on a coordinator to fan-out data to one or multiple shards.
* *GatherNode*: used on a coordinator to aggregate results from one or many shards
into a combined stream of results.
* *DistributeNode*: used on a coordinator to fan-out data to one or multiple shards,
taking into account a collection's shard key.
* *RemoteNode*: a *RemoteNode* will perfom communication with another ArangoDB
instances in the cluster. For example, the cluster coordinator will need to
communicate with other servers to fetch the actual data from the shards. It
will do so via *RemoteNode*s. The data servers themselves might again pull
further data from the coordinator, and thus might also employ *RemoteNode*s.
!SUBSECTION List of optimizer rules
The following optimizer rules may appear in the `rules` attribute of a plan:
* `move-calculations-up`: will appear if a *CalculationNode* was moved up in a plan.
The intention of this rule is to move calculations up in the processing pipeline
as far as possible (ideally out of enumerations) so they are executed less often.
* `move-filters-up`: will appear if a *FilterNode* was moved up in a plan. The
intention of this rule is to move filters up in the processing pipeline as far
as possible (ideally out of enumerations) so they are executed less often.
* `remove-unnecessary-filters`: will appear if a *FilterNode* was removed or replaced.
*FilterNode*s whose filter condition will always evaluate to *true* will be
removed from the plan, whereas *FilterNode* that will never let any results pass
will be replaced with a *NoResultsNode*.
* `remove-redundant-calculations`: will appear if redundant calculations (expressions
with the exact same result) are found in the query. The optimizer rule will then
replace references to the redundant expressions with a single reference, allowing
other optimizer rules to remove the then-unneeded *CalculationNode*s.
* `remove-unnecessary-calculations`: will appear if *CalculationNode*s were removed
from the query. The rule will removed all calculations whose result is not
referenced in the query (note that this may be a consequence of applying other
optimizations).
* `remove-redundant-sorts`: will appear if multiple *SORT* statements can be merged
into fewer sorts.
* `interchange-adjacent-enumerations`: will appear if a query contains multiple
*FOR* statements whose order was permuted. Permutation of *FOR* statements is
performed because it may enable further optimizations by other rules.
* `use-index-range`: will appear if an index can be used to iterate over a collection.
As a consequence, an *EnumerateCollectionNode* will have been replaced with an
*IndexRangeNode* in the plan.
* `use-index-for-sort`: will appear if an index can be used to avoid a *SORT*
operation. If the rule was applied, a *SortNode* will have been removed from the
plan.
The following optimizer rules may appear in the `rules` attribute of cluster plans:
* `scatter-in-cluster`: to be documented soon
* `distribute-in-cluster`: to be documented soon
* `distribute-filtercalc-to-cluster`: to be documented soon
* `distribute-sort-to-cluster`: to be documented soon
* `remove-unnecessary-remote-scatter`: to be documented soon
* `undistribute-remove-after-enum-coll`: to be documented soon
Note that some rules may appear multiple times in the list, with number suffixes.
This is due to the same rule being applied multiple times, at different positions
in the optimizer pipeline.