mirror of https://gitee.com/bigwinds/arangodb
391 lines
20 KiB
Plaintext
391 lines
20 KiB
Plaintext
!CHAPTER The AQL query optimizer
|
|
|
|
AQL queries are sent through an optimizer before execution. The task of the optimizer is
|
|
to create an initial execution plan for the query, look for optimization opportunities and
|
|
apply them. As a result, the optimizer might produce multiple execution plans for a
|
|
single query. It will then calculate the costs for all plans and pick the plan with the
|
|
lowest total cost. This resulting plan is considered to be the *optimal plan*, which is
|
|
then executed.
|
|
|
|
The optimizer is designed to only perform optimization if they are *safe*, in the
|
|
meaning that an optimization does not modify the result of a query.
|
|
|
|
|
|
!SUBSECTION Execution plans
|
|
|
|
The `explain` command can be used to query the optimal executed plan or even all plans
|
|
the optimizer has generated. Additionally, `explain` can reveal some more information
|
|
about the optimizer's view of the query.
|
|
|
|
Here's an example that shows the execution plan for a simple query, using the `explain`
|
|
method of `ArangoStatement`:
|
|
|
|
@startDocuBlockInline AQLEXP_01_explainCreate
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_01_explainCreate}
|
|
~addIgnoreCollection("test")
|
|
db._create("test");
|
|
for (i = 0; i < 100; ++i) { db.test.save({ value: i }); }
|
|
db.test.ensureSkiplist("value");
|
|
stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain();
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_01_explainCreate
|
|
|
|
The result details will be very verbose so they are not shown here in full. Instead,
|
|
let's take a closer look at the results step by step.
|
|
|
|
!SUBSUBSECTION Execution nodes
|
|
|
|
In general, an execution plan can be considered to be a pipeline of processing steps.
|
|
Each processing step is carried out by a so-called *execution node*
|
|
|
|
The `nodes` attribute of the `explain` result contains these *execution nodes* in
|
|
the *execution plan*. The output is still very verbose, so here's a shorted form of it:
|
|
|
|
@startDocuBlockInline AQLEXP_02_explainOverview
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_02_explainOverview}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain().plan.nodes.map(function (node) { return node.type; });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_02_explainOverview
|
|
|
|
*Note that the list of nodes might slightly change in future versions of ArangoDB if
|
|
new execution node types get added or the optimizer will create somewhat more
|
|
optimized plans).*
|
|
|
|
When a plan is executed, the query execution engine will start with the node at
|
|
the bottom of the list (i.e. the *ReturnNode*).
|
|
|
|
The *ReturnNode*'s purpose is to return data to the caller. It does not produce
|
|
data itself, so it will ask the node above itself, this is the *CalculationNode*
|
|
in our example.
|
|
*CalculationNode*s are responsible for evaluating arbitrary expressions. In our
|
|
example query, the *CalculationNode* will evaluate the value of `i.value`, which
|
|
is needed by the *ReturnNode*. The calculation will be applied for all data the
|
|
*CalculationNode* gets from the node above it, in our example the *FilterNode*.
|
|
|
|
*FilterNode*s will only let certain documents pass. Normally, filters are based on
|
|
the evaluation of an expression. The filters expression result (`i.value > 97`)
|
|
is calculated in the *CalculationNode* above the *FilterNode*.
|
|
|
|
Finally, all of this needs to be done for documents of collection `test`. This is
|
|
where the *IndexRangeNode* enters the game. It will use an index (thus its name)
|
|
to find certain documents in the collection and ship it down the pipeline in the
|
|
order required by `SORT i.value`. The *IndexRangeNode* itself has a *SingletonNode*
|
|
as its input. The sole purpose of a *SingletonNode* node is to provide a single empty
|
|
document as input for other processing steps. It is always the end of the pipeline.
|
|
|
|
Here's a summary:
|
|
* SingletonNode: produces empty document as input for other processing steps.
|
|
* IndexRangeNode: iterates over the index on attribute `value` in collection `test`
|
|
in the order required by `SORT i.value`.
|
|
* CalculationNode: evaluates the result of the calculation `i.value > 97` to `true` or `false`
|
|
* FilterNode: only lets documents pass where above calculation returned `true`
|
|
* CalculationNode: calculates return value `i.value`
|
|
* ReturnNode: returns data to the caller
|
|
|
|
|
|
!SUBSUBSECTION Optimizer rules
|
|
|
|
Note that in the example, the optimizer has optimized the `SORT` statement away.
|
|
It can do it safely because there is a sorted index on `i.value`, which it has
|
|
picked in the *IndexRangeNode*. As the index values are iterated in sorted order
|
|
anyway, the extra *SortNode* would be redundant and was removed.
|
|
|
|
Additionally, the optimizer has done more work to generate an execution plan that
|
|
avoids as much expensive operations as possible. Here is the list of optimizer rules
|
|
that were applied to the plan:
|
|
|
|
@startDocuBlockInline AQLEXP_03_explainRules
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_03_explainRules}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain().plan.rules;
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_03_explainRules
|
|
|
|
Here is the meaning of these rules in context of this query:
|
|
* `move-calculations-up`: moves a *CalculationNode* as far up in the processing pipeline
|
|
as possible
|
|
* `move-filters-up`: moves a *FilterNode* as far up in the processing pipeline as
|
|
possible
|
|
* `remove-redundant-calculations`: replaces references to variables with references to
|
|
other variables that contain the exact same result. In the example query, `i.value`
|
|
is calculated multiple times, but each calculation inside a loop iteration would
|
|
produce the same value. Therefore, the expression result is shared by several nodes.
|
|
* `remove-unnecessary-calculations`: removes *CalculationNode*s whose result values are
|
|
not used in the query. In the example this happens due to the `remove-redundant-calculations`
|
|
rule having made some calculations unnecessary.
|
|
* `use-index-range`: use an index to iterate over a collection instead of performing a
|
|
full collection scan. In the example case this makes sense, as the index can be
|
|
used for filtering and sorting.
|
|
* `use-index-for-sort`: removes a `SORT` operation if it is already satisfied by
|
|
traversing over a sorted index
|
|
|
|
Note that some rules may appear multiple times in the list, with number suffixes.
|
|
This is due to the same rule being applied multiple times, at different positions
|
|
in the optimizer pipeline.
|
|
|
|
|
|
!SUBSUBSECTION Collections used in a query
|
|
|
|
The list of collections used in a plan (and query) is contained in the `collections`
|
|
attribute of a plan:
|
|
|
|
@startDocuBlockInline AQLEXP_04_explainCollections
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_04_explainCollections}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain().plan.collections
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_04_explainCollections
|
|
|
|
The `name` attribute contains the name of the `collection`, and `type` is the
|
|
access type, which can be either `read` or `write`.
|
|
|
|
|
|
!SUBSUBSECTION Variables used in a query
|
|
|
|
The optimizer will also return a list of variables used in a plan (and query). This
|
|
list will contain auxiliary variables created by the optimizer itself. This list
|
|
can be ignored by end users in most cases.
|
|
|
|
|
|
!SUBSUBSECTION Cost of a query
|
|
|
|
For each plan the optimizer generates, it will calculate the total cost. The plan
|
|
with the lowest total cost is considered to be the optimal plan. Costs are
|
|
estimates only, as the actual execution costs are unknown to the optimizer.
|
|
Costs are calculated based on heuristics that are hard-coded into execution nodes.
|
|
Cost values do not have any unit.
|
|
|
|
|
|
!SUBSECTION Retrieving all execution plans
|
|
|
|
To retrieve not just the optimal plan but a list of all plans the optimizer has
|
|
generated, set the option `allPlans` to `true`:
|
|
|
|
This will return a list of all plans in the `plans` attribute instead of in the
|
|
`plan` attribute:
|
|
|
|
@startDocuBlockInline AQLEXP_05_explainAllPlans
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_05_explainAllPlans}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain({ allPlans: true });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_05_explainAllPlans
|
|
|
|
!SUBSECTION Retrieving the plan as it was generated by the parser / lexer
|
|
|
|
To retrieve the plan which closely matches your query, you may turn off most
|
|
optimization rules (i.e. cluster rules cannot be disabled if you're running
|
|
the explain on a cluster coordinator) set the option `rules` to `-all`:
|
|
|
|
This will return an unoptimized plan in the `plan`:
|
|
|
|
@startDocuBlockInline AQLEXP_06_explainUnoptimizedPlans
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_06_explainUnoptimizedPlans}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain({ optimizer: { rules: [ "-all" ] } });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_06_explainUnoptimizedPlans
|
|
|
|
Note that some optimizations are already done at parse time (i.e. evaluate simple constant
|
|
calculation as `1 + 1`)
|
|
|
|
|
|
!SUBSECTION Turning specific optimizer rules off
|
|
|
|
Optimizer rules can also be turned on or off individually, using the `rules` attribute.
|
|
This can be used to enable or disable one or multiple rules. Rules that shall be enabled
|
|
need to be prefixed with a `+`, rules to be disabled should be prefixed with a `-`. The
|
|
pseudo-rule `all` matches all rules.
|
|
|
|
Rules specified in `rules` are evaluated from left to right, so the following works to
|
|
turn on just the one specific rule:
|
|
|
|
@startDocuBlockInline AQLEXP_07_explainSingleRulePlans
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_07_explainSingleRulePlans}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain({ optimizer: { rules: [ "-all", "+use-index-range" ] } });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_07_explainSingleRulePlans
|
|
|
|
By default, all rules are turned on. To turn off just a few specific rules, use something
|
|
like this:
|
|
|
|
@startDocuBlockInline AQLEXP_08_explainDisableSingleRulePlans
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_08_explainDisableSingleRulePlans}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain({ optimizer: { rules: [ "-use-index-range", "-use-index-for-sort" ] } });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_08_explainDisableSingleRulePlans
|
|
|
|
The maximum number of plans created by the optimizer can also be limited using the
|
|
`maxNumberOfPlans` attribute:
|
|
|
|
@startDocuBlockInline AQLEXP_09_explainMaxNumberOfPlans
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_09_explainMaxNumberOfPlans}
|
|
~var stmt = db._createStatement("FOR i IN test FILTER i.value > 97 SORT i.value RETURN i.value");
|
|
stmt.explain({ maxNumberOfPlans: 1 });
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_09_explainMaxNumberOfPlans
|
|
|
|
!SUBSECTION Optimizer statistics
|
|
|
|
The optimizer will return statistics as a part of an `explain` result.
|
|
|
|
The following attributes will be returned in the `stats` attribute of an `explain` result:
|
|
|
|
- `plansCreated`: total number of plans created by the optimizer
|
|
- `rulesExecuted`: number of rules executed (note: an executed rule does not
|
|
indicate a plan was actually modified by a rule)
|
|
- `rulesSkipped`: number of rules skipped by the optimizer
|
|
|
|
|
|
!SUBSECTION Warnings
|
|
|
|
For some queries, the optimizer may produce warnings. These will be returned in
|
|
the `warnings` attribute of the `explain` result:
|
|
|
|
@startDocuBlockInline AQLEXP_10_explainWarn
|
|
@EXAMPLE_ARANGOSH_OUTPUT{AQLEXP_10_explainWarn}
|
|
var stmt = db._createStatement("FOR i IN 1..10 RETURN 1 / 0")
|
|
stmt.explain().warnings;
|
|
~db._drop("test")
|
|
~removeIgnoreCollection("test")
|
|
@END_EXAMPLE_ARANGOSH_OUTPUT
|
|
@endDocuBlock AQLEXP_10_explainWarn
|
|
|
|
There is an upper bound on the number of warning a query may produce. If that
|
|
bound is reached, no further warnings will be returned.
|
|
|
|
|
|
!SUBSECTION List of execution nodes
|
|
|
|
The following execution node types will appear in the output of `explain`:
|
|
|
|
* *SingletonNode*: the purpose of a *SingletonNode* is to produce an empty document
|
|
that is used as input for other processing steps. Each execution plan will contain
|
|
exactly one *SingletonNode* as its top node.
|
|
* *EnumerateCollectionNode*: enumeration over documents of a collection (given in
|
|
its *collection* attribute) without using an index.
|
|
* *IndexRangeNode*: enumeration over a specific index (given in its *index* attribute)
|
|
of a collection. The index range is specified in the *ranges* attribute of the node.
|
|
* *EnumerateListNode*: enumeration over a list of (non-collection) values.
|
|
* *FilterNode*: only lets values pass that satisfy a filter condition. Will appear once
|
|
per *FILTER* statement.
|
|
* *LimitNode*: limits the number of results passed to other processing steps. Will
|
|
appear once per *LIMIT* statement.
|
|
* *CalculationNode*: evaluates an expression. The expression result may be used by
|
|
other nodes, e.g. *FilterNode*, *EnumerateListNode*, *SortNode* etc.
|
|
* *SubqueryNode*: executes a subquery.
|
|
* *SortNode*: performs a sort of its input values.
|
|
* *AggregateNode*: aggregates its input and produces new output variables. This will
|
|
appear once per *COLLECT* statement.
|
|
* *ReturnNode*: returns data to the caller. Will appear in each read-only query at
|
|
least once. Subqueries will also contain *ReturnNode*s.
|
|
* *InsertNode*: inserts documents into a collection (given in its *collection*
|
|
attribute). Will appear exactly once in a query that contains an *INSERT* statement.
|
|
* *RemoveNode*: removes documents from a collection (given in its *collection*
|
|
attribute). Will appear exactly once in a query that contains a *REMOVE* statement.
|
|
* *ReplacesNode*: replaces documents in a collection (given in its *collection*
|
|
attribute). Will appear exactly once in a query that contains a *REPLACE* statement.
|
|
* *UpdateNode*: updates documents in a collection (given in its *collection*
|
|
attribute). Will appear exactly once in a query that contains an *UPDATE* statement.
|
|
* *NoResultsNode*: will be inserted if *FILTER* statements turn out to be never
|
|
satisfiable. The *NoResultsNode* will pass an empty result set into the processing
|
|
pipeline.
|
|
|
|
For queries in the cluster, the following nodes may appear in execution plans:
|
|
|
|
* *ScatterNode*: used on a coordinator to fan-out data to one or multiple shards.
|
|
* *GatherNode*: used on a coordinator to aggregate results from one or many shards
|
|
into a combined stream of results.
|
|
* *DistributeNode*: used on a coordinator to fan-out data to one or multiple shards,
|
|
taking into account a collection's shard key.
|
|
* *RemoteNode*: a *RemoteNode* will perform communication with another ArangoDB
|
|
instances in the cluster. For example, the cluster coordinator will need to
|
|
communicate with other servers to fetch the actual data from the shards. It
|
|
will do so via *RemoteNode*s. The data servers themselves might again pull
|
|
further data from the coordinator, and thus might also employ *RemoteNode*s.
|
|
So, all of the above cluster relevant nodes will be accompanied by a *RemoteNode*.
|
|
|
|
|
|
!SUBSECTION List of optimizer rules
|
|
|
|
The following optimizer rules may appear in the `rules` attribute of a plan:
|
|
|
|
* `move-calculations-up`: will appear if a *CalculationNode* was moved up in a plan.
|
|
The intention of this rule is to move calculations up in the processing pipeline
|
|
as far as possible (ideally out of enumerations) so they are not executed in loops
|
|
if not required. It is also quite common that this rule enables further optimizations
|
|
to kick in.
|
|
* `move-filters-up`: will appear if a *FilterNode* was moved up in a plan. The
|
|
intention of this rule is to move filters up in the processing pipeline as far
|
|
as possible (ideally out of inner loops) so they filter results as early as possible.
|
|
* `remove-unnecessary-filters`: will appear if a *FilterNode* was removed or replaced.
|
|
*FilterNode*s whose filter condition will always evaluate to *true* will be
|
|
removed from the plan, whereas *FilterNode* that will never let any results pass
|
|
will be replaced with a *NoResultsNode*.
|
|
* `remove-redundant-calculations`: will appear if redundant calculations (expressions
|
|
with the exact same result) were found in the query. The optimizer rule will then
|
|
replace references to the redundant expressions with a single reference, allowing
|
|
other optimizer rules to remove the then-unneeded *CalculationNode*s.
|
|
* `remove-unnecessary-calculations`: will appear if *CalculationNode*s were removed
|
|
from the query. The rule will removed all calculations whose result is not
|
|
referenced in the query (note that this may be a consequence of applying other
|
|
optimizations).
|
|
* `remove-redundant-sorts`: will appear if multiple *SORT* statements can be merged
|
|
into fewer sorts.
|
|
* `interchange-adjacent-enumerations`: will appear if a query contains multiple
|
|
*FOR* statements whose order were permuted. Permutation of *FOR* statements is
|
|
performed because it may enable further optimizations by other rules.
|
|
* `remove-sort-rand`: will appear when a *SORT RAND()* expression is removed by
|
|
moving the random iteration into an *EnumerateCollectionNode*.
|
|
* `remove-collect-into`: will appear if an *INTO* clause was removed from a *COLLECT*
|
|
statement because the result of *INTO* is not used.
|
|
* `propagate-constant-attributes`: will appear when a constant value was inserted
|
|
into a filter condition, replacing a dynamic attribute value.
|
|
* `replace-or-with-in`: will appear if multiple *OR*-combined equality conditions
|
|
on the same variable or attribute were replaced with an *IN* condition.
|
|
* `remove-redundant-or`: will appear if multiple *OR* conditions for the same variable
|
|
or attribute were combined into a single condition.
|
|
* `use-index-range`: will appear if an index can be used to iterate over a collection.
|
|
As a consequence, an *EnumerateCollectionNode* was replaced with an
|
|
*IndexRangeNode* in the plan.
|
|
* `remove-filters-covered-by-index`: will appear if a *FilterNode* was removed or replaced
|
|
because the filter condition is already covered by an *IndexRangeNode*.
|
|
* `use-index-for-sort`: will appear if an index can be used to avoid a *SORT*
|
|
operation. If the rule was applied, a *SortNode* was removed from the plan.
|
|
* `move-calculations-down`: will appear if a *CalculationNode* was moved down in a plan.
|
|
The intention of this rule is to move calculations down in the processing pipeline
|
|
as far as possible (below *FILTER*, *LIMIT* and *SUBQUERY* nodes) so they are executed
|
|
as late as possible and not before their results are required.
|
|
* `patch-update-statements`: will appear if an *UpdateNode* was patched to not buffer
|
|
its input completely, but to process it in smaller batches. The rule will fire for an
|
|
*UPDATE* query that is fed by a full collection scan, and that does not use any other
|
|
indexes and subqueries.
|
|
|
|
The following optimizer rules may appear in the `rules` attribute of cluster plans:
|
|
|
|
* `distribute-in-cluster`: will appear when query parts get distributed in a cluster.
|
|
This is not an optimization rule, and it cannot be turned off.
|
|
* `scatter-in-cluster`: will appear when scatter, gather, and remote nodes are inserted
|
|
into a distributed query. This is not an optimization rule, and it cannot be turned off.
|
|
* `distribute-filtercalc-to-cluster`: will appear when filters are moved up in a
|
|
distributed execution plan. Filters are moved as far up in the plan as possible to
|
|
make result sets as small as possible as early as possible.
|
|
* `distribute-sort-to-cluster`: will appear if sorts are moved up in a distributed query.
|
|
Sorts are moved as far up in the plan as possible to make result sets as small as possible
|
|
as early as possible.
|
|
* `remove-unnecessary-remote-scatter`: will appear if a RemoteNode is followed by a
|
|
ScatterNode, and the ScatterNode is only followed by calculations or the SingletonNode.
|
|
In this case, there is no need to distribute the calculation, and it will be handled
|
|
centrally.
|
|
* `undistribute-remove-after-enum-coll`: will appear if a RemoveNode can be pushed into
|
|
the same query part that enumerates over the documents of a collection. This saves
|
|
inter-cluster roundtrips between the EnumerateCollectionNode and the RemoveNode.
|
|
|
|
Note that some rules may appear multiple times in the list, with number suffixes.
|
|
This is due to the same rule being applied multiple times, at different positions
|
|
in the optimizer pipeline.
|