mirror of https://gitee.com/bigwinds/arangodb
328 lines
11 KiB
Plaintext
328 lines
11 KiB
Plaintext
!CHAPTER COLLECT
|
|
|
|
The *COLLECT* keyword can be used to group an array by one or multiple group
|
|
criteria.
|
|
|
|
The *COLLECT* statement will eliminate all local variables in the current
|
|
scope. After *COLLECT* only the variables introduced by *COLLECT* itself are
|
|
available.
|
|
|
|
The general syntaxes for *COLLECT* are:
|
|
|
|
```
|
|
COLLECT variable-name = expression options
|
|
COLLECT variable-name = expression INTO groups-variable options
|
|
COLLECT variable-name = expression INTO groups-variable = projection-expression options
|
|
COLLECT variable-name = expression INTO groups-variable KEEP keep-variable options
|
|
COLLECT variable-name = expression WITH COUNT INTO count-variable options
|
|
COLLECT variable-name = expression AGGREGATE variable-name = aggregate-expression options
|
|
COLLECT AGGREGATE variable-name = aggregate-expression options
|
|
COLLECT WITH COUNT INTO count-variable options
|
|
```
|
|
|
|
!SUBSECTION Grouping syntaxes
|
|
|
|
The first syntax form of *COLLECT* only groups the result by the defined group
|
|
criteria specified in *expression*. In order to further process the results
|
|
produced by *COLLECT*, a new variable (specified by *variable-name*) is introduced.
|
|
This variable contains the group value.
|
|
|
|
Here's an example query that find the distinct values in *u.city* and makes
|
|
them available in variable *city*:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT city = u.city
|
|
RETURN {
|
|
"city" : city
|
|
}
|
|
```
|
|
|
|
The second form does the same as the first form, but additionally introduces a
|
|
variable (specified by *groups-variable*) that contains all elements that fell into the
|
|
group. This works as follows: The *groups-variable* variable is an array containing
|
|
as many elements as there are in the group. Each member of that array is
|
|
a JSON object in which the value of every variable that is defined in the
|
|
AQL query is bound to the corresponding attribute. Note that this considers
|
|
all variables that are defined before the *COLLECT* statement, but not those on
|
|
the top level (outside of any *FOR*), unless the *COLLECT* statement is itself
|
|
on the top level, in which case all variables are taken. Furthermore note
|
|
that it is possible that the optimizer moves *LET* statements out of *FOR*
|
|
statements to improve performance.
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT city = u.city INTO groups
|
|
RETURN {
|
|
"city" : city,
|
|
"usersInCity" : groups
|
|
}
|
|
```
|
|
|
|
In the above example, the array *users* will be grouped by the attribute
|
|
*city*. The result is a new array of documents, with one element per distinct
|
|
*u.city* value. The elements from the original array (here: *users*) per city are
|
|
made available in the variable *groups*. This is due to the *INTO* clause.
|
|
|
|
*COLLECT* also allows specifying multiple group criteria. Individual group
|
|
criteria can be separated by commas:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT country = u.country, city = u.city INTO groups
|
|
RETURN {
|
|
"country" : country,
|
|
"city" : city,
|
|
"usersInCity" : groups
|
|
}
|
|
```
|
|
|
|
In the above example, the array *users* is grouped by country first and then
|
|
by city, and for each distinct combination of country and city, the users
|
|
will be returned.
|
|
|
|
|
|
!SUBSECTION Discarding obsolete variables
|
|
|
|
The third form of *COLLECT* allows rewriting the contents of the *groups-variable*
|
|
using an arbitrary *projection-expression*:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT country = u.country, city = u.city INTO groups = u.name
|
|
RETURN {
|
|
"country" : country,
|
|
"city" : city,
|
|
"userNames" : groups
|
|
}
|
|
```
|
|
|
|
In the above example, only the *projection-expression* is *u.name*. Therefore,
|
|
only this attribute is copied into the *groups-variable* for each document.
|
|
This is probably much more efficient than copying all variables from the scope into
|
|
the *groups-variable* as it would happen without a *projection-expression*.
|
|
|
|
The expression following *INTO* can also be used for arbitrary computations:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT country = u.country, city = u.city INTO groups = {
|
|
"name" : u.name,
|
|
"isActive" : u.status == "active"
|
|
}
|
|
RETURN {
|
|
"country" : country,
|
|
"city" : city,
|
|
"usersInCity" : groups
|
|
}
|
|
```
|
|
|
|
*COLLECT* also provides an optional *KEEP* clause that can be used to control
|
|
which variables will be copied into the variable created by `INTO`. If no
|
|
*KEEP* clause is specified, all variables from the scope will be copied as
|
|
sub-attributes into the *groups-variable*.
|
|
This is safe but can have a negative impact on performance if there
|
|
are many variables in scope or the variables contain massive amounts of data.
|
|
|
|
The following example limits the variables that are copied into the *groups-variable*
|
|
to just *name*. The variables *u* and *someCalculation* also present in the scope
|
|
will not be copied into *groups-variable* because they are not listed in the *KEEP* clause:
|
|
|
|
```
|
|
FOR u IN users
|
|
LET name = u.name
|
|
LET someCalculation = u.value1 + u.value2
|
|
COLLECT city = u.city INTO groups KEEP name
|
|
RETURN {
|
|
"city" : city,
|
|
"userNames" : groups[*].name
|
|
}
|
|
```
|
|
|
|
*KEEP* is only valid in combination with *INTO*. Only valid variable names can
|
|
be used in the *KEEP* clause. *KEEP* supports the specification of multiple
|
|
variable names.
|
|
|
|
|
|
!SUBSECTION Group length calculation
|
|
|
|
*COLLECT* also provides a special *WITH COUNT* clause that can be used to
|
|
determine the number of group members efficiently.
|
|
|
|
The simplest form just returns the number of items that made it into the
|
|
*COLLECT*:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT WITH COUNT INTO length
|
|
RETURN length
|
|
```
|
|
|
|
The above is equivalent to, but more efficient than:
|
|
|
|
```
|
|
RETURN LENGTH(
|
|
FOR u IN users
|
|
RETURN length
|
|
)
|
|
```
|
|
|
|
The *WITH COUNT* clause can also be used to efficiently count the number
|
|
of items in each group:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT age = u.age WITH COUNT INTO length
|
|
RETURN {
|
|
"age" : age,
|
|
"count" : length
|
|
}
|
|
```
|
|
|
|
Note: the *WITH COUNT* clause can only be used together with an *INTO* clause.
|
|
|
|
|
|
!SUBSECTION Aggregation
|
|
|
|
A `COLLECT` statement can be used to perform aggregation of data per group. To
|
|
only determine group lengths, the `WITH COUNT INTO` variant of `COLLECT` can be
|
|
used as described before.
|
|
|
|
For other aggregations, it is possible to run aggregate functions on the `COLLECT`
|
|
results:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT ageGroup = FLOOR(u.age / 5) * 5 INTO g
|
|
RETURN {
|
|
"ageGroup" : ageGroup,
|
|
"minAge" : MIN(g[*].u.age),
|
|
"maxAge" : MAX(g[*].u.age)
|
|
}
|
|
```
|
|
|
|
The above however requires storing all group values during the collect operation for
|
|
all groups, which can be inefficient.
|
|
|
|
The special `AGGREGATE` variant of `COLLECT` allows building the aggregate values
|
|
incrementally during the collect operation, and is therefore often more efficient.
|
|
|
|
With the `AGGREGATE` variant the above query becomes:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT ageGroup = FLOOR(u.age / 5) * 5
|
|
AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
|
|
RETURN {
|
|
ageGroup,
|
|
minAge,
|
|
maxAge
|
|
}
|
|
```
|
|
|
|
The `AGGREGATE` keyword can only be used after the `COLLECT` keyword. If used, it
|
|
must directly follow the declaration of the grouping keys. If no grouping keys
|
|
are used, it must follow the `COLLECT` keyword directly:
|
|
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
|
|
RETURN {
|
|
minAge,
|
|
maxAge
|
|
}
|
|
```
|
|
|
|
Only specific expressions are allowed on the right-hand side of each `AGGREGATE`
|
|
assignment:
|
|
|
|
- on the top level, an aggregate expression must be a call to one of the supported
|
|
aggregation functions `LENGTH`, `MIN`, `MAX`, `SUM`, `AVERAGE`, `STDDEV_POPULATION`,
|
|
`STDDEV_SAMPLE`, `VARIANCE_POPULATION`, or `VARIANCE_SAMPLE`
|
|
|
|
- an aggregate expression must not refer to variables introduced by the `COLLECT` itself
|
|
|
|
|
|
!SUBSECTION COLLECT variants
|
|
|
|
Since ArangoDB 2.6, there are two variants of *COLLECT* that the optimizer can
|
|
choose from: the *sorted* variant and the *hash* variant. The *hash* variant only becomes a
|
|
candidate for *COLLECT* statements that do not use an *INTO* clause.
|
|
|
|
The optimizer will always generate a plan that employs the *sorted* method. The *sorted* method
|
|
requires its input to be sorted by the group criteria specified in the *COLLECT* clause.
|
|
To ensure correctness of the result, the AQL optimizer will automatically insert a *SORT*
|
|
statement into the query in front of the *COLLECT* statement. The optimizer may be able to
|
|
optimize away that *SORT* statement later if a sorted index is present on the group criteria.
|
|
|
|
In case a *COLLECT* qualifies for using the *hash* variant, the optimizer will create an extra
|
|
plan for it at the beginning of the planning phase. In this plan, no extra *SORT* statement will be
|
|
added in front of the *COLLECT*. This is because the *hash* variant of *COLLECT* does not require
|
|
sorted input. Instead, a *SORT* statement will be added after the *COLLECT* to sort its output.
|
|
This *SORT* statement may be optimized away again in later stages.
|
|
If the sort order of the *COLLECT* is irrelevant to the user, adding the extra instruction *SORT null*
|
|
after the *COLLECT* will allow the optimizer to remove the sorts altogether:
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT age = u.age
|
|
SORT null /* note: will be optimized away */
|
|
RETURN age
|
|
```
|
|
|
|
Which *COLLECT* variant is used by the optimizer depends on the optimizer's cost estimations. The
|
|
created plans with the different *COLLECT* variants will be shipped through the regular optimization
|
|
pipeline. In the end, the optimizer will pick the plan with the lowest estimated total cost as usual.
|
|
|
|
In general, the *sorted* variant of *COLLECT* should be preferred in cases when there is a sorted index
|
|
present on the group criteria. In this case the optimizer can eliminate the *SORT* statement in front
|
|
of the *COLLECT*, so that no *SORT* will be left.
|
|
|
|
If there is no sorted index available on the group criteria, the up-front sort required by the *sorted*
|
|
variant can be expensive. In this case it is likely that the optimizer will prefer the *hash* variant
|
|
of *COLLECT*, which does not require its input to be sorted.
|
|
|
|
Which variant of *COLLECT* was actually used can be figured out by looking into the execution plan of
|
|
a query, specifically the *AggregateNode* and its *aggregationOptions* attribute.
|
|
|
|
|
|
!SUBSECTION Setting COLLECT options
|
|
|
|
*options* can be used in a *COLLECT* statement to inform the optimizer about the preferred *COLLECT*
|
|
method. When specifying the following appendix to a *COLLECT* statement, the optimizer will always use
|
|
the *sorted* variant of *COLLECT* and not even create a plan using the *hash* variant:
|
|
|
|
```
|
|
OPTIONS { method: "sorted" }
|
|
```
|
|
|
|
Note that specifying *hash* as method will not make the optimizer use the *hash* variant. This is
|
|
because the *hash* variant is not eligible for all queries. Instead, if no options or any other method
|
|
than *sorted* are specified in *OPTIONS*, the optimizer will use its regular cost estimations.
|
|
|
|
|
|
!SUBSECTION COLLECT vs. RETURN DISTINCT
|
|
|
|
In order to make a result set unique, one can either use *COLLECT* or *RETURN DISTINCT*. Behind the
|
|
scenes, both variants will work by creating an *AggregateNode*. For both variants, the optimizer
|
|
may try the sorted and the hashed variant of *COLLECT*. The difference is therefore mainly syntactical,
|
|
with *RETURN DISTINCT* saving a bit of typing when compared to an equivalent *COLLECT*:
|
|
|
|
```
|
|
FOR u IN users
|
|
RETURN DISTINCT u.age
|
|
```
|
|
|
|
```
|
|
FOR u IN users
|
|
COLLECT age = u.age
|
|
RETURN age
|
|
```
|
|
|
|
However, *COLLECT* is vastly more flexible than *RETURN DISTINCT*. Additionally, the order of results is
|
|
undefined for a *RETURN DISTINCT*, whereas for a *COLLECT* the results will be sorted.
|
|
|
|
|
|
|