---
layout: default
description: The COLLECT keyword can be used to group an array by one or multiple grouping criteria
title: Grouping and aggregation with COLLECT in AQL
---
COLLECT
=======

The `COLLECT` keyword can be used to group an array by one or multiple group
criteria. 

The `COLLECT` statement will eliminate all local variables in the current
scope. After `COLLECT` only the variables introduced by `COLLECT` itself are
available.

There are several syntax variants for `COLLECT` operations:

```
COLLECT variableName = expression
COLLECT variableName = expression INTO groupsVariable
COLLECT variableName = expression INTO groupsVariable = projectionExpression
COLLECT variableName = expression INTO groupsVariable KEEP keepVariable
COLLECT variableName = expression WITH COUNT INTO countVariable
COLLECT variableName = expression AGGREGATE variableName = aggregateExpression
COLLECT variableName = expression AGGREGATE variableName = aggregateExpression INTO groupsVariable
COLLECT AGGREGATE variableName = aggregateExpression
COLLECT AGGREGATE variableName = aggregateExpression INTO groupsVariable
COLLECT WITH COUNT INTO countVariable
```

All variants can optionally end with an `OPTIONS { … }` clause.

Grouping syntaxes
-----------------

The first syntax form of `COLLECT` only groups the result by the defined group 
criteria specified in *expression*. In order to further process the results 
produced by `COLLECT`, a new variable (specified by *variableName*) is introduced. 
This variable contains the group value.

Here's an example query that find the distinct values in *u.city* and makes
them available in variable *city*:

```
FOR u IN users
  COLLECT city = u.city
  RETURN { 
    "city" : city 
  }
```

The second form does the same as the first form, but additionally introduces a
variable (specified by *groupsVariable*) that contains all elements that fell into the
group. This works as follows: The *groupsVariable* variable is an array containing 
as many elements as there are in the group. Each member of that array is
a JSON object in which the value of every variable that is defined in the 
AQL query is bound to the corresponding attribute. Note that this considers
all variables that are defined before the `COLLECT` statement, but not those on
the top level (outside of any `FOR`), unless the `COLLECT` statement is itself
on the top level, in which case all variables are taken. Furthermore note 
that it is possible that the optimizer moves `LET` statements out of `FOR`
statements to improve performance. 

```
FOR u IN users
  COLLECT city = u.city INTO groups
  RETURN { 
    "city" : city, 
    "usersInCity" : groups 
  }
```

In the above example, the array *users* will be grouped by the attribute
*city*. The result is a new array of documents, with one element per distinct
*u.city* value. The elements from the original array (here: *users*) per city are
made available in the variable *groups*. This is due to the `INTO` clause.

`COLLECT` also allows specifying multiple group criteria. Individual group
criteria can be separated by commas:

```
FOR u IN users
  COLLECT country = u.country, city = u.city INTO groups
  RETURN { 
    "country" : country, 
    "city" : city, 
    "usersInCity" : groups 
  }
```

In the above example, the array *users* is grouped by country first and then
by city, and for each distinct combination of country and city, the users
will be returned.


Discarding obsolete variables
-----------------------------

The third form of `COLLECT` allows rewriting the contents of the *groupsVariable* 
using an arbitrary *projectionExpression*:

```
FOR u IN users
  COLLECT country = u.country, city = u.city INTO groups = u.name
  RETURN { 
    "country" : country, 
    "city" : city, 
    "userNames" : groups 
  }
```

In the above example, only the *projectionExpression* is *u.name*. Therefore,
only this attribute is copied into the *groupsVariable* for each document. 
This is probably much more efficient than copying all variables from the scope into 
the *groupsVariable* as it would happen without a *projectionExpression*.

The expression following `INTO` can also be used for arbitrary computations:

```
FOR u IN users
  COLLECT country = u.country, city = u.city INTO groups = { 
    "name" : u.name, 
    "isActive" : u.status == "active"
  }
  RETURN { 
    "country" : country, 
    "city" : city, 
    "usersInCity" : groups 
  }
```

`COLLECT` also provides an optional `KEEP` clause that can be used to control
which variables will be copied into the variable created by `INTO`. If no 
`KEEP` clause is specified, all variables from the scope will be copied as 
sub-attributes into the *groupsVariable*. 
This is safe but can have a negative impact on performance if there 
are many variables in scope or the variables contain massive amounts of data. 

The following example limits the variables that are copied into the *groupsVariable*
to just *name*. The variables *u* and *someCalculation* also present in the scope
will not be copied into *groupsVariable* because they are not listed in the `KEEP` clause:

```
FOR u IN users
  LET name = u.name
  LET someCalculation = u.value1 + u.value2
  COLLECT city = u.city INTO groups KEEP name 
  RETURN { 
    "city" : city, 
    "userNames" : groups[*].name 
  }
```

`KEEP` is only valid in combination with `INTO`. Only valid variable names can
be used in the `KEEP` clause. `KEEP` supports the specification of multiple 
variable names.


Group length calculation
------------------------

`COLLECT` also provides a special *WITH COUNT* clause that can be used to 
determine the number of group members efficiently.

The simplest form just returns the number of items that made it into the
`COLLECT`:

```
FOR u IN users
  COLLECT WITH COUNT INTO length
  RETURN length
```

The above is equivalent to, but less efficient than:

```
RETURN LENGTH(users)
```

The *WITH COUNT* clause can also be used to efficiently count the number
of items in each group:

```
FOR u IN users
  COLLECT age = u.age WITH COUNT INTO length
  RETURN { 
    "age" : age, 
    "count" : length 
  }
```

Note: the *WITH COUNT* clause can only be used together with an `INTO` clause.


Aggregation
-----------

A `COLLECT` statement can be used to perform aggregation of data per group. To
only determine group lengths, the `WITH COUNT INTO` variant of `COLLECT` can be
used as described before.

For other aggregations, it is possible to run aggregate functions on the `COLLECT`
results:

```
FOR u IN users
  COLLECT ageGroup = FLOOR(u.age / 5) * 5 INTO g
  RETURN { 
    "ageGroup" : ageGroup,
    "minAge" : MIN(g[*].u.age),
    "maxAge" : MAX(g[*].u.age)
  }
```

The above however requires storing all group values during the collect operation for 
all groups, which can be inefficient. 

The special `AGGREGATE` variant of `COLLECT` allows building the aggregate values 
incrementally during the collect operation, and is therefore often more efficient.

With the `AGGREGATE` variant the above query becomes:

```
FOR u IN users
  COLLECT ageGroup = FLOOR(u.age / 5) * 5 
  AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
  RETURN {
    ageGroup, 
    minAge, 
    maxAge 
  }
```

The `AGGREGATE` keyword can only be used after the `COLLECT` keyword. If used, it 
must directly follow the declaration of the grouping keys. If no grouping keys 
are used, it must follow the `COLLECT` keyword directly:


```
FOR u IN users
  COLLECT AGGREGATE minAge = MIN(u.age), maxAge = MAX(u.age)
  RETURN {
    minAge, 
    maxAge 
  }
```
      
Only specific expressions are allowed on the right-hand side of each `AGGREGATE` 
assignment:

- on the top level, an aggregate expression must be a call to one of the supported 
  aggregation functions `LENGTH`, `MIN`, `MAX`, `SUM`, `AVERAGE`, `STDDEV_POPULATION`, 
  `STDDEV_SAMPLE`, `VARIANCE_POPULATION`, `VARIANCE_SAMPLE`, `UNIQUE`, `SORTED_UNIQUE` 
  or `COUNT_DISTINCT`. The following aliases are allowed too: `COUNT` (for `LENGTH`),
  `AVG` (for `AVERAGE`), `STDDEV` (for `STDDEV_POPULATION`), `VARIANCE` (for `VARIANCE_POPULATION`),
  `COUNT_UNIQUE` (for `COUNT_DISTINCT`).

- an aggregate expression must not refer to variables introduced by the `COLLECT` itself


COLLECT variants
----------------

Since ArangoDB 2.6, there are two variants of `COLLECT` that the optimizer can
choose from: the *sorted* variant and the *hash* variant. The *hash* variant only becomes a
candidate for `COLLECT` statements that do not use an `INTO` clause.

The optimizer will always generate a plan that employs the *sorted* method. The *sorted* method 
requires its input to be sorted by the group criteria specified in the `COLLECT` clause. 
To ensure correctness of the result, the AQL optimizer will automatically insert a `SORT` 
statement into the query in front of the `COLLECT` statement. The optimizer may be able to 
optimize away that `SORT` statement later if a sorted index is present on the group criteria. 

In case a `COLLECT` statement qualifies for using the *hash* variant, the optimizer will create an extra 
plan for it at the beginning of the planning phase. In this plan, no extra `SORT` statement will be
added in front of the `COLLECT`. This is because the *hash* variant of `COLLECT` does not require
sorted input. Instead, a `SORT` statement will be added after the `COLLECT` to sort its output. 
This `SORT` statement may be optimized away again in later stages. 
If the sort order of the `COLLECT` is irrelevant to the user, adding the extra instruction *SORT null* 
after the `COLLECT` will allow the optimizer to remove the sorts altogether:

```
FOR u IN users
  COLLECT age = u.age
  SORT null  /* note: will be optimized away */
  RETURN age
```
  
Which `COLLECT` variant is used by the optimizer depends on the optimizer's cost estimations. The 
created plans with the different `COLLECT` variants will be shipped through the regular optimization 
pipeline. In the end, the optimizer will pick the plan with the lowest estimated total cost as usual. 

In general, the *sorted* variant of `COLLECT` should be preferred in cases when there is a sorted index
present on the group criteria. In this case the optimizer can eliminate the `SORT` statement in front
of the `COLLECT`, so that no `SORT` will be left. 

If there is no sorted index available on the group criteria, the up-front sort required by the *sorted* 
variant can be expensive. In this case it is likely that the optimizer will prefer the *hash* variant
of `COLLECT`, which does not require its input to be sorted. 

Which variant of `COLLECT` was actually used can be figured out by looking into the execution plan of
a query, specifically the *AggregateNode* and its *aggregationOptions* attribute.


Setting COLLECT options
-----------------------

*options* can be used in a `COLLECT` statement to inform the optimizer about the preferred `COLLECT`
method. When specifying the following appendix to a `COLLECT` statement, the optimizer will always use
the *sorted* variant of `COLLECT` and not even create a plan using the *hash* variant:

```
OPTIONS { method: "sorted" }
```

It is also possible to specify *hash* as the preferred method. In this case the optimizer will create
a plan using the *hash* method only if the COLLECT statement qualifies (not all COLLECT statements
can use the *hash* method). In case the COLLECT statement qualifies, there will be only a one plan
that uses the *hash* method. If it does not qualify, the optimizer will use the *sorted* method.

If no method is specified, then the optimizer will create a plan that uses the *sorted* method, and
an additional plan using the *hash* method if the COLLECT statement qualifies for it.


COLLECT vs. RETURN DISTINCT
---------------------------

In order to make a result set unique, one can either use `COLLECT` or `RETURN DISTINCT`. Behind the
scenes, both variants will work by creating an *AggregateNode*. For both variants, the optimizer
may try the sorted and the hashed variant of `COLLECT`. The difference is therefore mainly syntactical,
with `RETURN DISTINCT` saving a bit of typing when compared to an equivalent `COLLECT`:

```
FOR u IN users
  RETURN DISTINCT u.age
```

```
FOR u IN users
  COLLECT age = u.age
  RETURN age
```

However, `COLLECT` is vastly more flexible than `RETURN DISTINCT`. Aside from
its sophisticated grouping and aggregation capabilities, `COLLECT` also allows
you to place a `LIMIT` operation before `RETURN` to potentially stop the
`COLLECT` operation early.
Additionally, `COLLECT` supports [options](#setting-collect-options).

`RETURN DISTINCT` does not change the order of results, whereas `COLLECT` sorts
them (regardless of the method, _sorted_ or _hash_) unless explicitly disabled
by the user with a subsequent `SORT null`
(see [COLLECT variants](#collect-variants)).