arangodb/Documentation/Books/Users/AqlExamples/Grouping.mdpp

!CHAPTER Grouping

To group results by arbitrary criteria, AQL provides the *COLLECT* keyword.
*COLLECT* will perform a grouping, but no aggregation. Aggregation can still be
added in the query if required.

!SUBSECTION Ensuring uniqueness

*COLLECT* can be used to make a result set unique. The following query will return each distinct
`age` attribute value only once:

```
FOR u IN users
  COLLECT age = u.age
  RETURN age
```

This is grouping without tracking the group values, but just the group criterion (*age*) value.

Grouping can also be done on multiple levels using *COLLECT*:

```
FOR u IN users
  COLLECT status = u.status, age = u.age
  RETURN { status, age }
```


Alternatively *RETURN DISTINCT* can be used to make a result set unique. *RETURN DISTINCT* supports a
single criterion only:

```
FOR u IN users
  RETURN DISTINCT u.age
```

Note: the order of results is undefined for *RETURN DISTINCT*.

!SUBSECTION Fetching group values

To group users by age, and return the names of the users with the highest ages,
we'll issue a query like this:

```
FOR u IN users
  FILTER u.active == true
  COLLECT age = u.age INTO usersByAge
  SORT age DESC LIMIT 0, 5
  RETURN {
    "age" : age,
    "users" : usersByAge[*].u.name
  }

[
  {
    "age" : 37,
    "users" : [
      "John",
      "Sophia"
    ]
  },
  {
    "age" : 36,
    "users" : [
      "Fred",
      "Emma"
    ]
  },
  {
    "age" : 34,
    "users" : [
      "Madison"
    ]
  },
  {
    "age" : 33,
    "users" : [
      "Chloe",
      "Michael"
    ]
  },
  {
    "age" : 32,
    "users" : [
      "Alexander"
    ]
  }
]
```

The query will put all users together by their *age* attribute. There will be one
result document per distinct *age* value (let aside the *LIMIT*). For each group,
we have access to the matching document via the *usersByAge* variable introduced in
the *COLLECT* statement.

!SUBSECTION Variable Expansion

The *usersByAge* variable contains the full documents found, and as we're only
interested in user names, we'll use the expansion operator <i>[\*]</i> to extract just the
*name* attribute of all user documents in each group.

The <i>[\*]</i> expansion operator is just a handy short-cut. Instead of <i>usersByAge[\*].u.name</i>
we could also write:

```
FOR temp IN usersByAge
  RETURN temp.u.name
```

!SUBSECTION Grouping by multiple criteria

To group by multiple criteria, we'll use multiple arguments in the *COLLECT* clause.
For example, to group users by *ageGroup* (a derived value we need to calculate first)
and then by *gender*, we'll do:

```
FOR u IN users
  FILTER u.active == true
  COLLECT ageGroup = FLOOR(u.age / 5) * 5,
          gender = u.gender INTO group
  SORT ageGroup DESC
  RETURN {
    "ageGroup" : ageGroup,
    "gender" : gender
  }

[
  {
    "ageGroup" : 35,
    "gender" : "f"
  },
  {
    "ageGroup" : 35,
    "gender" : "m"
  },
  {
    "ageGroup" : 30,
    "gender" : "f"
  },
  {
    "ageGroup" : 30,
    "gender" : "m"
  },
  {
    "ageGroup" : 25,
    "gender" : "f"
  },
  {
    "ageGroup" : 25,
    "gender" : "m"
  }
]
```

!SUBSECTION Aggregation

So far we only grouped data without aggregation. Adding aggregation is simple in AQL,
as all that needs to be done is to run an aggregate function on the array created by
the *INTO* clause of a *COLLECT* statement:

```
FOR u IN users
  FILTER u.active == true
  COLLECT ageGroup = FLOOR(u.age / 5) * 5,
          gender = u.gender INTO group
  SORT ageGroup DESC
  RETURN {
    "ageGroup" : ageGroup,
    "gender" : gender,
    "numUsers" : LENGTH(group)
  }

[
  {
    "ageGroup" : 35,
    "gender" : "f",
    "numUsers" : 2
  },
  {
    "ageGroup" : 35,
    "gender" : "m",
    "numUsers" : 2
  },
  {
    "ageGroup" : 30,
    "gender" : "f",
    "numUsers" : 4
  },
  {
    "ageGroup" : 30,
    "gender" : "m",
    "numUsers" : 4
  },
  {
    "ageGroup" : 25,
    "gender" : "f",
    "numUsers" : 2
  },
  {
    "ageGroup" : 25,
    "gender" : "m",
    "numUsers" : 2
  }
]
```

We have used the function *LENGTH* here (it returns the length of a array). This is the
equivalent to SQL's `SELECT g, COUNT(*) FROM ... GROUP BY g`.
In addition to *LENGTH* AQL also provides *MAX*, *MIN*, *SUM* and *AVERAGE* as
basic aggregation functions.

In AQL all aggregation functions can be run on arrays only. If an aggregation function
is run on anything that is not an array, an error will occur and the query will fail.

!SUBSECTION Post-filtering aggregated data

To filter on the results of a grouping or aggregation operation (i.e. something
similar to *HAVING* in SQL), simply add another *FILTER* clause after the *COLLECT*
statement.

For example, to get the 3 *ageGroup*s with the most users in them:

```
FOR u IN users
  FILTER u.active == true
  COLLECT ageGroup = FLOOR(u.age / 5) * 5 INTO group
  LET numUsers = LENGTH(group)
  FILTER numUsers > 2 // group must contain at least 3 users in order to qualify
  SORT numUsers DESC
  LIMIT 0, 3
  RETURN {
    "ageGroup" : ageGroup,
    "numUsers" : numUsers,
    "users" : group[*].u.name
  }

[
  {
    "ageGroup" : 30,
    "numUsers" : 8,
    "users" : [
      "Abigail",
      "Madison",
      "Anthony",
      "Alexander",
      "Isabella",
      "Chloe",
      "Daniel",
      "Michael"
    ]
  },
  {
    "ageGroup" : 25,
    "numUsers" : 4,
    "users" : [
      "Mary",
      "Mariah",
      "Jim",
      "Diego"
    ]
  },
  {
    "ageGroup" : 35,
    "numUsers" : 4,
    "users" : [
      "Fred",
      "John",
      "Emma",
      "Sophia"
    ]
  }
]
```

To increase readability, the repeated expression *LENGTH(group)* was put into a variable
*numUsers*. The *FILTER* on *numUsers* is the equivalent an SQL *HAVING* clause.