1
0
Fork 0
arangodb/Documentation/Books/Manual/Analyzers
Vasiliy 1a22d1360c issue 526.9.1: implement swagger interface, add documentation (#8730)
* issue 526.9.1: implement swagger interface, add documentation

* address review comments

* add ngram

* Formatting

* Move REST description to new Analyzers top chapter in HTTP book

* Missed a DocuBlock

* Add Analyzers chapter to Manual SUMMARY.md

* Move REST API description back to Manual

Headlines were broken

* Add n-gram example
2019-04-16 18:54:30 +03:00
..
README.md

README.md

Analyzers powered by IResearch

Background

The concept of value "analysis" refers to the process of breaking up a given value into a set of sub-values, which are internally tied together by metadata, which in turn influences both the search and sort stages to provide the most appropriate match for the specified conditions, similar to queries to web search engines.

In plain terms this means a user can for example:

  • request documents where the body attribute best matches a quick brown fox
  • request documents where the dna attribute best matches a DNA sub sequence
  • request documents where the name attribute best matches gender
  • etc. (via custom analyzers)

What are Analyzers

Analyzers are helpers that allow a user the parse and transform an arbitrary value, currently only string values are supported, into zero or more resulting values. The parsing and transformation applied is directed by the analyzer type and the analyzer properties.

The Analyzer implementations themselves are provided by the underlying IResearch library. Therefore their most common usecase for filter condition matching is with ArangoSearch Views. However, Analyzers can be used as standalone helpers via the TOKENS(...) function, allowing a user to leverage the value transformation power of the Analyzer in any context where an AQL function can be used.

A user-visible Analyzer is simply an alias for an underlying implementation type + configuration properties and a set of features. The features dictate what term matching capabilities are available and as such are only applicable in the context of ArangoSearch Views.

The aforementioned three configuration attributes that an Analyzer is composed of are given a simple name that can be used to reference the said Analyzer. Thus an analyzer definition is composed of the following attributes:

  • name: the analyzer name
  • type: the analyzer type
  • properties: the properties used to configure the specified type
  • features: the set of features to set on the analyzer generated fields

The valid values for type is any Analyzer type available.

The valid values for the properties are dependant on what type is used. For example for the text type its property may simply be an object with the value "locale": "en", whereas for the "delimited" type its property may simply be the delimiter ,.

The valid values for the features are dependant on both the capabilities of the underlying type and the query filtering and sorting functions that the result can be used with. For example the text type will produce frequency + norm + position and the PHRASE(...) function requires frequency + position to be available.

Currently the following features are supported:

  • frequency: how often a term is seen, required for PHRASE(...)
  • norm: the field normalization factor
  • position: sequentially increasing term position, required for PHRASE(...) if present then the frequency feature is also required

Analyzer usage

For Analyzer usage in the context of ArangoSearch Views please see the section ArangoSearch Views.

The value transformation capabilities of a given analyzer can be invoked via the TOKENS(...) function to for example:

  • break up a string of words into individual words, while also optionally filtering out stopwords, applying case conversion and extracting word stems
  • parse CSV/TSV or other delimiter encoded string values into individual fields

The signature of the TOKENS(...) function is:

TOKENS(<value-to-parse>, <analyzer-name-to-apply>)

It currently accepts any string value, and an analyzer name, and will produce an array of zero or more tokens generated by the specified analyzer transformation.

Analyzer management

The following operations are exposed via JavaScript and REST APIs for analyzer management:

  • create: creation of a new analyzer definition
  • get: retrieve an existing analyzer definition
  • list: retrieve a listing of all available analyzer definitions
  • remove: remove an analyzer definition

JavaScript

The JavaScript API is accessible via the @arangodb/analyzers endpoint from both server-side and client-side code, e.g.

var analyzers = require("@arangodb/analyzers");

The create operation is accessible via:

analyzers.save(<name>, <type>[, <properties>[, <features>]])

… where properties can be represented either as a string, an object or a null value and features is an array of string encoded feature names.

The get operation is accessible via:

analyzers.analyzer(<name>)

The list operation is accessible via:

analyzers.toArray()

The remove operation is accessible via:

analyzers.remove(<name> [, <force>])

Additionally individual analyzer instances expose getter accessors for the aforementioned definition attributes:

analyzer.name()
analyzer.type()
analyzer.properties()
analyzer.features()

RESTful API

The create operation is accessible via the POST method on the URL:

/_api/analyzer

With the Analyzer configuration passed via the body as an object with attributes:

  • name: string (required)
  • type: string (required)
  • properties: string or object or null (optional) default: null
  • features: array of strings (optional) default: empty array

The get operation is accessible via the GET method on the URL:

/_api/analyzer/{analyzer-name}

A successful result will be an object with the fields:

  • name
  • type
  • properties
  • features

The list operation is accessible via the GET method on the URL:

/_api/analyzer

A successful result will be an array of object with the fields:

  • name
  • type
  • properties
  • features

The remove operation is accessible via the DELETE method on the URL:

/_api/analyzer/{analyzer-name}[?force=true]

Also see Analyzers in the HTTP book including a list of available Analyzer Types.