arangodb

History
Thomas Schmidts 4bf4a9bb26 Changed README.html to index.html in the documentation		2015-06-09 09:40:12 +02:00
..
README.mdpp	Changed README.html to index.html in the documentation	2015-06-09 09:40:12 +02:00
README.mdpp

!CHAPTER Arangoimp

This manual describes the ArangoDB importer _arangoimp_, which can be used for
bulk imports.

The most convenient method to import a lot of data into ArangoDB is to use the
*arangoimp* command-line tool. It allows you to import data records from a file
into an existing database collection.

It is possible to import [document keys](../Glossary/index.html#document_key) with the documents using the *_key*
attribute. When importing into an [edge collection](../Glossary/index.html#edge_collection), it is mandatory that all
imported documents have the *_from* and *_to* attributes, and that they contain
valid references.

Let's assume for the following examples you want to import user data into an 
existing collection named "users" on the server.


!SECTION Importing Data into an ArangoDB Database

!SUBSECTION Importing JSON-encoded Data

Let's further assume the import at hand is encoded in JSON. We'll be using these
example user records to import:

```js
{ "name" : { "first" : "John", "last" : "Connor" }, "active" : true, "age" : 25, "likes" : [ "swimming"] }
{ "name" : { "first" : "Jim", "last" : "O'Brady" }, "age" : 19, "likes" : [ "hiking", "singing" ] }
{ "name" : { "first" : "Lisa", "last" : "Jones" }, "dob" : "1981-04-09", "likes" : [ "running" ] }
```

To import these records, all you need to do is to put them into a file (with one
line for each record to import) and run the following command:

    > arangoimp --file "data.json" --type json --collection "users"

This will transfer the data to the server, import the records, and print a
status summary. To show the intermediate progress during the import process, the
option *--progress* can be added. This option will show the percentage of the
input file that has been sent to the server. This will only be useful for big
import files. 

    > arangoimp --file "data.json" --type json --collection users --progress true

It is also possible to use the output of another command as an input for arangoimp.
For example, the following shell command can be used to pipe data from the `cat`
process to arangoimp:

    > cat data.json | arangoimp --file - --type json --collection users

Note that you have to use `--file -` if you want to use another command as input 
for arangoimp. No progress can be reported for such imports as the size of the input
will be unknown to arangoimp.

By default, the endpoint *tcp://127.0.0.1:8529* will be used.  If you want to
specify a different endpoint, you can use the *--server.endpoint* option. You
probably want to specify a database user and password as well.  You can do so by
using the options *--server.username* and *--server.password*.  If you do not
specify a password, you will be prompted for one.

    > arangoimp --server.endpoint tcp://127.0.0.1:8529 --server.username root --file "data.json" --type json --collection "users"

Note that the collection (*users* in this case) must already exist or the import
will fail. If you want to create a new collection with the import data, you need
to specify the *--create-collection* option. Note that it is only possible to 
create a document collection using the *--create-collection* flag, and no edge
collections.

    > arangoimp --file "data.json" --type json --collection "users" --create-collection true

When importing data into an existing collection it is often convenient to first
remove all data from the collection and then start the import. This can be achieved
by passing the *--overwrite* parameter to _arangoimp_. If it is set to *true*,
any existing data in the collection will be removed prior to the import. Note
that any existing index definitions for the collection will be preserved even if 
*--overwrite* is set to true.
    
    > arangoimp --file "data.json" --type json --collection "users" --overwrite true

As the import file already contains the data in JSON format, attribute names and
data types are fully preserved. As can be seen in the example data, there is no
need for all data records to have the same attribute names or types. Records can
be in-homogenous.

Please note that by default, _arangoimp_ will import data into the specified 
collection in the default database (*_system*). To specify a different database, 
use the *--server.database* option when invoking _arangoimp_. 


!SUBSECTION JSON input file formats

**Note**: *arangoimp* supports two formats when importing JSON data from 
a file. The first format (also used above) requires the input file to contain one 
complete JSON document in each line, e.g.

```js
{ "_key": "one", "value": 1 }
{ "_key": "two", "value": 2 }
{ "_key": "foo", "value": "bar" }
...
```

The above format can be imported sequentially by _arangoimp_. It will read data
from the input file in chunks and send it in batches to the server. Each batch
will be about as big as specified in the command-line parameter *--batch-size*.

An alternative is to put one big JSON document into the input file like this:

```js
[
  { "_key": "one", "value": 1 },
  { "_key": "two", "value": 2 },
  { "_key": "foo", "value": "bar" },
  ...
]
```

This format allows line breaks within the input file as required. The downside 
is that the whole input file will need to be read by _arangoimp_ before it can
send the first batch. This might be a problem if the input file is big. By
default, _arangoimp_ will allow importing such files up to a size of about 16 MB.

If you want to allow your _arangoimp_ instance to use more memory, you may want
to increase the maximum file size by specifying the command-line option
*--batch-size*. For example, to set the batch size to 32 MB, use the following
command:

    > arangoimp --file "data.json" --type json --collection "users" --batch-size 33554432

Please also note that you may need to increase the value of *--batch-size* if
a single document inside the input file is bigger than the value of *--batch-size*.


!SUBSECTION Importing CSV Data

_arangoimp_ also offers the possibility to import data from CSV files. This
comes handy when the data at hand is in CSV format already and you don't want to
spend time converting them to JSON for the import.

To import data from a CSV file, make sure your file contains the attribute names
in the first row. All the following lines in the file will be interpreted as
data records and will be imported.

The CSV import requires the data to have a homogeneous structure. All records
must have exactly the same amount of columns as there are headers.

The cell values can have different data types though. If a cell does not have
any value, it can be left empty in the file. These values will not be imported
so the attributes will not "be there" in document created. Values enclosed in
quotes will be imported as strings, so to import numeric values, boolean values
or the null value, don't enclose the value in quotes in your file.

We'll be using the following import for the CSV import:

```
"first","last","age","active","dob"
"John","Connor",25,true,
"Jim","O'Brady",19,,
"Lisa","Jones",,,"1981-04-09"
Hans,dos Santos,0123,,
Wayne,Brewer,,false,
```

The command line to execute the import is:

    > arangoimp --file "data.csv" --type csv --collection "users"

The above data will be imported into 5 documents which will look as follows:

```js
{ "first" : "John", "last" : "Connor", "active" : true, "age" : 25 } 
{ "first" : "Jim", "last" : "O'Brady", "age" : 19 }
{ "first" : "Lisa", "last" : "Jones", "dob" : "1981-04-09" } 
{ "first" : "Hans", "last" : "dos Santos", "age" : 123 } 
{ "first" : "Wayne", "last" : "Brewer", "active" : false } 
```

As can be seen, values left completely empty in the input file will be treated
as absent. Numeric values not enclosed in quotes will be treated as numbers.
Note that leading zeros in numeric values will be removed. To import numbers
with leading zeros, please use strings.
The literals *true* and *false* will be treated as booleans if they are not
enclosed in quotes. Other values not enclosed in quotes will be treated as
strings.
Any values enclosed in quotes will be treated as strings, too.

String values containing the quote character or the separator must be enclosed
with quote characters. Within a string, the quote character itself must be
escaped with another quote character (or with a backslash if the *--backslash-escape*
option is used).

Note that the quote and separator characters can be adjusted via the
*--quote* and *--separator* arguments when invoking _arangoimp_. The quote
character defaults to the double quote (*"*). To use a literal quote in a 
string, you can use two quote characters. 
To use backslash for escaping quote characters, please set the option 
*--backslash-escape* to *true*.

The importer supports Windows (CRLF) and Unix (LF) line breaks. Line breaks might
also occur inside values that are enclosed with the quote character.

Here's an example for using literal quotes and newlines inside values:

```
"name","password"
"Foo","r4ndom""123!"
"Bar","wow!
this is a
multine password!"
"Bartholomew ""Bart"" Simpson","Milhouse"
```

Extra whitespace at the end of each line will be ignored. Whitespace at the 
start of lines or between field values will not be ignored, so please make sure 
that there is no extra whitespace in front of values or between them.


!SUBSECTION Importing TSV Data

You may also import tab-separated values (TSV) from a file. This format is very
simple: every line in the file represents a data record. There is no quoting or
escaping. That also means that the separator character (which defaults to the
tabstop symbol) must not be used anywhere in the actual data.

As with CSV, the first line in the TSV file must contain the attribute names,
and all lines must have an identical number of values.

If a different separator character or string should be used, it can be specified
with the *--separator* argument. 

An example command line to execute the TSV import is:

    > arangoimp --file "data.tsv" --type tsv --collection "users" 


!SUBSECTION Importing into an Edge Collection

arangoimp can also be used to import data into an existing edge collection.
The import data must, for each edge to import, contain at least the *_from* and
*_to* attributes. These indicate which other two documents the edge should connect.
It is necessary that these attributes are set for all records, and point to 
valid document ids in existing collections.

*Examples*

```js
{ "_from" : "users/1234", "_to" : "users/4321", "desc" : "1234 is connected to 4321" }
```

**Note**: The edge collection must already exist when the import is started. Using 
the *--create-collection* flag will not work because arangoimp will always try to 
create a regular document collection if the target collection does not exist.


!SUBSECTION Updating existing documents

By default, arangoimp will try to insert all documents from the import file into the
specified collection. In case the import file contains documents that are already present
in the target collection (matching is done via the *_key* attributes), then a default
arangoimp run will not import these documents and complain about unique key constraint
violations.

However, arangoimp can be used to update or replace existing documents in case they
already exist in the target collection. It provides the command-line option *--on-duplicate*
to control the behavior in case a document is already present in the database.

The default value of *--on-duplicate* is *error*. This means that when the import file
contains a document that is present in the target collection already, then trying to
re-insert a document with the same *_key* value is considered an error, and the document in
the database will not be modified.

Other possible values for *--on-duplicate* are:

- *update*: each document present in the import file that is also present in the target
  collection already will be updated by arangoimp. *update* will perform a partial update
  of the existing document, modifying only the attributes that are present in the import 
  file and leaving all other attributes untouched.
  
  The values of system attributes *_id*, *_key*, *_rev*, *_from* and *_to* cannot be
  updated or replaced in existing documents. 

- *replace*: each document present in the import file that is also present in the target
  collection already will be replace by arangoimp. *replace* will replace the existing
  document entirely, resulting in a document with only the attributes specified in the import
  file. 
  
  The values of system attributes *_id*, *_key*, *_rev*, *_from* and *_to* cannot be
  updated or replaced in existing documents. 

- *ignore*: each document present in the import file that is also present in the target
  collection already will be ignored and not modified in the target collection. 

When *--on-duplicate* is set to either *update* or *replace*, arangoimp will return the
number of documents updated/replaced in the *updated* return value. When set to another
value, the value of *updated* will always be zero. When *--on-duplicate* is set to *ignore*,
arangoimp will return the number of ignored documents in the *ignored* return value.
When set to another value, *ignored* will always be zero.

It is possible to perform a combination of inserts and updates/replaces with a single 
arangoimp run. When *--on-duplicate* is set to *update* or *replace*, all documents present
in the import file will be inserted into the target collection provided they are valid
and do not already exist with the specified *_key*. Documents that are already present
in the target collection (identified by *_key* attribute) will instead be updated/replaced.


!SUBSECTION Arangoimp result output

An _arangoimp_ import run will print out the final results on the command line.
It will show the

* number of documents created (*created*)
* number of documents updated/replaced (*updated/replaced*, only non-zero if 
  *--on-duplicate* was set to *update* or *replace*, see below) 
* number of warnings or errors that occurred on the server side (*warnings/errors*)
* number of ignored documents (only non-zero if *--on-duplicate* was set to *ignore*). 

*Example*

```js
created:          2
warnings/errors:  0
updated/replaced: 0
ignored:          0
```

For CSV and TSV imports, the total number of input file lines read will also be printed
(*lines read*).

_arangoimp_ will also print out details about warnings and errors that happened on the 
server-side (if any).


!SUBSECTION Attribute Naming and Special Attributes

Attributes whose names start with an underscore are treated in a special way by 
ArangoDB:

- the optional *_key* attribute contains the document's key. If specified, the value
  must be formally valid (e.g. must be a string and conform to the naming conventions). 
  Additionally, the key value must be unique within the 
  collection the import is run for.
- *_from*: when importing into an edge collection, this attribute contains the id
  of one of the documents connected by the edge. The value of *_from* must be a
  syntactically valid document id and the referred collection must exist.
- *_to*: when importing into an edge collection, this attribute contains the id
  of the other document connected by the edge. The value of *_to* must be a
  syntactically valid document id and the referred collection must exist.
- *_rev*: this attribute contains the revision number of a document. However, the
  revision numbers are managed by ArangoDB and cannot be specified on import. Thus 
  any value in this attribute is ignored on import.

If you import values into *_key*, you should make sure they are valid and unique.

When importing data into an edge collection, you should make sure that all import
documents can *_from* and *_to* and that their values point to existing documents.