mirror of https://gitee.com/bigwinds/arangodb
201 lines
7.7 KiB
Markdown
201 lines
7.7 KiB
Markdown
Importing data
|
|
==============
|
|
|
|
Problem
|
|
-------
|
|
|
|
I want to import data from a file into ArangoDB.
|
|
|
|
Solution
|
|
--------
|
|
|
|
ArangoDB comes with a command-line tool utility named `arangoimp`. This utility can be
|
|
used for importing JSON-encoded, CSV, and tab-separated files into ArangoDB.
|
|
|
|
`arangoimp` needs to be invoked from the command-line once for each import file.
|
|
The target collection can already exist or can be created by the import run.
|
|
|
|
### Importing JSON-encoded data
|
|
|
|
#### Input formats
|
|
|
|
There are two supported input formats for importing JSON-encoded data into ArangoDB:
|
|
|
|
- **line-by-line format**: This format expects each line in the input file to be a valid
|
|
JSON objects. No line breaks must occur within each single JSON object
|
|
|
|
- **array format**: Expects a file containing a single array of JSON objects. Whitespace is
|
|
allowed for formatting inside the JSON array and the JSON objects
|
|
|
|
Here's an example for the **line-by-line format** looks like this:
|
|
|
|
```js
|
|
{"author":"Frank Celler","time":"2011-10-26 08:42:49 +0200","sha":"c413859392a45873936cbe40797970f8eed93ff9","message":"first commit","user":"f.celler"}
|
|
{"author":"Frank Celler","time":"2011-10-26 21:32:36 +0200","sha":"10bb77b8cc839201ff59a778f0c740994083c96e","message":"initial release","user":"f.celler"}
|
|
...
|
|
```
|
|
|
|
Here's an example for the same data in **array format**:
|
|
|
|
```js
|
|
[
|
|
{
|
|
"author": "Frank Celler",
|
|
"time": "2011-10-26 08:42:49 +0200",
|
|
"sha": "c413859392a45873936cbe40797970f8eed93ff9",
|
|
"message": "first commit",
|
|
"user": "f.celler"
|
|
},
|
|
{
|
|
"author": "Frank Celler",
|
|
"time": "2011-10-26 21:32:36 +0200",
|
|
"sha": "10bb77b8cc839201ff59a778f0c740994083c96e",
|
|
"message": "initial release",
|
|
"user": "f.celler"
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
#### Importing JSON data in line-by-line format
|
|
|
|
An example data file in **line-by-line format** can be downloaded
|
|
[here](http://jsteemann.github.io/downloads/code/git-commits-single-line.json). The example
|
|
file contains all the commits to the ArangoDB repository as shown by `git log --reverse`.
|
|
|
|
The following commands will import the data from the file into a collection named `commits`:
|
|
|
|
```bash
|
|
# download file
|
|
wget http://jsteemann.github.io/downloads/code/git-commits-single-line.json
|
|
|
|
# actually import data
|
|
arangoimp --file git-commits-single-line.json --collection commits --create-collection true
|
|
```
|
|
|
|
Note that no file type has been specified when `arangoimp` was invoked. This is because `json`
|
|
is its default input format.
|
|
|
|
The other parameters used have the following meanings:
|
|
|
|
- `file`: input filename
|
|
- `collection`: name of the target collection
|
|
- `create-collection`: whether or not the collection should be created if it does not exist
|
|
|
|
The result of the import printed by `arangoimp` should be:
|
|
|
|
```
|
|
created: 20039
|
|
warnings/errors: 0
|
|
total: 20039
|
|
```
|
|
|
|
The collection `commits` should now contain the example commit data as present in the input file.
|
|
|
|
#### Importing JSON data in array format
|
|
|
|
An example input file for the **array format** can be found [here](http://jsteemann.github.io/downloads/code/git-commits-array.json).
|
|
|
|
The command for importing JSON data in **array format** is similar to what we've done before:
|
|
|
|
```bash
|
|
# download file
|
|
wget http://jsteemann.github.io/downloads/code/git-commits-array.json
|
|
|
|
# actually import data
|
|
arangoimp --file git-commits-array.json --collection commits --create-collection true
|
|
```
|
|
|
|
Though the import command is the same (except the filename), there is a notable difference between the
|
|
two JSON formats: for the **array format**, `arangoimp` will read and parse the JSON in its entirety
|
|
before it sends any data to the ArangoDB server. That means the whole input file must fit into
|
|
`arangoimp`'s buffer. By default, `arangoimp` will allocate a 16 MiB internal buffer, and input files bigger
|
|
than that will be rejected with the following message:
|
|
|
|
```
|
|
import file is too big. please increase the value of --batch-size (currently 16777216).
|
|
```
|
|
|
|
So for JSON input files in **array format** it might be necessary to increase the value of `--batch-size`
|
|
in order to have the file imported. Alternatively, the input file can be converted to **line-by-line format**
|
|
manually.
|
|
|
|
|
|
### Importing CSV data
|
|
|
|
Data can also be imported from a CSV file. An example file can be found [here](http://jsteemann.github.io/downloads/code/git-commits.csv).
|
|
|
|
The `--type` parameter for the import command must now be set to `csv`:
|
|
|
|
```bash
|
|
# download file
|
|
wget http://jsteemann.github.io/downloads/code/git-commits.csv
|
|
|
|
# actually import data
|
|
arangoimp --file git-commits.csv --type csv --collection commits --create-collection true
|
|
```
|
|
|
|
For the CSV import, the first line in the input file has a special meaning: every value listed in the
|
|
first line will be treated as an attribute name for the values in all following lines. All following
|
|
lines should also have the same number of "columns".
|
|
|
|
"columns" inside the CSV input file can be left empty though. If a "column" is left empty in a line,
|
|
then this value will be omitted for the import so the respective attribute will not be set in the imported
|
|
document. Note that values from the input file that are enclosed in double quotes will always be imported as
|
|
strings. To import numeric values, boolean values or the `null` value, don't enclose these values in quotes in
|
|
the input file. Note that leading zeros in numeric values will be removed. Importing numbers with leading
|
|
zeros will only work when putting the numbers into strings.
|
|
|
|
Here is an example CSV file:
|
|
|
|
```plain
|
|
"author","time","sha","message"
|
|
"Frank Celler","2011-10-26 08:42:49 +0200","c413859392a45873936cbe40797970f8eed93ff9","first commit"
|
|
"Frank Celler","2011-10-26 21:32:36 +0200","10bb77b8cc839201ff59a778f0c740994083c96e","initial release"
|
|
...
|
|
```
|
|
|
|
`arangoimp` supports Windows (CRLF) and Unix (LF) line breaks. Line breaks might also occur inside values
|
|
that are enclosed with the quote character.
|
|
|
|
The default separator for CSV files is the comma. It can be changed using the `--separator` parameter
|
|
when invoking `arangoimp`. The quote character defaults to the double quote (**"**). To use a literal double
|
|
quote inside a "column" in the import data, use two double quotes. To change the quote character, use the
|
|
`--quote` parameter. To use a backslash for escaping quote characters, please set the option `--backslash-escape`
|
|
to `true`.
|
|
|
|
|
|
### Changing the database and server endpoint
|
|
|
|
By default, `arangoimp` will connect to the default database on `127.0.0.1:8529` with a user named
|
|
`root`. To change this, use the following parameters:
|
|
|
|
- `server.database`: name of the database to use when importing (default: `_system`)
|
|
- `server.endpoint`: address of the ArangoDB server (default: `tcp://127.0.0.1:8529`)
|
|
|
|
|
|
### Using authentication
|
|
|
|
`arangoimp` will by default send an username `root` and an empty password to the ArangoDB
|
|
server. This is ArangoDB's default configuration, and it should be changed. To make `arangoimp`
|
|
use a different username or password, the following command-line arguments can be used:
|
|
|
|
- `server.username`: username, used if authentication is enabled on server
|
|
- `server.password`: password for user, used if authentication is enabled on server
|
|
|
|
The password argument can also be omitted in order to avoid having it saved in the shell's
|
|
command-line history. When specifying a username but omitting the password parameter,
|
|
`arangoimp` will prompt for a password.
|
|
|
|
|
|
### Additional parameters
|
|
|
|
By default, `arangoimp` will import data into the specified collection but will not touch
|
|
existing data. Often it is convenient to first remove all data from a collection and then run
|
|
the import. `arangoimp` supports this with the optional `--overwrite` flag. When setting it to
|
|
`true`, all documents in the collection will be removed prior to the import.
|
|
|
|
**Author:** [Jan Steemann](https://github.com/jsteemann)
|
|
|
|
**Tags**: #arangoimp #import
|