1
0
Fork 0
arangodb/Documentation/Books/HTTP/repairs.md

234 lines
8.3 KiB
Markdown

---
layout: default
description: Repair Jobs
---
Repair Jobs
===========
distributeShardsLike
--------------------
Before versions 3.2.12 and 3.3.4 there was a bug in the collection creation
which could lead to a violation of the property that its shards were
distributed on the DBServers exactly as the prototype collection from the
`distributeShardsLike` setting.
**Please read everything carefully before using this API!**
There is a job that can restore this property safely. However, while the
job is running,
- the `replicationFactor` *must not be changed* for any affected collection or
prototype collection (i.e. set in `distributeShardsLike`, including
[SmartGraphs](../graphs-smart-graphs.html)),
- *neither should shards be moved* of one of those prototypes
- and shutdown of DBServers *should be avoided*
during the repairs. Also only one repair job should run at any given time.
Failure to meet those requirements will mostly cause the job to abort, but still
allow to restart it safely. However, changing the `replicationFactor` during
repairs may leave it in a state that is not repairable without manual
intervention!
Shutting down the coordinator which executes the job will abort it, but it can
safely be restarted on another coordinator. However, there may still be a shard
move ongoing even after the job stopped. If the job is started again before the
move is finished, repairing the affected collection will fail, but the repair
can be restarted safely.
If there is any affected collection which `replicationFactor` is equal to
the total number of DBServers, the repairs might abort. In this case, it is
necessary to reduce the `replicationFactor` by one (or add a DBServer). The
job will not do that automatically.
Generally, the job will abort if any of its assumptions fail, at the start
or during the repairs. It can be started again and will resume from the
current state.
### Testing with `GET /_admin/repairs/distributeShardsLike`
Using `GET` will **not** trigger any repairs, but only calculate and return
the operations necessary to repair the cluster. This way, you can also
check if there is something to repair.
```
$ wget -qSO - http://localhost:8529/_admin/repair/distributeShardsLike | jq .
HTTP/1.1 200 OK
X-Content-Type-Options: nosniff
Server: ArangoDB
Connection: Keep-Alive
Content-Type: application/json; charset=utf-8
Content-Length: 53
{
"error": false,
"code": 200,
"message": "Nothing to do."
}
```
In the example above, all collections with `distributeShardsLike` have their
shards distributed correctly. The response if something is broken looks like
this:
```json
{
"error": false,
"code": 200,
"collections": {
"_system/someCollection": {
"PlannedOperations": [
{
"BeginRepairsOperation": {
"database": "_system",
"collection": "someCollection",
"distributeShardsLike": "aPrototypeCollection",
"renameDistributeShardsLike": true,
"replicationFactor": 4
}
},
{
"MoveShardOperation": {
"database": "_system",
"collection": "someCollection",
"shard": "s2000109",
"from": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
"to": "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
"isLeader": false
}
},
{
"MoveShardOperation": {
"database": "_system",
"collection": "someCollection",
"shard": "s2000109",
"from": "PRMR-ee3d7af6-1fbf-4ab7-bfd1-56d0a1c1c9b9",
"to": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
"isLeader": true
}
},
{
"FixServerOrderOperation": {
"database": "_system",
"collection": "someCollection",
"distributeShardsLike": "aPrototypeCollection",
"shard": "s2000109",
"distributeShardsLikeShard": "s2000092",
"leader": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
"followers": [
"PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
"PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e",
"PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a"
],
"distributeShardsLikeFollowers": [
"PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
"PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
"PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e"
]
}
},
{
"FinishRepairsOperation": {
"database": "_system",
"collection": "someCollection",
"distributeShardsLike": "aPrototypeCollection",
"shards": [
{
"shard": "s2000109",
"protoShard": "s2000092",
"dbServers": [
"PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
"PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
"PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
"PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e"
]
},
{
"shard": "s2000110",
"protoShard": "s2000093",
"dbServers": [
"PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
"PRMR-ee3d7af6-1fbf-4ab7-bfd1-56d0a1c1c9b9",
"PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
"PRMR-99c2ac17-f417-4710-82aa-8350417dd089"
]
},
[...]
]
}
}
],
"error": false
}
}
}
```
If something is to be repaired, the response will have the property
`collections` with an entry `<db>/<collection>` for each collection which
has to be repaired. Each collection also as a separate `error` property
which will be `true` iff an error occurred for this collection (and `false`
otherwise). If `error` is `true`, the properties `errorNum` and
`errorMessage` will also be set, and in some cases also `errorDetails`
with additional information on how to handle a specific error.
### Repairing with `POST /_admin/repairs/distributeShardsLike`
As this job possibly has to move a lot of data around, it can take a while
depending on the size of the affected collections. So this should *not
be called synchronously*, but only via
[Async Results](../http/async-results-management.html): i.e., set the
header `x-arango-async: store` to put the job into background and get
its results later. Otherwise the request will most probably result in a
timeout and the response will be lost! The job will still continue unless
the coordinator is stopped, but there is no way to find out if it is
still running, or get success or error information afterwards.
Starting the job in background can be done like so:
```
$ wget --method=POST --header='x-arango-async: store' -qSO - http://localhost:8529/_admin/repair/distributeShardsLike
HTTP/1.1 202 Accepted
X-Content-Type-Options: nosniff
X-Arango-Async-Id: 152223973119118
Server: ArangoDB
Connection: Keep-Alive
Content-Type: text/plain; charset=utf-8
Content-Length: 0
```
This line is of notable importance:
```
X-Arango-Async-Id: 152223973119118
```
as it contains the job id which can be used to fetch the state and results
of the job later. `GET`ting `/_api/job/pending` and `/_api/job/done` will list
job ids of jobs that are pending or done, respectively.
This can also be done with the `GET` method for testing.
The job api must be used to fetch the state and results. It will return
a `204` while the job is running. The actual response will be returned
only once, after that the job is deleted and the api will return a `404`.
It is therefore recommended to write the response directly to a file for
later inspection. Fetching the result is done by calling `/_api/job` via
`PUT`:
```
$ wget --method=PUT -qSO - http://localhost:8529/_api/job/152223973119118 | jq .
HTTP/1.1 200 OK
X-Content-Type-Options: nosniff
X-Arango-Async-Id: 152223973119118
Server: ArangoDB
Connection: Keep-Alive
Content-Type: application/json; charset=utf-8
Content-Length: 53
{
"error": false,
"code": 200,
"message": "Nothing to do."
}
```
The final response will look like the response of the `GET` call.
If an error occurred the response should contain details on how to proceed.
If in doubt, ask as on Slack:
[arangodb.com/community/](https://arangodb.com/community/){:target="_blank"}