1
0
Fork 0
Commit Graph

32 Commits

Author SHA1 Message Date
Lars Maier a1bae63cf1 [3.4] Verbose Abort Reason (#8878)
* Added reason to job abort method.

* Additional abort that is not in devel.
2019-05-01 13:54:47 +02:00
Max Neunhöffer 54f84cab92 Performance tuning for many shards. (#8577) 2019-03-29 21:34:45 +01:00
Max Neunhöffer 1365eebfac
Make AddFollower and RemoveFollower less aggressive. (#8477)
* Make AddFollower and RemoveFollower less aggressive.
* Adjust comment
* Early exit in count loop.
* Adjust comment in 2nd place.
* CHANGELOG.
2019-03-21 15:27:22 +01:00
Max Neunhöffer 46e479376d
Further supervision fixes. (#8259)
* Do not schedule Coordinators in Plan.

* Finish failed server when server is no longer in health.

* Fix removeServer checks.

Check that server is no longer in use before removing it. Give 60s
waiting time for condition to be met. Also observer agency lock.

* Finish FailedFollower job if server no longer follower.

This can happen because RemoveFollower was faster.

* Only use GOOD servers as replacement followers.

* Fix AddFollower for satellite collections.

* Fix RemoveServer for satellite collections.

* MoveShard handles moves from leader to followers

* Prepare CleanoutServer and FailedServer for satellite collections.

* More sorting out of AddFollower and RemoveFollower.

* Fix RemoveFollower job w.r.t. choice of follower to remove.

* Fix message.

* kill you own sub jobs, please

* Added preconditions to payloads for supervision's job finishers

* Improve logging.

* Add agency diagnostics to failed move shard test, start.

* Add coordinator agency diagnostics.

* Remove warning.

* Add changelog entry.

* Add agency diagnostics if things go sour with move shard.

* Add agency diags when things go wrong 2.

* API /_api/agency/state: back to old format.

* Fix Windows compilation.

* handle aborts in supervision and wait for the last Raft log to be committed

* tests compiling, 2 failing for valid reasons

* Correctly report TRI_ERROR_CLUSTER_CONNECTION_LOST as 503.

* FailedLeader /FailedFollower cannot continue, when aborting blocks
2019-03-04 11:43:35 +01:00
Max Neunhöffer b87f362f27
The big supervision fix. (#8243)
* Updated CleanoutServerTests. Exclude servers in ToBeCleanedServers. Allow bad servers as new follower.
* Prefer good servers.
* Removed copy, sort and binary_search for a list of ~10 elements.
* Fix move shard bug with compare.
* MoveShard fixes, expansion of doForAllShards
* Count only GOOD servers in actualReplicationFactor.
* Make RemoveFollower remove broken servers.
* Precondition on Plan Version for updating Current as leader.
* CleanupServer to evict server from ToBeCleaned, when aborting
* cleanoutserver with payload in finish
* Use static string for ToBeCleanedOut.
* Fixed typo in log message.
* Change warning level. If a MoveShard job is aborted and we can no longer roll back, then we issue a WARNING rather than a DEBUG log message.
* Another typo and log level.
* Start to fix unit tests.
* Does not make sense for AddFollowerTest to have a FAILED leader.
* Only count GOOD followers in AddFollower.
* Fix AddFollowerTest.
* Report precondition failed in MoveShard follower case.
* Add CHANGELOG.
2019-02-25 08:12:18 -05:00
Frank Celler 9477af198b big reformat 2018-12-26 00:57:05 +01:00
Kaveh Vahedipour 28754cbf15 Feature/schmutz plus plus (#5972)
- Schmutz now called "Maintenance" and completely implemented in C++
 - Fix index locking bug in mmfiles
 - Fix a bug in mmfiles with silent option and repsert
 - Slightly increase supervision okperiod and graceperiod
2018-08-24 12:15:35 +02:00
Matthew Von-Maszewski c0c149cf5b Create non-throwing wrappers for Node access in Agency (#4598)
* safety checkin of Node throw reduction.
* final round of Node throw protection.  Common accessors now protected to force code to hasAsXXX() functions.
2018-04-17 10:21:14 +02:00
Simon 68442dae5a Fixing agency prefix in Agency/Job.cpp (#5039)
* Fixing some test issues and fixing the agency prefix in Agency/Job.cpp
* Making logic consistent in  failed- leader / follower job
* reverting condition back to == GOOD
2018-04-09 16:21:24 +02:00
Simon Grätzer 7c31960cf2 Feature/async failover (#3451) 2017-10-18 23:59:29 +02:00
m0ppers bb1d303473 Cmake 5.0 complains about unused lambda captures (#3390) 2017-10-13 12:20:48 +02:00
Andreas Streichardt 439203dc3b Better logging 2017-05-11 12:20:15 +02:00
Max Neunhoeffer 09ff77cce2 Make Windows VS compiler a bit happier. 2017-04-28 17:18:37 +02:00
Kaveh Vahedipour 1f81ce28b0 merge in cpp & js from 3.1.18 yet to do tests 2017-04-21 15:41:05 +02:00
Kaveh Vahedipour 4cc830b0df merge from 3.1 2017-02-20 20:05:52 +01:00
jsteemann b3ac54d065 remove global namespace include 2017-02-13 13:03:33 +01:00
Kaveh Vahedipour 76e5dec3d7 agent with less traffic 2017-02-10 17:03:15 +01:00
Kaveh Vahedipour 3f3633bd2c supervision to proper preconditioning of jobs on plan 2017-01-27 15:29:22 +01:00
Kaveh Vahedipour ab22ffa8ee shard jobs should check for the plan to be the same as expected 2017-01-27 11:27:45 +01:00
Kaveh Vahedipour c803d52f51 startLocalCluster handles port offset so that multiple clusters can be started on same machine 2017-01-27 09:33:42 +01:00
Kaveh Vahedipour 2b9c018817 fixed resilience 2016-12-09 16:35:32 +01:00
Kaveh Vahedipour eddecc0a4c clones method in Jobs more useful 2016-12-09 09:29:00 +01:00
Kaveh Vahedipour c6ef45b64d AddFollower to handle multiple followers at the same time 2016-12-08 15:12:05 +01:00
Kaveh Vahedipour b930b23fc2 AddFollower jobs for newly arrived db server to satisfy replication factors 2016-12-07 16:20:47 +01:00
Frank Celler e4ba82e8e9 rewrite of AgencyComm 2016-10-23 00:46:30 +02:00
jsteemann 34f7e27d6c Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types 2016-09-08 09:27:53 +02:00
Frank Celler 52b1541f46 silenced warning in maintainer-mode 2016-09-08 08:41:58 +02:00
jsteemann 8ef63acf55 Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types 2016-09-07 15:24:51 +02:00
Kaveh Vahedipour beb46cc1a0 cppcheck warnings 2016-09-07 15:11:10 +02:00
jsteemann c14c6ab025 removed unused variables 2016-09-07 08:56:48 +02:00
Frank Celler 5a14ab5a12 silence warning 2016-09-06 23:20:56 +02:00
Andreas Streichardt 6396ac4dc7 Implement removeServer job 2016-09-06 16:49:25 +02:00