1
0
Fork 0
Commit Graph

44 Commits

Author SHA1 Message Date
Lars Maier 6b04e3de03 Ported ResignLeadership to 3.4 (#9669)
* Ported ResignLeadership to 3.5

* Added http route.
2019-08-09 16:41:13 +02:00
Jan 3cedbe4a67
replace potentially unsafe binary comparisons with logical ones (#9380) 2019-07-04 14:56:38 +02:00
Max Neunhöffer e53966c843
Try to fix agency problems with snapshots. (#8947)
* Try to fix agency problems with snapshots.
* Abort MoveShards jobs that have the failed server as fromServer.
* Report aborts.
2019-05-16 14:41:39 +02:00
Max Neunhöffer 0acb19b18a
Better logic for AddFollower/RemoveFollower scheduling. (#8655)
* Fail a MoveShard job to a FAILED server.
* Better logic for AddFollower/RemoveFollower scheduling.
* Abort MoveShard (leader) in case of a FAILED server in Plan.
* Wait for statistics collections before doing stuff in tests.

cleanOutServer, moveShard, failover and the like.

* Abort MoveShard for follower if FAILED server in Plan.
* Take resigned servers into account when checking for health.
* CHANGELOG.
2019-04-02 23:34:27 +02:00
Max Neunhöffer 54f84cab92 Performance tuning for many shards. (#8577) 2019-03-29 21:34:45 +01:00
Max Neunhöffer 1365eebfac
Make AddFollower and RemoveFollower less aggressive. (#8477)
* Make AddFollower and RemoveFollower less aggressive.
* Adjust comment
* Early exit in count loop.
* Adjust comment in 2nd place.
* CHANGELOG.
2019-03-21 15:27:22 +01:00
Kaveh Vahedipour ab3206486d [3.4] job must not copy snapshots (#8406)
* job must not copy snapshots
* Node correct empty children
* checked all hasAsChildren sites
* No copy in operator() for node.
* Don't spam log.
* const operator too
* full path to missing key in agency
* the key is missing
* Another info level to DEBUG from INFO.
* Increase timeouts of MoveShard and CleanOutServer agency jobs.
* CHANGELOG.
2019-03-20 17:03:19 +01:00
Max Neunhöffer 46e479376d
Further supervision fixes. (#8259)
* Do not schedule Coordinators in Plan.

* Finish failed server when server is no longer in health.

* Fix removeServer checks.

Check that server is no longer in use before removing it. Give 60s
waiting time for condition to be met. Also observer agency lock.

* Finish FailedFollower job if server no longer follower.

This can happen because RemoveFollower was faster.

* Only use GOOD servers as replacement followers.

* Fix AddFollower for satellite collections.

* Fix RemoveServer for satellite collections.

* MoveShard handles moves from leader to followers

* Prepare CleanoutServer and FailedServer for satellite collections.

* More sorting out of AddFollower and RemoveFollower.

* Fix RemoveFollower job w.r.t. choice of follower to remove.

* Fix message.

* kill you own sub jobs, please

* Added preconditions to payloads for supervision's job finishers

* Improve logging.

* Add agency diagnostics to failed move shard test, start.

* Add coordinator agency diagnostics.

* Remove warning.

* Add changelog entry.

* Add agency diagnostics if things go sour with move shard.

* Add agency diags when things go wrong 2.

* API /_api/agency/state: back to old format.

* Fix Windows compilation.

* handle aborts in supervision and wait for the last Raft log to be committed

* tests compiling, 2 failing for valid reasons

* Correctly report TRI_ERROR_CLUSTER_CONNECTION_LOST as 503.

* FailedLeader /FailedFollower cannot continue, when aborting blocks
2019-03-04 11:43:35 +01:00
Max Neunhöffer b87f362f27
The big supervision fix. (#8243)
* Updated CleanoutServerTests. Exclude servers in ToBeCleanedServers. Allow bad servers as new follower.
* Prefer good servers.
* Removed copy, sort and binary_search for a list of ~10 elements.
* Fix move shard bug with compare.
* MoveShard fixes, expansion of doForAllShards
* Count only GOOD servers in actualReplicationFactor.
* Make RemoveFollower remove broken servers.
* Precondition on Plan Version for updating Current as leader.
* CleanupServer to evict server from ToBeCleaned, when aborting
* cleanoutserver with payload in finish
* Use static string for ToBeCleanedOut.
* Fixed typo in log message.
* Change warning level. If a MoveShard job is aborted and we can no longer roll back, then we issue a WARNING rather than a DEBUG log message.
* Another typo and log level.
* Start to fix unit tests.
* Does not make sense for AddFollowerTest to have a FAILED leader.
* Only count GOOD followers in AddFollower.
* Fix AddFollowerTest.
* Report precondition failed in MoveShard follower case.
* Add CHANGELOG.
2019-02-25 08:12:18 -05:00
Frank Celler 9477af198b big reformat 2018-12-26 00:57:05 +01:00
jsteemann 44c7b1b476 remove tabstops 2018-07-16 15:00:12 +02:00
Simon 45fbed497b Supervision Job for Active Failover (#5066) 2018-04-23 12:49:41 +02:00
Matthew Von-Maszewski c0c149cf5b Create non-throwing wrappers for Node access in Agency (#4598)
* safety checkin of Node throw reduction.
* final round of Node throw protection.  Common accessors now protected to force code to hasAsXXX() functions.
2018-04-17 10:21:14 +02:00
Simon 68442dae5a Fixing agency prefix in Agency/Job.cpp (#5039)
* Fixing some test issues and fixing the agency prefix in Agency/Job.cpp
* Making logic consistent in  failed- leader / follower job
* reverting condition back to == GOOD
2018-04-09 16:21:24 +02:00
Tobias Gödderz 4f6847b1b8 Bug fix/supervision bug distributeshardslike and virtual collections (#4759) 2018-03-07 09:54:39 +01:00
Michael Hackstein 76e7461aa9
Revert "bug fix for jobs looking at distrubuteShardsLike and virtual collections (#4665)" (#4758)
This reverts commit 3c35cd32dd.
2018-03-05 17:48:29 +01:00
Kaveh Vahedipour 3c35cd32dd bug fix for jobs looking at distrubuteShardsLike and virtual collections (#4665) 2018-03-05 17:37:07 +01:00
Matthew Von-Maszewski e566150b2e There is a start-up race condition where collection could be in plan but not current. A server shutdown during this period locks system. (#4478) 2018-02-19 09:14:24 +01:00
Kaveh Vahedipour 42f543fd10 constituent correctly persisiting _votedFor and _term (#4248) 2018-01-16 09:47:25 +01:00
Kaveh Vahedipour 7b80deb5cc Fixed object assignment operator for agency's key value store (#3701)
* Fixed object assignment operator for agency's key value store
* Node's toJson is now actually toJson. getString should be used for string extractions
* adjust agency's documentation (clarify precondition)
2017-11-17 15:49:40 +01:00
Kaveh Vahedipour 00650e6a3f Bug fix/agency mt fixes (#3158)
* added debugging methods

* try to fix invalid access in case of error

* remove unused members

* bugfixes and comments

* all agency fixes in

* merge bug

* partially unguarded Agent::lead fixed

* all agency fixes in

* added nrBlocked to thread startup eval

* added nrBlocked to thread startup eval

* recombination of cases in State::get

* some maps replaced with unordered_maps

* optimized maps some
2017-08-30 10:43:51 +02:00
Jan 47e29e6e1f Bug fix/issues 1806 (#3069)
* fix buffer overruns in linenoise for long input lines

* don't make historian repeatedly print the same error messages that nothing can be done about

* make the implementations of the logging operator<<s not throw exceptions, so that logging does throw exceptions as an unintended side effect

* update CHANGELOG

* improve error message

* don't copy strings, but pass them by const reference
2017-08-18 22:58:09 +02:00
Kaveh Vahedipour fd90318fd8 correct-funny-fail-rotation-after-compaction-bugfix (#2774) 2017-07-12 22:39:23 +02:00
Andreas Streichardt f2670f8040 Extract compareServerList and make it reuseable 2017-05-24 14:13:51 +02:00
Andreas Streichardt 8558cb85c9 warning on windows 2017-05-11 13:41:20 +02:00
Kaveh Vahedipour e7797d292e fixed shard ordering in Job::clones with consequences for unit tests 2017-04-27 13:37:47 +02:00
Kaveh Vahedipour 262bb4faac avoid warnings for time being 2017-04-24 16:49:26 +02:00
Kaveh Vahedipour ccc388a940 more dictributeShardsLike code mergedfrom 3.1 2017-04-24 15:13:40 +02:00
Kaveh Vahedipour c099c6daa9 more dictributeShardsLike code mergedfrom 3.1 2017-04-24 15:12:38 +02:00
Andreas Streichardt 7322e3bff3 Allow seeding of randomgenerator for tests 2017-04-21 18:08:49 +02:00
Kaveh Vahedipour 1f81ce28b0 merge in cpp & js from 3.1.18 yet to do tests 2017-04-21 15:41:05 +02:00
Kaveh Vahedipour f3cb1307a5 3.1 fixes backported to devel 2017-02-03 10:48:25 +01:00
jsteemann fa917937c4 do not use namespaces in header files 2017-02-01 13:41:31 +01:00
Max Neunhoeffer 7e4f45ec5c Fix server list comparison. 2017-01-19 14:20:00 +01:00
Kaveh Vahedipour 8251cd46e1 cannot depend on Slice.getDouble 2016-12-15 15:23:45 +01:00
Kaveh Vahedipour 2b9c018817 fixed resilience 2016-12-09 16:35:32 +01:00
Kaveh Vahedipour eddecc0a4c clones method in Jobs more useful 2016-12-09 09:29:00 +01:00
Kaveh Vahedipour b930b23fc2 AddFollower jobs for newly arrived db server to satisfy replication factors 2016-12-07 16:20:47 +01:00
Kaveh Vahedipour 3a1a9c898c correct handling of distributeShardsLike in FailedFollower 2016-12-05 15:44:53 +01:00
jsteemann 9d9b4871ba fixes for Visual Studio 2016-10-31 12:16:39 +01:00
Kaveh Vahedipour 72bf15c118 Fixed moveShard to do distributeShardsLike in start instead of create 2016-10-06 15:32:41 +02:00
Kaveh Vahedipour ce8c1a0cac revisiting all supervision jobs 2016-10-05 17:16:02 +02:00
Kaveh Vahedipour e419a52369 Implementations out of Job header 2016-10-05 15:28:26 +02:00
Kaveh Vahedipour 138d3f304e Implementations out of Job header 2016-10-05 15:26:57 +02:00