1
0
Fork 0
Commit Graph

75 Commits

Author SHA1 Message Date
Max Neunhöffer 54f84cab92 Performance tuning for many shards. (#8577) 2019-03-29 21:34:45 +01:00
Max Neunhöffer 46e479376d
Further supervision fixes. (#8259)
* Do not schedule Coordinators in Plan.

* Finish failed server when server is no longer in health.

* Fix removeServer checks.

Check that server is no longer in use before removing it. Give 60s
waiting time for condition to be met. Also observer agency lock.

* Finish FailedFollower job if server no longer follower.

This can happen because RemoveFollower was faster.

* Only use GOOD servers as replacement followers.

* Fix AddFollower for satellite collections.

* Fix RemoveServer for satellite collections.

* MoveShard handles moves from leader to followers

* Prepare CleanoutServer and FailedServer for satellite collections.

* More sorting out of AddFollower and RemoveFollower.

* Fix RemoveFollower job w.r.t. choice of follower to remove.

* Fix message.

* kill you own sub jobs, please

* Added preconditions to payloads for supervision's job finishers

* Improve logging.

* Add agency diagnostics to failed move shard test, start.

* Add coordinator agency diagnostics.

* Remove warning.

* Add changelog entry.

* Add agency diagnostics if things go sour with move shard.

* Add agency diags when things go wrong 2.

* API /_api/agency/state: back to old format.

* Fix Windows compilation.

* handle aborts in supervision and wait for the last Raft log to be committed

* tests compiling, 2 failing for valid reasons

* Correctly report TRI_ERROR_CLUSTER_CONNECTION_LOST as 503.

* FailedLeader /FailedFollower cannot continue, when aborting blocks
2019-03-04 11:43:35 +01:00
Frank Celler 9477af198b big reformat 2018-12-26 00:57:05 +01:00
Kaveh Vahedipour 860fa21219 Bug fix 3.4/index readiness (#6716)
* backport of test data generation for maintenance from devel
* 3.4 working
* fixing index use in cluster while still being built
* fixed broken views
* correct 200 for ensureIndex
* merge with 3.4
* agency comm to handle replace in array
* supervision changes
* cluster info's exsureIndex
* 3.4 ready
* timeout
* missing files from origin
* neunhoef complaints
* bogus entry
* no need to wait for current once again
* no longer necessary. done in IndexFactory now
* correct comments
* left overs
* dead code revived
* Move CHANGELOG entry to the right place.
2018-11-21 14:41:36 +01:00
Lars Maier 0e9aa10c2a Feature 3.4/cleanup lost collections (#6627)
* Working draft: clean lost collections in supervision.
* Added early exit as in spec.
* Finished test. Fixed logging.
2018-09-27 10:35:39 +02:00
Kaveh Vahedipour 28754cbf15 Feature/schmutz plus plus (#5972)
- Schmutz now called "Maintenance" and completely implemented in C++
 - Fix index locking bug in mmfiles
 - Fix a bug in mmfiles with silent option and repsert
 - Slightly increase supervision okperiod and graceperiod
2018-08-24 12:15:35 +02:00
Matthew Von-Maszewski a84f7805ad Feature/mv thread death logging (#5111)
* Initial low level interface for thread crash reporting (and management).
* Add a member version of isClusterRole()
* isolate heartbeat thread creation to new StartHeartbeatThread().  create heartbeat thread even if not a cluster or if an agent.
* update runDBServer() and runCoordinator() to shutdown more quickly by polling isStopping() at additional locations.
* copying updates from different branch / PR
* basic thread crash logging.  Not yet tied into Agency arangod or have any specific threads posting crashes
* make Supervision thread a CriticalThread
* sandwich CriticalThread between Thread and other classes to create long term, repeating thread crash reporting.
* restore code lost upon branch update relating to new startHeartbeatThread() function
* add CriticalThread.cpp to build
* add new runAgentServer() function to loop for Agents.  Make Heartbeat thread derive from CriticalThread.
* remove debug line
2018-04-23 15:50:14 +02:00
Kaveh Vahedipour 3d043b35a3 Feature/supervsion maintenance mode (#5108)
* Supervision goes to Maintenance mode, when /arango/Supervision/Maintenance exists
* coordinator route stands
* stop updates in transient, when supervision off
2018-04-20 13:23:22 +02:00
Kaveh Vahedipour 7f9786eb27 builder fixed for agency transaction. worked only for a single server. (#4436) 2018-02-06 23:14:53 +01:00
Kaveh Vahedipour 255d90d26a cherry pick from 3.2 pull request for bug-fix/supervision-thread-exists-on-pre3.2-agency (#3709)
This is the HealthRecord upgrade patch.
2017-11-17 10:14:14 +01:00
Jan bef52d7dc3
Bug fix/cleanup after cppcheck (#3639) 2017-11-10 13:53:28 +01:00
Kaveh Vahedipour 627f344266 fixed a bug, where when servers failed, when also agency leadership c… (#3189)
* fixed a bug, where when servers failed, when also agency leadership changes

* redid entire design of checkDBServers/checkCoordinators.

* comparison in supervision must be between oldPersisted and newHealth

* UI stuff

* UI stuff

* FailedServer test needed adjustment

* Hopefully final round

* fixed supervision failure detection

* FailedServer tests back to origin devel

* oldNot documented among preconditions in Agency HTTP API docs

* changed only look for status updated

* non action line in api-cluster
2017-09-07 16:10:23 +02:00
Kaveh Vahedipour 00650e6a3f Bug fix/agency mt fixes (#3158)
* added debugging methods

* try to fix invalid access in case of error

* remove unused members

* bugfixes and comments

* all agency fixes in

* merge bug

* partially unguarded Agent::lead fixed

* all agency fixes in

* added nrBlocked to thread startup eval

* added nrBlocked to thread startup eval

* recombination of cases in State::get

* some maps replaced with unordered_maps

* optimized maps some
2017-08-30 10:43:51 +02:00
Andreas Streichardt fe59502848 Fix server health 2017-05-11 12:20:15 +02:00
Kaveh Vahedipour 68efba18e8 keep agencyPrefix, when non set 2017-04-26 15:32:26 +02:00
Kaveh Vahedipour 1f81ce28b0 merge in cpp & js from 3.1.18 yet to do tests 2017-04-21 15:41:05 +02:00
Kaveh Vahedipour 8d66d69f83 supervision handles coordinator demise correctly 2017-02-07 11:29:37 +01:00
Kaveh Vahedipour aaee2f9e61 transient heartbeats 2017-01-18 13:43:33 +01:00
Kaveh Vahedipour 55985ed5de missing prototypes 2017-01-09 10:38:34 +01:00
jsteemann 7359ac44b2 more style cleanup 2017-01-05 10:52:03 +01:00
Kaveh Vahedipour 12e54902df agency's supervision must wait grace period after becoming leader before acting on db server failure 2016-12-21 11:17:41 +01:00
Max Neunhoeffer 985ccaeb70 Get rid of Supervision::wakeUp(). 2016-12-20 10:19:24 +01:00
Kaveh Vahedipour 51b279346b redirects to myelf should be hinstory 2016-12-06 17:10:15 +01:00
Andreas Streichardt 63a173f002 Delete all shard move jobs when server is healthy again 2016-11-22 14:13:09 +01:00
Kaveh Vahedipour 9a6f605f2f fixed small double / long conversion 2016-10-31 17:00:55 +01:00
Kaveh Vahedipour f8235b9c63 agency locks code review 2016-10-25 15:07:57 +02:00
Max Neunhoeffer 3a76784af4 Protect memory accesses to _snapshot in Supervision. 2016-10-12 10:23:21 +00:00
Kaveh Vahedipour 1f4abf3c36 upgrade 3.0 agency to 3.1 2016-10-06 17:04:29 +02:00
jsteemann f5a595f464 Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types 2016-09-07 08:52:07 +02:00
Andreas Streichardt 6396ac4dc7 Implement removeServer job 2016-09-06 16:49:25 +02:00
jsteemann 6ddf8bab54 Merge branch 'devel' of https://github.com/arangodb/arangodb into generic-col-types 2016-09-06 11:22:14 +02:00
Kaveh Vahedipour 85ea1d5ff9 clang-format 2016-09-06 10:01:33 +02:00
Andreas Streichardt f9fea70c3e readd method 2016-09-05 15:50:41 +02:00
Kaveh Vahedipour 9808a55a33 some cleaning up 2016-09-05 15:12:46 +02:00
jsteemann c6efe26198 cppcheck 2016-08-25 14:04:23 +02:00
Andreas Streichardt 89ebeefbb9 Proper shutdown 2016-08-24 13:51:23 +02:00
Andreas Streichardt 47a0f8602a Better shutdown handling 2016-08-23 12:51:38 +02:00
Andreas Streichardt 03b9d97e2f Implement proper cluster shutdown 2016-08-18 11:23:23 +02:00
Andreas Streichardt 3f412debf0 Revert futile attempts to implement client resilience tests 2016-08-17 18:12:40 +02:00
Andreas Streichardt 70af1e3647 Implement proper cluster shutdown 2016-08-17 17:25:39 +02:00
Andreas Streichardt 526c8f42c2 Fix foxx issues in cluster
Bootstrap will now be done on the bootstrap coordinator.

queues will now be executed by the "foxxmaster"
2016-07-29 16:06:31 +02:00
jsteemann f21561b25f use nullptr, don't include Thread.h when unnecessary 2016-06-15 19:21:53 +02:00
Kaveh Vahedipour beba4887a3 shrink cluster in supervision 2016-06-10 18:10:37 +02:00
Kaveh Vahedipour 00d6111a3e server health for aardvark 2016-06-03 14:27:04 +02:00
Kaveh Vahedipour 427453bcc7 server health for aardvark 2016-06-03 12:19:39 +02:00
Kaveh Vahedipour 9957270df6 hunting down exceptions in agency supervision 2016-05-31 21:42:41 +02:00
Max Neunhoeffer b600ddbeb4 Fix getUniqueIds and updateAgencyPrefix in Supervision.
This prevents some race conditions at cluster startup that crashed the
agency.
2016-05-31 12:38:17 -06:00
Kaveh Vahedipour 7b440f94dc Moving Job classes out of Supervision 2016-05-31 16:28:54 +02:00
Kaveh Vahedipour bad7a6a35a leader fail seems good 2016-05-31 15:21:42 +02:00
Kaveh Vahedipour 68478f530d visual studio warning 2016-05-30 15:47:08 +02:00