1
0
Fork 0
Commit Graph

48 Commits

Author SHA1 Message Date
Kaveh Vahedipour 59b78a1ec4 [3.4] back port of agency call back cleanup (#9723)
* back port of agency call back cleanup
* storecallback missing
* revert callback bodies to API specification
* array needs be inside so that multiple unobserves to same key are possible
2019-08-23 10:12:11 +02:00
Jan 3219e63381
less copying in ClusterInfo::loadPlan() (#9650) 2019-08-08 10:04:36 +02:00
Lars Maier 1c13296f86 [3.4] ClientID Agency Transaction (#8651)
* Changed clientId to format <serverid>:<uuid>.
* Changed behavior if id is not known.
2019-04-30 10:35:34 +02:00
Max Neunhöffer 46e479376d
Further supervision fixes. (#8259)
* Do not schedule Coordinators in Plan.

* Finish failed server when server is no longer in health.

* Fix removeServer checks.

Check that server is no longer in use before removing it. Give 60s
waiting time for condition to be met. Also observer agency lock.

* Finish FailedFollower job if server no longer follower.

This can happen because RemoveFollower was faster.

* Only use GOOD servers as replacement followers.

* Fix AddFollower for satellite collections.

* Fix RemoveServer for satellite collections.

* MoveShard handles moves from leader to followers

* Prepare CleanoutServer and FailedServer for satellite collections.

* More sorting out of AddFollower and RemoveFollower.

* Fix RemoveFollower job w.r.t. choice of follower to remove.

* Fix message.

* kill you own sub jobs, please

* Added preconditions to payloads for supervision's job finishers

* Improve logging.

* Add agency diagnostics to failed move shard test, start.

* Add coordinator agency diagnostics.

* Remove warning.

* Add changelog entry.

* Add agency diagnostics if things go sour with move shard.

* Add agency diags when things go wrong 2.

* API /_api/agency/state: back to old format.

* Fix Windows compilation.

* handle aborts in supervision and wait for the last Raft log to be committed

* tests compiling, 2 failing for valid reasons

* Correctly report TRI_ERROR_CLUSTER_CONNECTION_LOST as 503.

* FailedLeader /FailedFollower cannot continue, when aborting blocks
2019-03-04 11:43:35 +01:00
Frank Celler 9477af198b big reformat 2018-12-26 00:57:05 +01:00
Kaveh Vahedipour 860fa21219 Bug fix 3.4/index readiness (#6716)
* backport of test data generation for maintenance from devel
* 3.4 working
* fixing index use in cluster while still being built
* fixed broken views
* correct 200 for ensureIndex
* merge with 3.4
* agency comm to handle replace in array
* supervision changes
* cluster info's exsureIndex
* 3.4 ready
* timeout
* missing files from origin
* neunhoef complaints
* bogus entry
* no need to wait for current once again
* no longer necessary. done in IndexFactory now
* correct comments
* left overs
* dead code revived
* Move CHANGELOG entry to the right place.
2018-11-21 14:41:36 +01:00
Simon 6eb9e38b08 Better agency pool update (#7036) 2018-10-24 16:23:10 +02:00
Kaveh Vahedipour 28754cbf15 Feature/schmutz plus plus (#5972)
- Schmutz now called "Maintenance" and completely implemented in C++
 - Fix index locking bug in mmfiles
 - Fix a bug in mmfiles with silent option and repsert
 - Slightly increase supervision okperiod and graceperiod
2018-08-24 12:15:35 +02:00
Tobias Gödderz 8c87f51429 Feature/fix inconsistent distribute shards like job (#4743) 2018-05-07 16:53:08 +02:00
Simon 35136a89c0 Fix some problems with active failover (#4540) 2018-02-09 15:11:53 +01:00
Jan b2b6c06cbf
Feature/efficiency (#3736) 2018-01-05 16:51:31 +01:00
Kaveh Vahedipour 22e6a68747 Bug fix/integer overflow when calculating waits in constituent (#4050)
* integer overflow in Constituent could seize operation of Agency

* less likely integer overflow on double conversion

* less likely integer overflow on double conversion

* changed comparison to integer comparison as suggested by @neunhoef
2017-12-19 21:40:46 +01:00
Max Neunhöffer 74458d9d34 Add security check in AgencyComm::sendWithFailover. (#3838) 2017-12-06 10:50:40 +01:00
Kaveh Vahedipour f7b4150b64 no clientId anymore in send/sendWithFailOver SPIs (#3819) 2017-11-28 10:47:36 +01:00
Kaveh Vahedipour 27cd691bbf Bug fix/agencycomm validate methods broken (#3805) 2017-11-27 14:18:25 +01:00
Simon Grätzer ee8209943f Missing things for active / passive (#3578)
* Switching from ttl to supervision based failover mechanism

* Allowing canceling of ongoing actions

* refactored asyncjobmanager

* refactoring some code

* adding read-only flag

* catching some exceptions to reduce log pollution, removing unnecessary code, removing tests for _changeMode

* fixing "createsANewDatabaseWithAnInvalidUser"

* auth = off does not longer make everyone superuser

* Fixing cluster_sync and maybe resilience
2017-11-04 20:30:23 +01:00
Simon Grätzer fd3f9d99d9 Fixing webinterface access (#3464)
* intermediate commit

* Refactoring the ExecContext

* Fixing authentication

* Added start script

* some fixes

* fixed access to nullptr

* some c++

* fixed misleading message

* Made DatabaseGuard movable. Also adapted map insertions to _vocbase in Syncer classes, which failed to compile under older GCC versions

* added support for global flag to replication handler

* Started Refactoring in replication-static

* Fixing syncer code

* store applier configuration

* Static replication tests now test replication in a non system Database

* added flags to replication feature

* Adding some extra checks

* Fixing issue with rocksdb rest replication handler

* replication static now runs _system and otherdatabase replication tests.

* Fixing crash on startup

* Replication_sync now tests _system as well as other Database

* Fixing up heartbeat thread, adding global flag to rest handler

* Fixing wrong assert

* some cleanup, probably some tests are broken

* Made non-system db version of replication-ongoing tests

* fix determine-open-transaction

* Fixed ongoing tests. And added a test where we drop a database on slave while replication is still ongoing

* test fixes

* Activated ongoing other db tests. Also added a test that drops the DB on master, while the slave is still syncing.

* some better error reporting

* gradually switch to Result

* createCollection -> create

* re-activate using of collection ids for now

* enable auto-start

* Fixed create collection in replication ongoing test

* Added first draft of a test for global replication

* move to Result

* use system database for global applier

* improved error reporting

* fixed invalid URLs

* add test case filter

* load existing global applier configuration

* improve error reporting

* Added further tests for global replication

* Fixed global replication test, it now properly waits for replication. Timeouts after 10 seconds.

* Removed erronious assertion

* improve error reporting

* intermediate commit

* Added a test-case for global replication where the Master already has some data and the slave is clean

* fix deletion of replication contexts

* Fixed JSLint

* compiling code

* fix typo

* do not fail for global applier when no database is configured

* intermediate commit

* syncer supports switch for 3.3 / 3.2

* fixed errors

* Fixing some replication bugs

* Fixing some assertions

* Fixed missing commit markers

* Fixing assertion on database drop

* Attempt to fix deadlock in applier and assertion

* Fixing some stupid things

* Support for collection parameter

* Acidentally turned off some tests

* Grrr

* Fixing wrong method call

* Fixed startscript

* Fixed assignmet instead of equality check typo

* Added a test far interrupted replication. For now it justs tests basics on _system database.

* Improved index tests on replication.

* properly initialize variable

* fixed some replication problems

* MMFiles wal access support

* fix replication issues

* Started mmfiles replication support

* fixing a bug

* Fixing an issue

* fixing some mmfiles stuff

* fix test

* reload users

* prevent pure virtual method call

* intermediate commit

* Making from exclusive

* do not call getMasterState if child syncer

* some reformatting

* Adding global support for handleCommandSync

* Fixing assertion

* removing some debug logs

* Changing return codes

* Fixing some issues in the rest handler

* Make replication less susceptible to errors

* remove some debug output

* return last log tick

* remove waits from tests

* fix two tests

* changing header for open-transactions call

* some fixes

* fix test

* invalidate cached databases

* merging request and execcontext

* try to fix assertion error

* renamed method

* fix compile warning

* small changes

* Always use execcontext

* Fixing an assert

* fix replication issues

* try to fix collection lookups

* try to fix master/slave start

* Changing comments in heartbeat thread

* fix wrong signature of READ_LOCKER_EVENTUAL

* log server role in testing mode

* Fixed authentication, removed execContext in favor of request context

* Adding cluster rest api

* Fixing cluster rest handler

* Fixing cluster callback

* Some refactoring

* Queue creation is not a single operation

* Allowed for leader redirects

* Setting start of batch

* Disabling 2.8 compat tests

* fix start/stop bugs

* jslint

* various little changes

* add flag for exposing jwt

* indentation

* cleanup

* Some changed to guid

* fixing tcp to http, vst

* changed endpoint header

* small fixes

* Reorder servers by health status

* Higher timeout

* Changing error messages

* update the fromTick when fetching multiple batches from the coordinator

* more debug info

* Reducing copy pasted code

* change uid generation

* reducing logspam

* more exceptions for redirects

* more exceptions

* attempt to fix uniqids in cluster

* centralize printing of HTTP errors in replication

* debug output

* fix messages for authentication

* cleanup

* removing --cluster.my-id, --cluster.my-local-info

* Added leadership race to bootstrap, determine foxxmaster on boostrap, removing obsolete code

* improve error reporting in RestAqlHandler

* Changing heartbeat thread, fixing cluster_sync

* some more debug output

* added master

* attempt to make tests more deterministic

* added logging about indexes

* added some safety checks to the logger

* slighty better error messages

* fix location header for SSL

* fix error message

* try to make tests more deterministic

* change error code from TRI_ERROR_INTERNAL (which we want to avoid) to TRI_ERROR_FAILED

* Fixing broken webinterface access

* reverting groovy change

* Fixing read-only internal users

* Using superuser rights for dashboard now

* Adding mode field to _admin/server/role

* added mode TRYAGAIN

* remove inventory lock (does not seem necessary here)

* remove invalid assertion

* fixing agency bugs

* Removing debug output

* return proper errors in case of "method not allowed"

* Fixed up some info messages

* jslint
2017-10-20 18:06:59 +02:00
Simon Grätzer 7c31960cf2 Feature/async failover (#3451) 2017-10-18 23:59:29 +02:00
Kaveh Vahedipour 00650e6a3f Bug fix/agency mt fixes (#3158)
* added debugging methods

* try to fix invalid access in case of error

* remove unused members

* bugfixes and comments

* all agency fixes in

* merge bug

* partially unguarded Agent::lead fixed

* all agency fixes in

* added nrBlocked to thread startup eval

* added nrBlocked to thread startup eval

* recombination of cases in State::get

* some maps replaced with unordered_maps

* optimized maps some
2017-08-30 10:43:51 +02:00
Kaveh Vahedipour 1d1e0f5a50 Feature/cluster id and extended health (#3046)
* added unique id to cluster, added access to Health

* added agents to health api

* added agents to health api

* added agents to health api

* transaction information for api

* agents listed like other servers

* missing line through merge conflict
2017-08-18 11:13:23 +02:00
Max Neunhöffer 2f874249bb Bug fix/adjust agency comm timeouts (#2765)
* Take out 503 timeouts altogether.
* Overhaul of AgencyComm::sendWithFailover loop.
* Let performRequests optionally ignore 404 coll not found.
* Fix error message "database not found" when AgencyComm failed.
* Add log entries in Agency if locks are acquired too slowly.
* Reexecute the javascript cluster sync stuff even if there was no plan/current change...So failed sync jobs can retry later...
* Cover callbacks in Communicator by lock. This fixes https://github.com/arangodb/planning/issues/370
* Put in delay in waiting for leader in agency test.
* Schmutz logging to heartbeat topic.
* Add more lock time diagnostic in agent.
* Switch on agencycomm tracing in coordinator.
2017-07-13 00:44:28 +02:00
Andreas Streichardt 9472ab821c Fix rolling back of indices 2017-05-15 15:48:01 +02:00
Kaveh Vahedipour 1f81ce28b0 merge in cpp & js from 3.1.18 yet to do tests 2017-04-21 15:41:05 +02:00
Kaveh Vahedipour 4cc830b0df merge from 3.1 2017-02-20 20:05:52 +01:00
Andreas Streichardt 8de9941df5 Also remove from includes 2017-02-15 16:18:41 +01:00
Kaveh Vahedipour 7fbf9fb621 AgencyCallBacks registry and unregistry are more talkative than bool 2017-02-10 17:31:26 +01:00
Kaveh Vahedipour f45d775106 AgencyComm evaluates fully sent requests properly. 2017-01-24 09:14:28 +01:00
Kaveh Vahedipour 67cd7deaaa ClusterInfo enjoys clientIds 2017-01-19 14:51:29 +01:00
Kaveh Vahedipour 54ccffc0ee agencycommresult with clientids 2017-01-19 14:11:09 +01:00
Kaveh Vahedipour 3639e2ad5b inquire in agency interface adjusted 2017-01-19 11:33:01 +01:00
Kaveh Vahedipour aaee2f9e61 transient heartbeats 2017-01-18 13:43:33 +01:00
Andreas Streichardt 466f932701 First steps to low level replication debugging 2017-01-06 17:19:07 +01:00
Kaveh Vahedipour 9d5a5537ce remove deceased agents from AgencyComm 2017-01-02 17:12:00 +01:00
Kaveh Vahedipour e9f465d13b read/write/transact interface lifted up to js 2016-12-28 15:37:05 +01:00
Max Neunhoeffer a6998744f1 Repair failover loop of AgencyComm. 2016-12-23 13:18:32 +01:00
Kaveh Vahedipour f5e836697a heartbeat adds agents to agencycomm 2016-12-20 17:39:32 +01:00
Kaveh Vahedipour dd0146a54d Merge branch 'devel' of https://github.com/arangodb/arangodb into devel 2016-12-16 12:26:38 +01:00
Kaveh Vahedipour 0df8e4e2cd isWatch no longer needed after move to arangodb agency 2016-12-16 12:26:27 +01:00
jsteemann b4df6577c0 don't copy responses around 2016-12-16 11:00:18 +01:00
Kaveh Vahedipour 84fa31a39d agencycommanager ran in locks when ::redirect called ::failed 2016-12-14 17:27:46 +01:00
Kaveh Vahedipour a7f88840e7 Fixed redirect issues in AgencyComm 2016-12-14 12:12:00 +01:00
Kaveh Vahedipour 77c8c51865 FailedFollower and Windows build problmes 2016-11-30 15:39:10 +01:00
Kaveh Vahedipour 3518fb1319 AgencyComm: validation defined in transactions 2016-11-28 16:09:55 +01:00
jsteemann e81a3c1ec1 fix issues found by cppcheck 2016-11-25 17:26:42 +01:00
Kaveh Vahedipour 4a95e82fa6 ShortName for servers in new ugly UUID world 2016-11-25 15:25:51 +01:00
Kaveh Vahedipour 41e1ba144f general transactions in agency comm 2016-11-25 09:24:41 +01:00
Frank Celler e4ba82e8e9 rewrite of AgencyComm 2016-10-23 00:46:30 +02:00
Frank Celler 959797c54f moved to Agency 2016-10-23 00:46:30 +02:00