1
0
Fork 0
Commit Graph

1166 Commits

Author SHA1 Message Date
Jan 5abf0c1185 Bug fix/fixes 1511 (#3711) 2017-11-16 14:18:51 +01:00
Max Neunhöffer 766ab7c8cf
Fix agency shutdown bug. (#3683)
* Fix agency shutdown bug.
* Remove precondition that was not needed in AgencyComm::removeValues.
* Fail fatally if threads do not shut down.
2017-11-14 16:33:46 +01:00
Jan e1ecc6b02c fix some threading issues (#3659) 2017-11-12 22:34:51 +01:00
Kaveh Vahedipour c9621ff230 Feature/new agency checks for preconditions (#3612) 2017-11-11 22:48:23 +01:00
Max Neunhöffer bff630b332 Handle leader resignation race with redirectRequst. (#3663) 2017-11-11 19:38:29 +01:00
Kaveh Vahedipour 7e816db51e Bug fix/agency restart enhancements (#3619)
* Removed unused active(...) method in Agent
* Inception's restart from persistence allows peer with empty active RAFT list to join
* Agency's UUID is persisted outside of the database comparable to coordinator and db server action.
* Publicized Methods to UUID stuff in ServerState
* Inception method documentation
* added --agency.disaster-recovery-id to allow for specification of known former agency id. this is a very dangerous option potentially.
* Delete a unused methods.
* separate _id and _recoveryId
* populating active list with entire pool
* Improve logging.
* reject gossip from unknown agent, if pool is complete
2017-11-10 23:40:26 +01:00
Jan bef52d7dc3
Bug fix/cleanup after cppcheck (#3639) 2017-11-10 13:53:28 +01:00
Max Neunhöffer 3c0ee6908b Bug fix/lead to agent (#3541) 2017-11-09 11:10:09 +01:00
Jan 98eecaae20 bug fix for agency precondition checks (#3579) 2017-11-06 23:55:41 +01:00
Simon Grätzer ee8209943f Missing things for active / passive (#3578)
* Switching from ttl to supervision based failover mechanism

* Allowing canceling of ongoing actions

* refactored asyncjobmanager

* refactoring some code

* adding read-only flag

* catching some exceptions to reduce log pollution, removing unnecessary code, removing tests for _changeMode

* fixing "createsANewDatabaseWithAnInvalidUser"

* auth = off does not longer make everyone superuser

* Fixing cluster_sync and maybe resilience
2017-11-04 20:30:23 +01:00
jsteemann a5c777e565 fix broken inquiry results in AgencyComm 2017-10-26 20:10:54 +02:00
Max Neunhöffer cb05d33e17 Term is a number not a string. (#3520) 2017-10-26 12:02:38 +02:00
Max Neunhöffer ee96c37237 Fix agency restart problems. (#3493)
* Fix agency restart problems (port from a 3.2 fix).

* Further fixes after Craneware rescue.
2017-10-25 18:05:58 +02:00
Michael Hackstein 15d9a4be5f Reactivated the failover of the FoxxMaster, it was not modified anymore after the current master dies (#3510) 2017-10-25 18:03:24 +02:00
Jan 720e6df82e Bug fix/fixes 1910 (#3471)
* properly initialize all properties

* use faster comparison

* properly detect and handle "method not allowed"

* code-style

* remove unused variable

* narrow variable scope

* handle non-existance of AuthenticationFeature

* remove dead code

* replace some C string handling with std::strings

* moved assertion to the correct place

* honor number of array members for IN operator

* slightly adjust error messages

* slighty adjust some error messages

* try to fix issue with lingering replication contexts on shutdown

* clean up heartbeat thread a little bit

* small fixes
2017-10-23 09:17:36 +02:00
Max Neunhöffer 67300f9d77 Add a hidden AGENCY_DUMP for agency emergency recovery. (#3474) 2017-10-21 00:24:32 +02:00
Simon Grätzer fd3f9d99d9 Fixing webinterface access (#3464)
* intermediate commit

* Refactoring the ExecContext

* Fixing authentication

* Added start script

* some fixes

* fixed access to nullptr

* some c++

* fixed misleading message

* Made DatabaseGuard movable. Also adapted map insertions to _vocbase in Syncer classes, which failed to compile under older GCC versions

* added support for global flag to replication handler

* Started Refactoring in replication-static

* Fixing syncer code

* store applier configuration

* Static replication tests now test replication in a non system Database

* added flags to replication feature

* Adding some extra checks

* Fixing issue with rocksdb rest replication handler

* replication static now runs _system and otherdatabase replication tests.

* Fixing crash on startup

* Replication_sync now tests _system as well as other Database

* Fixing up heartbeat thread, adding global flag to rest handler

* Fixing wrong assert

* some cleanup, probably some tests are broken

* Made non-system db version of replication-ongoing tests

* fix determine-open-transaction

* Fixed ongoing tests. And added a test where we drop a database on slave while replication is still ongoing

* test fixes

* Activated ongoing other db tests. Also added a test that drops the DB on master, while the slave is still syncing.

* some better error reporting

* gradually switch to Result

* createCollection -> create

* re-activate using of collection ids for now

* enable auto-start

* Fixed create collection in replication ongoing test

* Added first draft of a test for global replication

* move to Result

* use system database for global applier

* improved error reporting

* fixed invalid URLs

* add test case filter

* load existing global applier configuration

* improve error reporting

* Added further tests for global replication

* Fixed global replication test, it now properly waits for replication. Timeouts after 10 seconds.

* Removed erronious assertion

* improve error reporting

* intermediate commit

* Added a test-case for global replication where the Master already has some data and the slave is clean

* fix deletion of replication contexts

* Fixed JSLint

* compiling code

* fix typo

* do not fail for global applier when no database is configured

* intermediate commit

* syncer supports switch for 3.3 / 3.2

* fixed errors

* Fixing some replication bugs

* Fixing some assertions

* Fixed missing commit markers

* Fixing assertion on database drop

* Attempt to fix deadlock in applier and assertion

* Fixing some stupid things

* Support for collection parameter

* Acidentally turned off some tests

* Grrr

* Fixing wrong method call

* Fixed startscript

* Fixed assignmet instead of equality check typo

* Added a test far interrupted replication. For now it justs tests basics on _system database.

* Improved index tests on replication.

* properly initialize variable

* fixed some replication problems

* MMFiles wal access support

* fix replication issues

* Started mmfiles replication support

* fixing a bug

* Fixing an issue

* fixing some mmfiles stuff

* fix test

* reload users

* prevent pure virtual method call

* intermediate commit

* Making from exclusive

* do not call getMasterState if child syncer

* some reformatting

* Adding global support for handleCommandSync

* Fixing assertion

* removing some debug logs

* Changing return codes

* Fixing some issues in the rest handler

* Make replication less susceptible to errors

* remove some debug output

* return last log tick

* remove waits from tests

* fix two tests

* changing header for open-transactions call

* some fixes

* fix test

* invalidate cached databases

* merging request and execcontext

* try to fix assertion error

* renamed method

* fix compile warning

* small changes

* Always use execcontext

* Fixing an assert

* fix replication issues

* try to fix collection lookups

* try to fix master/slave start

* Changing comments in heartbeat thread

* fix wrong signature of READ_LOCKER_EVENTUAL

* log server role in testing mode

* Fixed authentication, removed execContext in favor of request context

* Adding cluster rest api

* Fixing cluster rest handler

* Fixing cluster callback

* Some refactoring

* Queue creation is not a single operation

* Allowed for leader redirects

* Setting start of batch

* Disabling 2.8 compat tests

* fix start/stop bugs

* jslint

* various little changes

* add flag for exposing jwt

* indentation

* cleanup

* Some changed to guid

* fixing tcp to http, vst

* changed endpoint header

* small fixes

* Reorder servers by health status

* Higher timeout

* Changing error messages

* update the fromTick when fetching multiple batches from the coordinator

* more debug info

* Reducing copy pasted code

* change uid generation

* reducing logspam

* more exceptions for redirects

* more exceptions

* attempt to fix uniqids in cluster

* centralize printing of HTTP errors in replication

* debug output

* fix messages for authentication

* cleanup

* removing --cluster.my-id, --cluster.my-local-info

* Added leadership race to bootstrap, determine foxxmaster on boostrap, removing obsolete code

* improve error reporting in RestAqlHandler

* Changing heartbeat thread, fixing cluster_sync

* some more debug output

* added master

* attempt to make tests more deterministic

* added logging about indexes

* added some safety checks to the logger

* slighty better error messages

* fix location header for SSL

* fix error message

* try to make tests more deterministic

* change error code from TRI_ERROR_INTERNAL (which we want to avoid) to TRI_ERROR_FAILED

* Fixing broken webinterface access

* reverting groovy change

* Fixing read-only internal users

* Using superuser rights for dashboard now

* Adding mode field to _admin/server/role

* added mode TRYAGAIN

* remove inventory lock (does not seem necessary here)

* remove invalid assertion

* fixing agency bugs

* Removing debug output

* return proper errors in case of "method not allowed"

* Fixed up some info messages

* jslint
2017-10-20 18:06:59 +02:00
Kaveh Vahedipour 428e163db9 Return the result of the inquiry (#3465) 2017-10-20 15:01:32 +02:00
Jan 7840d3f824 Bug fix/fixes 1810 (#3460)
* improve error reporting in RestAqlHandler

* added logging about indexes

* added some safety checks to the logger

* slighty better error messages

* fix location header for SSL

* fix error message

* try to make tests more deterministic

* change error code from TRI_ERROR_INTERNAL (which we want to avoid) to TRI_ERROR_FAILED
2017-10-19 11:28:01 +02:00
Simon Grätzer 7c31960cf2 Feature/async failover (#3451) 2017-10-18 23:59:29 +02:00
Kaveh Vahedipour 46333a762f Bug fix/agency restart after compaction and holes in log (#3413)
* State fixes holes in RAFT index range
* Avoid application of entries older than compaction index _cur and guard for unsigned overflow
2017-10-13 16:01:41 +02:00
m0ppers bb1d303473 Cmake 5.0 complains about unused lambda captures (#3390) 2017-10-13 12:20:48 +02:00
Max Neunhöffer 9a2385b941 Add host id detection and show in /_admin/cluster/Health. (#3389) 2017-10-11 12:42:44 +02:00
Max Neunhöffer d86f27bd19 Bug fix/agency leader timeouts (#3373)
* Send out empty heartbeats regardless of non-empty AppendEntriesRPC.
* Also improve logging:
  Note if a log in the empty heartbeat sending takes > 0.01 s.
  Clearly mark places where a leader resigns in logging.
  Log if no empty heartbeat is sent out.
* Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses.
* Add debug logging for _lastAcked and challengeLeadership.
* Remove some unused code. Do not count ourselves in challengeLeadership.
* Removal of entire activation/deactivation mechanisms in agency
* TRI_microtime up to c++11
* added term to response to sendAppendEntries.
2017-10-06 10:11:51 +02:00
Max Neunhoeffer af3f977997
Revert "Send out empty heartbeats regardless of non-empty AppendEntriesRPC."
This reverts commit e974501446.
2017-10-02 15:02:15 +02:00
Max Neunhoeffer 2852f80b5a
Revert "Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses."
This reverts commit 45d37edfb2.
2017-10-02 15:02:06 +02:00
Max Neunhoeffer 45d37edfb2
Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses. 2017-10-02 15:01:11 +02:00
Max Neunhoeffer e974501446
Send out empty heartbeats regardless of non-empty AppendEntriesRPC.
Also improve logging:
  Note if a log in the empty heartbeat sending takes > 0.01 s.
  Clearly mark places where a leader resigns in logging.
  Log if no empty heartbeat is sent out.
2017-10-02 14:14:41 +02:00
Max Neunhöffer 47f367d3f0 Bug fix/agency compactor deadlock (#3335)
* Fix a deadlock between Agent thread and compactor thread.
* Improve comments in header.
* Organise clean shutdown of agency threads.
2017-09-28 12:20:57 +02:00
Max Neunhöffer 22e46978a6 Bug fix/sort out agency locks (#3306)
New locking concept in Agency. Ensure empty heartbeats can be sent, answered and processed without long locks. Adjust logging. Fix compaction bugs.
2017-09-27 15:22:30 +02:00
Kaveh Vahedipour 3700f75b0c State has to keep log for removeConflicts and acoording log all the way (#3249) 2017-09-16 12:20:47 +02:00
Jan 5165155ed1 Bug fix/fixes 0609 (#3227)
* do not use V8 variant of AQL functions in early optimization stage when a C++ variant is available

* additionally, simplify AQL function definitions and aliases

* warn when more than 90% of max mappings are in use

* added C++ variant of replication catchup

* added `--log.role` option

* updated CHANGELOG

* removed non-existing scheduler.threads option from config

* removed useless __FILE__, __LINE__ invocations

* updated CHANGELOG

* allow a priority V8 context

* remove TRI_CORE_MEM_ZONE

* try to fix Windows errors & warnings

* cleanup

* removed memory zones altogether

* exclude system collections from collection tests
2017-09-13 16:28:21 +02:00
Kaveh Vahedipour 627f344266 fixed a bug, where when servers failed, when also agency leadership c… (#3189)
* fixed a bug, where when servers failed, when also agency leadership changes

* redid entire design of checkDBServers/checkCoordinators.

* comparison in supervision must be between oldPersisted and newHealth

* UI stuff

* UI stuff

* FailedServer test needed adjustment

* Hopefully final round

* fixed supervision failure detection

* FailedServer tests back to origin devel

* oldNot documented among preconditions in Agency HTTP API docs

* changed only look for status updated

* non action line in api-cluster
2017-09-07 16:10:23 +02:00
Simon Grätzer ffc465433a No access collections Improvements (#3190)
* consolidated EdgeDocumentToken

* optimizing cluster traversal

* adding skip collection checks

* API cleanup

* copying AQLValue to avoid use-after-free bugs

* Fixing rocksdb SingleServerEdgeCursor

* Fixing a collection resolving issue
2017-09-07 14:55:07 +02:00
Jan 0abbc3a3c6 fix duplicate mutex (#3215) 2017-09-07 14:38:29 +02:00
Kaveh Vahedipour e808867ddc Bug fix/unordered map changes order in catch tests (#3175)
* order of free and free2 changed with use of unordered_multimap vs multimap

* fixing order independance for maps in catch tests fixed?

* missing bits and pieces
2017-08-31 15:58:48 +02:00
Kaveh Vahedipour 00650e6a3f Bug fix/agency mt fixes (#3158)
* added debugging methods

* try to fix invalid access in case of error

* remove unused members

* bugfixes and comments

* all agency fixes in

* merge bug

* partially unguarded Agent::lead fixed

* all agency fixes in

* added nrBlocked to thread startup eval

* added nrBlocked to thread startup eval

* recombination of cases in State::get

* some maps replaced with unordered_maps

* optimized maps some
2017-08-30 10:43:51 +02:00
Kaveh Vahedipour 4c94a1c8ab fixed a bug in create collection in cluster, where transaction result was not checked for success before access (#3137) 2017-08-28 14:58:26 +02:00
Jan 47e29e6e1f Bug fix/issues 1806 (#3069)
* fix buffer overruns in linenoise for long input lines

* don't make historian repeatedly print the same error messages that nothing can be done about

* make the implementations of the logging operator<<s not throw exceptions, so that logging does throw exceptions as an unintended side effect

* update CHANGELOG

* improve error message

* don't copy strings, but pass them by const reference
2017-08-18 22:58:09 +02:00
Kaveh Vahedipour 1d1e0f5a50 Feature/cluster id and extended health (#3046)
* added unique id to cluster, added access to Health

* added agents to health api

* added agents to health api

* added agents to health api

* transaction information for api

* agents listed like other servers

* missing line through merge conflict
2017-08-18 11:13:23 +02:00
m0ppers 930dd8aad2 MSVC is pendantic (but right) (#3047) 2017-08-17 21:25:34 +02:00
Frank Celler e446cff433 added result code in error message 2017-08-12 10:52:32 +02:00
Jan 49fa0bf4b4 used the required mutex in Store::clear to avoid races (#2957)
also added asserts for that the mutex is actually held everywhere where it is required
2017-08-05 17:15:51 +02:00
Jan ed9d15156e remove dependency on MMFiles features from non-MMFiles files (#2925) 2017-08-01 22:16:43 +02:00
Kaveh Vahedipour 2bbb2224e8 agency size 1 issues on windows and arschlinux's hyping of the bleeding edge 2017-07-14 12:07:14 +02:00
Andreas Streichardt f1a6f16135 Carrotfix to only stop inception thread if it has been started
should really be done in the thread lib but that is a bit risky so
shortly before the release
2017-07-14 10:20:49 +02:00
Max Neunhöffer bf168a7496 Fix a bug that the URL was overwritten by redirect in AgencyComm. (#2794) 2017-07-13 16:27:09 +02:00
Max Neunhöffer 2f874249bb Bug fix/adjust agency comm timeouts (#2765)
* Take out 503 timeouts altogether.
* Overhaul of AgencyComm::sendWithFailover loop.
* Let performRequests optionally ignore 404 coll not found.
* Fix error message "database not found" when AgencyComm failed.
* Add log entries in Agency if locks are acquired too slowly.
* Reexecute the javascript cluster sync stuff even if there was no plan/current change...So failed sync jobs can retry later...
* Cover callbacks in Communicator by lock. This fixes https://github.com/arangodb/planning/issues/370
* Put in delay in waiting for leader in agency test.
* Schmutz logging to heartbeat topic.
* Add more lock time diagnostic in agent.
* Switch on agencycomm tracing in coordinator.
2017-07-13 00:44:28 +02:00
Kaveh Vahedipour fd90318fd8 correct-funny-fail-rotation-after-compaction-bugfix (#2774) 2017-07-12 22:39:23 +02:00
Jan 49d2313c2c fix ub in agency compaction (#2736) 2017-07-09 20:45:00 +02:00