1
0
Fork 0
Commit Graph

1184 Commits

Author SHA1 Message Date
Matthew Von-Maszewski 41d1bfce23 create independent executeLockedRead and executeLockedWrite to speed performance (#4178) 2017-12-29 13:36:48 +01:00
Kaveh Vahedipour d6ce7a1301 Agency read write locks ported from devel (#4175) 2017-12-28 11:28:11 +01:00
Matthew Von-Maszewski 392ddde251 Bug fix 3.3: Fix supervisor thread crash (#4165)
* port devel branch to 3.3 of supervisor thread death fix
2017-12-27 22:34:29 +01:00
Max Neunhöffer ef8fcd101c
Port to 3.3 of various fixes around leadership preparation in agency. (#4150)
* Add logging for _earliestPackage in Agent.
* Really enforce the hidden option --server.maximal-threads if given.
* Switch off --log.force-direct in scripts/startStandAloneAgency.sh
* Lower the timeout for sending AppendEntriesRPC to 150s.
* Erase _earliestPackage when becoming a leader.
* Challenge leadership in agent main loop.
* Use steady_clock for _earliestPackage.
* Change _lastAcked and _leaderSince to steady_clock as well.
* time difference calculations based on old readSystemClock to steadyClockToDouble
* All system_clock transitioned to steady_clock in Agent. Remaining system_clock are user input / output or timestamps
* Inception system_clock to steady_clock
2017-12-27 16:47:16 +01:00
Matthew Von-Maszewski f35215ea51 Have twice seen coordinator go into long loop on shutdown. Added two tests for isStopping() to break the loops. (#4139) 2017-12-21 20:56:14 +01:00
Jan b7ee607312
Bug fix 3.3/integer overflow when calculating waits in constituent (#4090)
* integer overflow in Constituent could seize operation of Agency

* less likely integer overflow on double conversion

* less likely integer overflow on double conversion

* changed comparison to integer comparison as suggested by @neunhoef
2017-12-19 10:10:05 +01:00
Jan 9c5893e7a7
fix premature unlock (#3802) (#4027)
* fix some deadlocks found by evil lock manager (tm)

* fix duplicate lock

* fix indentation

* ensure proper lock dependencies

* fix lock acquisition

* removed useless comment

* do not lock twice

* create either a V8 transaction context or a standalone transaction context, depending on if we are called from within V8 or not

* AQL micro optimizations

* use explicit constructor

* only use V8DealerFeature's ConditionLocker for acquiring a free V8 context

entering and exiting the selected context is then done later on without having to hold the ConditionLocker

* remove some recursive locks

* Disable custom deadlock detection when Thread Sanitizer is enabled

* Changing ifdef's

* grr

* broke gcc

* Using atomic for ApplicationServer::_server

* fix premature unlock

* add some asserts

* honor collection locking in cluster

* yet one more lock fix

* removed assertion

* some more bugfixes

* Fixing assert

(cherry picked from commit 1155df173bfb67303077fbe04ee8d909517bfd21)
2017-12-13 18:46:14 +01:00
Jan 7af86685e3
when upgrading from 3.1 LastHeartBeatAcked could also have been missing, when the 3.1 cluster had not run for long enough (#3974) 2017-12-08 17:33:37 +01:00
Jan bec83181be Bug fix 3.3/add security check end with failover (#3911)
* Add security check in AgencyComm::sendWithFailover.

* some cleanup

* added some more tests

* add typeName() to AgencyCommTransaction to make the transaction type printable in debug messages

* improve debuggability
2017-12-07 10:33:59 +01:00
Jan ba729150bf backporting inquire fixes (#3920) 2017-12-07 10:27:41 +01:00
Kaveh Vahedipour f7b4150b64 no clientId anymore in send/sendWithFailOver SPIs (#3819) 2017-11-28 10:47:36 +01:00
Kaveh Vahedipour c300eee5f0 minor (#3813) 2017-11-27 18:22:13 +01:00
Kaveh Vahedipour 27cd691bbf Bug fix/agencycomm validate methods broken (#3805) 2017-11-27 14:18:25 +01:00
Kaveh Vahedipour 2beaef41ff Bug fix/agencycomm validate methods broken (#3784) 2017-11-24 10:31:07 +01:00
Simon Grätzer 987daca85b Handle invalid endpoints in AgencyComm (#3729) 2017-11-17 16:35:59 +01:00
Kaveh Vahedipour 7b80deb5cc Fixed object assignment operator for agency's key value store (#3701)
* Fixed object assignment operator for agency's key value store
* Node's toJson is now actually toJson. getString should be used for string extractions
* adjust agency's documentation (clarify precondition)
2017-11-17 15:49:40 +01:00
Kaveh Vahedipour 255d90d26a cherry pick from 3.2 pull request for bug-fix/supervision-thread-exists-on-pre3.2-agency (#3709)
This is the HealthRecord upgrade patch.
2017-11-17 10:14:14 +01:00
Jan b4f6ee9273 Feature/improved index api for unique constraints and replication (#3715) 2017-11-16 21:02:01 +01:00
Jan 5abf0c1185 Bug fix/fixes 1511 (#3711) 2017-11-16 14:18:51 +01:00
Max Neunhöffer 766ab7c8cf
Fix agency shutdown bug. (#3683)
* Fix agency shutdown bug.
* Remove precondition that was not needed in AgencyComm::removeValues.
* Fail fatally if threads do not shut down.
2017-11-14 16:33:46 +01:00
Jan e1ecc6b02c fix some threading issues (#3659) 2017-11-12 22:34:51 +01:00
Kaveh Vahedipour c9621ff230 Feature/new agency checks for preconditions (#3612) 2017-11-11 22:48:23 +01:00
Max Neunhöffer bff630b332 Handle leader resignation race with redirectRequst. (#3663) 2017-11-11 19:38:29 +01:00
Kaveh Vahedipour 7e816db51e Bug fix/agency restart enhancements (#3619)
* Removed unused active(...) method in Agent
* Inception's restart from persistence allows peer with empty active RAFT list to join
* Agency's UUID is persisted outside of the database comparable to coordinator and db server action.
* Publicized Methods to UUID stuff in ServerState
* Inception method documentation
* added --agency.disaster-recovery-id to allow for specification of known former agency id. this is a very dangerous option potentially.
* Delete a unused methods.
* separate _id and _recoveryId
* populating active list with entire pool
* Improve logging.
* reject gossip from unknown agent, if pool is complete
2017-11-10 23:40:26 +01:00
Jan bef52d7dc3
Bug fix/cleanup after cppcheck (#3639) 2017-11-10 13:53:28 +01:00
Max Neunhöffer 3c0ee6908b Bug fix/lead to agent (#3541) 2017-11-09 11:10:09 +01:00
Jan 98eecaae20 bug fix for agency precondition checks (#3579) 2017-11-06 23:55:41 +01:00
Simon Grätzer ee8209943f Missing things for active / passive (#3578)
* Switching from ttl to supervision based failover mechanism

* Allowing canceling of ongoing actions

* refactored asyncjobmanager

* refactoring some code

* adding read-only flag

* catching some exceptions to reduce log pollution, removing unnecessary code, removing tests for _changeMode

* fixing "createsANewDatabaseWithAnInvalidUser"

* auth = off does not longer make everyone superuser

* Fixing cluster_sync and maybe resilience
2017-11-04 20:30:23 +01:00
jsteemann a5c777e565 fix broken inquiry results in AgencyComm 2017-10-26 20:10:54 +02:00
Max Neunhöffer cb05d33e17 Term is a number not a string. (#3520) 2017-10-26 12:02:38 +02:00
Max Neunhöffer ee96c37237 Fix agency restart problems. (#3493)
* Fix agency restart problems (port from a 3.2 fix).

* Further fixes after Craneware rescue.
2017-10-25 18:05:58 +02:00
Michael Hackstein 15d9a4be5f Reactivated the failover of the FoxxMaster, it was not modified anymore after the current master dies (#3510) 2017-10-25 18:03:24 +02:00
Jan 720e6df82e Bug fix/fixes 1910 (#3471)
* properly initialize all properties

* use faster comparison

* properly detect and handle "method not allowed"

* code-style

* remove unused variable

* narrow variable scope

* handle non-existance of AuthenticationFeature

* remove dead code

* replace some C string handling with std::strings

* moved assertion to the correct place

* honor number of array members for IN operator

* slightly adjust error messages

* slighty adjust some error messages

* try to fix issue with lingering replication contexts on shutdown

* clean up heartbeat thread a little bit

* small fixes
2017-10-23 09:17:36 +02:00
Max Neunhöffer 67300f9d77 Add a hidden AGENCY_DUMP for agency emergency recovery. (#3474) 2017-10-21 00:24:32 +02:00
Simon Grätzer fd3f9d99d9 Fixing webinterface access (#3464)
* intermediate commit

* Refactoring the ExecContext

* Fixing authentication

* Added start script

* some fixes

* fixed access to nullptr

* some c++

* fixed misleading message

* Made DatabaseGuard movable. Also adapted map insertions to _vocbase in Syncer classes, which failed to compile under older GCC versions

* added support for global flag to replication handler

* Started Refactoring in replication-static

* Fixing syncer code

* store applier configuration

* Static replication tests now test replication in a non system Database

* added flags to replication feature

* Adding some extra checks

* Fixing issue with rocksdb rest replication handler

* replication static now runs _system and otherdatabase replication tests.

* Fixing crash on startup

* Replication_sync now tests _system as well as other Database

* Fixing up heartbeat thread, adding global flag to rest handler

* Fixing wrong assert

* some cleanup, probably some tests are broken

* Made non-system db version of replication-ongoing tests

* fix determine-open-transaction

* Fixed ongoing tests. And added a test where we drop a database on slave while replication is still ongoing

* test fixes

* Activated ongoing other db tests. Also added a test that drops the DB on master, while the slave is still syncing.

* some better error reporting

* gradually switch to Result

* createCollection -> create

* re-activate using of collection ids for now

* enable auto-start

* Fixed create collection in replication ongoing test

* Added first draft of a test for global replication

* move to Result

* use system database for global applier

* improved error reporting

* fixed invalid URLs

* add test case filter

* load existing global applier configuration

* improve error reporting

* Added further tests for global replication

* Fixed global replication test, it now properly waits for replication. Timeouts after 10 seconds.

* Removed erronious assertion

* improve error reporting

* intermediate commit

* Added a test-case for global replication where the Master already has some data and the slave is clean

* fix deletion of replication contexts

* Fixed JSLint

* compiling code

* fix typo

* do not fail for global applier when no database is configured

* intermediate commit

* syncer supports switch for 3.3 / 3.2

* fixed errors

* Fixing some replication bugs

* Fixing some assertions

* Fixed missing commit markers

* Fixing assertion on database drop

* Attempt to fix deadlock in applier and assertion

* Fixing some stupid things

* Support for collection parameter

* Acidentally turned off some tests

* Grrr

* Fixing wrong method call

* Fixed startscript

* Fixed assignmet instead of equality check typo

* Added a test far interrupted replication. For now it justs tests basics on _system database.

* Improved index tests on replication.

* properly initialize variable

* fixed some replication problems

* MMFiles wal access support

* fix replication issues

* Started mmfiles replication support

* fixing a bug

* Fixing an issue

* fixing some mmfiles stuff

* fix test

* reload users

* prevent pure virtual method call

* intermediate commit

* Making from exclusive

* do not call getMasterState if child syncer

* some reformatting

* Adding global support for handleCommandSync

* Fixing assertion

* removing some debug logs

* Changing return codes

* Fixing some issues in the rest handler

* Make replication less susceptible to errors

* remove some debug output

* return last log tick

* remove waits from tests

* fix two tests

* changing header for open-transactions call

* some fixes

* fix test

* invalidate cached databases

* merging request and execcontext

* try to fix assertion error

* renamed method

* fix compile warning

* small changes

* Always use execcontext

* Fixing an assert

* fix replication issues

* try to fix collection lookups

* try to fix master/slave start

* Changing comments in heartbeat thread

* fix wrong signature of READ_LOCKER_EVENTUAL

* log server role in testing mode

* Fixed authentication, removed execContext in favor of request context

* Adding cluster rest api

* Fixing cluster rest handler

* Fixing cluster callback

* Some refactoring

* Queue creation is not a single operation

* Allowed for leader redirects

* Setting start of batch

* Disabling 2.8 compat tests

* fix start/stop bugs

* jslint

* various little changes

* add flag for exposing jwt

* indentation

* cleanup

* Some changed to guid

* fixing tcp to http, vst

* changed endpoint header

* small fixes

* Reorder servers by health status

* Higher timeout

* Changing error messages

* update the fromTick when fetching multiple batches from the coordinator

* more debug info

* Reducing copy pasted code

* change uid generation

* reducing logspam

* more exceptions for redirects

* more exceptions

* attempt to fix uniqids in cluster

* centralize printing of HTTP errors in replication

* debug output

* fix messages for authentication

* cleanup

* removing --cluster.my-id, --cluster.my-local-info

* Added leadership race to bootstrap, determine foxxmaster on boostrap, removing obsolete code

* improve error reporting in RestAqlHandler

* Changing heartbeat thread, fixing cluster_sync

* some more debug output

* added master

* attempt to make tests more deterministic

* added logging about indexes

* added some safety checks to the logger

* slighty better error messages

* fix location header for SSL

* fix error message

* try to make tests more deterministic

* change error code from TRI_ERROR_INTERNAL (which we want to avoid) to TRI_ERROR_FAILED

* Fixing broken webinterface access

* reverting groovy change

* Fixing read-only internal users

* Using superuser rights for dashboard now

* Adding mode field to _admin/server/role

* added mode TRYAGAIN

* remove inventory lock (does not seem necessary here)

* remove invalid assertion

* fixing agency bugs

* Removing debug output

* return proper errors in case of "method not allowed"

* Fixed up some info messages

* jslint
2017-10-20 18:06:59 +02:00
Kaveh Vahedipour 428e163db9 Return the result of the inquiry (#3465) 2017-10-20 15:01:32 +02:00
Jan 7840d3f824 Bug fix/fixes 1810 (#3460)
* improve error reporting in RestAqlHandler

* added logging about indexes

* added some safety checks to the logger

* slighty better error messages

* fix location header for SSL

* fix error message

* try to make tests more deterministic

* change error code from TRI_ERROR_INTERNAL (which we want to avoid) to TRI_ERROR_FAILED
2017-10-19 11:28:01 +02:00
Simon Grätzer 7c31960cf2 Feature/async failover (#3451) 2017-10-18 23:59:29 +02:00
Kaveh Vahedipour 46333a762f Bug fix/agency restart after compaction and holes in log (#3413)
* State fixes holes in RAFT index range
* Avoid application of entries older than compaction index _cur and guard for unsigned overflow
2017-10-13 16:01:41 +02:00
m0ppers bb1d303473 Cmake 5.0 complains about unused lambda captures (#3390) 2017-10-13 12:20:48 +02:00
Max Neunhöffer 9a2385b941 Add host id detection and show in /_admin/cluster/Health. (#3389) 2017-10-11 12:42:44 +02:00
Max Neunhöffer d86f27bd19 Bug fix/agency leader timeouts (#3373)
* Send out empty heartbeats regardless of non-empty AppendEntriesRPC.
* Also improve logging:
  Note if a log in the empty heartbeat sending takes > 0.01 s.
  Clearly mark places where a leader resigns in logging.
  Log if no empty heartbeat is sent out.
* Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses.
* Add debug logging for _lastAcked and challengeLeadership.
* Remove some unused code. Do not count ourselves in challengeLeadership.
* Removal of entire activation/deactivation mechanisms in agency
* TRI_microtime up to c++11
* added term to response to sendAppendEntries.
2017-10-06 10:11:51 +02:00
Max Neunhoeffer af3f977997
Revert "Send out empty heartbeats regardless of non-empty AppendEntriesRPC."
This reverts commit e974501446.
2017-10-02 15:02:15 +02:00
Max Neunhoeffer 2852f80b5a
Revert "Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses."
This reverts commit 45d37edfb2.
2017-10-02 15:02:06 +02:00
Max Neunhoeffer 45d37edfb2
Make leader more tolerant w.r.t. incoming AppendEntriesRPC responses. 2017-10-02 15:01:11 +02:00
Max Neunhoeffer e974501446
Send out empty heartbeats regardless of non-empty AppendEntriesRPC.
Also improve logging:
  Note if a log in the empty heartbeat sending takes > 0.01 s.
  Clearly mark places where a leader resigns in logging.
  Log if no empty heartbeat is sent out.
2017-10-02 14:14:41 +02:00
Max Neunhöffer 47f367d3f0 Bug fix/agency compactor deadlock (#3335)
* Fix a deadlock between Agent thread and compactor thread.
* Improve comments in header.
* Organise clean shutdown of agency threads.
2017-09-28 12:20:57 +02:00
Max Neunhöffer 22e46978a6 Bug fix/sort out agency locks (#3306)
New locking concept in Agency. Ensure empty heartbeats can be sent, answered and processed without long locks. Adjust logging. Fix compaction bugs.
2017-09-27 15:22:30 +02:00
Kaveh Vahedipour 3700f75b0c State has to keep log for removeConflicts and acoording log all the way (#3249) 2017-09-16 12:20:47 +02:00
Jan 5165155ed1 Bug fix/fixes 0609 (#3227)
* do not use V8 variant of AQL functions in early optimization stage when a C++ variant is available

* additionally, simplify AQL function definitions and aliases

* warn when more than 90% of max mappings are in use

* added C++ variant of replication catchup

* added `--log.role` option

* updated CHANGELOG

* removed non-existing scheduler.threads option from config

* removed useless __FILE__, __LINE__ invocations

* updated CHANGELOG

* allow a priority V8 context

* remove TRI_CORE_MEM_ZONE

* try to fix Windows errors & warnings

* cleanup

* removed memory zones altogether

* exclude system collections from collection tests
2017-09-13 16:28:21 +02:00