Scalable many entity shared/exclusive file system based lock. More...

#include "atomic_append.hpp"

Inheritance diagram for llfio_v2_xxx::algorithm::shared_fs_mutex::atomic_append:

Public Types
using	entity_type = shared_fs_mutex::entity_type
	The type of an entity id.

using	entities_type = shared_fs_mutex::entities_type
	The type of a sequence of entities.

Public Member Functions
	atomic_append (const atomic_append &)=delete
	No copy construction.

atomic_append &	operator= (const atomic_append &)=delete
	No copy assignment.

	atomic_append (atomic_append &&o) noexcept
	Move constructor.

atomic_append &	operator= (atomic_append &&o) noexcept
	Move assign.

const file_handle &	handle () const noexcept
	Return the handle to file being used for this lock.

virtual void	unlock (entities_type entities, unsigned long long hint) noexcept final
	Unlock a previously locked sequence of entities.

entity_type	entity_from_buffer (const char *buffer, size_t bytes, bool exclusive=true) noexcept
	Generates an entity id from a sequence of bytes.

template<typename T >
entity_type	entity_from_string (const std::basic_string< T > &str, bool exclusive=true) noexcept
	Generates an entity id from a string.

entity_type	random_entity (bool exclusive=true) noexcept
	Generates a cryptographically random entity id.

void	fill_random_entities (span< entity_type > seq, bool exclusive=true) noexcept
	Fills a sequence of entity ids with cryptographic randomness. Much faster than calling random_entity() individually.

result< entities_guard >	lock (entities_type entities, deadline d=deadline(), bool spin_not_sleep=false) noexcept
	Lock all of a sequence of entities for exclusive or shared access.

result< entities_guard >	lock (entity_type entity, deadline d=deadline(), bool spin_not_sleep=false) noexcept
	Lock a single entity for exclusive or shared access.

result< entities_guard >	try_lock (entities_type entities) noexcept
	Try to lock all of a sequence of entities for exclusive or shared access.

result< entities_guard >	try_lock (entity_type entity) noexcept
	Try to lock a single entity for exclusive or shared access.

Static Public Member Functions
static result< atomic_append >	fs_mutex_append (const path_handle &base, path_view lockfile, bool nfs_compatibility=false, bool skip_hashing=false) noexcept

Protected Member Functions
virtual result< void >	_lock (entities_guard &out, deadline d, bool spin_not_sleep) noexcept final

Detailed Description

Scalable many entity shared/exclusive file system based lock.

Lock files and byte ranges scale poorly to the number of items being concurrently locked with typically an exponential drop off in performance as the number of items being concurrently locked rises. This file system algorithm solves this problem using IPC via a shared append-only lock file.

Compatible with networked file systems (NFS too if the special nfs_compatibility flag is true. Note turning this on is not free of cost if you don't need NFS compatibility).
Nearly constant time to number of entities being locked.
Nearly constant time to number of processes concurrently using the lock (i.e. number of waiters).
Can sleep until a lock becomes free in a power-efficient manner.
Sudden power loss during use is recovered from.

Caveats:

Much slower than byte_ranges for few waiters or small number of entities.
Sudden process exit with locks held will deadlock all other users.
Maximum of twelve entities may be locked concurrently.
Wasteful of disk space if used on a non-extents based filing system (e.g. FAT32, ext3). It is best used in /tmp if possible (file_handle::temp_file()). If you really must use a non-extents based filing system, destroy and recreate the object instance periodically to force resetting the lock file's length to zero.
Similarly older operating systems (e.g. Linux < 3.0) do not implement extent hole punching and therefore will also see excessive disk space consumption. Note at the time of writing OS X doesn't implement hole punching at all.
If your OS doesn't have sane byte range locks (OS X, BSD, older Linuxes) and multiple objects in your process use the same lock file, misoperation will occur. Use lock_files instead.

Todo:

Implement hole punching once I port that code from LLFIO v1.

Decide on some resolution mechanism for sudden process exit.

There is a 1 out of 2^64-2 chance of unique id collision. It would be nice if we actually formally checked that our chosen unique id is actually unique.

Member Function Documentation

◆ _lock()

virtual result<void> llfio_v2_xxx::algorithm::shared_fs_mutex::atomic_append::_lock	(	entities_guard &	out,
		deadline	d,
		bool	spin_not_sleep
	)

inlinefinalprotectedvirtualnoexcept

Todo:: Read from header.last_known_good immediately if possible in order to avoid a duplicate read later

Implements llfio_v2_xxx::algorithm::shared_fs_mutex::shared_fs_mutex.

       {
         LLFIO_LOG_FUNCTION_CALL(this);
         atomic_append_detail::lock_request lock_request;
         if(out.entities.size() > sizeof(lock_request.entities) / sizeof(lock_request.entities[0]))
         {
           return errc::argument_list_too_long;
         }
 
         std::chrono::steady_clock::time_point began_steady;
         std::chrono::system_clock::time_point end_utc;
         if(d)
         {
           if((d).steady)
           {
             began_steady = std::chrono::steady_clock::now();
           }
           else
           {
             end_utc = (d).to_time_point();
           }
         }
         // Fire this if an error occurs
         auto disableunlock = undoer([&] { out.release(); });
 
         // Write my lock request immediately
         memset(&lock_request, 0, sizeof(lock_request));
         lock_request.unique_id = _unique_id;
         auto count = std::chrono::system_clock::now() - std::chrono::system_clock::from_time_t(_header.time_offset);
         lock_request.us_count = std::chrono::duration_cast<std::chrono::microseconds>(count).count();
         lock_request.items = out.entities.size();
         memcpy(lock_request.entities, out.entities.data(), sizeof(lock_request.entities[0]) * out.entities.size());
         if(!_skip_hashing)
         {
           lock_request.hash = QUICKCPPLIB_NAMESPACE::algorithm::hash::fast_hash::hash((reinterpret_cast<char *>(&lock_request)) + 16, sizeof(lock_request) - 16);
         }
         // My lock request will be the file's current length or higher
         OUTCOME_TRY(my_lock_request_offset, _h.maximum_extent());
         {
           OUTCOME_TRYV(_h.set_append_only(true));
           auto undo = undoer([this] { (void) _h.set_append_only(false); });
           file_handle::extent_guard append_guard;
           if(_nfs_compatibility)
           {
             auto lastbyte = static_cast<file_handle::extent_type>(-1);
             // Lock up to the beginning of the shadow lock space
             lastbyte &= ~(1ULL << 63U);
             OUTCOME_TRY(append_guard_, _h.lock(my_lock_request_offset, lastbyte, true));
             append_guard = std::move(append_guard_);
           }
           OUTCOME_TRYV(_h.write(0, {{reinterpret_cast<byte *>(&lock_request), sizeof(lock_request)}}));
         }
 
         // Find the record I just wrote
         alignas(64) byte _buffer[4096 + 2048];  // 6Kb cache line aligned buffer
         // Read onwards from length as reported before I wrote my lock request
         // until I find my lock request. This loop should never actually iterate
         // except under extreme load conditions.
         //! \todo Read from header.last_known_good immediately if possible in order
         //! to avoid a duplicate read later
         for(;;)
         {
           file_handle::buffer_type req{_buffer, sizeof(_buffer)};
           file_handle::io_result<file_handle::buffers_type> readoutcome = _h.read({req, my_lock_request_offset});
           // Should never happen :)
           if(readoutcome.has_error())
           {
             LLFIO_LOG_FATAL(this, "atomic_append::lock() saw an error when searching for just written data");
             std::terminate();
           }
           const atomic_append_detail::lock_request *record, *lastrecord;
           for(record = reinterpret_cast<const atomic_append_detail::lock_request *>(readoutcome.value()[0].data()), lastrecord = reinterpret_cast<const atomic_append_detail::lock_request *>(readoutcome.value()[0].data() + readoutcome.value()[0].size()); record < lastrecord && record->hash != lock_request.hash;
               ++record)
           {
             my_lock_request_offset += sizeof(atomic_append_detail::lock_request);
           }
           if(record->hash == lock_request.hash)
           {
             break;
           }
         }
 
         // extent_guard is now valid and will be unlocked on error
         out.hint = my_lock_request_offset;
         disableunlock.dismiss();
 
         // Lock my request for writing so others can sleep on me
         file_handle::extent_guard my_request_guard;
         if(!spin_not_sleep)
         {
           auto lock_offset = my_lock_request_offset;
           // Set the top bit to use the shadow lock space on Windows
           lock_offset |= (1ULL << 63U);
           OUTCOME_TRY(my_request_guard_, _h.lock(lock_offset, sizeof(lock_request), true));
           my_request_guard = std::move(my_request_guard_);
         }
 
         // Read every record preceding mine until header.first_known_good inclusive
         auto record_offset = my_lock_request_offset - sizeof(atomic_append_detail::lock_request);
         do
         {
         reload:
           // Refresh the header and load a snapshot of everything between record_offset
           // and first_known_good or -6Kb, whichever the sooner
           OUTCOME_TRYV(_read_header());
           // If there are no preceding records, we're done
           if(record_offset < _header.first_known_good)
           {
             break;
           }
           auto start_offset = record_offset;
           if(start_offset > sizeof(_buffer) - sizeof(atomic_append_detail::lock_request))
           {
             start_offset -= sizeof(_buffer) - sizeof(atomic_append_detail::lock_request);
           }
           else
           {
             start_offset = sizeof(atomic_append_detail::lock_request);
           }
           if(start_offset < _header.first_known_good)
           {
             start_offset = _header.first_known_good;
           }
           assert(record_offset >= start_offset);
           assert(record_offset - start_offset <= sizeof(_buffer));
           file_handle::buffer_type req{_buffer, (size_t)(record_offset - start_offset) + sizeof(atomic_append_detail::lock_request)};
           OUTCOME_TRY(batchread, _h.read({req, start_offset}));
           assert(batchread[0].size() == record_offset - start_offset + sizeof(atomic_append_detail::lock_request));
           const atomic_append_detail::lock_request *record = reinterpret_cast<atomic_append_detail::lock_request *>(batchread[0].data() + batchread[0].size() - sizeof(atomic_append_detail::lock_request));
           const atomic_append_detail::lock_request *firstrecord = reinterpret_cast<atomic_append_detail::lock_request *>(batchread[0].data());
 
           // Skip all completed lock requests or not mentioning any of my entities
           for(; record >= firstrecord; record_offset -= sizeof(atomic_append_detail::lock_request), --record)
           {
             // If a completed lock request, skip
             if(!record->hash && (record->unique_id == 0u))
             {
               continue;
             }
             // If record hash doesn't match contents it's a torn read, reload
             if(!_skip_hashing)
             {
               if(record->hash != QUICKCPPLIB_NAMESPACE::algorithm::hash::fast_hash::hash((reinterpret_cast<const char *>(record)) + 16, sizeof(atomic_append_detail::lock_request) - 16))
               {
                 goto reload;
               }
             }
 
             // Does this record lock anything I am locking?
             for(const auto &entity : out.entities)
             {
               for(size_t n = 0; n < record->items; n++)
               {
                 if(record->entities[n].value == entity.value)
                 {
                   // Is the lock I want exclusive or the lock he wants exclusive?
                   // If so, need to block
                   if((record->entities[n].exclusive != 0u) || (entity.exclusive != 0u))
                   {
                     goto beginwait;
                   }
                 }
               }
             }
           }
           // None of this batch of records has anything to do with my request, so keep going
           continue;
 
         beginwait:
           // Sleep until this record is freed using a shared lock
           // on the record in our way. Note there is a race here
           // between when the lock requester writes the lock
           // request and when he takes an exclusive lock on it,
           // so if our shared lock succeeds we need to immediately
           // unlock and retry based on the data.
           std::this_thread::yield();
           if(!spin_not_sleep)
           {
             deadline nd;
             if(d)
             {
               if((d).steady)
               {
                 std::chrono::nanoseconds ns = std::chrono::duration_cast<std::chrono::nanoseconds>((began_steady + std::chrono::nanoseconds((d).nsecs)) - std::chrono::steady_clock::now());
                 if(ns.count() < 0)
                 {
                   (nd).nsecs = 0;
                 }
                 else
                 {
                   (nd).nsecs = ns.count();
                 }
               }
               else
               {
                 (nd) = (d);
               }
             }
             auto lock_offset = record_offset;
             // Set the top bit to use the shadow lock space on Windows
             lock_offset |= (1ULL << 63U);
             OUTCOME_TRYV(_h.lock(lock_offset, sizeof(*record), false, nd));
           }
           // Make sure we haven't timed out during this wait
           if(d)
           {
             if((d).steady)
             {
               if(std::chrono::steady_clock::now() >= (began_steady + std::chrono::nanoseconds((d).nsecs)))
               {
                 return errc::timed_out;
               }
             }
             else
             {
               if(std::chrono::system_clock::now() >= end_utc)
               {
                 return errc::timed_out;
               }
             }
           }
         } while(record_offset >= _header.first_known_good);
         return success();
       }

◆ fs_mutex_append()

static result<atomic_append> llfio_v2_xxx::algorithm::shared_fs_mutex::atomic_append::fs_mutex_append	(	const path_handle &	base,
		path_view	lockfile,
		bool	nfs_compatibility = `false`,
		bool	skip_hashing = `false`
	)

inlinestaticnoexcept

Initialises a shared filing system mutex using the file at lockfile

Returns: An implementation of shared_fs_mutex using the atomic_append algorithm.

Parameters

base	Optional base for the path to the file.
lockfile	The path to the file to use for IPC.
nfs_compatibility	Make this true if the lockfile could be accessed by NFS.
skip_hashing	Some filing systems (typically the copy on write ones e.g. ZFS, btrfs) guarantee atomicity of updates and therefore torn writes are never observed by readers. For these, hashing can be safely disabled.

Todo:: fs_mutex_append needs to check if file still exists after lock is granted, awaiting path fetching.

       {
         LLFIO_LOG_FUNCTION_CALL(0);
         OUTCOME_TRY(ret, file_handle::file(base, lockfile, file_handle::mode::write, file_handle::creation::if_needed, file_handle::caching::temporary));
         atomic_append_detail::header header;
         // Lock the entire header for exclusive access
         auto lockresult = ret.try_lock(0, sizeof(header), true);
         //! \todo fs_mutex_append needs to check if file still exists after lock is granted, awaiting path fetching.
         if(lockresult.has_error())
         {
           if(lockresult.error() != errc::timed_out)
           {
             return std::move(lockresult).error();
           }
           // Somebody else is also using this file
         }
         else
         {
           // I am the first person to be using this (stale?) file, so write a new header and truncate
           OUTCOME_TRYV(ret.truncate(sizeof(header)));
           memset(&header, 0, sizeof(header));
           header.time_offset = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
           header.first_known_good = sizeof(header);
           header.first_after_hole_punch = sizeof(header);
           if(!skip_hashing)
           {
             header.hash = QUICKCPPLIB_NAMESPACE::algorithm::hash::fast_hash::hash((reinterpret_cast<char *>(&header)) + 16, sizeof(header) - 16);
           }
           OUTCOME_TRYV(ret.write(0, {{reinterpret_cast<byte *>(&header), sizeof(header)}}));
         }
         // Open a shared lock on last byte in header to prevent other users zomping the file
         OUTCOME_TRY(guard, ret.lock(sizeof(header) - 1, 1, false));
         // Unlock any exclusive lock I gained earlier now
         if(lockresult)
         {
           lockresult.value().unlock();
         }
         // The constructor will read and cache the header
         return atomic_append(std::move(ret), std::move(guard), nfs_compatibility, skip_hashing);
       }

The documentation for this class was generated from the following file:

include/llfio/v2.0/algorithm/shared_fs_mutex/atomic_append.hpp

Public Types

Public Member Functions

Static Public Member Functions

Protected Member Functions

Detailed Description

Member Function Documentation

◆ _lock()

◆ fs_mutex_append()