Pimsync v0.5.11 is out, with the following set of improvements.
Incremental sync
When watching a storage (e.g.: watching a filesystem for real-time changes), instead of executing a full synchronisation, a more focused and granular synchronisation is done. This avoid wasting a lot of networking usage and processing power checking a lot of items that we are certain have not changed.
This feature was made easier by multiple refactors leading up to it. In
particular, the “Plan” generated is simply a stream of operations. and much of
the code works on exactly this: Stream<Item = Result<Operation, PlanError>>.
The whole execution logic can handle either incremental plans or full plans the
same way thanks to this abstraction.
When an interval tick is received (e.g.: every 5 minutes) or when unknown changes have occurred, a regular full sync is performed. When a single file changes, only that collection is synchronised, skipping all others.
The main change required for this was generating a plan based on specific change events — all the other plumbing required for this had been happening in refactors leading to this point, including splitting up the planning logic in order to be able to use only portions of it.
On item-level synchronisation
I’m somewhat hesitant to sync individual items (rather than keeping a collection as a minimal unit). When a new file is created locally, before uploading it, pimsync needs to ensure that no other item has the same UID (since UIDs must be unique), and this requires reading the entire remote collection. WebDAV servers SHOULD validate this themselves, but I fear that some other niche scenarios could lead to duplicate UIDs if pimsync doesn’t check first.
I need to ponder this further. It’s possible that item-level synchronisation is entirely safe, and if so, enabling that would be trivial at this point: all the scaffolding is in place.
kqueue for BSD
On Linux, pimsync watches the local filesystem using inotify, receiving notifications of file changes as they happen. These are quite granular, and indicate exactly which files changed.
A counterpart for BSD was missing, and is now implemented. The BSD backend uses kqueue, which yields events with less granularity. Monitoring a directory along with its N files requires opening 1+N file descriptors with kqueue, which is an absurd amount of file descriptors. To work around this limitation, pimsync only watches the directory itself, receiving notifications whenever any change happens, without the granularity of which file changed.
This means that for a single file change, we still sync its entire directory/collection (but not the whole storage). I’ve included some a debounce to the events (which applies to both kqueue and inotify) to avoid synchronising the same collection multiple times in quick succession.
Exit codes
I’ve stabilised exit codes used by pimsync, so that scripted and automated usage
can properly detect exit conditions. Where feasible, they are based on the exit
codes from sysexits.h, and fully documented in the manual page. Briefly, they
are:
0: Success.64/EX_USAGE: Usage error: bad arguments / flags.74/EX_IOERR: Error writing to status database. This are typically fatal, and require manual review of what went wrong. These should not happen ever on a healthy setup.78/EX_CONFIG: Configuration (file) error.130: User aborted (i.e.: during manual conflict resolution)3: Conflicts detected. This error only happens when runningsync, and indicates that some conflicts need to be resolved manually viaresolve-conflicts.1: Other errors (e.g.: no network duringsync, couldn’t execute conflict resolution command duringresolve-conflicts, etc).
Along with exit code 3, pimsync will now also clearly indicate when manual
conflict resolution is required near the end of its output, so this is clearly
visible even with high logging levels.
Skipping discovery
pimsync automatically discovers the server host, port and exact locations via the mechanism described in RFC 6764. For scenarios such as broken discovery configuration or bugs in the server implementation, this is problematic, since discovery fails and pimsync cannot run. For scenarios where the user is already providing the exact URL which contains the calendars, this is mildly wasteful, although not terrible.
It is now possible to disable different portions of the discovery process with
the new discovery directive. The default remains the same: discover full,
whereas discovery collections detects collections at the exact given URL,
skipping all other service discovery. The full details are in the manual page,
as usual.
In future, pimsync shall also cache this discovery data (based on DNS TTL and HTTP Cache-Control) to ensure that it never runs more frequently than necessary (currently it runs only once during start-up).
Smaller changes
Relative paths in the configuration file are now rejected. These can lead to unexpected data loss, due to the way storages and the status database correlate to each other. Running a sync with the same status database but different local storage could lead to deletion of remote data quite easily. I initially wanted to warn about this behaviour and implemented some “protections” for this, but they quickly devolved into heuristics with lots of potential caveats.
Typically, relative paths should never be necessary. At worse, a command to produce the paths can be used, or for complex automation, the configuration file can be generated programmatically. I think these are both theoretical cases rather than practical scenarios.
After manual conflict resolution, the status database is now immediately updated. This reduces a bit of extra network traffic on the next sync afterwards, and prevents seeing a second conflict is either side is edited after the conflict resolution but before the next full sync.