‹ back home

vdirsyncer status update, November 2023

2023-11-27 #open-source #status-update #vdirsyncer

Designing the low level icalendar parser took longer than it should have taken. To be sincere, part of the problem was my trying to be too ambitious in its scope and growing beyond the strictly necessary requirements.

Requirements

The main goal of this parser (now called vparser) is to parse only the basic structure of icalendar and vcard files. The main requirements for vdirsyncer’s use case are:

Focusing on a this use case

I consider code re-usability important, and therefore wanted this parser to be reusable for other applications that need to parse icalendar data.

My original intent was for this implementation to be usable in other sorts of applications, including something like a calendar GUI application. However, such an application would have somewhat conflicting requirements, and trying to design a parser that would fit such a use case resulted in a design that was not ideal for vdirsyncer.

Designing a lower level library optimised for all usages turned out to be unrealistic. The design on which I have finally settled is optimal for vdirsyncer, and quite suitable for other low-level tools, for example:

However, its performance may not be optimal for something like a calendar UI.

Quick intro to content lines

As a quick reference, content lines look something like this:

DTSTART;TZID=America/New_York:19970902T090000

Note that a single content line may have zero, one or more parameters.

Designs considered

I already had quite a few designs in mind for my parser, and quite a few notes spread across different places. I’ve kept around and present here only a few of these (most others were really regurgitations of the same ideas presented below).

The following all share a same general approach: export a trait1 which consumers must implement. The parser will parse the entire input data. As it encounters elements on the input being parsed, it will call the appropriate functions on the trait handler implementation.

The general idea is inspired on Java’s SAX parsers. Generally, these handlers would implement a state pattern.

Version 1

trait ParseHandler {
    // TODO: maybe use an enum with variants KnownName and XName?
    fn name(&self, name: Cow<str>);

    fn param(&self, name: Cow<str>, value: &[Cow<str>]);

    fn value(&self, value: Cow<&str>);

    fn end_of_file(&self);

    /// Called when invalid data is found.
    ///
    /// Includes all input text until the next `\r\n`.
    fn invalid(&self, value: Cow<&str>);
}

A consumer of the parser would implement this trait and handle events as function calls. For the following example line:

BEGIN:VTODO

The following sequence would be invoked (this won’t strictly compile, and is simplified for readability):

handler.name("BEGIN");
handler.value("VTODO");

Pros:

Cons:

I didn’t consider this terrible, but it’s definitely not good either.

Version 2

I tried using an enum as a type handed over to the handler, and a single event per line:

#[non_exhaustive]
pub enum ContentLine<'input> {
    Begin {
        params: Vec<Cow<'input, &'input str>>,
        value: Cow<'input, &'input str>,
    },
    End {
        params: Vec<Cow<'input, &'input str>>,
        value: Cow<'input, &'input str>,
    },
    // ... TODO: maybe variants for each standard name too?
    Other {
        name: Cow<'input, &'input str>,
        params: Vec<Cow<'input, &'input str>>,
        value: Cow<'input, &'input str>,
    },
    Invalid(Cow<'input, &'input str>),
}

trait ParseHandler {
    fn content_line(&self, line: ContentLine);
}

Compared to the previous approach, usage is (mildly) easier for consumers that want to actually interpret the calendar component’s data, but doesn’t provide great improvements for the vdirsyncer use case.

Pros:

Cons:

Version 3

pub struct ContentLine<'input> {
    data: Cow<'input, str>,
}

impl<'input> ContentLine<'_> {
    pub fn name(&self) -> Cow<str> {
        // ...
    }
    pub fn params(&self) -> Cow<str> {
        // ...
    }
    pub fn value(&self) -> Cow<str> {
        // ...
    }
    pub fn raw(&self) -> Cow<str> {
        // ...
    }
}

trait ParseHandler {
    fn content_line(&self, line: ContentLine);
    fn invalid_line(&self, line: Cow<&str>);
}

This approach does away with the enum type used above, but keeps the idea of a single event per content line.

Pros:

Cons:

Overall, this is a minor improvement over previous approaches. Mostly the API provides a cleaner API (for all use cases).

Version 4

pub enum Name<'input> {
    Begin,
    End,
    // ... TODO: variants for each standard name
    Other(Cow<'input, &'input str>),
    Invalid(Cow<'input, &'input str>),
}

pub struct ContentLine<'input> {
    name: Name<'input>,
    data: &'input str,
}

// Implements impl Iterator<Item = &str>
pub struct Params {}

impl<'input> ContentLine<'input> {
    pub fn name(&'input self) -> &'input Name {
        &self.name
    }

    pub fn params(&self) -> Params {
        // ...
    }
    pub fn value(&self) -> Cow<str> {
        // ...
    }
}

trait ParseHandler {
    fn content_line(&self, line: ContentLine);
    fn invalid_line(&self, line: Cow<&str>);
}

This API is more convenient for situations which want to interpret the data inside calendar components. It also avoids parsing the name portion of each line twice.

Pros:

Cons:

Taking a step back

At this point, a few patterns where clear:

Current Version

After the above experiments, I set out to design something that would be a perfect fit for vdirsyncer and other low level tools. Higher level tools will not have the same level of consideration.

The resulting data type merely includes references (pointers) to the original data:

/// A valid content line.
///
/// May be folded via a CRLF immediately followed by a single linear
/// white-space character (i.e., SPACE or HTAB).
pub struct ContentLine<'input> {
    raw: &'input str,
    // Everything before the first colon or semicolon (whichever comes first).
    name: &'input str,
    // Everything after the unquoted semicolon and before the first unquoted colon.
    params: &'input str,
    // Everything after the first unquoted colon.
    value: &'input str,
}

trait ParseHandler {
    fn content_line(&self, line: ContentLine);
}

Pros:

Cons (sort of):

All the disadvantages here affect other type of tools, but not vdirsyncer specifically.

Accessors

The data in the structure above points to the original input data, which may contain folded lines. Accessing it is implemented via methods that unfold content on-demand:

impl<'input> ContentLine<'input> {
    fn raw(&self) -> &'input str {
        // ...
    }

    fn name(&self) -> Cow<'input, str> {
        // ...
    }

    fn params(&self) -> Cow<'input, str> {
        // ...
    }

    fn value(&self) -> Cow<'input, str> {
        // ...
    }

    fn re_folded(&self) -> Cow<'input, str> {
        // ...
    }
}

Some notes on this:

Inversion of control

An issue with this approach is that parsing can’t be stopped immediately. There’s no way to stop the parser immediately and ignore the rest of a file.

I addressed this by simply using a Parser type which can be used to iterates over lines:

struct Parser<'data> {
    data: &'data str,
    index: usize,
}

impl<'data> Parser<'data> {
    fn new(data: &'data str) -> Parser<'data> {
        // ...
    }
}

impl<'data> Iterator for Parser<'data> {
    type Item = ContentLine<'data>;

    fn next(&mut self) -> Option<Self::Item> {
        // ...
    }
}

This is also a more idiomatic Rust API, which actually turned out to be more pleasant to use.

Implementing the parser

Implementing the parser is relatively straightforward. Given that I want to handle invalid input gracefully, I only really need to concearn myself with a subset of the rules for icalendar content lines.

In general terms, I’ll parse each line until finding one of the following:

(these are the same rules as those in the comments in the sample code above)

In my head, this looked like a decision tree, and this is pretty much what the implementation looks like.

The parsing code is one huge function. I usually consider functions this long an indicator of poor code quality, but I find this to be an exception. The code’s indentation actually reflects depth in this decision tree that I mentioned. And the lack of fancy macros or “smart” helpers makes it straightforward to read and follow.

When writing this, I initially represented the main branches with a todo!() for each one. I and then filled in the missing parts, usually by writing tests that would reach this unreachable code and refining until those passed.

I also looked into criterion for performance benchmarks (I only used it for one specific improvement so far). It turns out to be quite simple to use. I intend to use this for a few pending improvements after the first working releases.2.

The rest of this month

It’s been a few weeks since the last vdirsyncer status update. The design work mentioned above took only a couple of days. I’m still not very happy with the time consumed on it. A less polished approach could have sufficed for an initial implementation, and this won’t be as flexible as I hoped it would be.

The implementation itself didn’t take long. I already had a pretty clear mental picture on what it would look like, so it was mostly a day or two of writing tests, writing code, and iterating on that.

I’m now finalising integrating this into the vdirsyncer codebase and dealing with (what I hope are) the final bits of the basics in conflict resolution.

All this aside, I’ve been enjoying some time off recently, and also a lot of time working around more networking issues than I had anticipated.

Until next month!


  1. If you are not familiar with Rust, think of a think of a Java interface or Python abstract class. ↩︎

  2. It is quite tempting to start diving into performance tweaks and improvements. But that is honestly terrible at this stage because it just postpones an actual funcional release. ↩︎

Have comments or want to discuss this topic?
Send an email to ~whynothugo/public-inbox@lists.sr.ht (mailing list etiquette)

— § —