Designing the low level icalendar parser took longer than it should have taken. To be sincere, part of the problem was my trying to be too ambitious in its scope and growing beyond the strictly necessary requirements.
Requirements
The main goal of this parser (now called vparser
) is to parse only the basic
structure of icalendar and vcard files. The main requirements for vdirsyncer’s
use case are:
- Extracting the
UID
from an entry (calendar component or vcard). - Normalise a component:
- Sort its properties into a deterministic order.
- Recognise and ignore specific properties (e.g.:
PRODID
).
- Avoid copying data in memory wherever possible. This will run once for every single item , so needs to be well-performant.
- Handle potentially invalid components. The general structure needs to be
reasonably valid:
- The syntax of content lines must be correct.
- The syntax of individual parameter types or values may be invalid.
Focusing on a this use case
I consider code re-usability important, and therefore wanted this parser to be reusable for other applications that need to parse icalendar data.
My original intent was for this implementation to be usable in other sorts of applications, including something like a calendar GUI application. However, such an application would have somewhat conflicting requirements, and trying to design a parser that would fit such a use case resulted in a design that was not ideal for vdirsyncer.
Designing a lower level library optimised for all usages turned out to be unrealistic. The design on which I have finally settled is optimal for vdirsyncer, and quite suitable for other low-level tools, for example:
icalendar-funnel
: a tool that shall read two distinct calendars and copy their items into a third (this is a special kind of one-way sync).vcard-merge
: a tool that shall merge two vcard entries into a single one.
However, its performance may not be optimal for something like a calendar UI.
Quick intro to content lines
As a quick reference, content lines look something like this:
DTSTART;TZID=America/New_York:19970902T090000
- The
name
isDTSTART
. - The only parameter here is:
TZID=America/New_York
.- The parameter’s name is
TZID
. - The parameter’s value is
America/New_York
- The parameter’s name is
- The value is
19970902T090000
Note that a single content line may have zero, one or more parameters.
Designs considered
I already had quite a few designs in mind for my parser, and quite a few notes spread across different places. I’ve kept around and present here only a few of these (most others were really regurgitations of the same ideas presented below).
The following all share a same general approach: export a trait1 which consumers must implement. The parser will parse the entire input data. As it encounters elements on the input being parsed, it will call the appropriate functions on the trait handler implementation.
The general idea is inspired on Java’s SAX parsers. Generally, these handlers would implement a state pattern.
Version 1
trait ParseHandler {
// TODO: maybe use an enum with variants KnownName and XName?
fn name(&self, name: Cow<str>);
fn param(&self, name: Cow<str>, value: &[Cow<str>]);
fn value(&self, value: Cow<&str>);
fn end_of_file(&self);
/// Called when invalid data is found.
///
/// Includes all input text until the next `\r\n`.
fn invalid(&self, value: Cow<&str>);
}
A consumer of the parser would implement this trait and handle events as function calls. For the following example line:
BEGIN:VTODO
The following sequence would be invoked (this won’t strictly compile, and is simplified for readability):
handler.name("BEGIN");
handler.value("VTODO");
Pros:
- Skipping a content lines is as simple as ignoring all events until a
corresponding
handle.name("END")
is received. - Using
Cow
allows passing a reference most of the time, but owned data in only when a content line is folded.
Cons:
- Using
Cow
forparams
implies creating copies of parameters even in situations which don’t care about them (which is every single scenario in the case ofvdirsyncer
). - Likewise, the array of parameters always needs to be allocated, even if
unused.
- As I review this, I realise that re-using the same
Vec
for each call is feasible, although I suspect that lifetimes would be quite unpleasant to deal with for consumers that make use of parameters.
- As I review this, I realise that re-using the same
- For usages that need to further interpret the calendar component data, this API is quite tedious to use.
- Requires additional logic to reconstruct content lines; which means
vdirsyncer
would be decomposing them and the re-composing them again, adding pointless complexity.
I didn’t consider this terrible, but it’s definitely not good either.
Version 2
I tried using an enum as a type handed over to the handler, and a single event per line:
#[non_exhaustive]
pub enum ContentLine<'input> {
Begin {
params: Vec<Cow<'input, &'input str>>,
value: Cow<'input, &'input str>,
},
End {
params: Vec<Cow<'input, &'input str>>,
value: Cow<'input, &'input str>,
},
// ... TODO: maybe variants for each standard name too?
Other {
name: Cow<'input, &'input str>,
params: Vec<Cow<'input, &'input str>>,
value: Cow<'input, &'input str>,
},
Invalid(Cow<'input, &'input str>),
}
trait ParseHandler {
fn content_line(&self, line: ContentLine);
}
Compared to the previous approach, usage is (mildly) easier for consumers that
want to actually interpret the calendar component’s data, but doesn’t provide
great improvements for the vdirsyncer
use case.
Pros:
- Skipping content lines as simple as the previous approach.
- Each line is a single event, which is simpler to handle for consumers that need to interpret the calendar component data.
Cons:
- Still parses parameters for lines that we don’t care, and allocates lots of
Vec
s that may never be used.- Again, maybe using the same
Vec
that merely gets repopulated for each line was feasible. The above type would have a&[Cow<str>]
instead ofVec<Cow<str>>
. However, this sounds like lifetimes could be even more complicated, given that the array has a shorter lifetime than the references (I’m not even 100% sure that this is allowed).
- Again, maybe using the same
- Requires different enum types for vcard and icalendar. Keeping all the
variants updated would be quite tedious, especially given that most won’t
even be used in
vdirsyncer
’s use case. - Also requires additional logic to reconstruct content lines as the previous
approach.
- On the other hand, given that this approach uses one event per line, it is possible to simply provide a reference to the original line.
Version 3
pub struct ContentLine<'input> {
data: Cow<'input, str>,
}
impl<'input> ContentLine<'_> {
pub fn name(&self) -> Cow<str> {
// ...
}
pub fn params(&self) -> Cow<str> {
// ...
}
pub fn value(&self) -> Cow<str> {
// ...
}
pub fn raw(&self) -> Cow<str> {
// ...
}
}
trait ParseHandler {
fn content_line(&self, line: ContentLine);
fn invalid_line(&self, line: Cow<&str>);
}
This approach does away with the enum type used above, but keeps the idea of a single event per content line.
Pros:
- Skipping content lines still simple as the previous approach.
- Doesn’t create
Vec
s with pointers to parameters that are never used. - Requires no extra complexity to access the original content line unfolded.
Cons:
- Requires parsing a lot of portions twice; once when extracting the content line, and again when accessing the name, value or parameters.
- Using
Cow
to hold data inContentLine
means that we copy around every line when unfolding, even if the unfolded portions are never used. - Requires extra complexity to re-create the original content line in its folded form.
Overall, this is a minor improvement over previous approaches. Mostly the API provides a cleaner API (for all use cases).
Version 4
pub enum Name<'input> {
Begin,
End,
// ... TODO: variants for each standard name
Other(Cow<'input, &'input str>),
Invalid(Cow<'input, &'input str>),
}
pub struct ContentLine<'input> {
name: Name<'input>,
data: &'input str,
}
// Implements impl Iterator<Item = &str>
pub struct Params {}
impl<'input> ContentLine<'input> {
pub fn name(&'input self) -> &'input Name {
&self.name
}
pub fn params(&self) -> Params {
// ...
}
pub fn value(&self) -> Cow<str> {
// ...
}
}
trait ParseHandler {
fn content_line(&self, line: ContentLine);
fn invalid_line(&self, line: Cow<&str>);
}
This API is more convenient for situations which want to interpret the data inside calendar components. It also avoids parsing the name portion of each line twice.
Pros:
- Skipping a content lines is as simple as ignoring all events until a matching
Name::End
is found. - Does not unfold and copy data that won’t be used.
- Accessing parameters has a much more ergonomic interface.
- The cost of accessing parameters is only paid when they are actually required.
- No
Cow
for data that won’t be used.
Cons:
- Still requires a second pass to access parameters.
- The enum type for the name might be tedious to maintain, and their value is
questionable.
- On the other hand: it is perfectly feasible to replace it with a
&'str
instead; another version of the handler that is omitted here did exactly that. We’ll see more of this idea below.
- On the other hand: it is perfectly feasible to replace it with a
Taking a step back
At this point, a few patterns where clear:
- Whenever I avoided unfolding and re-allocating data that vdirsyncer doesn’t
care about (mostly parameters), interpreting that data requires parsing that
same portion of data twice.
- Implying that I should prioritise the vdirsyncer use case, rather than try to make this too generic.
- Using a
Cow<str>
as a field in the struct type means unfolding and copying the entire line, including portions that would never be used. - Using enums for the line’s name doesn’t seem to provide a worthwhile advantage, and always brings in maintenance overhead.
Current Version
After the above experiments, I set out to design something that would be a perfect fit for vdirsyncer and other low level tools. Higher level tools will not have the same level of consideration.
The resulting data type merely includes references (pointers) to the original data:
/// A valid content line.
///
/// May be folded via a CRLF immediately followed by a single linear
/// white-space character (i.e., SPACE or HTAB).
pub struct ContentLine<'input> {
raw: &'input str,
// Everything before the first colon or semicolon (whichever comes first).
name: &'input str,
// Everything after the unquoted semicolon and before the first unquoted colon.
params: &'input str,
// Everything after the first unquoted colon.
value: &'input str,
}
trait ParseHandler {
fn content_line(&self, line: ContentLine);
}
Pros:
- Skipping a component is still as simple as ignoring all lines until a
matching
END:XXX
event. - Parses everything that vdirsyncer cares about in one pass.
- No eager unfolding and copying of data that won’t be used.
- There is no need to reconstruct the original line; a reference to it is passed.
Cons (sort of):
- Accessing individual parameters still needs a second pass.
- But vdirsyncer doesn’t actually access those at all.
- For usages that need to further interpret the calendar component data, this
still requires a second pass to access individual parameters.
- Again, vdirsyncer doesn’t need this.
All the disadvantages here affect other type of tools, but not vdirsyncer specifically.
Accessors
The data in the structure above points to the original input data, which may contain folded lines. Accessing it is implemented via methods that unfold content on-demand:
impl<'input> ContentLine<'input> {
fn raw(&self) -> &'input str {
// ...
}
fn name(&self) -> Cow<'input, str> {
// ...
}
fn params(&self) -> Cow<'input, str> {
// ...
}
fn value(&self) -> Cow<'input, str> {
// ...
}
fn re_folded(&self) -> Cow<'input, str> {
// ...
}
}
Some notes on this:
- vdirsyncer doesn’t even read parameters, so that won’t even be used.
- Names are seldom folded (due to being at the beginning of the line), so
name()
will usually return aCow::Borrowed
(e.g.: a reference).- Names are accessed for every line, so maybe I’ll escape them prematurely and avoid a second pass for them too.
- Accessing values is only done for a small minority of lines, and only those will be unfolded.
- The
re_folded
method can be used to normalise how lines are folded. I won’t be using this initially and want to evaluate whether it is ever of any real impact in real-life conflict resolution. - Explicit lifetimes here allow keeping refernces beyond the lifetime of the parser itself.
Inversion of control
An issue with this approach is that parsing can’t be stopped immediately. There’s no way to stop the parser immediately and ignore the rest of a file.
I addressed this by simply using a Parser
type which can be used to iterates
over lines:
struct Parser<'data> {
data: &'data str,
index: usize,
}
impl<'data> Parser<'data> {
fn new(data: &'data str) -> Parser<'data> {
// ...
}
}
impl<'data> Iterator for Parser<'data> {
type Item = ContentLine<'data>;
fn next(&mut self) -> Option<Self::Item> {
// ...
}
}
This is also a more idiomatic Rust API, which actually turned out to be more pleasant to use.
Implementing the parser
Implementing the parser is relatively straightforward. Given that I want to handle invalid input gracefully, I only really need to concearn myself with a subset of the rules for icalendar content lines.
In general terms, I’ll parse each line until finding one of the following:
- A semicolon: indicates that the name ends here and parameters.
- A colon: indicates that the name/parameters ends and the value begins.
- A
\r\n
:- When followed by whitespace indicates a continuation line.
- Otherwise, it indicates the end of this content line.
(these are the same rules as those in the comments in the sample code above)
In my head, this looked like a decision tree, and this is pretty much what the implementation looks like.
The parsing code is one huge function. I usually consider functions this long an indicator of poor code quality, but I find this to be an exception. The code’s indentation actually reflects depth in this decision tree that I mentioned. And the lack of fancy macros or “smart” helpers makes it straightforward to read and follow.
When writing this, I initially represented the main branches with a todo!()
for each one. I and then filled in the missing parts, usually by writing tests
that would reach this unreachable code and refining until those passed.
I also looked into criterion for performance benchmarks (I only used it for one specific improvement so far). It turns out to be quite simple to use. I intend to use this for a few pending improvements after the first working releases.2.
The rest of this month
It’s been a few weeks since the last vdirsyncer status update. The design work mentioned above took only a couple of days. I’m still not very happy with the time consumed on it. A less polished approach could have sufficed for an initial implementation, and this won’t be as flexible as I hoped it would be.
The implementation itself didn’t take long. I already had a pretty clear mental picture on what it would look like, so it was mostly a day or two of writing tests, writing code, and iterating on that.
I’m now finalising integrating this into the vdirsyncer codebase and dealing with (what I hope are) the final bits of the basics in conflict resolution.
All this aside, I’ve been enjoying some time off recently, and also a lot of time working around more networking issues than I had anticipated.
Until next month!