IMAP prescribes this exact algorithm, which has led to quite a bit of pain.
Most notably, all of the messages titled Recipient unknown land in the same thread. I’ve also seen complaints about subthreads, as occurs e.g. when you reply off-list to a mailing-list mail. A lot of people expect that off-list, person-to-person message to start a new thread, but jwz’s algorithm says it’s part of the same thread.
is there an alternative algorithm, that you are aware of?
That avoids this specific problem (of ‘unknown’), by creating a new thread, and still does not rely on a database?
I am trying to think of something similar from the world of compiler AST management, but cannot come with anything useful yet.
If by ‘unknown’ you mean messages without a message-id and the problems they create for threading, then that isn’t as much of a problem as it was, thanks to the overwhelming problem of spam.
Until around 2000 some software would send messages with all kinds of syntax errors, including having no message-id, and the maintainers would often be arrogant about that. That attitude faded away between 2000 and 2010, as the spam filters increasingly treated syntax errors as spam signs. Nowadays messages without a message-id are much rarer.
I think the optimal threading algorithm now is the first part of jwz’s, but without the subject-based merging, and instead with a mechanism to split threads whenever a reply’s address set is a strict subset of the set in the message to which it replies.
Vaguely relevant blog post: Exchange’s threading is like jwz’s except with fixed-size IDs. A 176-bit ID for the thread, 44 bits for each message.
I think the optimal threading algorithm now is the first part of jwz’s, but without the subject-based merging, and instead with a mechanism to split threads whenever a reply’s address set is a strict subset of the set in the message to which it replies.
This was the approach I had in my mail client [0] but the heuristic you mention seemed not always helpful to me. This is why I decided to allow only message id threading. I plan on adding thread editing in the UI; so that the user can fix broken threading by themselves.
Yes, I agree it’s not a clear win. You may be able to improve on it if you know, or can guess, why one or more addresses were removed from the set, and whether the main thread continues or not.
This reminded me the pain when I was asked to thread / extract some communication edges and vertices from an email dataset that was exported to me via Outlook, at a time when not much information about its threading algorithm was available. I wrote two blog posts and over the years people have commented adding valuable information for anyone to use.
IMAP prescribes this exact algorithm, which has led to quite a bit of pain.
Most notably, all of the messages titled Recipient unknown land in the same thread. I’ve also seen complaints about subthreads, as occurs e.g. when you reply off-list to a mailing-list mail. A lot of people expect that off-list, person-to-person message to start a new thread, but jwz’s algorithm says it’s part of the same thread.
is there an alternative algorithm, that you are aware of? That avoids this specific problem (of ‘unknown’), by creating a new thread, and still does not rely on a database?
I am trying to think of something similar from the world of compiler AST management, but cannot come with anything useful yet.
If by ‘unknown’ you mean messages without a message-id and the problems they create for threading, then that isn’t as much of a problem as it was, thanks to the overwhelming problem of spam.
Until around 2000 some software would send messages with all kinds of syntax errors, including having no message-id, and the maintainers would often be arrogant about that. That attitude faded away between 2000 and 2010, as the spam filters increasingly treated syntax errors as spam signs. Nowadays messages without a message-id are much rarer.
I think the optimal threading algorithm now is the first part of jwz’s, but without the subject-based merging, and instead with a mechanism to split threads whenever a reply’s address set is a strict subset of the set in the message to which it replies.
Vaguely relevant blog post: Exchange’s threading is like jwz’s except with fixed-size IDs. A 176-bit ID for the thread, 44 bits for each message.
This was the approach I had in my mail client [0] but the heuristic you mention seemed not always helpful to me. This is why I decided to allow only message id threading. I plan on adding thread editing in the UI; so that the user can fix broken threading by themselves.
[0] https://meli.delivery
Yes, I agree it’s not a clear win. You may be able to improve on it if you know, or can guess, why one or more addresses were removed from the set, and whether the main thread continues or not.
This reminded me the pain when I was asked to thread / extract some communication edges and vertices from an email dataset that was exported to me via Outlook, at a time when not much information about its threading algorithm was available. I wrote two blog posts and over the years people have commented adding valuable information for anyone to use.