Chasing Linux Kernel Archives

Kernel development is truly impossible to keep track of. The main mailing
list alone is vast beyond belief. Then there are all the side lists and IRC
channels, not to mention all the corporate mailing lists dedicated to
kernel development that never see the light of day. In some ways, kernel
development has become fundamentally mysterious.

Once in a while, some lunatic decides to try to reach back into the past and
study as much of the corpus of kernel discussion as he or she can
find. One such person is Joey Pabalinas, who recently wanted to gather
everything together in Maildir format, so he could do searches, calculate
statistics, generate pseudo-hacker AI bots and whatnot.

He couldn’t find any existing giant corpus, so he tried to create his own
by piecing together mail archived on various sites. It turned out to be
more than a million separate files, which was too much to host on either
or GitLab. He asked the linux kernel mailing
for suggestions on better
hosting opportunities. Although he acknowledged, “It’s possible I’m the only
weirdo who finds this kind of thing useful, but I figured I should share it
just in case I’m not.”

Joe Perches suggested plumbing the archives at, which
go back decades. But Joey said he’d tried that, and he found it all but
impossible to convert those archives to the Mailbox format he wanted.
Instead, he’d spent the previous several weeks scraping the
archive and scripting his own conversion routines.

Konstantin Ryabitsev remarked:

The maildir format is kind of terrible for
LKML, because having millions of messages in a single directory is very
hard on the underlying FS. If you break it up into multiple folders, then
it becomes difficult to search. This is the main reason why we have chosen
to go with the public-inbox format, which solves both of these problems and
allows for a very efficient archive updating and replication using git.

Meanwhile, Jasper Spaans raised his eyebrows at Joey’s statement that he’d
gotten more than a million separate files by scraping Jasper

First of all, there are more than 3M messages stored in the
database, so I guess you’ve missed some messages or something is really
broken. Besides, unless you figured out how to get to the raw data, you’ve
just scraped a rendering which discards stuff like pgp signatures etc and
has very incomplete headers. Unless you don’t care for those of course.

Powered by WPeMatico