FlatPress Wiki

Everything you need to know :)

User Tools

Site Tools


tools:db:mailboximporter

My workflow is a little convoluted, because I was going for more of a quick-n-dirty solution than a fully-featured one, but here goes.

I wanted to convert a bunch of mbox (email) data to FlatPress entries. I'm using Mozilla Thunderbird, but this would work with most mail readers (because variants of the mbox format are fairly standard).

I needed to parse out the date, subject, author, and text from the emails, and then feed that into FP somehow. Here's what I did:

1) Inside my mail reader, I put all the relevant emails into the same folder. (call it “FP_posts”)
2) from the filesystem, I made a working copy of that folder. (a file called “FP_posts” in the TBird data folder)
3) I used uudeview (actually, xdeview, the X11 version of it) to separate the attachments from the text.
3a) In xdeview, I turned on “Handle Text Files” option, let it decode all the files, and then just grabbed the 0001.txt, 0002.txt, etc. that contained the individual emails.
4) I concatenated all the 0*.txt files into one big file (FP_posts.txt) that now just contains the text parts of the emails.
5) I ran this file through the readmbox.pl Perl script that I wrote (code below)
5a) This creates FP_Posts.txt.csv (although really delimited by stuff other than commas – read the code) with just the fields that FP is interested in.
6) I uploaded the CSV to a new folder on the webserver, …/flatpress/mbox/FP_Posts.txt.csv
7) I uploaded mboxtoflatpress.php, a new PHP script that I wrote (code below) to the main FP folder …/flatpress/mboxtoflatpress.php
8) go to URL …/flatpress/mboxtoflatpress.php?file=mbox/FP_Posts.txt.csv
8a) This will print out subject lines of each message as they are created on the blog
9) Go to Admin→Entries→Drafts and publish away!

The script posts each entry:
- as drafts, in case there are some you don't want going public immediately.
- with the date/time that appears in the email
- with “author” set to the “From: ” entry from the email (seems to grab just the “full name”, rather than the email address, YMMV)
- with “subject” and “body” from the email (as you'd expect)

Here's readmbox.pl:

#!/usr/bin/perl
# readmbox.pl - parse an mbox-style file into a CSV-style file
#
# Created 03.05.2009 by Jimbo S. Harris
#
# This file reads a source file off the command line
# and parses out the following fields from each email in the file:
#
# From: To: Date: Subject: and the message body
#
# it makes some attempt to handle multipart messages, but it's best if they're text/plain.
#
# The output is in the form:
#
# FROM|TO|SUBJECT|DATE|`BODY`|ø
#
# with one such entry in the output file for each recognized message in the input file.
#
# The output file has the same name as the input file, with ".csv" appended to it.
#
# The output file is designed to be read by mboxtoflatpress.php
#

# open input and output files
open( IN, $ARGV[0] ) || die "could not open the input file" . $ARGV[0];
open( OUT, ">" . $ARGV[0] . ".csv" ) || die "could not open the output file" . $ARGV[0] . ".csv";

#variable declarations
$state = "header";
$body = "";
$boundary = "";
$hasboundary = 0;
$delim="|";
$text_delim="`";
$eol_delim="ø";
$printdots=1;
$count=0;
$save=1;
$|=1;

print $state . "\n";

# traverse the input file one line at a time
while( <IN> )
{
$line = "";
$line = $_;
chop $line;

# let the user know (on STDOUT) if a lot of data is passing through
$count++;
print " " . $count . " " if( $count % 1000 == 0 );

# read off the headers first
if( $state eq "header" )
{
print "." if( $printdots == 1 );
($junk, $from) = split( /:/, $line, 2 ) if( $line =~ /^From:/ );
($junk, $to) = split( /:/, $line, 2 ) if( $line =~ /^To:/ );
($junk, $subject) = split( /:/, $line, 2 ) if( $line =~ /^Subject:/ );
($junk, $date) = split( /:/, $line, 2 ) if( $line =~ /^Date:/ );

# this is an attempt to parse multipart content properly. It worked OK, but can be buggy.
if( $line =~ /boundary/ )
{
#print "boundary line: (" . $_ . ")\n";

#$boundary = $_;

#$boundary =~ s/boundary=\"\(.*\)$/\1/;
($junk, $boundary) = split( /=/, $line, 2 ) ;

$boundary =~ s/\"//g;

$boundary = "--" . $boundary;

#print "boundary vars: (" . $boundary . ")\n";
$hasboundary = 1;
#print "got boundary: " . $boundary . "\n";
}

# if you are done with the headers, switch to processing the message body
if( $hasboundary == 1 )
{
if( $line =~ /^$boundary/ ) #if you're in the header of a multipart message, look for the boundary string
{
$state = "body";
print $count . "\n\n" . $state . " with boundary:\n\n (" . $boundary . ")\n\n\n";
$count = 0;
}
}
else
{
if( $line =~ /^$/ ) # if you're in the header in a single-part message, look for a blank line
{
$state = "body";
$boundary = "From ";
print $count . "\n" . $state;
$count = 0;
}
}
}

# process the message body
elsif( $state eq "body" )
{
print "." if( $printdots == 1 );

# the boundary represents a state transition
if( $line =~ /^$boundary/ )
{
if( $hasboundary == 1 )
{
# if you're in a multipart message, and you're saving off the body,
# once you find the next boundary, just "dump" the rest of the message
# while looking for the "From " (the beginning of the following message)
$boundary = "From ";
$hasboundary = 0;
print $count . "\n\ndumping rest of multipart"; $printdots = 0;
$count = 0;
$save = 0;
}
else
{
# if you're just looking for the beginning of the next message ("From - ")
# and you've found it, then spit out the current message and start a new one.
print $count . "\n\ndone with message:\n\n (" . $subject . ") \nat line:\n\n (" . $_ . ")\n\n\n";
print OUT $from . $delim . $to . $delim . $subject . $delim . $date . $delim . $text_delim . $body . $text_delim . $delim . $eol_delim if( $from );
$state = "header";
$body = "";
$from = "";
print "\n" . $state; #$printdots = 1;
$count = 0;
$boundary = "";
$hasboundary = 0;
$save = 1;
}

}
# normal condition -- haven't reached EOM yet, just append to the saved message body.
elsif( $save == 1 )
{
print ".";
$body = $body . $line . "\r\n";
}
}
}
close( IN );
close(OUT);

Here's mboxtoflatpress.php:

<?php
/* mboxtoflatpress.php - post a CSV to FlatPress
 
Created 3.6.2009 by Jimbo S. Harris
 
This script reads input files created by readmbox.pl
The files are in the form:
 
FROM|TO|SUBJECT|DATE|`BODY`|ø
 
with possibly multiple entries per file
 
use $debug to see what the parsed entries look like without posting them
*/
 
 
// FlatPress initialization routines
include 'defaults.php';
include INCLUDES_DIR .'includes.php';
 
system_init();
 
if(!user_loggedin()) die('Please log in');
// end FlatPress init
//put month names and corosponding number in array
 
$monthnum = array(
"Jan"=>"01",
"Feb"=>"02",
"Mar"=>"03",
"Apr"=>"04",
"May"=>"05",
"Jun"=>"06",
"Jul"=>"07",
"Aug"=>"08",
"Sep"=>"09",
"Oct"=>"10",
"Nov"=>"11",
"Dec"=>"12",
);
 
// Set $debug to 1 in order to see the CSV get parsed without actually posting anything.
$debug = 0;
 
// Grab the filename off of the URL .../mboxtoflatpress.php?file="foo.txt"
$file = ( ( isset( $_GET['file'] ) ) ? $_GET['file'] : "./Astro" ) .'.csv'; // forces .csv extension
$mbox = file_get_contents( $file );
 
// ø is the delimiter between messages
$emails = preg_split( "/ø/", $mbox, -1, PREG_SPLIT_NO_EMPTY );
 
foreach( $emails as $email )
{
// | is the delimiter between fields. See field order in the list()
list( $from, $to, $subject, $date, $body ) = preg_split( "/\|/", $email, -1, PREG_SPLIT_NO_EMPTY );
 
// ` is the delimiter surrounding the body
list( $body ) = preg_split( "/`/", $body, -1, PREG_SPLIT_NO_EMPTY );
 
// need to turn the date/time into a timestamp.
// Example: Sun, 01 Feb 2009 10:59:37 -0800
if( $debug == 1 )
{
$datefields = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY );
echo "datefields:<BR>";
print_r( $datefields );
echo "<BR>";
}
else
{
list( $junk, $day, $month, $year, $hour, $minute, $second, $junk ) = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY );
$timestamp = mktime($hour,$minute,$second,$monthnum[$month],$day,$year);
//echo "<BR>".$year.$monthnum[$month].$day.$hour.$minute.$second."<BR>".$timestamp."<BR>".date("r", $timestamp)."<BR>";
echo "<BR>".$date."<BR>".$timestamp."<BR>".date("r", $timestamp)."<BR>";
}
 
 
// $entry is what gets posted to FlatPress
$entry = array( 'author' => $from, 'subject' => $subject, 'date' => $timestamp, 'flags' => "draft", 'content' => $body );
 
if( $debug )
{
echo "next entry:<BR><PRE>";
print_r( $entry );
echo "<PRE><BR><BR>";
}
else
{
echo "Loaded message with subject: " . $entry['subject'] . "<BR>";
//$id = entry_save($entry);
$id = draft_save($entry);
}
}
?>
tools/db/mailboximporter.txt · Last modified: 2019/01/12 17:53 (external edit)