FlatPress Wiki

Everything you need to know :)

User Tools

Site Tools


tools:db:mailboximporter

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

tools:db:mailboximporter [2019/01/12 17:53] (current)
Line 1: Line 1:
 +My workflow is a little convoluted, because I was going for more of a quick-n-dirty solution than a fully-featured one, but here goes.
  
 +I wanted to convert a bunch of mbox (email) data to FlatPress entries. I'm using Mozilla Thunderbird,​ but this would work with most mail readers (because variants of the mbox format are fairly standard). ​
 +
 +I needed to parse out the date, subject, author, and text from the emails, and then feed that into FP somehow. Here's what I did:
 +
 +1) Inside my mail reader, I put all the relevant emails into the same folder. (call it "​FP_posts"​)\\
 +2) from the filesystem, I made a working copy of that folder. (a file called "​FP_posts"​ in the TBird data folder)\\
 +3) I used uudeview (actually, xdeview, the X11 version of it) to separate the attachments from the text.\\
 +3a) In xdeview, I turned on "​Handle Text Files" option, let it decode all the files, and then just grabbed the 0001.txt, 0002.txt, etc. that contained the individual emails.\\
 +4) I concatenated all the 0*.txt files into one big file (FP_posts.txt) that now just contains the text parts of the emails.\\
 +5) I ran this file through the readmbox.pl Perl script that I wrote (code below)\\
 +5a) This creates FP_Posts.txt.csv (although really delimited by stuff other than commas -- read the code) with just the fields that FP is interested in.\\
 +6) I uploaded the CSV to a new folder on the webserver, .../​flatpress/​mbox/​FP_Posts.txt.csv\\
 +7) I uploaded mboxtoflatpress.php,​ a new PHP script that I wrote (code below) to the main FP folder .../​flatpress/​mboxtoflatpress.php\\
 +8) go to URL .../​flatpress/​mboxtoflatpress.php?​file=mbox/​FP_Posts.txt.csv\\
 +8a) This will print out subject lines of each message as they are created on the blog\\
 +9) Go to Admin->​Entries->​Drafts and publish away!\\
 +
 +The script posts each entry:\\
 +- as drafts, in case there are some you don't want going public immediately.\\
 +- with the date/time that appears in the email\\
 +- with "​author"​ set to the "From: " entry from the email (seems to grab just the "full name", rather than the email address, YMMV)\\
 +- with "​subject"​ and "​body"​ from the email (as you'd expect)\\
 +
 +Here's readmbox.pl:​\\
 +<​code>​
 +#​!/​usr/​bin/​perl
 +# readmbox.pl - parse an mbox-style file into a CSV-style file
 +#
 +# Created 03.05.2009 by Jimbo S. Harris
 +#
 +# This file reads a source file off the command line
 +# and parses out the following fields from each email in the file:
 +#
 +# From: To: Date: Subject: and the message body
 +#
 +# it makes some attempt to handle multipart messages, but it's best if they'​re text/plain.
 +#
 +# The output is in the form:
 +#
 +# FROM|TO|SUBJECT|DATE|`BODY`|ø
 +#
 +# with one such entry in the output file for each recognized message in the input file.
 +#
 +# The output file has the same name as the input file, with "​.csv"​ appended to it.
 +#
 +# The output file is designed to be read by mboxtoflatpress.php
 +#
 +
 +# open input and output files
 +open( IN, $ARGV[0] ) || die "could not open the input file" . $ARGV[0];
 +open( OUT, ">"​ . $ARGV[0] . "​.csv"​ ) || die "could not open the output file" . $ARGV[0] . "​.csv";​
 +
 +#variable declarations
 +$state = "​header";​
 +$body = "";​
 +$boundary = "";​
 +$hasboundary = 0;
 +$delim="​|";​
 +$text_delim="​`";​
 +$eol_delim="​ø";​
 +$printdots=1;​
 +$count=0;
 +$save=1;
 +$|=1;
 +
 +print $state . "​\n";​
 +
 +# traverse the input file one line at a time
 +while( <IN> )
 +{
 +$line = "";​
 +$line = $_;
 +chop $line;
 +
 +# let the user know (on STDOUT) if a lot of data is passing through
 +$count++;
 +print " " . $count . " " if( $count % 1000 == 0 );
 +
 +# read off the headers first
 +if( $state eq "​header"​ )
 +{
 +print "​."​ if( $printdots == 1 );
 +($junk, $from) = split( /:/, $line, 2 ) if( $line =~ /^From:/ );
 +($junk, $to) = split( /:/, $line, 2 ) if( $line =~ /^To:/ );
 +($junk, $subject) = split( /:/, $line, 2 ) if( $line =~ /^Subject:/ );
 +($junk, $date) = split( /:/, $line, 2 ) if( $line =~ /^Date:/ );
 +
 +# this is an attempt to parse multipart content properly. It worked OK, but can be buggy.
 +if( $line =~ /boundary/ )
 +{
 +#print "​boundary line: (" . $_ . "​)\n";​
 +
 +#$boundary = $_;
 +
 +#$boundary =~ s/​boundary=\"​\(.*\)$/​\1/;​
 +($junk, $boundary) = split( /=/, $line, 2 ) ;
 +
 +$boundary =~ s/​\"//​g;​
 +
 +$boundary = "​--"​ . $boundary;
 +
 +#print "​boundary vars: (" . $boundary . "​)\n";​
 +$hasboundary = 1;
 +#print "got boundary: " . $boundary . "​\n";​
 +}
 +
 +# if you are done with the headers, switch to processing the message body
 +if( $hasboundary == 1 )
 +{
 +if( $line =~ /​^$boundary/​ ) #if you're in the header of a multipart message, look for the boundary string
 +{
 +$state = "​body";​
 +print $count . "​\n\n"​ . $state . " with boundary:​\n\n (" . $boundary . "​)\n\n\n";​
 +$count = 0;
 +}
 +}
 +else
 +{
 +if( $line =~ /^$/ ) # if you're in the header in a single-part message, look for a blank line
 +{
 +$state = "​body";​
 +$boundary = "From ";
 +print $count . "​\n"​ . $state;
 +$count = 0;
 +}
 +}
 +}
 +
 +# process the message body
 +elsif( $state eq "​body"​ )
 +{
 +print "​."​ if( $printdots == 1 );
 +
 +# the boundary represents a state transition
 +if( $line =~ /​^$boundary/​ )
 +{
 +if( $hasboundary == 1 )
 +{
 +# if you're in a multipart message, and you're saving off the body,
 +# once you find the next boundary, just "​dump"​ the rest of the message
 +# while looking for the "From " (the beginning of the following message)
 +$boundary = "From ";
 +$hasboundary = 0;
 +print $count . "​\n\ndumping rest of multipart";​ $printdots = 0;
 +$count = 0;
 +$save = 0;
 +}
 +else
 +{
 +# if you're just looking for the beginning of the next message ("From - ")
 +# and you've found it, then spit out the current message and start a new one.
 +print $count . "​\n\ndone with message:​\n\n (" . $subject . ") \nat line:\n\n (" . $_ . "​)\n\n\n";​
 +print OUT $from . $delim . $to . $delim . $subject . $delim . $date . $delim . $text_delim . $body . $text_delim . $delim . $eol_delim if( $from );
 +$state = "​header";​
 +$body = "";​
 +$from = "";​
 +print "​\n"​ . $state; #$printdots = 1;
 +$count = 0;
 +$boundary = "";​
 +$hasboundary = 0;
 +$save = 1;
 +}
 +
 +}
 +# normal condition -- haven'​t reached EOM yet, just append to the saved message body.
 +elsif( $save == 1 )
 +{
 +print "​.";​
 +$body = $body . $line . "​\r\n";​
 +}
 +}
 +}
 +close( IN );
 +close(OUT);
 +</​code>​
 +
 +Here's mboxtoflatpress.php:​\\
 +<code php>
 +<?php
 +/* mboxtoflatpress.php - post a CSV to FlatPress
 +
 +Created 3.6.2009 by Jimbo S. Harris
 +
 +This script reads input files created by readmbox.pl
 +The files are in the form:
 +
 +FROM|TO|SUBJECT|DATE|`BODY`|ø
 +
 +with possibly multiple entries per file
 +
 +use $debug to see what the parsed entries look like without posting them
 +*/
 +
 +
 +// FlatPress initialization routines
 +include '​defaults.php';​
 +include INCLUDES_DIR .'​includes.php';​
 +
 +system_init();​
 +
 +if(!user_loggedin()) die('​Please log in');
 +// end FlatPress init
 +//put month names and corosponding number in array
 +
 +$monthnum = array(
 +"​Jan"​=>"​01",​
 +"​Feb"​=>"​02",​
 +"​Mar"​=>"​03",​
 +"​Apr"​=>"​04",​
 +"​May"​=>"​05",​
 +"​Jun"​=>"​06",​
 +"​Jul"​=>"​07",​
 +"​Aug"​=>"​08",​
 +"​Sep"​=>"​09",​
 +"​Oct"​=>"​10",​
 +"​Nov"​=>"​11",​
 +"​Dec"​=>"​12",​
 +);
 +
 +// Set $debug to 1 in order to see the CSV get parsed without actually posting anything.
 +$debug = 0;
 +
 +// Grab the filename off of the URL .../​mboxtoflatpress.php?​file="​foo.txt"​
 +$file = ( ( isset( $_GET['​file'​] ) ) ? $_GET['​file'​] : "​./​Astro"​ ) .'​.csv';​ // forces .csv extension
 +$mbox = file_get_contents( $file );
 +
 +// ø is the delimiter between messages
 +$emails = preg_split( "/​ø/",​ $mbox, -1, PREG_SPLIT_NO_EMPTY );
 +
 +foreach( $emails as $email )
 +{
 +// | is the delimiter between fields. See field order in the list()
 +list( $from, $to, $subject, $date, $body ) = preg_split( "/​\|/",​ $email, -1, PREG_SPLIT_NO_EMPTY );
 +
 +// ` is the delimiter surrounding the body
 +list( $body ) = preg_split( "/​`/",​ $body, -1, PREG_SPLIT_NO_EMPTY );
 +
 +// need to turn the date/time into a timestamp.
 +// Example: Sun, 01 Feb 2009 10:59:37 -0800
 +if( $debug == 1 )
 +{
 +$datefields = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY );
 +echo "​datefields:<​BR>";​
 +print_r( $datefields );
 +echo "<​BR>";​
 +}
 +else
 +{
 +list( $junk, $day, $month, $year, $hour, $minute, $second, $junk ) = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY );
 +$timestamp = mktime($hour,​$minute,​$second,​$monthnum[$month],​$day,​$year);​
 +//echo "<​BR>"​.$year.$monthnum[$month].$day.$hour.$minute.$second."<​BR>"​.$timestamp."<​BR>"​.date("​r",​ $timestamp)."<​BR>";​
 +echo "<​BR>"​.$date."<​BR>"​.$timestamp."<​BR>"​.date("​r",​ $timestamp)."<​BR>";​
 +}
 +
 +
 +// $entry is what gets posted to FlatPress
 +$entry = array( '​author'​ => $from, '​subject'​ => $subject, '​date'​ => $timestamp, '​flags'​ => "​draft",​ '​content'​ => $body );
 +
 +if( $debug )
 +{
 +echo "next entry:<​BR><​PRE>";​
 +print_r( $entry );
 +echo "<​PRE><​BR><​BR>";​
 +}
 +else
 +{
 +echo "​Loaded message with subject: " . $entry['​subject'​] . "<​BR>";​
 +//$id = entry_save($entry);​
 +$id = draft_save($entry);​
 +}
 +}
 +?>
 +</​code>​
tools/db/mailboximporter.txt · Last modified: 2019/01/12 17:53 (external edit)