User Tools

Site Tools


tools:db:mailboximporter

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

tools:db:mailboximporter [2019/01/12 17:53] – external edit 127.0.0.1tools:db:mailboximporter [2020/04/15 19:57] (current) – removed arvid
Line 1: Line 1:
-My workflow is a little convoluted, because I was going for more of a quick-n-dirty solution than a fully-featured one, but here goes. 
  
-I wanted to convert a bunch of mbox (email) data to FlatPress entries. I'm using Mozilla Thunderbird, but this would work with most mail readers (because variants of the mbox format are fairly standard).  
- 
-I needed to parse out the date, subject, author, and text from the emails, and then feed that into FP somehow. Here's what I did: 
- 
-1) Inside my mail reader, I put all the relevant emails into the same folder. (call it "FP_posts")\\ 
-2) from the filesystem, I made a working copy of that folder. (a file called "FP_posts" in the TBird data folder)\\ 
-3) I used uudeview (actually, xdeview, the X11 version of it) to separate the attachments from the text.\\ 
-3a) In xdeview, I turned on "Handle Text Files" option, let it decode all the files, and then just grabbed the 0001.txt, 0002.txt, etc. that contained the individual emails.\\ 
-4) I concatenated all the 0*.txt files into one big file (FP_posts.txt) that now just contains the text parts of the emails.\\ 
-5) I ran this file through the readmbox.pl Perl script that I wrote (code below)\\ 
-5a) This creates FP_Posts.txt.csv (although really delimited by stuff other than commas -- read the code) with just the fields that FP is interested in.\\ 
-6) I uploaded the CSV to a new folder on the webserver, .../flatpress/mbox/FP_Posts.txt.csv\\ 
-7) I uploaded mboxtoflatpress.php, a new PHP script that I wrote (code below) to the main FP folder .../flatpress/mboxtoflatpress.php\\ 
-8) go to URL .../flatpress/mboxtoflatpress.php?file=mbox/FP_Posts.txt.csv\\ 
-8a) This will print out subject lines of each message as they are created on the blog\\ 
-9) Go to Admin->Entries->Drafts and publish away!\\ 
- 
-The script posts each entry:\\ 
-- as drafts, in case there are some you don't want going public immediately.\\ 
-- with the date/time that appears in the email\\ 
-- with "author" set to the "From: " entry from the email (seems to grab just the "full name", rather than the email address, YMMV)\\ 
-- with "subject" and "body" from the email (as you'd expect)\\ 
- 
-Here's readmbox.pl:\\ 
-<code> 
-#!/usr/bin/perl 
-# readmbox.pl - parse an mbox-style file into a CSV-style file 
-# 
-# Created 03.05.2009 by Jimbo S. Harris 
-# 
-# This file reads a source file off the command line 
-# and parses out the following fields from each email in the file: 
-# 
-# From: To: Date: Subject: and the message body 
-# 
-# it makes some attempt to handle multipart messages, but it's best if they're text/plain. 
-# 
-# The output is in the form: 
-# 
-# FROM|TO|SUBJECT|DATE|`BODY`|ø 
-# 
-# with one such entry in the output file for each recognized message in the input file. 
-# 
-# The output file has the same name as the input file, with ".csv" appended to it. 
-# 
-# The output file is designed to be read by mboxtoflatpress.php 
-# 
- 
-# open input and output files 
-open( IN, $ARGV[0] ) || die "could not open the input file" . $ARGV[0]; 
-open( OUT, ">" . $ARGV[0] . ".csv" ) || die "could not open the output file" . $ARGV[0] . ".csv"; 
- 
-#variable declarations 
-$state = "header"; 
-$body = ""; 
-$boundary = ""; 
-$hasboundary = 0; 
-$delim="|"; 
-$text_delim="`"; 
-$eol_delim="ø"; 
-$printdots=1; 
-$count=0; 
-$save=1; 
-$|=1; 
- 
-print $state . "\n"; 
- 
-# traverse the input file one line at a time 
-while( <IN> ) 
-{ 
-$line = ""; 
-$line = $_; 
-chop $line; 
- 
-# let the user know (on STDOUT) if a lot of data is passing through 
-$count++; 
-print " " . $count . " " if( $count % 1000 == 0 ); 
- 
-# read off the headers first 
-if( $state eq "header" ) 
-{ 
-print "." if( $printdots == 1 ); 
-($junk, $from) = split( /:/, $line, 2 ) if( $line =~ /^From:/ ); 
-($junk, $to) = split( /:/, $line, 2 ) if( $line =~ /^To:/ ); 
-($junk, $subject) = split( /:/, $line, 2 ) if( $line =~ /^Subject:/ ); 
-($junk, $date) = split( /:/, $line, 2 ) if( $line =~ /^Date:/ ); 
- 
-# this is an attempt to parse multipart content properly. It worked OK, but can be buggy. 
-if( $line =~ /boundary/ ) 
-{ 
-#print "boundary line: (" . $_ . ")\n"; 
- 
-#$boundary = $_; 
- 
-#$boundary =~ s/boundary=\"\(.*\)$/\1/; 
-($junk, $boundary) = split( /=/, $line, 2 ) ; 
- 
-$boundary =~ s/\"//g; 
- 
-$boundary = "--" . $boundary; 
- 
-#print "boundary vars: (" . $boundary . ")\n"; 
-$hasboundary = 1; 
-#print "got boundary: " . $boundary . "\n"; 
-} 
- 
-# if you are done with the headers, switch to processing the message body 
-if( $hasboundary == 1 ) 
-{ 
-if( $line =~ /^$boundary/ ) #if you're in the header of a multipart message, look for the boundary string 
-{ 
-$state = "body"; 
-print $count . "\n\n" . $state . " with boundary:\n\n (" . $boundary . ")\n\n\n"; 
-$count = 0; 
-} 
-} 
-else 
-{ 
-if( $line =~ /^$/ ) # if you're in the header in a single-part message, look for a blank line 
-{ 
-$state = "body"; 
-$boundary = "From "; 
-print $count . "\n" . $state; 
-$count = 0; 
-} 
-} 
-} 
- 
-# process the message body 
-elsif( $state eq "body" ) 
-{ 
-print "." if( $printdots == 1 ); 
- 
-# the boundary represents a state transition 
-if( $line =~ /^$boundary/ ) 
-{ 
-if( $hasboundary == 1 ) 
-{ 
-# if you're in a multipart message, and you're saving off the body, 
-# once you find the next boundary, just "dump" the rest of the message 
-# while looking for the "From " (the beginning of the following message) 
-$boundary = "From "; 
-$hasboundary = 0; 
-print $count . "\n\ndumping rest of multipart"; $printdots = 0; 
-$count = 0; 
-$save = 0; 
-} 
-else 
-{ 
-# if you're just looking for the beginning of the next message ("From - ") 
-# and you've found it, then spit out the current message and start a new one. 
-print $count . "\n\ndone with message:\n\n (" . $subject . ") \nat line:\n\n (" . $_ . ")\n\n\n"; 
-print OUT $from . $delim . $to . $delim . $subject . $delim . $date . $delim . $text_delim . $body . $text_delim . $delim . $eol_delim if( $from ); 
-$state = "header"; 
-$body = ""; 
-$from = ""; 
-print "\n" . $state; #$printdots = 1; 
-$count = 0; 
-$boundary = ""; 
-$hasboundary = 0; 
-$save = 1; 
-} 
- 
-} 
-# normal condition -- haven't reached EOM yet, just append to the saved message body. 
-elsif( $save == 1 ) 
-{ 
-print "."; 
-$body = $body . $line . "\r\n"; 
-} 
-} 
-} 
-close( IN ); 
-close(OUT); 
-</code> 
- 
-Here's mboxtoflatpress.php:\\ 
-<code php> 
-<?php 
-/* mboxtoflatpress.php - post a CSV to FlatPress 
- 
-Created 3.6.2009 by Jimbo S. Harris 
- 
-This script reads input files created by readmbox.pl 
-The files are in the form: 
- 
-FROM|TO|SUBJECT|DATE|`BODY`|ø 
- 
-with possibly multiple entries per file 
- 
-use $debug to see what the parsed entries look like without posting them 
-*/ 
- 
- 
-// FlatPress initialization routines 
-include 'defaults.php'; 
-include INCLUDES_DIR .'includes.php'; 
- 
-system_init(); 
- 
-if(!user_loggedin()) die('Please log in'); 
-// end FlatPress init 
-//put month names and corosponding number in array 
- 
-$monthnum = array( 
-"Jan"=>"01", 
-"Feb"=>"02", 
-"Mar"=>"03", 
-"Apr"=>"04", 
-"May"=>"05", 
-"Jun"=>"06", 
-"Jul"=>"07", 
-"Aug"=>"08", 
-"Sep"=>"09", 
-"Oct"=>"10", 
-"Nov"=>"11", 
-"Dec"=>"12", 
-); 
- 
-// Set $debug to 1 in order to see the CSV get parsed without actually posting anything. 
-$debug = 0; 
- 
-// Grab the filename off of the URL .../mboxtoflatpress.php?file="foo.txt" 
-$file = ( ( isset( $_GET['file'] ) ) ? $_GET['file'] : "./Astro" ) .'.csv'; // forces .csv extension 
-$mbox = file_get_contents( $file ); 
- 
-// ø is the delimiter between messages 
-$emails = preg_split( "/ø/", $mbox, -1, PREG_SPLIT_NO_EMPTY ); 
- 
-foreach( $emails as $email ) 
-{ 
-// | is the delimiter between fields. See field order in the list() 
-list( $from, $to, $subject, $date, $body ) = preg_split( "/\|/", $email, -1, PREG_SPLIT_NO_EMPTY ); 
- 
-// ` is the delimiter surrounding the body 
-list( $body ) = preg_split( "/`/", $body, -1, PREG_SPLIT_NO_EMPTY ); 
- 
-// need to turn the date/time into a timestamp. 
-// Example: Sun, 01 Feb 2009 10:59:37 -0800 
-if( $debug == 1 ) 
-{ 
-$datefields = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY ); 
-echo "datefields:<BR>"; 
-print_r( $datefields ); 
-echo "<BR>"; 
-} 
-else 
-{ 
-list( $junk, $day, $month, $year, $hour, $minute, $second, $junk ) = preg_split( "/[ :]/", $date, -1, PREG_SPLIT_NO_EMPTY ); 
-$timestamp = mktime($hour,$minute,$second,$monthnum[$month],$day,$year); 
-//echo "<BR>".$year.$monthnum[$month].$day.$hour.$minute.$second."<BR>".$timestamp."<BR>".date("r", $timestamp)."<BR>"; 
-echo "<BR>".$date."<BR>".$timestamp."<BR>".date("r", $timestamp)."<BR>"; 
-} 
- 
- 
-// $entry is what gets posted to FlatPress 
-$entry = array( 'author' => $from, 'subject' => $subject, 'date' => $timestamp, 'flags' => "draft", 'content' => $body ); 
- 
-if( $debug ) 
-{ 
-echo "next entry:<BR><PRE>"; 
-print_r( $entry ); 
-echo "<PRE><BR><BR>"; 
-} 
-else 
-{ 
-echo "Loaded message with subject: " . $entry['subject'] . "<BR>"; 
-//$id = entry_save($entry); 
-$id = draft_save($entry); 
-} 
-} 
-?> 
-</code> 
tools/db/mailboximporter.1547312012.txt.gz · Last modified: 2019/01/12 17:53 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki