Monday, July 6, 2009

Searching Plurks

How often do you find code in a blog post? I became a plurk fan a while back, but the one feature I missed (and apparently couldn't live without) was a way to "grep" my time line. For non-UNIX geeks, "grep" is a command on UNIX for finding patterns in files. It stands for:

g/re/p

which in the original UNIX editor, "ed" meant globally search for a regular expression and print lines that match.

Here's a sample of how it works:


$ perl plurkgrep.pl young
fetching http://www.plurk.com/m
fetching http://www.plurk.com/m/p/178i6r
fetching http://www.plurk.com/m/p/178g4g
found young in
JeffYoung says On my way to pick up window unit from friend. A/C people won't be able to come until tomorrow at the earliest.
on 2009-07-07 at 00:34

Responses:
kxxn is glad you have that option!
JeffYoung says Me too!

http://www.plurk.com/m/p/178g4g



So, how can I get this goodness? First off, you have to be familiar with a command line.

Save the code below in a file, "plurkgrep.pl" or some such. On a UNIX or Mac you will have to be geeky enough to make it executable and put a header line on it with the locaiton of your Perl binary, or call it as "perl plurkgrep.pl". You may also need to install "Term::ReadKey" from CPAN.

The first time you run it you will have to use the "-u user" option, and it will prompt you for a password. It will store the cookie from plurk.com in a file, plurk.cookie, so you won't have to login every time.

If there's enough interest I could make this into a web page, but it would require users to enter their plurk password and trust me not to misuse it. Let me know what you think!


use LWP;
use Term::ReadKey;
use HTTP::Cookies;

my $sleep = 0;
my $quiet = 0;
my $word = 0;
my $user = '';
my $now = time;
my $then = $now - 14 * 24 * 3600;

my $cookie_jar = HTTP::Cookies->new(file => "$ENV{'HOME'}/plurk.cookies", autosave => 1);
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);
$browser->agent('Gecko/2009060214 Firefox/3.0.11'); # Pretend to be firefox

# Parse args
while (scalar(@ARGV) > 1) {
if ($ARGV[0] eq '-u') {
shift;
$user = shift;
}
elsif ($ARGV[0] eq '-d') {
shift;
my $days = shift;
$then = $now - 24 * 3600 * $days;
}
elsif ($ARGV[0] eq '-w') {
shift;
$word = 1;
}
elsif ($ARGV[0] eq '-q') {
shift;
$quiet = 1;
}
elsif ($ARGV[0] eq '-h') {
usage();
}
}
usage() if (!scalar(@ARGV));

my $term = join(' ', @ARGV);
$term = '\b' . $term . '\b' if ($word);
my $host = 'http://www.plurk.com';
my $url = "$host/m";
my $login_url = "$host/m/login";

$html = fetch($url);
while ($html =~ m!<form action="/m/login" method="post">!) {
# We need to login
# read password
print('Password: ');
ReadMode('noecho');
$password = ReadLine(0);
ReadMode('restore');
chomp($password);
my $req = HTTP::Request->new(POST => $login_url);
$req->content_type('application/x-www-form-urlencoded');
$req->content('username=' . $user . '&password=' . $password);
my $res = $browser->request($req);

# Check the outcome of the response
if (!$res->is_success) {
die("POST $login_url failed with " . $res->status_line . "\n");
}
$html = $res->content;
}

# We now have first (current) page of plurks
my $next_url = '';
my $time = $now;
while ($time >= $then) {
my @plurk_html = split(/<div class="plurk">/, $html);
for my $plurk (@plurk_html) {
if ($plurk =~ m!<span class="meta">.*<a [^>]* href="([^"]*)">\d*\s*responses?</a>!m) {
doGrep($1);
}
}
# get next page
if ($html =~ m!<div\s+class="pagination">\s*[^*]*<a\s+href="(\?offset=(\d+)[^"]*)">next!m) {
$next_url = $1;
$time = $2;
}
else {
die("Can't find next page\n");
}
$html = timeline("?offset=$time");
}

# We now have all the plurks - fetch each one and do the grep
sub doGrep
{
my $plurk = shift;
my $html = fetch("$host$plurk");
$html =~ s!<h3>Respond:</h3>.*!!m;
$html =~ s!<h3><a href="#top">back to top</a>[\s\S]*!!m;
$html =~ s!^[\s\S]*<div class="bigplurk">!!m;
# clean up formatting
$txt = $html;
$txt =~ s!\s+<span class="qualifier [^"]*">([^\<]+)</span>\s+! $1 !gm;
# remove HTML tags
$txt =~ s/\<[^\<]+\>//g;
(my $grep = $txt) =~ s/\s+/ /mg;
if ($grep =~ m/$term/i) {
print("found $term in\n$txt\n$host$plurk\n");
exit(0);
}
}

sub timeline
{
my $time = shift;
return fetch("$url$time");
}

sub fetch
{
my $url = shift;
warn("fetching $url\n") unless ($quiet);
sleep $sleep;
$sleep = 1;
my $req = HTTP::Request->new(GET => $url);
my $res = $browser->request($req);
if (!$res->is_success) {
die("GET $login_url failed with " . $res->status_line . "\n");
}
return $res->content;
}

sub usage
{
die <<_;
usage: $0 [-u user] [-d days] [-w] term(s)
-u user supplies the user name (if you have to login)
-d days tells how many days back to search (default is 14)
-w searches for a whole word (so a search for "friend" would not
match "friendly")
-q search quietly (see below)
term(s) a word or phrase to match. Keep in mind you have the full
power of Perl's regular expressions in searching.

$0 will print out the text of each matching plurk (including replies),
and will also print out the URL of the matching plurk. Because plurk.com has
restrictions on how fast pages can be accessed, $0 pulls down 1
plurk/second. Because of this, it can appear to be stuck, so it prints out
each URL it fetches as it works, so you know it's working. If you don't like
that behavior, -q searches quietly, only printing the result.
_
}

0 comments:

Post a Comment