Introduction
The perl module
WWW::Mechanize::Shell is a brilliant tool for browsing websites at a very low level - think somewhere in between using telnet and using a command line based browser like lynx or links or w3m and you'll be close. WWW::Mechanize::Shell is more than that though, it allows you to script a complete HTTP session so it can be replayed back at a later date without any interaction using WWW::Mechanize::Shell's parent perl module WWW::Mechanize - great for automatically submitting HTML forms/ POST data regularly via a cron job for example.
In this article I'll be talking about installing WWW::Mechanize::Shell, look at a typical WWW::Mechanize::Shell browsing session and look at some examples of how I use WWW::Mechanize::Shell to make things easier. Finally the article will end with a real world example - using mechshell to automate logging into FreshPorts and updating a watch list.
Installing WWW::Mechanize::Shell
As the name suggests, WWW::Mechanize::Shell is a perl module whose 'parent' is the
WWW::Mechanize module written by Andy Lester (WWW::Mechanize::Shell itself is written by Max Maischein at time of original writing). WWW::Mechanize does all the work in the background - WWW::Mechanize::Shell just makes it easy to interact in a HTTP session. WWW::Mechanize::Shell and all it's dependencies can be installed from the ports tree:
CODE:
root@users /root# cd /usr/ports/www/p5-WWW-Mechanize-Shell/
root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# make install
root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# rehash
Getting Started Using WWW::Mechanize::Shell
Once installed, start up the WWW::Mechanize::Shell using the following you can use the following commandline:
CODE:
root@users /root# perl -MWWW::Mechanize::Shell -eshell
To make things easier though I use a CSH shell alias which aliases 'mechshell' to the command above:
CODE:
root@users /root# grep mechshell $cshrc
alias mechshell perl -MWWW::Mechanize::Shell -eshell
Examples of WWW::Mechanize Usage
I usually use WWW::Mechanize when I want to manipulate data from websites that require a stateful HTTP session - ie a browsing session where there's more than one URL you have to visit to complete the 'session'. Usually these kind of stateful sessions involve logging into the website first, then browsing to another page to obtain the data and then I have the WWW::Mechanize perl script handle the data and return any results on the commandline.
Some examples of scripts that I've use WWW::Mechanize with:
eclipse_flex_speed.pl
My ISP (Eclipse UK) used to allow you to 'flex' your internet speed from 256k up to 2Mbps. They ran an offer for a while where you could flex to the max for 3 months - unfortunately you could only flex for 12 hours at a time, which meant logging into the control panel every 12hrs, selecting the maximum speed and then submitting the form. PITA basically.
Instead I wrote eclipse_flex_speed.pl to automatically login to the Eclipse control panel, 'click' the 2Mbps radio button and then submit the form so my speed got flexed automagically. I then added the script as a cron job to autorun every 12hours, saving the haslle of doing it all manually!
aod_get.pl
The BBC website allows you to listen to streams of all BBC radio broadcasts for up to a week after they've been aired live. The problem is that the web interface you listen to the stream on in your web browser only allows you to skip 5 or 15 minutes ahead in time and doesn't allow you to go to specific times in the stream. To get around this you can obtain the URL of the real player stream and open it in a standalone real player - doing this you can go to any point in the stream easily. Trouble is finding the URL of the stream isn't that easy and involves viewing the source HTML of the web UI and copy/pasting a partial URL.
I started to write a WWW::Mechanize script to automate the 'screen scraping' of all the available feeds from the BBC Audio On Demand site and listing them on one single HTML page linking the name of the feed to the real player feed URL. As it goes though, someone else - Dave Cross - already had the same idea and wrote
a great script for scraping the BBC feeds automatically. I now run this in a cronjob once a week.
torrentflux_ctl.pl
This is a script for starting and stopping all torrents under the control of the torrentflux web based bittorrent client. The script logs in as the torrent owner and then stops or starts all the torrents for that user - basically just does a GET of a URL that causes torrentflux to stop or start all torrents. Crude but effective.
Real world example - Automating the update of FreshPorts watch list
Below is a real world example usage of WWW::Mechanize::Shell - automating the procedure of updating your watch list on
Freshports.org. I've included comments as '# this is a comment' to help explain what each command is doing.
CODE:
# Start up mechshell - alias for 'perl -MWWW::Mechanize::Shell -eshell':
munk@users /home/munk# mechshell
# Request the URL http://www.freshports.org/login.php.
# Note the HTTP response '(200)' is displayed underneath
# to indicate the page was fetched successfully:
(no url)>get http://www.freshports.org/login.php
Retrieving http://www.freshports.org/login.php(200)
# Use the mechshell 'dump' command to dump the contents
# of all forms found on the login page:
http://www.freshports.org/login.php>dump
POST http://www.freshports.org/login.php?origin=%2F [l]
custom_settings=1 (hidden readonly)
LOGIN=1 (hidden readonly)
UserID= (text)
Password= (password)
submit=Login (submit)
<NONAME>=<UNDEF> (reset)
# There's just a single form on this page:
# - the form's 'ACTION' is set to submit the form using the POST method
# to the url http://www.freshports.org/login.php?origin=%2F
# The form contains the following form fields:
# - 2 hidden fields
# - 1 text field called 'UserID'
# - 1 password field called 'Password'
# - 1 submit field called 'Login'
# Fill in the 'UserID' and 'Password' fields:
http://www.freshports.org/login.php>value UserID munk
http://www.freshports.org/login.php>value Password xxxxxx
# And then submit the form. Note we can just use the mechshell 'submit'
# command here because there is only a single form on the page. If there were
# more than one form on the page we would need to specify which button exactly to
# click:
http://www.freshports.org/login.php>submit
200
# Again note that the '200' response indicates the request was successful.
# Also note that the next mechshell prompt below has changed from
# 'http://www.freshports.org/login.php>' to just 'http://www.freshports.org/' -
# this indicates that the login script has probably redirected us to the
# freshports home page.
# Now we take a look to check that the login succeeded ok. To do this we use
# the mechshell 'content' command which effectively dumps the content of the
# returned page back at us in a pager.
# What we're looking for is the text 'Logged in as munk' which will indicate we
# logged in ok:
http://www.freshports.org/>content
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
-snip-
<td NOWRAP><FONT SIZE="-1">Logged in as munk</FONT><br><FONT SIZE="-1"><a href="/customize.php?origin=%2F" title="Customize your settings">Customize</a
></FONT><br><FONT SIZE="-1"><a href="/logout.php" title="Logout of the website">Logout</a></FONT><br><FONT SIZE="-1"><a href="/my-flagged-commits.php" title="Li
st of commits you have flagged">My Flagged Commits</a></FONT><br>
-snip-
# Now we're logged in ok we can continue to upload the mypkg_info.txt file we
# created earlier.
# First browse to the pkg_upload.php page:
http://www.freshports.org/>get http://www.freshports.org/pkg_upload.php
Retrieving http://www.freshports.org/pkg_upload.php(200)
# Now use 'dump' to see a list of form fields on this page.
# Note that there are 2 submit buttons on this page:
http://www.freshports.org/pkg_upload.php>dump
POST http://www.freshports.org/pkg_upload.php (multipart/form-data)
pkg_info= (file)
staging=Staging (submit)
wlid=5393 (option) [*5393/main*]
replaceappend=replace (radio) [*replace/Replace list contents|append/Append to list (duplicates will be removed)]
upload=Upload (submit)
# We need to fill out the form here. Uploading files with mechshell is as
# simple as completing the correct file type field:
http://www.freshports.org/pkg_upload.php>value pkg_info /tmp/mypkg_info.txt
# Ok, now we're ready to submit the form.
# Note that because there are 2 submit buttons on this form, we must explicitly
# tell mechshell which button it is that we want to click on - to do that we use
# the 'click' command. Just using 'submit' here would possibly click on the
# 'staging' button which is not what we want - instead we use the command
# 'click upload' to indicate we want to click on the 'upload' button:
http://www.freshports.org/pkg_upload.php>click upload
(200)
# Success! It's a good idea now to just check that this worked by browsing in
# a web browser to your watch list and checking the new items were updated ok (of
# course you can do this in mechshell if you want but I'll leave that out here!).
# Finally, the really cool bit. The mechshell 'script' command will dump out
# the perl code required to perform all of the above actions again if you copy
# them into a perl script:
http://www.freshports.org/pkg_upload.php>script
#!perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::FormFiller;
use URI::URL;
-snip-
# Also, if you provide a filename as an argument to the 'script' command,
# mechshell will dump all the script commands to that filename:
http://www.freshports.org/pkg_upload.php>script /tmp/freshports_update.pl
# Finally, use 'quit' to exit the mechshell:
http://www.freshports.org/pkg_upload.php>quit
munk@users /home/munk#
Now all that remains is to open up /tmp/freshports_update.pl and tidy the script up so that it's more suitable for automated use via cron. For example, any 'dump' and 'content' commands can be taken out - these would only cause problems anyway if run from a non-interactive shell as used by cron.
We also need to add some code to have the script dump the contents of 'pkg_info -qoa' to a temporary file prior to uploading.
The completed 'quick and dirty' hack looks like this then:
CODE:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::FormFiller;
use URI::URL;
# FreshPorts username/pass:
my $user="munk";
my $pass="xxxxx";
# Temp location to store output from 'pkg_info -qoa':
my $mypkg_info="/tmp/freshports/mypkg_info.txt";
# prepare file containing output from: pkg_info -qoa
`pkg_info -qoa > $mypkg_info`;
# Prepare WWW::Mechanize:
my $agent = WWW::Mechanize->new( autocheck => 1 );
my $formfiller = WWW::Mechanize::FormFiller->new();
$agent->env_proxy();
# Login to FreshPorts:
$agent->get('http://www.freshports.org/login.php');
$agent->form_number(1) if $agent->forms and scalar @{$agent->forms};
{ local $^W; $agent->current_form->value('UserID', $user); };
{ local $^W; $agent->current_form->value('Password', $pass); };
$agent->submit();
# Submit pkg_info details to FreshPorts pkg_upload page:
$agent->get('http://www.freshports.org/pkg_upload.php');
$agent->form_number(1) if $agent->forms and scalar @{$agent->forms};
{ local $^W; $agent->current_form->value('pkg_info', $mypkg_info); };
$agent->click('upload');
# Remove temporary file:
`rm $mypkg_info`;
After saving the script and making the file executable, an entry can then be added to cron to have the script auto update the list of ports at freshports once a week - or however often you require it to be updated, once a week is more than enough for me. Sorted! :)