Google the SiteQuicksearchCategoriesSyndicate This BlogCreative CommonsBlog Administration |
Wednesday, August 15. 2007Command Line Web Browsing With WWW::Mechanize::Shell
Introduction
The perl module WWW::Mechanize::Shell is a brilliant tool for browsing websites at a very low level - think somewhere in between using telnet and using a command line based browser like lynx or links or w3m and you'll be close. WWW::Mechanize::Shell is more than that though, it allows you to script a complete HTTP session so it can be replayed back at a later date without any interaction using WWW::Mechanize::Shell's parent perl module WWW::Mechanize - great for automatically submitting HTML forms/ POST data regularly via a cron job for example. In this article I'll be talking about installing WWW::Mechanize::Shell, look at a typical WWW::Mechanize::Shell browsing session and look at some examples of how I use WWW::Mechanize::Shell to make things easier. Finally the article will end with a real world example - using mechshell to automate logging into FreshPorts and updating a watch list. Installing WWW::Mechanize::Shell As the name suggests, WWW::Mechanize::Shell is a perl module whose 'parent' is the WWW::Mechanize module written by Andy Lester (WWW::Mechanize::Shell itself is written by Max Maischein at time of original writing). WWW::Mechanize does all the work in the background - WWW::Mechanize::Shell just makes it easy to interact in a HTTP session. WWW::Mechanize::Shell and all it's dependencies can be installed from the ports tree: CODE: root@users /root# cd /usr/ports/www/p5-WWW-Mechanize-Shell/ root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# make install root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# rehash Getting Started Using WWW::Mechanize::Shell Once installed, start up the WWW::Mechanize::Shell using the following you can use the following commandline: CODE: root@users /root# perl -MWWW::Mechanize::Shell -eshell To make things easier though I use a CSH shell alias which aliases 'mechshell' to the command above: CODE: root@users /root# grep mechshell $cshrc alias mechshell perl -MWWW::Mechanize::Shell -eshell Examples of WWW::Mechanize Usage I usually use WWW::Mechanize when I want to manipulate data from websites that require a stateful HTTP session - ie a browsing session where there's more than one URL you have to visit to complete the 'session'. Usually these kind of stateful sessions involve logging into the website first, then browsing to another page to obtain the data and then I have the WWW::Mechanize perl script handle the data and return any results on the commandline. Some examples of scripts that I've use WWW::Mechanize with: eclipse_flex_speed.pl My ISP (Eclipse UK) used to allow you to 'flex' your internet speed from 256k up to 2Mbps. They ran an offer for a while where you could flex to the max for 3 months - unfortunately you could only flex for 12 hours at a time, which meant logging into the control panel every 12hrs, selecting the maximum speed and then submitting the form. PITA basically. Instead I wrote eclipse_flex_speed.pl to automatically login to the Eclipse control panel, 'click' the 2Mbps radio button and then submit the form so my speed got flexed automagically. I then added the script as a cron job to autorun every 12hours, saving the haslle of doing it all manually! aod_get.pl The BBC website allows you to listen to streams of all BBC radio broadcasts for up to a week after they've been aired live. The problem is that the web interface you listen to the stream on in your web browser only allows you to skip 5 or 15 minutes ahead in time and doesn't allow you to go to specific times in the stream. To get around this you can obtain the URL of the real player stream and open it in a standalone real player - doing this you can go to any point in the stream easily. Trouble is finding the URL of the stream isn't that easy and involves viewing the source HTML of the web UI and copy/pasting a partial URL. I started to write a WWW::Mechanize script to automate the 'screen scraping' of all the available feeds from the BBC Audio On Demand site and listing them on one single HTML page linking the name of the feed to the real player feed URL. As it goes though, someone else - Dave Cross - already had the same idea and wrote a great script for scraping the BBC feeds automatically. I now run this in a cronjob once a week. torrentflux_ctl.pl This is a script for starting and stopping all torrents under the control of the torrentflux web based bittorrent client. The script logs in as the torrent owner and then stops or starts all the torrents for that user - basically just does a GET of a URL that causes torrentflux to stop or start all torrents. Crude but effective. Real world example - Automating the update of FreshPorts watch list Below is a real world example usage of WWW::Mechanize::Shell - automating the procedure of updating your watch list on Freshports.org. I've included comments as '# this is a comment' to help explain what each command is doing. CODE: # Start up mechshell - alias for 'perl -MWWW::Mechanize::Shell -eshell': munk@users /home/munk# mechshell # Request the URL http://www.freshports.org/login.php. # Note the HTTP response '(200)' is displayed underneath # to indicate the page was fetched successfully: (no url)>get http://www.freshports.org/login.php Retrieving http://www.freshports.org/login.php(200) # Use the mechshell 'dump' command to dump the contents # of all forms found on the login page: http://www.freshports.org/login.php>dump POST http://www.freshports.org/login.php?origin=%2F [l] custom_settings=1 (hidden readonly) LOGIN=1 (hidden readonly) UserID= (text) Password= (password) submit=Login (submit) <NONAME>=<UNDEF> (reset) # There's just a single form on this page: # - the form's 'ACTION' is set to submit the form using the POST method # to the url http://www.freshports.org/login.php?origin=%2F # The form contains the following form fields: # - 2 hidden fields # - 1 text field called 'UserID' # - 1 password field called 'Password' # - 1 submit field called 'Login' # Fill in the 'UserID' and 'Password' fields: http://www.freshports.org/login.php>value UserID munk http://www.freshports.org/login.php>value Password xxxxxx # And then submit the form. Note we can just use the mechshell 'submit' # command here because there is only a single form on the page. If there were # more than one form on the page we would need to specify which button exactly to # click: http://www.freshports.org/login.php>submit 200 # Again note that the '200' response indicates the request was successful. # Also note that the next mechshell prompt below has changed from # 'http://www.freshports.org/login.php>' to just 'http://www.freshports.org/' - # this indicates that the login script has probably redirected us to the # freshports home page. # Now we take a look to check that the login succeeded ok. To do this we use # the mechshell 'content' command which effectively dumps the content of the # returned page back at us in a pager. # What we're looking for is the text 'Logged in as munk' which will indicate we # logged in ok: http://www.freshports.org/>content <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> -snip- <td NOWRAP><FONT SIZE="-1">Logged in as munk</FONT><br><FONT SIZE="-1"><a href="/customize.php?origin=%2F" title="Customize your settings">Customize</a ></FONT><br><FONT SIZE="-1"><a href="/logout.php" title="Logout of the website">Logout</a></FONT><br><FONT SIZE="-1"><a href="/my-flagged-commits.php" title="Li st of commits you have flagged">My Flagged Commits</a></FONT><br> -snip- # Now we're logged in ok we can continue to upload the mypkg_info.txt file we # created earlier. # First browse to the pkg_upload.php page: http://www.freshports.org/>get http://www.freshports.org/pkg_upload.php Retrieving http://www.freshports.org/pkg_upload.php(200) # Now use 'dump' to see a list of form fields on this page. # Note that there are 2 submit buttons on this page: http://www.freshports.org/pkg_upload.php>dump POST http://www.freshports.org/pkg_upload.php (multipart/form-data) pkg_info= (file) staging=Staging (submit) wlid=5393 (option) [*5393/main*] replaceappend=replace (radio) [*replace/Replace list contents|append/Append to list (duplicates will be removed)] upload=Upload (submit) # We need to fill out the form here. Uploading files with mechshell is as # simple as completing the correct file type field: http://www.freshports.org/pkg_upload.php>value pkg_info /tmp/mypkg_info.txt # Ok, now we're ready to submit the form. # Note that because there are 2 submit buttons on this form, we must explicitly # tell mechshell which button it is that we want to click on - to do that we use # the 'click' command. Just using 'submit' here would possibly click on the # 'staging' button which is not what we want - instead we use the command # 'click upload' to indicate we want to click on the 'upload' button: http://www.freshports.org/pkg_upload.php>click upload (200) # Success! It's a good idea now to just check that this worked by browsing in # a web browser to your watch list and checking the new items were updated ok (of # course you can do this in mechshell if you want but I'll leave that out here!). # Finally, the really cool bit. The mechshell 'script' command will dump out # the perl code required to perform all of the above actions again if you copy # them into a perl script: http://www.freshports.org/pkg_upload.php>script #!perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; -snip- # Also, if you provide a filename as an argument to the 'script' command, # mechshell will dump all the script commands to that filename: http://www.freshports.org/pkg_upload.php>script /tmp/freshports_update.pl # Finally, use 'quit' to exit the mechshell: http://www.freshports.org/pkg_upload.php>quit munk@users /home/munk# Now all that remains is to open up /tmp/freshports_update.pl and tidy the script up so that it's more suitable for automated use via cron. For example, any 'dump' and 'content' commands can be taken out - these would only cause problems anyway if run from a non-interactive shell as used by cron. We also need to add some code to have the script dump the contents of 'pkg_info -qoa' to a temporary file prior to uploading. The completed 'quick and dirty' hack looks like this then: CODE: #!/usr/bin/perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; # FreshPorts username/pass: my $user="munk"; my $pass="xxxxx"; # Temp location to store output from 'pkg_info -qoa': my $mypkg_info="/tmp/freshports/mypkg_info.txt"; # prepare file containing output from: pkg_info -qoa `pkg_info -qoa > $mypkg_info`; # Prepare WWW::Mechanize: my $agent = WWW::Mechanize->new( autocheck => 1 ); my $formfiller = WWW::Mechanize::FormFiller->new(); $agent->env_proxy(); # Login to FreshPorts: $agent->get('http://www.freshports.org/login.php'); $agent->form_number(1) if $agent->forms and scalar @{$agent->forms}; { local $^W; $agent->current_form->value('UserID', $user); }; { local $^W; $agent->current_form->value('Password', $pass); }; $agent->submit(); # Submit pkg_info details to FreshPorts pkg_upload page: $agent->get('http://www.freshports.org/pkg_upload.php'); $agent->form_number(1) if $agent->forms and scalar @{$agent->forms}; { local $^W; $agent->current_form->value('pkg_info', $mypkg_info); }; $agent->click('upload'); # Remove temporary file: `rm $mypkg_info`; After saving the script and making the file executable, an entry can then be added to cron to have the script auto update the list of ports at freshports once a week - or however often you require it to be updated, once a week is more than enough for me. Sorted! :) Saturday, November 18. 2006Let root see all files with locate
The locate utility on linux was one of the first tools I hit when I made the move to FreeBSD a few years back - knowing where files are is half the battle when you're trying to configure things and find documentation on how to do it. The trouble with locate though as jdarnold mentions in his article 'Locate This!' is that if you build the locate database as 'root', you end up exposing everything to any user that runs the locate command. The other problem he mentions is the locate db is only updated weekly on FreeBSD by default via the periodic system which isn't really enough if you use your system regularly.
I remember thinking along the same lines a while back and after reading through the man pages the solution I found was to create two separate databases - one for root and one for regular users. The 'regular' db is updated on a weekly basis as per the default on FreeBSD via periodic, whereas the other 'root' locate db is built daily in a crontab so I can get the latest up to date details on which files are where. To get the root db built first you need to create a crontab entry - i put this in /etc/crontab: CODE: 39 2 * * * root env -i LOCATE_CONFIG=/root/locate/conf/locate.rc /usr/libexec/locate.updatedb > /dev/null 2>&1 This tells the locate.updatedb script to use a separate configuration file - /root/locate/conf/locate.rc - for building root's locate db. The content of /root/locate/conf/locate.rc look like this: CODE: FCODES="/root/locate/db/locate.database.root" which indicates that this db should be built in /root/locate/db/locate.database.root instead of the default locate in /var/db/locate.database. You can safely run the command as root on the commandline to initialize your new db: CODE: root@users /root# env -i LOCATE_CONFIG=/root/locate/conf/locate.rc /usr/libexec/locate.updatedb Once the database is built you can move on to test the new db works ok: CODE: root@users /root# locate -d /root/locate/db/locate.database.root .cshrc.root /root/.cshrc.root This file is only readable by root, so it seems to work ok. To make things easier, add a shell alias in root's .cshrc file aliasing 'locate' to the command 'locate -d /root/locate/db/locate.database.root': CODE: root@users /root# grep locate $cshrc alias locate locate -d /root/locate/db/locate.database.root With the "-d /root/locate/db/locate.database.root" switch, locate will use the db at /root/locate/db/locate.database.root instead of the default /var/db/locate.database and root will be able to use locate to find any files in the filesystem, not just those that are world readable. Finally, one way to update the regular locate db as root but without making it list every world readable file is to perform the following: CODE: #!/bin/sh # make sure db file exists: touch /var/db/locate.database # then change ownership to the nobody user: chown nobody /var/db/locate.database # make it writeable by nobody and readable by everyone else: chmod 644 /var/db/locate.database # then move on to update the db... # first make sure we're in the / folder where the db update starts: cd / # then finally run the updatedb command as the 'nobody' user: echo "/usr/libexec/locate.updatedb" | su - -fm nobody This is basically what the 310.locate periodic script does and results in a locate db that contains only files that are readable by the 'nobody' user - essentially all 'world readable' files. Comparing the sizes of the root db against the nobody db: CODE: root@users /# ls -al /var/db/locate.database /root/locate/db/locate.database.root -rw-r--r-- 1 root wheel 4070484 Nov 18 02:45 /root/locate/db/locate.database.root -rw-r--r-- 1 nobody wheel 3280409 Nov 18 11:41 /var/db/locate.database You can see the size difference there, not as many entries in nobody's db as root's. Just to double check: CODE: root@users /root# locate .cshrc.root /root/bin/ktrace.out /root/ktrace.out /usr/local/etc/snort/ktrace.out root@users /root# echo "locate ktrace.out" | su - -fm nobody /usr/local/etc/snort/ktrace.out So from that you can see that 'nobody' can see the ktrace.out files located in /root - apart from root of course :) Sorted. Sunday, November 5. 2006Expand shell globs using 'ctrl-z'
Just noticed a semi useful feature of the CSH shell (shells in general? not tested it) whilst running 'rm -rf *' in a directory. Got a bit paranoid I was doing something silly (running the command as root), so hit 'ctrl-z' to put the process into the background and the '*' part was expanded in the job control list:
CODE: [11:21:24] root@users /usr/local/www/web/torrentflux.munk.me.uk# rm -rf * ^Z [1] + Suspended rm -rf TF_BitTornado adodb downloads images language mods searchEngines themes Yay. Useful tip #341 you'll probably never use but at some point in the future think mmm... now where did I read about that thing about this thing... Wednesday, October 18. 2006CSH Tips: Auto 'whereis' on the tcsh command line
Just read this entry about the 'whereis' command on Unix and it reminded me of another great shell tip for tcsh users (csh/tcsh on FreeBSD since they're the same thing!) - the shell can 'normalize' any command on the command line if you bind the normalize function to a keystroke - this allows you to easily see how a command would expand after it's executed. Somewhat esoteric without an example, but I use it so much I thought I'd post about it.
First off set a key binding for the normalize command - in the shell type in: CODE: bindkey "^W" normalize-command or add it to ~/.cshrc to make it permanent. Obviously you can set it to whatever keybinding you want, I use ctrl-w. Now type in any command that you'd use on the commandline and then whilst the cursor is at the end of the command, hit the keystroke you entered for the normalize command - in our case above, ctrl-w. The command you entered should automatically get expanded to the absolute path of the command. For example if I type in: CODE: ls and then hit 'ctrl-w' whilst the cursor is just after the 's', the result will look like: CODE: /bin/ls Magic! Like I say I use this function quite a lot on the commandline, particularly when I want to see how a command I enter will get expanded - the normalize function works on aliases as well as just plain commands, so it's quicker to type 'portupgrade^w' than to type in 'alias portupgrade' to see how I've got my portupgrade alias set up. It's also great for quickly editing system executables from the commandline without having to remember where the file/script is or cut/paste the results from 'whereis' or 'locate' etc. Tuesday, October 10. 2006Shell Tip: Review your most commonly used command lines
Just got through reading an interesting article on how to review your most commonly used Unix commands. The idea is to sort the most commonly used commands numerically with a view to maybe shortening the most command ones using aliases, similar in a way to the time saving article mentioned here a while ago.
(The lifehacker article is actually just picking up on the original article on IBM's site entitled Unix productivity tips and is a good read for anyone wanting to improve their efficiency on the shell command line.) Note for tcsh on FreeBSD, the command you probably want to use is this: CODE: history | tail -1000 | awk '{print $3}' | sort | uniq -c | sort -r which outputs this kind of list for my shell history (listing top 10 used commands using '| head -10'): CODE: ; history | tail -1000 | awk '{print $3}' | sort | uniq -c | sort -r | head -10 212 sc 158 m 96 fg 75 s 68 cd 56 vi 37 ls 26 grep 19 man 18 alias I'm pretty happy with that, most of the commands are either 1 or 2 character aliases at least :)
(Page 1 of 3, totaling 11 entries)
» next page
|

