Google the SiteQuicksearchCategoriesSyndicate This BlogCreative CommonsBlog Administration |
Wednesday, August 15. 2007Command Line Web Browsing With WWW::Mechanize::Shell
Introduction
The perl module WWW::Mechanize::Shell is a brilliant tool for browsing websites at a very low level - think somewhere in between using telnet and using a command line based browser like lynx or links or w3m and you'll be close. WWW::Mechanize::Shell is more than that though, it allows you to script a complete HTTP session so it can be replayed back at a later date without any interaction using WWW::Mechanize::Shell's parent perl module WWW::Mechanize - great for automatically submitting HTML forms/ POST data regularly via a cron job for example. In this article I'll be talking about installing WWW::Mechanize::Shell, look at a typical WWW::Mechanize::Shell browsing session and look at some examples of how I use WWW::Mechanize::Shell to make things easier. Finally the article will end with a real world example - using mechshell to automate logging into FreshPorts and updating a watch list. Installing WWW::Mechanize::Shell As the name suggests, WWW::Mechanize::Shell is a perl module whose 'parent' is the WWW::Mechanize module written by Andy Lester (WWW::Mechanize::Shell itself is written by Max Maischein at time of original writing). WWW::Mechanize does all the work in the background - WWW::Mechanize::Shell just makes it easy to interact in a HTTP session. WWW::Mechanize::Shell and all it's dependencies can be installed from the ports tree: CODE: root@users /root# cd /usr/ports/www/p5-WWW-Mechanize-Shell/ root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# make install root@users /home/munk/ports/www/p5-WWW-Mechanize-Shell# rehash Getting Started Using WWW::Mechanize::Shell Once installed, start up the WWW::Mechanize::Shell using the following you can use the following commandline: CODE: root@users /root# perl -MWWW::Mechanize::Shell -eshell To make things easier though I use a CSH shell alias which aliases 'mechshell' to the command above: CODE: root@users /root# grep mechshell $cshrc alias mechshell perl -MWWW::Mechanize::Shell -eshell Examples of WWW::Mechanize Usage I usually use WWW::Mechanize when I want to manipulate data from websites that require a stateful HTTP session - ie a browsing session where there's more than one URL you have to visit to complete the 'session'. Usually these kind of stateful sessions involve logging into the website first, then browsing to another page to obtain the data and then I have the WWW::Mechanize perl script handle the data and return any results on the commandline. Some examples of scripts that I've use WWW::Mechanize with: eclipse_flex_speed.pl My ISP (Eclipse UK) used to allow you to 'flex' your internet speed from 256k up to 2Mbps. They ran an offer for a while where you could flex to the max for 3 months - unfortunately you could only flex for 12 hours at a time, which meant logging into the control panel every 12hrs, selecting the maximum speed and then submitting the form. PITA basically. Instead I wrote eclipse_flex_speed.pl to automatically login to the Eclipse control panel, 'click' the 2Mbps radio button and then submit the form so my speed got flexed automagically. I then added the script as a cron job to autorun every 12hours, saving the haslle of doing it all manually! aod_get.pl The BBC website allows you to listen to streams of all BBC radio broadcasts for up to a week after they've been aired live. The problem is that the web interface you listen to the stream on in your web browser only allows you to skip 5 or 15 minutes ahead in time and doesn't allow you to go to specific times in the stream. To get around this you can obtain the URL of the real player stream and open it in a standalone real player - doing this you can go to any point in the stream easily. Trouble is finding the URL of the stream isn't that easy and involves viewing the source HTML of the web UI and copy/pasting a partial URL. I started to write a WWW::Mechanize script to automate the 'screen scraping' of all the available feeds from the BBC Audio On Demand site and listing them on one single HTML page linking the name of the feed to the real player feed URL. As it goes though, someone else - Dave Cross - already had the same idea and wrote a great script for scraping the BBC feeds automatically. I now run this in a cronjob once a week. torrentflux_ctl.pl This is a script for starting and stopping all torrents under the control of the torrentflux web based bittorrent client. The script logs in as the torrent owner and then stops or starts all the torrents for that user - basically just does a GET of a URL that causes torrentflux to stop or start all torrents. Crude but effective. Real world example - Automating the update of FreshPorts watch list Below is a real world example usage of WWW::Mechanize::Shell - automating the procedure of updating your watch list on Freshports.org. I've included comments as '# this is a comment' to help explain what each command is doing. CODE: # Start up mechshell - alias for 'perl -MWWW::Mechanize::Shell -eshell': munk@users /home/munk# mechshell # Request the URL http://www.freshports.org/login.php. # Note the HTTP response '(200)' is displayed underneath # to indicate the page was fetched successfully: (no url)>get http://www.freshports.org/login.php Retrieving http://www.freshports.org/login.php(200) # Use the mechshell 'dump' command to dump the contents # of all forms found on the login page: http://www.freshports.org/login.php>dump POST http://www.freshports.org/login.php?origin=%2F [l] custom_settings=1 (hidden readonly) LOGIN=1 (hidden readonly) UserID= (text) Password= (password) submit=Login (submit) <NONAME>=<UNDEF> (reset) # There's just a single form on this page: # - the form's 'ACTION' is set to submit the form using the POST method # to the url http://www.freshports.org/login.php?origin=%2F # The form contains the following form fields: # - 2 hidden fields # - 1 text field called 'UserID' # - 1 password field called 'Password' # - 1 submit field called 'Login' # Fill in the 'UserID' and 'Password' fields: http://www.freshports.org/login.php>value UserID munk http://www.freshports.org/login.php>value Password xxxxxx # And then submit the form. Note we can just use the mechshell 'submit' # command here because there is only a single form on the page. If there were # more than one form on the page we would need to specify which button exactly to # click: http://www.freshports.org/login.php>submit 200 # Again note that the '200' response indicates the request was successful. # Also note that the next mechshell prompt below has changed from # 'http://www.freshports.org/login.php>' to just 'http://www.freshports.org/' - # this indicates that the login script has probably redirected us to the # freshports home page. # Now we take a look to check that the login succeeded ok. To do this we use # the mechshell 'content' command which effectively dumps the content of the # returned page back at us in a pager. # What we're looking for is the text 'Logged in as munk' which will indicate we # logged in ok: http://www.freshports.org/>content <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> -snip- <td NOWRAP><FONT SIZE="-1">Logged in as munk</FONT><br><FONT SIZE="-1"><a href="/customize.php?origin=%2F" title="Customize your settings">Customize</a ></FONT><br><FONT SIZE="-1"><a href="/logout.php" title="Logout of the website">Logout</a></FONT><br><FONT SIZE="-1"><a href="/my-flagged-commits.php" title="Li st of commits you have flagged">My Flagged Commits</a></FONT><br> -snip- # Now we're logged in ok we can continue to upload the mypkg_info.txt file we # created earlier. # First browse to the pkg_upload.php page: http://www.freshports.org/>get http://www.freshports.org/pkg_upload.php Retrieving http://www.freshports.org/pkg_upload.php(200) # Now use 'dump' to see a list of form fields on this page. # Note that there are 2 submit buttons on this page: http://www.freshports.org/pkg_upload.php>dump POST http://www.freshports.org/pkg_upload.php (multipart/form-data) pkg_info= (file) staging=Staging (submit) wlid=5393 (option) [*5393/main*] replaceappend=replace (radio) [*replace/Replace list contents|append/Append to list (duplicates will be removed)] upload=Upload (submit) # We need to fill out the form here. Uploading files with mechshell is as # simple as completing the correct file type field: http://www.freshports.org/pkg_upload.php>value pkg_info /tmp/mypkg_info.txt # Ok, now we're ready to submit the form. # Note that because there are 2 submit buttons on this form, we must explicitly # tell mechshell which button it is that we want to click on - to do that we use # the 'click' command. Just using 'submit' here would possibly click on the # 'staging' button which is not what we want - instead we use the command # 'click upload' to indicate we want to click on the 'upload' button: http://www.freshports.org/pkg_upload.php>click upload (200) # Success! It's a good idea now to just check that this worked by browsing in # a web browser to your watch list and checking the new items were updated ok (of # course you can do this in mechshell if you want but I'll leave that out here!). # Finally, the really cool bit. The mechshell 'script' command will dump out # the perl code required to perform all of the above actions again if you copy # them into a perl script: http://www.freshports.org/pkg_upload.php>script #!perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; -snip- # Also, if you provide a filename as an argument to the 'script' command, # mechshell will dump all the script commands to that filename: http://www.freshports.org/pkg_upload.php>script /tmp/freshports_update.pl # Finally, use 'quit' to exit the mechshell: http://www.freshports.org/pkg_upload.php>quit munk@users /home/munk# Now all that remains is to open up /tmp/freshports_update.pl and tidy the script up so that it's more suitable for automated use via cron. For example, any 'dump' and 'content' commands can be taken out - these would only cause problems anyway if run from a non-interactive shell as used by cron. We also need to add some code to have the script dump the contents of 'pkg_info -qoa' to a temporary file prior to uploading. The completed 'quick and dirty' hack looks like this then: CODE: #!/usr/bin/perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; # FreshPorts username/pass: my $user="munk"; my $pass="xxxxx"; # Temp location to store output from 'pkg_info -qoa': my $mypkg_info="/tmp/freshports/mypkg_info.txt"; # prepare file containing output from: pkg_info -qoa `pkg_info -qoa > $mypkg_info`; # Prepare WWW::Mechanize: my $agent = WWW::Mechanize->new( autocheck => 1 ); my $formfiller = WWW::Mechanize::FormFiller->new(); $agent->env_proxy(); # Login to FreshPorts: $agent->get('http://www.freshports.org/login.php'); $agent->form_number(1) if $agent->forms and scalar @{$agent->forms}; { local $^W; $agent->current_form->value('UserID', $user); }; { local $^W; $agent->current_form->value('Password', $pass); }; $agent->submit(); # Submit pkg_info details to FreshPorts pkg_upload page: $agent->get('http://www.freshports.org/pkg_upload.php'); $agent->form_number(1) if $agent->forms and scalar @{$agent->forms}; { local $^W; $agent->current_form->value('pkg_info', $mypkg_info); }; $agent->click('upload'); # Remove temporary file: `rm $mypkg_info`; After saving the script and making the file executable, an entry can then be added to cron to have the script auto update the list of ports at freshports once a week - or however often you require it to be updated, once a week is more than enough for me. Sorted! :) Sunday, December 24. 2006Portupgrade fails to upgrade dependencies
When using portupgrade to upgrade ports recursively, occasionally you get a problem where portupgrade fails to upgrade a dependency of a port that's being upgraded. This seems to happen most often with perl ports - p5-* ports - probably because perl packages/ports use the most dependencies of all with being modular in design anyway.
An example is when I just went to run the weekly 'portupgrade -arR' and whilst upgrading p5-PathTools-3.21, portupgrade found that another port - p5-Scalar-List-Utils-1.18 - also needed upgrading. Unfortunately the upgrade of that port failed with the following error: CODE: ===> Checking if lang/p5-Scalar-List-Utils already installed ===> p5-Scalar-List-Utils-1.18,1 is already installed You may wish to ``make deinstall'' and install this port again by ``make reinstall'' to upgrade it properly. If you really wish to overwrite the old port of lang/p5-Scalar-List-Utils without deleting it first, set the variable "FORCE_PKG_REGISTER" in your environment or the "make install" command line. *** Error code 1 Stop in /home/munk/ports/lang/p5-Scalar-List-Utils. *** Error code 1 Stop in /home/munk/ports/devel/p5-PathTools. The problem is that any already installed dependencies - regardless of whether they need upgrading or not - are seen by portupgrade as being installed already and so it refuses to upgrade. Hence if the port you're trying to upgrade has a dependency that *also* needs upgrading, this will fail. Adding '-f' onto the portupgrade line makes no difference. One solution is to set the environment variable 'FORCE_PKG_REGISTER': CODE: setenv FORCE_PKG_REGISTER 1 and then run the portupgrade command again. The installed status of the dependencies are then effectively ignored and the port dependencies are forcibly installed. Not sure why portupgrade doesn't upgrade any dependencies automatically - I would have thought any dependencies of a port would be automatically checked for upgrades and upgraded if necessary. Maybe I'm missing something. Monday, September 4. 2006Solving permission problems with parsepath.pl
parsepath.pl is a brilliant perl script for fixing permissions problems on Unix based platforms by Jeremy Mates. Probably the most common type of permission problem from a sysadmin/webmaster's viewpoint is uploading a file to a directory in a website's document root folder and then trying to access the file or script in a web browser only to get the dreaded 403 error message:
Forbidden Most time the solution is very simple, just change the permissions on 'test.php' to make sure the user the webserver runs as can read the file correctly - the simplest and most common method being to change the mode of the file to '755': CODE: chmod 755 test.php Unfortunately sometimes it's not that easy and many times you see users asking 'I'm getting 'access denied' errors even though I've changed the perms to 755'. The problem is that one of the subdirectories that the 'test.php' file lives in has permissions set so that the webserver can't read the file properly. Now that's where the headache comes in :) However, parsepath.pl can take the headache out of fixing permissions problems. Say you have a website document root directory tree /usr/local/www/web/www.munk.me.uk/foo/bar and you upload a web script 'test.php' into that directory. You try and access the file in a webbrowser but get the 403 permission denied error above. First off you check the permissions on the file itself: CODE: [23:58:17] root@users /usr/local/www/web/www.munk.me.uk/foo/bar# ; ls -l total 0 -rwxr-xr-x 1 www www 0 Sep 4 23:39 test.php That looks ok, with permissions 755 and the owner/group set to 'www' the webserver user 'www' should be able to read the file ok. So in this case the problem must be with the permissions on one of the parent subdirectories. The old method of working out the perms would be either to trawl one by one through each directory checking the perms on each subdirectory or to change the permissions recursively on the document root folder so all subfolders have the read bit set for the webserver user/group. With parsepath.pl things are a lot simpler though - just run the following command: CODE: [0:03:21] root@users /usr/local/www/web/www.munk.me.uk/foo/bar# parsepath.pl user=www +r test.php ! group=www +rx fails: d 0700 root:www /usr/local/www/web/www.munk.me.uk/foo ! unix-other +rx fails: d 0750 root:wheel /usr/local/www/web/www.munk.me.uk/foo/bar With this command parsepath.pl recurses through each subdirectory below the file/path you feed it on the commandline and tells you the permissions problems - if any - for the user 'www' (the user=www argument) to read (the +r argument) the file 'test.php'. In the output, we're told that permissions to read the test.php by the user www fails on two counts: CODE: # the group bit on the folder 'foo' doesn't have the +rx flag set: ! group=www +rx fails: d 0700 root:www /usr/local/www/web/www.munk.me.uk/foo # the other bit on the folder 'bar' doesn't have the +rx flag set: ! unix-other +rx fails: d 0750 root:wheel /usr/local/www/web/www.munk.me.uk/foo/bar With this information it's easy enough to go in and make the changes necessary to fix the problem using 'chmod g+rx foo foo/bar'. There are other ways of invoking parsepath.pl though. Running it just with a file/path as an argument it'll tell you the permissions on each subdirectory under it: CODE: [0:10:33] root@users /usr/local/www/web/www.munk.me.uk/foo/bar# > parsepath.pl /usr/local/www/web/www.munk.me.uk/foo/bar/test.php % /usr/local/www/web/www.munk.me.uk/foo/bar/test.php d 0755 root:wheel / d 0755 root:wheel /usr d 0755 root:wheel /usr/local d 0755 root:wheel /usr/local/www d 0770 www:wheel /usr/local/www/web d 0750 www:www /usr/local/www/web/www.munk.me.uk d 0700 root:www /usr/local/www/web/www.munk.me.uk/foo d 0750 root:wheel /usr/local/www/web/www.munk.me.uk/foo/bar f 0755 root:www /usr/local/www/web/www.munk.me.uk/foo/bar/test.php which can is better to see a whole tree in one go. No permissions were harmed in the making of this article! I'll include the parsepath.pl script in the extended article just in case the original ever gets lost - big credit of course goes to the author of the script, Jeremy Mates. His site is actually very interesting from a sysadmin's point of view containing lots of interesting admin scripts and thoughts on system administration in general - spent quite a while grazing through his stuff there - cheers Jeremy. Continue reading "Solving permission problems with parsepath.pl" Thursday, August 31. 2006Serendipity Spam Statistics
I just downloaded this great looking spam statistics plugin for Serendipity from Andreas. Unfortunately after installing it it didn't seem to work, so I got stuck in to see what was up.
Turns out it only works when the spamblock plugin logs to the database, so I'll either look into making it work with log files or maybe think about adding something to the admin stats plugin if that's possible. Or do neither given it's not uber important to me given I get a raft of info on the spam stats each night via a cron job. I have a cron job that checks various things spam related on a daily basis - checking for referer spam, quarantined files uploaded via PHP, mod_security log entries that need attention and finally checking for serendipity / weblog spam. The situation with weblog spam had gotten so bad on the old domain munk.nu that I even ended up creating a script to convert spamblock log entries into firewall rules for ipf. I'm not kidding, at least 100 trackback spam entries per day through June and July - for the year 2006 so far there are nearly 9000 unique IPs dropping new trackback spam. What's annoying too is that even adding offending IPs to my firewall block list, each and every new day there would be another 100 new unique IP addresses spamming the blog. No doubt this is a botnet - 100 new zombies found per day sounds like a professional organisation. Ho hum. Anyway I'll add the 'log2ipf.pl' perl script in the extended part of this article. It's a perl script that's little more than an extended 'grep | sed' which searches for text in a file and then reports how many results it found for each item. In the default case using just 'log2ipf.pl somefile.log' it searches for: CODE: "s9y"=>qr/.*\[REJECTED: [No API-created comments|Trackback URL invalid|Filtered by Akismet\.com].*, IP (.*?)].*/, in this case it reports a list of IP addresses and how many times each IP address was 'caught' trying to spam - but it could be modified to do anything. For example I have another 'filter' setup to see how many people use a google search to find pics on my server by searching for the term 'picasa.ini': CODE: "picasa" =>qr/^.*?\s+(.*?)\s+.*%22index\+of%22\+%2F\+picasa\.ini.*/ so I can feed apache logfiles to log2ipf.pl using this commandline: CODE: ; log2ipf.pl -l picasa /var/log/httpd/all/2006/07/*/* 24.242.97.20: 1 67.141.28.129: 1 telling me there was just 2 such searches during July 2006 (woo). I seem to remember that search returning more than that at the time I wrote the filter though lol. You get the idea anyway. To add a new 'filter', best thing to do is import a sample logfile line you want to produce a result, then customize the script %re variable to include your custom filter. For example, say you wanted to search for auth log failures for SSH (this is actually done for you by the periodic utility on FreeBSD if you set it up in /etc/periodic.conf, but that's another article! - you could write something like this for the %re filter: CODE: my %re=( "s9y"=>qr/.*\[REJECTED: [No API-created comments|Trackback URL invalid|Filtered by Akismet\.com].*, IP (.*?)].*/, #Example of logfile line we want to catch: # Aug 26 14:57:35 users sshd[30136]: Failed password for root from 211.48.62.102 port 50706 ssh2 "ssh" =>qr/.*Failed password for .* from (.*?) .*/, "picasa" =>qr/^.*?\s+(.*?)\s+.*%22index\+of%22\+%2F\+picasa\.ini.*/ ); which would result in: CODE: ; log2ipf.pl -l ssh /var/log/auth.log 168.126.71.148: 1 210.34.14.53: 3 84.10.149.105: 3 211.48.62.102: 3 220.231.54.232: 3 195.10.193.4: 5 213.179.181.26: 11 As I say you can do the equivalent with grep, sed, sort and uniq on the commandline: CODE: ; grep "Failed password for" /var/log/auth.log | sed -e 's/.*Failed password for .* from \([^ ]*\).*/\1/' \ | sort | uniq -c | sort -n 1 168.126.71.148 3 210.34.14.53 3 211.48.62.102 3 220.231.54.232 3 84.10.149.105 5 195.10.193.4 11 213.179.181.26 But for a very large file the timing differences between this method and the perl script are massive. Anyhoo this is turning into a crazy long entry so I'll turn it in. The script log2ipf.pl - should rename that really since it's got little to do with ipf really! - is in the extended article below if anyone's interested. Continue reading "Serendipity Spam Statistics" Saturday, January 1. 2005Awstats Updates and Broken Icons
AWStats was recently updated from version 6.1 to 6.2. Of itself this was no problem, checking the changelog there is nothing significant that breaks anything. However unfortunately the AWStats FreeBSD port was modified by the port maintainer to correct a problem that already existed - namely that some of the tools that ship with AWStats don't work because they can't find the files that they need - icon files for example.
The port maintainer modified the port so that the icon files for AWStats are now placed in /usr/local/www/awstats/icons/ - the original default location was /usr/local/www/icons/. This took me a little while to figure out, perhaps it's the New Years hangover but a more detailed note in the /usr/ports/UPDATING file for current users of awstats would have been good. Ho hum I suppose it's at least something that anything was added at all to UPDATING. I'm all for tidying up directory structures but in this case it seems that a 'fix' for something that was trivially broken has actually resulted in breaking something that was working fine - as the saying goes if it ain't borked don't fix it. The short and tall of it is that to fix the problem that this 'fix' created, a 'DirIcons' directive needs to be added to each awstats config file: CODE: DirIcons="/awstatsicons" and then an alias line needs to be added to the httpd.conf file: CODE: Alias /awstatsicons "/usr/local/www/awstats/icons/" otherwise the awstats pages look odd because the icon files can't be found. For posterity there's a GNATS message I drafted to be sent in reply to this FreeBSD Problem Report regarding the recent awstats update in the extended article below. Continue reading "Awstats Updates and Broken Icons"
(Page 1 of 3, totaling 13 entries)
» next page
|

