Malcolm Farmer/Perl script to find new topics

< Malcolm Farmer

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

I don't know how useful this'll be, but here's the script that I use to find the New Topics. It's pretty simple, much simpler than what I started with.

The script assumes that its data files are in the directory /home/scripts. Change these references as required.... The file existing.pages holds the full list of pages from the last time the script ran; to generate your own, do "touch existing.pages", and run the script. First time round, every page is new, and their names are appended to existing.pages, together with the date that the script ran; subsequent runs just append the new topics since the last run of the script.

New pages are listed on STDOUT so I can see that the script did something; the actual usable output goes to the file called new.topics, in a form suitable for a quick cut and paste to the New Topics page.

Alan Millar sent me his script for autoposting pages, so my next step is to automatically post the results as a cron job....


#!/usr/bin/perl 
use LWP;
# first we read yesterdays list of all pages
open(INFILE, "/home/scripts/existing.pages");
while (<INFILE>)
{
   $temp=$_;
   chomp($temp);
   if ($temp=~/^####/){} #ignore lines with the date 
   else
{$existing{$temp}++;} #make a hash of other lines
}
close (INFILE);
# now retrieve todays list of all pages
# tell anyone browsing the log files what you're up to
$queryname="running a script to find New Topics. Queries to: farmermj\@XXX.XX.XXXX.xxxx";  

$browser = LWP::UserAgent->new();
$browser->agent($queryname);
($url) = "http://www.wikipedia.com/wiki.cgi?action=index";
$webdoc=$browser->request(HTTP::Request->new(GET => $url));
if ($webdoc->is_success)   #...then it's loaded the page OK
{
 print STDOUT "Page loaded OK\n";
 open (OUTFILE,">/home/scripts/new.topics");
 open (NEWTOPIC,">>/home/scripts/existing.pages");
 $now="#### ".`date`;
 print NEWTOPIC $now;                           # log the date
 @listing=split(/\n/,$webdoc->content);
 $lines=$#listing;
 for ($i=0; $i <$lines ; $i++)
 {        
  if ($listing[$i] =~ /\/wiki\//)      # find a page record
  {                                    # extract the pagename 
    ($dummy,$pagename)=split(/">/, $listing[$i]);
    $pagename=~s/<\/a><br>//;
    if ($existing{$pagename} != 0)
    {}                                 #ignore if we've already got it
    else 
    { 
      print $pagename,"\n";
      print OUTFILE "- ",$pagename," -\n";
      print NEWTOPIC $pagename,"\n";                
    }            
  }
}  
close (OUTFILE);
close (NEWTOPIC);
}
else 
{
 print STDOUT "Couldn't get it\n"; 
}



Other approaches might be to use wget and run diff on the result from the previous day, but this script should be a bit more portable to non-Unix systems.

The above script is pretty trivial, but I hereby make the standard declaration that it is released under the GPL.


I'll implement a similar function in the PHP script right away ;) --Magnus Manske


Hey, no fair! You're doing it at the backend! But I'll be happy to retire from running this once the switch to the PHP wikipedia goes on line. It's odd such a function wasn't in the wiki software from the start; or is it just that Wikipedia is the first wiki to get so many edits per day that it needs such a function to sift through them? --Malcolm Farmer