Posted by Frank Hale on 1999-05-04
I have a little perl script which strips all links out of webpages and
builds a separate page with those links. I can parse out the http://
pattern very nicely. What I want to do now is make a pattern that will
get the description too.
example:
<a href="http://www.somewhere.com/">This is a cool page</a>
I want to grab the part which says "this is a cool page" from all the <a
href=""></a> tags in a webpage.
Here is code which pulls out the URL's and builds a separate webpage
with those URL's as links.
#!/usr/bin/perl
#
# geturl.pl [infile] [outfile]
#
# Purpose: Strips out URL's from the input webpage and generates a
# new webpage with those urls as links.
#
# Frank Hale
# frankhale@xxxxxxxx.xxx.xxx
# 2 May 1999
##################################################################
my @commandlineparms;
my @links;
&main;
sub main {
if ($ARGV[0] and $ARGV[1]) {
$commandlineparms[0] = $ARGV[0];
$commandlineparms[1] = $ARGV[1];
} else {
print "usage: geturl.pl [infile] [outfile]\n";
}
&parse_file;
}
sub parse_file {
my @in; # the input file
my $url;
open (IN, "<".$commandlineparms[0]);
while(<IN>)
{
s/^\s+//g; # Strip trailing whitespace
s/\t//g; # Strip out tabs
s/\n//g; # Strip newlines
push(@in, $_);
}
close(IN);
# Parse each line
foreach $line (@in)
{
if ($line =~ /(http:\/\/[0-9A-Za-z._&?=%\/~-]+)/ or
/(www[A-Za-z.-]+)/ or
/(ftp[A-Za-z.-]+)/)
{
$url = "<A HREF=\"$1\">$1</A>\n";
push (@links, $url);
}
}
&generate_webpage;
}
sub generate_webpage {
open (OUT, ">$commandlineparms[1]");
print OUT "<HTML><TITLE>LINKS</TITLE><BODY>@links</BODY></HTML>";
close (OUT);
}
--
From: Frank Hale
Email: frankhale@xxxxxxxx.xxx.xxx
ICQ: 7205161
Website: http://www.franksstuff.com (DOWN FOR A WHILE)
Previous post | Next post | Timeline | Home