Coder's Guild Mailing List

Perl pattern matching

Posted by Frank Hale on 1999-05-04

I have a little perl script which strips all links out of webpages and
builds a separate page with those links. I can parse out the http://
pattern very nicely. What I want to do now is make a pattern that will
get the description too.

example:
<a href="http://www.somewhere.com/">This is a cool page</a>
                                    
I want to grab the part which says "this is a cool page" from all the <a
href=""></a> tags in a webpage.

Here is code which pulls out the URL's and builds a separate webpage
with those URL's as links.

#!/usr/bin/perl
#
# geturl.pl [infile] [outfile]
#
# Purpose: Strips out URL's from the input webpage and generates a 
# new webpage with those urls as links.
#  
# Frank Hale
# frankhale@xxxxxxxx.xxx.xxx
# 2 May 1999
##################################################################
my @commandlineparms;
my @links;

&main;

sub main {

    if ($ARGV[0] and $ARGV[1]) {
        $commandlineparms[0] = $ARGV[0]; 
        $commandlineparms[1] = $ARGV[1];        
        
    } else {
        print "usage: geturl.pl [infile] [outfile]\n";
    }

    &parse_file;
}

sub parse_file {

    my @in;  # the input file
    my $url; 

    open (IN, "<".$commandlineparms[0]);
    while(<IN>) 
    {
        s/^\s+//g;    # Strip trailing whitespace
        s/\t//g;      # Strip out tabs
        s/\n//g;      # Strip newlines

        push(@in, $_);
    }
    close(IN);
        
    # Parse each line
    foreach $line (@in)
    {   
        if ($line =~ /(http:\/\/[0-9A-Za-z._&?=%\/~-]+)/ or
                 /(www[A-Za-z.-]+)/ or
                 /(ftp[A-Za-z.-]+)/)
        {   
            $url = "<A HREF=\"$1\">$1</A>\n";
            
            push (@links, $url);
        }
    }
    
    &generate_webpage;
}

sub generate_webpage {
    
    open (OUT, ">$commandlineparms[1]");
    print OUT "<HTML><TITLE>LINKS</TITLE><BODY>@links</BODY></HTML>";
    close (OUT);
}

-- 
From:      Frank Hale
Email:     frankhale@xxxxxxxx.xxx.xxx
ICQ:       7205161
Website:   http://www.franksstuff.com (DOWN FOR A WHILE)