HTML::LinkExtractor.3pm

Langue: en

Autres versions - même langue

Version: 2005-01-07 (ubuntu - 24/10/10)

Section: 3 (Bibliothèques de fonctions)

NAME

HTML::LinkExtractor - Extract links from an HTML document

DESCRIPTION

HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Example ( please run the examples ):

     use HTML::LinkExtractor;
     use Data::Dumper;
 
     my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
     my $LX = new HTML::LinkExtractor();
 
     $LX->parse(\$input);
 
     print Dumper($LX->links);
     __END__
     # the above example will yield
     $VAR1 = [
               {
                 '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
                 'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'),
                 'tag' => 'a'
               }
             ];
 
 

"HTML::LinkExtractor" will also correctly extract nested link-type tags.

SYNOPSIS

     ## the demo
     perl LinkExtractor.pm
     perl LinkExtractor.pm file.html othefile.html
 
     ## or if the module is installed, but you don't know where
 
     perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
     perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '
 
     ## or
 
     use HTML::LinkExtractor;
     use LWP qw( get ); #     use LWP::Simple qw( get );
 
     my $base = 'http://search.cpan.org';
     my $html = get($base.'/recent');
     my $LX = new HTML::LinkExtractor();
 
     $LX->parse(\$html);
 
     print qq{<base href="$base">\n};
 
     for my $Link( @{ $LX->links } ) {
     ## new modules are linked  by /author/NAME/Dist
         if( $$Link{href}=~ m{^\/author\/\w+} ) {
             print $$Link{_TEXT}."\n";
         }
     }
 
     undef $LX;
     __END__
 
     ## or
 
     use HTML::LinkExtractor;
     use Data::Dumper;
 
     my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
     my $LX = new HTML::LinkExtractor(
         sub {
             print Data::Dumper::Dumper(@_);
         },
         'http://perlFox.org/',
     );
 
     $LX->parse(\$input);
     $LX->strip(1);
     $LX->parse(\$input);
     __END__
 
     #### Calculate to total size of a web-page
     #### adds up the sizes of all the images and stylesheets and stuff
 
     use strict;
     use LWP; #     use LWP::Simple;
     use HTML::LinkExtractor;
                                                         #
     my $url  = shift || 'http://www.google.com';
     my $html = get($url);
     my $Total = length $html;
                                                         #
     print "initial size $Total\n";
                                                         #
     my $LX = new HTML::LinkExtractor(
         sub {
             my( $X, $tag ) = @_;
                                                         #
             unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
                                                         #
     print "$$tag{tag}\n";
                                                         #
                 for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
                     if( exists $$tag{$urlAttr} ) {
                         my $size = (head( $$tag{$urlAttr} ))[1];
                         $Total += $size if $size;
     print "adding $size\n" if $size;
                     }
                 }
             }
         },
         $url,
         0
     );
                                                         #
     $LX->parse(\$html);
                                                         #
     print "The total size of \n$url\n is $Total bytes\n";
     __END__
 
 

METHODS

$LX->new([\&callback, [$baseUrl, [1]]])

Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don't want to have a callback invoked, just put "undef" in place of a subref.

This is the only class method.

1.
a callback ( a sub reference, as in "sub{}", or "\&sub") which is to be called each time a new LINK is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means
 after the closing tag is encountered )

The callback receives an object reference($LX) and a link hashref.

2.
and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.
     $ALinkP{href} = URI->new_abs( $ALink{href}, $base );
 
 
3.
A ``boolean'' (just stick with 1). See the example in ``DESCRIPTION''. Normally, you'd get back _TEXT that looks like
     '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
 
 

If you turn this option on, you'll get the following instead

     '_TEXT' => ' I am a LINK!!! ',
 
 

The private utility function "_stripHTML" does this by using HTML::TokeParsers method get_trimmed_text.

You can turn this feature on an off by using "$LX->strip(undef || 0 || 1)"

$LX->parse( $filename || *FILEHANDLE || \$FileContent )

Each time you call "parse", you should pass it a $filename a *FILEHANDLE or a "\$FileContent"

Each time you call "parse" a new "HTML::TokeParser" object is created and stored in "$this->{_tp}".

You shouldn't need to mess with the TokeParser object.

$LX->links()

Only after you call "parse" will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)
     $VAR1 = [ { tag => 'img', src => 'image.png' }, ];
 
 

Please note that if yo provide a callback this array will be empty.

$LX->strip( [ 0 || 1 ])

If you pass in "undef" (or nothing), returns the state of the option. Passing in a true or false value sets the option.

If you wanna know what the option does see "$LX->new([\&callback, [$baseUrl, [1]]])"

Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI's (the links!!)

Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like "<a href="#"> TEST </a>"

How can that be?!?!

I took at look at %HTML::Tagset::linkElements and the following URL's
     http://www.blooberry.com/indexdot/html/tagindex/all.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm
     http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm
     http://www.blooberry.com/indexdot/html/tagpages/a/area.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/b/base.htm
     http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/d/del.htm
     http://www.blooberry.com/indexdot/html/tagpages/d/div.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm
     http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm
     http://www.blooberry.com/indexdot/html/tagpages/i/image.htm
     http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm
     http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm
     http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm
     http://www.blooberry.com/indexdot/html/tagpages/l/link.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/o/object.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/q/q.htm
 
     http://www.blooberry.com/indexdot/html/tagpages/s/script.htm
     http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm
 
     And the special cases 
 
     <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
     http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm
     '!doctype'  is really a process instruction, but is still listed
     in %TAGS with 'url' as the attribute
 
     and
 
     <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
     http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
     If there is a valid url, 'url' is set as the attribute.
     The meta tag has no 'attributes' listed in %TAGS.
 
 

SEE ALSO

HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.

AUTHOR

D.H (PodMaster)

Please use http://rt.cpan.org/ to report bugs.

Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.

LICENSE

Copyright (c) 2003, 2004 by D.H. (PodMaster). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.