RSS | Module Info | Add a review of HTML-TableExtract

2 out of 2 found this review helpful:

HTML-TableExtract (2.10)

Useful, but has flaws.

This module is very handy for getting the entries out of tables quickly. However it has some flaws. For example it's not possible to get the attributes of the <td> and other tags which form the table, so if you need to extract only the elements which have a certain name or class, you'll be stuck with this.

There is a way around the problem but it's complicated.

The other big problem with this module is that it's broken on Cygwin and Windows.

BKB - 2008-06-23 05:45:17
Was this review helpful to you?  Yes No

1 out of 1 found this review helpful:

HTML-TableExtract (2.10) *****

Excellent module, much easier than Template::Extract and HTML::TreeBuilder for extracting data from web pages in many cases, and one even doesn't have to look into the HTML source being processed.

My only complaint is the encoding problem. When dealing with pages in non-ascii and non-utf8 encodings like GB2312, it just refuses to match headers. I have to convert the HTML input to UTF-8 manually all the time. I think it may be a problem on the HTML::Parser side... So UTF-8 is always my best friend. :)

Agent Zhang - 2007-01-17 07:15:53
Was this review helpful to you?  Yes No

6 out of 6 found this review helpful:

HTML-TableExtract (2.06) *****

This module helped me create a parser that I struggling to build any other way. The headers feature is *very* handy and provides great basic functionality. If you need to go beyond this, be prepared to spend a bit of time understanding how things work; I found Matt's examples (http://www.mojotoad.com/sisk/projects/HTML-TableExtract/tables.html) to be helpful (and necessary). I give this module an overall high score because its great functionality trumps everythings else. It would have been even better if it were more intuitive (granted, this is highly subjective), or if the off-line examples were referenced in the POD. Kudos!

Graham Stead - 2006-01-06 06:40:42
Was this review helpful to you?  Yes No

HTML::TableExtract / HTML-TableExtract (1.07) ****

A must-have module for getting information out of any table-organized HTML page (you'll be surprised how many web pages this actually is true for).

It might be a little steep on the learning curve, but this is only due to it's powers, and the fact that extracting information out of nested tables is a daunting task.

Two small tips for getting your information:

1. Don't know where your table's at?
Construct the TableExtract without depth and count, and loop using:
foreach my $ts ($table->table_states) {
warn "DEBUG: Table found at ", join(',', $ts->coords) if $DEBUG>2;

2. Replace IMG tags with their ALT attribute:
Subclass TableExtract, overriding the start method:
package TableParser;

use base qw(HTML::TableExtract);

sub start {
my $self = shift;

if ($self->{_in_a_table} and $_[0] eq 'img') {
my %attrs = ref $_[1] ? %{$_[1]} : {};
$self->text($attrs{alt});
} else {
$self->SUPER::start(@_);
}
}

Asgeir Nilsen - 2003-08-16 13:00:26
Was this review helpful to you?  Yes No


the camel