problems scraping <td>
|
|
I’m having problems getting the following code snippet to scrape one particular page.
The line I want to scape looks like this: <TD WIDTH=”609” BGCOLOR=”#FFFFFF” COLSPAN=”8”> < IMG SRC=”/comics/luann/archive/images/luann2008034074915.gif” ALT=”Today’s Comic” BORDER=”0”>” However the line I get results for looks like this: <td>< A HREF=”/webmail/SendAStrip?AppName=SendAStrip&ComicName= /comics/luann/&Attachments=/comics/luann/archive/images/luann2008034074915.gif&EmailDate=March-15-2008”>< IMG SRC=”/images/icon_email_this_comic_to_a_friend.gif” WIDTH=”24” HEIGHT=”24” ALT=”E-mail This Comic to a Friend” BORDER=”0” >< /A>< /td> This line is about ten lines later in the HTML and does not have WIDTH=”609” in it so I’m not sure why it matches. Here is the output from logger: [ACTION] fetching document: http://www.comics.com/comics/luann http://www.comics.com/comics/luann The string <item> <item>/images/icon_email_this_comic_to_a_friend.gif</item> </item> That’s it[ACTION] Evaluating top with //td[@width=’609’] [INFO] Extraction finished succesfully! I’m really new to Scrubyt and built this using examples I’ve found. I suspect I missed some fundamental understanding since it works with other webpages. Thanks for any help. -edh |
|
|
The page is screwed up, the td you was linking to havn’t a really end, even my script fails somewhere, I can’t select results like ”//img12”(this is our image), btw you can build a batch script to extract the data like this:
ruby comics.com.rb | grep /archive/images | sed s/\<\*item\>/www.comics.com/ /usr/lib/ruby/gems/1.8/gems/scrubyt-0.4.03/lib/scrubyt/core/scraping/filters/text_filter.rb:25: warning: don't put space before argument parentheses [MODE] Learning [ACTION] fetching document: <a href="http://www.comics.com/comics/luann/">http://www.comics.com/comics/luann/</a> [ACTION] Evaluating comic with //img [INFO] Extraction finished succesfully! <a href="http://www.comics.com/comics/luann/archive/images/luann2006112780430.gif">www.comics.com/comics/luann/archive/images/luann2006112780430.gif</a></item> I don’t remember how to delete that </item> because / conflicts with sed. Code:
require 'rubygems'
require 'scrubyt'
Scrubyt.logger = Scrubyt::Logger.new
comic_data = Scrubyt::Extractor.define do
fetch 'http://www.comics.com/comics/luann/'
comic "//img" do
item 'src', :type => :attribute
end.ensure_absence_of_attribute("height")
end
comic_data.to_flat_xml.write($stdout, 1)
Output:
<item>
<item>/mycomics/images/clear_dot.gif</item>
</item>
<item>
<item>/mycomics/images/new_nav/top_button_spacer.gif</item>
</item>
<item>
<item>/mycomics/images/nav/new_nav/icon_privacypolicy.gif</item>
</item>
<item>
<item>/mycomics/images/nav/new_nav/icon_privacypolicy.gif</item>
</item>
<item>
<item><a href="http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann">http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann</a>@x41</item>
</item>
<item>
<item>/mycomics/images/new_nav/free_button_home.gif</item>
</item>
<item>
<item>/images/clear_dot.gif</item>
</item>
<item>
<item>/images/clear_dot.gif</item>
</item>
<item>
<item>/images/clear_dot.gif</item>
</item>
<item>
<item>/images/clear_dot.gif</item>
</item>
<item>
<item>/comics/luann/images/luann_musical.gif</item>
</item>
<item>
<item>/comics/luann/archive/images/luann2006112780430.gif</item>
</item>
<item>
<item><a href="http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann">http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann</a>@x82</item>
</item>
<item>
<item>//st.sageanalyst.net/NS?ci=734&di=d001&pg=comics&ai=</item>
</item>
<item>
<item><a href="http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann">http://oascentral.comics.com/RealMedia/ads/adstream_nx.ads/www.comics.com/comics/luann</a>@x81</item>
</item>
|