Scraping the content of td
|
|
Hi All, I am newbie at scraping and recently been using scrubyt. I had a problem fetching some text that was in a table’s td tag. I stumbled over many references and googled a lot but in vain. The page in the context is “http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm” and in the page content, I need to fetch the date-time to “Accepted:” text. While having a look at the source, there were many tables and many tds. How do I do this thing? Any help greatly appreciated… thanks Venkat Bagam |
|
|
Code of the website is very well written.. so easy to scrap, here is: require 'rubygems' require 'scrubyt' Scrubyt.logger = Scrubyt::Logger.new data = Scrubyt::Extractor.define do fetch 'http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm' top "//td[@valign='top']" do data "//b[@class='blue'][3]" end end data.to_xml.write($stdout,1) I use scrubyt 4.03 search it on forum, I also use firebug to find html structure by inspect element.
[MODE] Learning
[ACTION] fetching document: <a href="http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm">http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm</a>
[ACTION] Evaluating top with //td[@valign='top']
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[INFO] Extraction finished succesfully!
<root>
<top>
<data>2008-02-29 21:59:45</data>
</top>
</root>
|
|
|
Hi Antel, first of all, thanks for the quick reply. That worked pretty well. I have implemented the same technique in my app using Hpricot and Mechanize. I hope you people don’t mind a little discussion on Hpricot and Mechanize b’caz Scrubyt itself is built on top of Hpricot and Mechanize. This is how I did it: require ‘rubygems’ require ‘mechanize’ agent = WWW::Mechanize.new page = agent.get(‘http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm’l)
|