Scraping the content of td

Subscribe to Scraping the content of td 3 posts, 2 voices

 
Avatar my ruby on y... 4 posts

Hi All, I am newbie at scraping and recently been using scrubyt. I had a problem fetching some text that was in a table’s td tag. I stumbled over many references and googled a lot but in vain. The page in the context is “http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm” and in the page content, I need to fetch the date-time to “Accepted:” text. While having a look at the source, there were many tables and many tds. How do I do this thing? Any help greatly appreciated…

thanks Venkat Bagam

 
Avatar Antel 62 posts

Code of the website is very well written.. so easy to scrap, here is:

require 'rubygems'
require 'scrubyt'

Scrubyt.logger = Scrubyt::Logger.new
data = Scrubyt::Extractor.define do

  fetch 'http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm'
  top "//td[@valign='top']" do
  data "//b[@class='blue'][3]" 
  end    
end

data.to_xml.write($stdout,1)

I use scrubyt 4.03 search it on forum, I also use firebug to find html structure by inspect element.

[MODE] Learning
[ACTION] fetching document: <a href="http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm">http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm</a>
[ACTION] Evaluating top with //td[@valign='top']
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[ACTION] Evaluating data with //b[@class='blue'][3]
[INFO] Extraction finished succesfully!
  <root>
    <top>
      <data>2008-02-29 21:59:45</data>
    </top>
  </root>
 
Avatar my ruby on y... 4 posts

Hi Antel, first of all, thanks for the quick reply. That worked pretty well. I have implemented the same technique in my app using Hpricot and Mechanize. I hope you people don’t mind a little discussion on Hpricot and Mechanize b’caz Scrubyt itself is built on top of Hpricot and Mechanize. This is how I did it:

require ‘rubygems’ require ‘mechanize’

agent = WWW::Mechanize.new page = agent.get(‘http://www.sec.gov/Archives/edgar/data/1350102/000101968708000866/0001019687-08-000866-index.htm’l)
  1. you can use Hpricot(open(url)) to fetch the page.. but i have some open-uri issues… doc = Hpricot(page.content) time = doc.search(”//b[@class=’blue’]”)[5].inner_html
  2. [index of the element you want]
  3. .inner_html of the elemrnt thats all…