Recent Posts

Subscribe to Recent Posts 1432 posts found

Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 58

Aug 2, 2008
Avatar iclou 5 posts

Topic: Problems, Suggestions, Bugs and Other Insects / HTML vs DOM parsing

Hi all

I think scrubyt is a great tool, since 3 days I try to parse a table which is being generated by javascript. From the following URL I’d like to get the Indices ‘http://www.nyse.com/nonflash/market_module.html

The code below does not work – I am desperate…any help/hint is very much appreciated

Best regards, Mike

require ‘rubygems’

require ‘scrubyt’

require ‘firewatir’

Scrubyt.logger = Scrubyt::Logger.new

ff = FireWatir::Firefox.new()

top_url = ‘http://www.nyse.com/nonflash/market_module.html’

url_data = Scrubyt::Extractor.define(:agent => :firefox) do

fetch(top_url)
extract "/html/body/div/table[3]/tbody/tr/td/table[1]/tbody/tr[2]/td[1]"  do
record do
    text "/a/strong" 
end
end

end

url_data.to_xml.write($stdout, 1)

 
Aug 2, 2008
Avatar iclou 5 posts

Topic: Problems, Suggestions, Bugs and Other Insects / HTML vs DOM parsing

Hi all

How do I know if the extractor parses the HTML source or the browser’s DOM? I use firewatir and I would like ot parse the browser’s DOM. How do I have to specify that?

E.g. the following example goes to the HTML source, what needs to be changed that it parses the DOM?

fetch ‘http://finance.yahoo.com/’

stockinfo ”/html/body/div/div/div/div/div/div/table/tbody/tr” do

symbol "/td[1]/a[1]"
value "/td[3]/span[1]/b[1]"

end

Thanks for your help Mike

 
Aug 2, 2008
Avatar iclou 5 posts

Topic: Problems, Suggestions, Bugs and Other Insects / XPath followed by table

Let me phrase my question differently.

Is it possible to apply the text pattern after an XPath query?

Thanks, Mike

 
Aug 1, 2008
Avatar jlb180 1 post

Topic: Problems, Suggestions, Bugs and Other Insects / Steps for Installing on Vista, w/ Active Record

I’ve been experiencing the issue with RubyInline conflicts when Active Record is used. For some reason, uninstalling RubyInline 3.7.0 does not work on my system. I’ve read in other posts that the jruby solution is now defunct and the only thing that actually does work currently on Windows is using scrubyt 0.2.6.

Is this true? Have people used 0.3.4 successfully on Windows recently? If so, what are the steps for installation?

I’m from a non-profit that is assembling a volunteer-based scraping framework for collecting information from Government of Canada websites. The scrubyt pattern extractors from 0.3.4 are attractive for use on a large scale, however a requirement is that the framework must be able to be installed reliably on a number of volunteer machines, many of which will be running Windows.

The less time wasted, the better. Please advise.

Jennifer

 
Aug 1, 2008
Avatar iclou 5 posts

Topic: Problems, Suggestions, Bugs and Other Insects / XPath followed by table

Hi all

I have an XPath expression which points to a table. I need to catch the data from the table.

How can I extract this data? The following does not work.

url_data = Scrubyt::Extractor.define(:agent => :firefox) do end

data "/html/body/div/table[3]/tbody/tr/td/table[1]" do
  detail do
    item "name" 
    value "value" 
  end
end

Thanks for any tipp. Mike

 
Jul 31, 2008
Avatar iclou 5 posts

Topic: Problems, Suggestions, Bugs and Other Insects / fireWatir and parsing DOM, possible at all?

Hi all

scRubyt is such a great tool – congratulation!! I need to parse a page which is mainly generated by Javascript, so I use fireWatir.

Is it possible then to parse the DOM tree? I use the following snippet

ff = FireWatir::Firefox.new()

top_url = ‘http://www….’

ff.goto(top_url)

url_data = Scrubyt::Extractor.define(:agent => :firefox) do

data do
detail "/html/body/div/table[3]/tbody/tr/td/table[1]/tbody/tr[2]/td[2]"
end

end

I get the following error message: /var/lib/gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:120:in `get_hpricot_doc’: uninitialized class variable @@hpricot_doc in Scrubyt::FetchAction (NameError)

— Any help is very much appreciated. Thanks, Mike

 
Jul 31, 2008
Avatar Chewi 8 posts

Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example

Heh just read that the RubyInline stuff is being dropped anyway. Bugger.

 
Jul 30, 2008
Avatar Chewi 8 posts

Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example

I’ve fixed it but I’m really surprised that the problem wasn’t fixed already or at least reported to RubyInline before. After realising that it was caching the generated code (doh!), it was quite simple to fix.

Issue: http://rubyforge.org/tracker/index.php?func=detail&aid=21396&group_id=440&atid=1776

Patch: http://rubyforge.org/tracker/download.php/440/1776/21396/3945/patch

scrubyt depends on RubyInline 3.6.3 specifically. It will have to move up to the latest version once this fix is applied but from what little I’ve seen, it at least works with 3.6.7 anyway. Maybe 3.7.0 already works.

 
Jul 30, 2008
Avatar bubfranks 1 post

Topic: Problems, Suggestions, Bugs and Other Insects / RubyInline version conflict

Hello, I’m trying to install on cygwin, and parsetreereloaded, ruby2ruby install fine. Then installing scrubyt dies with:

ERROR: Error installing scrubyt: scrubyt requires RubyInline <= 3.6.3, runtime>

Sounds like the same problem.

bump

 
Jul 30, 2008
Avatar danieli 1 post

Topic: Usage HOWTOs / Multiple XML export

Hello to everybody!

First of all, I know I will post a silly question. I’m a sysadmin, don’t have much experience with Ruby programming but I’m used to Read Tons of Fine Manuals. And tons of messages in ML or forums. I’m quite always lucky and find the solution by myself, but this is not the case, so here I am:

I want to write an extractor which crawls a large site and save as many xml as it founds the right pages. Is there a way to save xml during a block declaration? Do I save results in an array and then cycle through it creating many small XMLs? Or shall I write a recursive function? Maybe .to_hash method is the right answer? The goal is to provide a form of throttling when importing XMLs into the db…

Please don’t call me newbie…well…yes I’m a Ruby/scrubyt newbie. I love to have clear ideas before starting to code. Moreover, I read quite every post here in the forum and I see you are very, very kind and patient.

Thank you in advance!

Dani

 
Jul 30, 2008
Avatar Chewi 8 posts

Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example

This is STILL happening. I remember it happening way back when I first tried scrubyt ages ago. 64-bit installations are becoming quite common now, isn’t it about time someone sorted this out?

 
Jul 28, 2008
Avatar Joe 1 post

Topic: Problems, Suggestions, Bugs and Other Insects / RubyInline version conflict

Hi Peter,

I’m looking forward to Scrubyt -v0.4.0 as well. I had to force install Scrubyt due to the RubyInline issue and I still get an error when running a scruby script:

ERROR: RubyGem version error: RubyInline(3.7.0 not = 3.6.3)

So forge ahead with the new Scrubyt. As for anyone else, I would appreciate any tips on replacing RubyInLine version 3.7.0 with 3.6.3.

 
Jul 25, 2008
Avatar HarryLi 2 posts

Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site?

Thank you so much Autch, your code is pretty straight forward ,It solved my problem.

 
Jul 24, 2008
Avatar Paul 1 post

Topic: Usage HOWTOs / Throttle scraping?

I’m using scRUBYt to scrape forums for a research-related search application I am creating. Many of the sites are large and I don’t want to annoy anybody by bombarding them with requests over the several hours it would take to scrape at full speed – is there any way to throttle the rate at which scRUBYt scrapes?

 
Jul 24, 2008
Avatar rstehwien 1 post

Topic: Usage HOWTOs / Filtering Links with a Pattern

I’m trying to learn scRUBYt by pulling out all the WizKids minature images. How can I filter out the links returned below to only those with an href having `releaseid=` for example I want <a href="figuregallery.asp?releaseid=11"> as well as <a href="figuregallery.asp?releaseid=99">?
web_data = Scrubyt::Extractor.define do
  fetch 'http://www.wizkidsgames.com/heroclix/dc/figuregallery.asp'

  release_links "//td"  do
    link "//a" do
      url   "href", :type => :attribute
    end
  end

  # 1. on first page find lines like this for each release:
  #  <td align="center"><font class="body"><a href="figuregallery.asp?releaseid=11"><img alt="Hypertime" src="/images/releases/release_Hypertime.gif" border="0"></a><br>Hypertime</font></td>
  # 2. on those links find lines like this for each character
  #  <tr><td class="tdbody"><a href="figuregallery.asp?unitid=2414">Aquaman</a></td><td class="tdbody">Rookie</td><td class="tdbody">Hypertime</td>
  # 3. the desired character info looks like
  #  <td colspan="2" align="center"><img src="/images/figures/Rotating/HDHT/HDHT_052_rot01.jpg" border="0" name="imgBase" id="Img1">
  #  <tr><td class="tdheader">Name</td><td class="tdbody">Aquaman</td></tr>
  #  <tr><td class="tdheader">Collector's Number</td><td class="tdbody">052</td></tr>
  # 4. Loop to 2 for each page navigation link that looks like
  #   <a href="figuregallery.asp?action=showsearchresults&Output=0&Flight=0&Aquatic=0&DialType=0&Retired=0&Strength=0&AttackQty=0&GamePlayTips=&DialCountComp=0&DialCountVal=0&ClickCountComp=0&ClickCountVal=0&StatType1=0&StatType2=0&StatType3=0&StatType4=0&StatType1Comp=0&StatType2Comp=0&StatType3Comp=0&StatType4Comp=0&StatType1Val=&StatType2Val=&StatType3Val=&StatType4Val=&PointValComp=0&PointVal=0&RangeComp=0&RangeVal=0&Rarity=&UAType1=0&UAType2=0&UAType1Comp=0&UAType2Comp=0&UAType1Val=&UAType2Val=&FrontArcComp=0&FrontArcVal=0&RearArcComp=0&RearArcVal=0&Ability1=0&Ability2=0&Ability3=0&Ability1Comp=0&Ability2Comp=0&Ability3Comp=0&USLID1=0&USLID2=0&USLID1Comp=0&USLID2Comp=0&USLID1Val=&USLID2Val=&keyword=&searchtype=0&sort=0&factionid=0&releaseid=11&p=2"> 2</a>

end

web_data.to_xml.write($stdout, 1)
 
Jul 24, 2008
Avatar autch 1 post

Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site?

Please try this article

http://d.hatena.ne.jp/autch/20080724#1216883135

 
Jul 23, 2008
Avatar aniruddh 4 posts

Topic: Problems, Suggestions, Bugs and Other Insects / Clicked page is not returned by click_by_xpath

Hi, I really need help with this issue. I have a lot of examples that are behaving in the same way. Please reply.

Thanks.

 
Jul 19, 2008
Avatar kioo 1 post

Topic: Usage HOWTOs / Can't get next_page with symbol to work

Hello scRUBYs,

I have a small scrubyt-task to extract movie information from the page film.at, as you can see in this pastie: http://pastie.org/236752

Everything seems to work fine, except the fact that it doesn’t turn pages. The problem here is that “next page” is called “weiter >>” where the arrows are the symbol » The wrapper works, as I tested it with every page (manually specified each one with “fetch” and made test-run).

Anyway, I tried these approaches:
next_page "a[weiter]", :limit => 3        # searching for anchor with text "weiter" 
next_page "weiter &raquo;", :limit => 3   # vanilla approach
next_page "weiter »", :limit => 3         # actual arrow symbol

Could someone point me to the right direction?

 
Jul 16, 2008
Avatar aniruddh 4 posts

Topic: Usage HOWTOs / Firescrubyt Tutorial

Thanks Scrubber

 
Jul 15, 2008
Avatar usul 2 posts

Topic: Usage HOWTOs / Clicking Button

Sorry I can’t paste HTML code here, please see code by clicking on show page source and jump to part where “Start Review” words are

 
Jul 15, 2008
Avatar usul 2 posts

Topic: Usage HOWTOs / Clicking Button

Hi I’m trying to click button on web site which isn’t in any form so I don’t know how to use submit function. Source code of web site looks like:

This review is in draft.

Click the Start Review button to notify reviewers.

I’ve tryied click_link ‘Start Review’, click_link ‘startReview’, submit ‘startReview, ‘Start Review’ and submit ‘Start Review’ but no one from this calls works Please help

 
Jul 15, 2008
Avatar aniruddh 4 posts

Topic: Problems, Suggestions, Bugs and Other Insects / Clicked page is not returned by click_by_xpath

Hi,

I was just trying to do stuff with FireScrubyt, and I ran into a problem. Following is the program:

require ‘rubygems’

require ‘scrubyt’

data = Scrubyt::Extractor.define :agent => :firefox do

fetch "http://www.wamu.com/store_locator/default.asp"
fill_textfield "_ctl0:PageBody:_ctl0:txtZipCode", "02199"
click_by_xpath "/html/body/div[@id='BodyContainer1']/div/form[@id='aspnetForm']/div[@id='Formfooter1']/div/div[@id='formnavigationright']/input"
rec "/"

end

puts data.to_xml

When I run this program, the code of the fetched page is returned and the clicked one is not. However, it clearly clicks on the firefox. What I want is the result page after submitting the query.

Please tell me what’s wrong with my approach.

Thanks in advance.

 
Jul 14, 2008
Avatar crasch 7 posts

Topic: Usage HOWTOs / Multiple constraints?

Hmmm…formatting seems a little whacked. Trying again:

Looking at the testcases, it looks like the constraints can be stacked. Here’s how I ultimately solved the problem I was working on:

sample code

Note the symbols_table extractor has two constraints: ensure_presence_of_pattern(‘price’) and select_indices(:all_but_last). I don’t know how deep you can stack the constraints, or if certain constraints can’t be stacked on each other, but the above worked for me.

Chris

 
Jul 14, 2008
Avatar crasch 7 posts

Topic: Usage HOWTOs / Multiple constraints?

Looking at the testcases, it looks like the constraints can be stacked. Here’s how I ultimately solved the problem I was working on:

def get_prices_all_funds() # Get a list of all the symbols (and their prices) in all the funds symbol_data = Scrubyt::Extractor.define do fetch ‘https://www.website.com’ fill_textfield ‘loginID’, ‘username’ fill_textfield ‘password’, ‘password’ submit click_link ‘Model Funds’ click_link ‘Model Fund #1’ click_link ‘View Holdings’ select_option(‘query.key’,’key’) submit(‘view_portal’,’GO’) symbols_table ”//table[@width=’100%’]” do symbol_row ”//tr” do end

symbol "//td[1]" do
            end.ensure_absence_of_attribute('class' => 'ftTxt10')
end
return symbol_data
shares "//td[3]" do
        end.ensure_absence_of_attribute('class' => 'ftTxt10')
end
price "//td[4]" do
    end.ensure_absence_of_attribute('class' => 'ftTxt10')
total_value "//td[5]" do
end.ensure_absence_of_attribute('class' => 'ftTxt10')
end.ensure_presence_of_pattern('price').select_indices(:all_but_last)

Note the symbols_table extractor has two constraints: ensure_presence_of_pattern(‘price’) and select_indices(:all_but_last). I don’t know how deep you can stack the constraints, or if certain constraints can’t be stacked on each other, but the above worked for me.

Chris

 
Jul 14, 2008
Avatar crasch 7 posts

Topic: Problems, Suggestions, Bugs and Other Insects / to_csv missing?

Thanks! Yeah, I went looking for the to_csv function in the code, and it looks like the entire result_dumper.rb file (where to_csv was defined) is no longer used.

Next page

Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 58