Recent Posts
Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 58
|
Aug 2, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / HTML vs DOM parsing Hi all I think scrubyt is a great tool, since 3 days I try to parse a table which is being generated by javascript. From the following URL I’d like to get the Indices ‘http://www.nyse.com/nonflash/market_module.html The code below does not work – I am desperate…any help/hint is very much appreciated Best regards, Mike require ‘rubygems’ require ‘scrubyt’ require ‘firewatir’ Scrubyt.logger = Scrubyt::Logger.new ff = FireWatir::Firefox.new() top_url = ‘http://www.nyse.com/nonflash/market_module.html’ url_data = Scrubyt::Extractor.define(:agent => :firefox) do
end url_data.to_xml.write($stdout, 1) |
|
Aug 2, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / HTML vs DOM parsing Hi all How do I know if the extractor parses the HTML source or the browser’s DOM? I use firewatir and I would like ot parse the browser’s DOM. How do I have to specify that? E.g. the following example goes to the HTML source, what needs to be changed that it parses the DOM? fetch ‘http://finance.yahoo.com/’ stockinfo ”/html/body/div/div/div/div/div/div/table/tbody/tr” do
end Thanks for your help Mike |
|
Aug 2, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / XPath followed by table Let me phrase my question differently. Is it possible to apply the text pattern after an XPath query? Thanks, Mike |
|
Aug 1, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Steps for Installing on Vista, w/ Active Record I’ve been experiencing the issue with RubyInline conflicts when Active Record is used. For some reason, uninstalling RubyInline 3.7.0 does not work on my system. I’ve read in other posts that the jruby solution is now defunct and the only thing that actually does work currently on Windows is using scrubyt 0.2.6. Is this true? Have people used 0.3.4 successfully on Windows recently? If so, what are the steps for installation? I’m from a non-profit that is assembling a volunteer-based scraping framework for collecting information from Government of Canada websites. The scrubyt pattern extractors from 0.3.4 are attractive for use on a large scale, however a requirement is that the framework must be able to be installed reliably on a number of volunteer machines, many of which will be running Windows. The less time wasted, the better. Please advise. Jennifer |
|
Aug 1, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / XPath followed by table Hi all I have an XPath expression which points to a table. I need to catch the data from the table. How can I extract this data? The following does not work. url_data = Scrubyt::Extractor.define(:agent => :firefox) do end
Thanks for any tipp. Mike |
|
Jul 31, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / fireWatir and parsing DOM, possible at all? Hi all scRubyt is such a great tool – congratulation!! I need to parse a page which is mainly generated by Javascript, so I use fireWatir. Is it possible then to parse the DOM tree? I use the following snippet ff = FireWatir::Firefox.new() top_url = ‘http://www….’ ff.goto(top_url) url_data = Scrubyt::Extractor.define(:agent => :firefox) do
end I get the following error message: /var/lib/gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:120:in `get_hpricot_doc’: uninitialized class variable @@hpricot_doc in Scrubyt::FetchAction (NameError) — Any help is very much appreciated. Thanks, Mike |
|
Jul 31, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example Heh just read that the RubyInline stuff is being dropped anyway. Bugger. |
|
Jul 30, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example I’ve fixed it but I’m really surprised that the problem wasn’t fixed already or at least reported to RubyInline before. After realising that it was caching the generated code (doh!), it was quite simple to fix. Issue: http://rubyforge.org/tracker/index.php?func=detail&aid=21396&group_id=440&atid=1776 Patch: http://rubyforge.org/tracker/download.php/440/1776/21396/3945/patch scrubyt depends on RubyInline 3.6.3 specifically. It will have to move up to the latest version once this fix is applied but from what little I’ve seen, it at least works with 3.6.7 anyway. Maybe 3.7.0 already works. |
|
Jul 30, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / RubyInline version conflict Hello, I’m trying to install on cygwin, and parsetreereloaded, ruby2ruby install fine. Then installing scrubyt dies with: ERROR: Error installing scrubyt: scrubyt requires RubyInline <= 3.6.3, runtime> Sounds like the same problem. bump |
|
Jul 30, 2008
|
Topic: Usage HOWTOs / Multiple XML export Hello to everybody! First of all, I know I will post a silly question. I’m a sysadmin, don’t have much experience with Ruby programming but I’m used to Read Tons of Fine Manuals. And tons of messages in ML or forums. I’m quite always lucky and find the solution by myself, but this is not the case, so here I am: I want to write an extractor which crawls a large site and save as many xml as it founds the right pages. Is there a way to save xml during a block declaration? Do I save results in an array and then cycle through it creating many small XMLs? Or shall I write a recursive function? Maybe .to_hash method is the right answer? The goal is to provide a form of throttling when importing XMLs into the db… Please don’t call me newbie…well…yes I’m a Ruby/scrubyt newbie. I love to have clear ideas before starting to code. Moreover, I read quite every post here in the forum and I see you are very, very kind and patient. Thank you in advance! Dani |
|
Jul 30, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Problem with a very simple example This is STILL happening. I remember it happening way back when I first tried scrubyt ages ago. 64-bit installations are becoming quite common now, isn’t it about time someone sorted this out? |
|
Jul 28, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / RubyInline version conflict Hi Peter, I’m looking forward to Scrubyt -v0.4.0 as well. I had to force install Scrubyt due to the RubyInline issue and I still get an error when running a scruby script: ERROR: RubyGem version error: RubyInline(3.7.0 not = 3.6.3) So forge ahead with the new Scrubyt. As for anyone else, I would appreciate any tips on replacing RubyInLine version 3.7.0 with 3.6.3. |
|
Jul 25, 2008
|
Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site? Thank you so much Autch, your code is pretty straight forward ,It solved my problem. |
|
Jul 24, 2008
|
Topic: Usage HOWTOs / Throttle scraping? I’m using scRUBYt to scrape forums for a research-related search application I am creating. Many of the sites are large and I don’t want to annoy anybody by bombarding them with requests over the several hours it would take to scrape at full speed – is there any way to throttle the rate at which scRUBYt scrapes? |
|
Jul 24, 2008
|
Topic: Usage HOWTOs / Filtering Links with a Pattern I’m trying to learn scRUBYt by pulling out all the WizKids minature images. How can I filter out the links returned below to only those with an href having `releaseid=` for example I want<a href="figuregallery.asp?releaseid=11"> as well as <a href="figuregallery.asp?releaseid=99">?
web_data = Scrubyt::Extractor.define do
fetch 'http://www.wizkidsgames.com/heroclix/dc/figuregallery.asp'
release_links "//td" do
link "//a" do
url "href", :type => :attribute
end
end
# 1. on first page find lines like this for each release:
# <td align="center"><font class="body"><a href="figuregallery.asp?releaseid=11"><img alt="Hypertime" src="/images/releases/release_Hypertime.gif" border="0"></a><br>Hypertime</font></td>
# 2. on those links find lines like this for each character
# <tr><td class="tdbody"><a href="figuregallery.asp?unitid=2414">Aquaman</a></td><td class="tdbody">Rookie</td><td class="tdbody">Hypertime</td>
# 3. the desired character info looks like
# <td colspan="2" align="center"><img src="/images/figures/Rotating/HDHT/HDHT_052_rot01.jpg" border="0" name="imgBase" id="Img1">
# <tr><td class="tdheader">Name</td><td class="tdbody">Aquaman</td></tr>
# <tr><td class="tdheader">Collector's Number</td><td class="tdbody">052</td></tr>
# 4. Loop to 2 for each page navigation link that looks like
# <a href="figuregallery.asp?action=showsearchresults&Output=0&Flight=0&Aquatic=0&DialType=0&Retired=0&Strength=0&AttackQty=0&GamePlayTips=&DialCountComp=0&DialCountVal=0&ClickCountComp=0&ClickCountVal=0&StatType1=0&StatType2=0&StatType3=0&StatType4=0&StatType1Comp=0&StatType2Comp=0&StatType3Comp=0&StatType4Comp=0&StatType1Val=&StatType2Val=&StatType3Val=&StatType4Val=&PointValComp=0&PointVal=0&RangeComp=0&RangeVal=0&Rarity=&UAType1=0&UAType2=0&UAType1Comp=0&UAType2Comp=0&UAType1Val=&UAType2Val=&FrontArcComp=0&FrontArcVal=0&RearArcComp=0&RearArcVal=0&Ability1=0&Ability2=0&Ability3=0&Ability1Comp=0&Ability2Comp=0&Ability3Comp=0&USLID1=0&USLID2=0&USLID1Comp=0&USLID2Comp=0&USLID1Val=&USLID2Val=&keyword=&searchtype=0&sort=0&factionid=0&releaseid=11&p=2"> 2</a>
end
web_data.to_xml.write($stdout, 1)
|
|
Jul 24, 2008
|
Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site? Please try this article |
|
Jul 23, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Clicked page is not returned by click_by_xpath Hi, I really need help with this issue. I have a lot of examples that are behaving in the same way. Please reply. Thanks. |
|
Jul 19, 2008
|
Topic: Usage HOWTOs / Can't get next_page with symbol to work Hello scRUBYs, I have a small scrubyt-task to extract movie information from the page film.at, as you can see in this pastie: http://pastie.org/236752 Everything seems to work fine, except the fact that it doesn’t turn pages. The problem here is that “next page” is called “weiter >>” where the arrows are the symbol » The wrapper works, as I tested it with every page (manually specified each one with “fetch” and made test-run). Anyway, I tried these approaches:next_page "a[weiter]", :limit => 3 # searching for anchor with text "weiter" next_page "weiter »", :limit => 3 # vanilla approach next_page "weiter »", :limit => 3 # actual arrow symbol Could someone point me to the right direction? |
|
Jul 16, 2008
|
Topic: Usage HOWTOs / Firescrubyt Tutorial Thanks Scrubber |
|
Jul 15, 2008
|
Topic: Usage HOWTOs / Clicking Button Sorry I can’t paste HTML code here, please see code by clicking on show page source and jump to part where “Start Review” words are |
|
Jul 15, 2008
|
Topic: Usage HOWTOs / Clicking Button Hi I’m trying to click button on web site which isn’t in any form so I don’t know how to use submit function. Source code of web site looks like:
This review is in draft.
Click the Start Review button to notify reviewers. I’ve tryied click_link ‘Start Review’, click_link ‘startReview’, submit ‘startReview, ‘Start Review’ and submit ‘Start Review’ but no one from this calls works Please help |
|
Jul 15, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / Clicked page is not returned by click_by_xpath Hi, I was just trying to do stuff with FireScrubyt, and I ran into a problem. Following is the program: require ‘rubygems’ require ‘scrubyt’ data = Scrubyt::Extractor.define :agent => :firefox do
end puts data.to_xml When I run this program, the code of the fetched page is returned and the clicked one is not. However, it clearly clicks on the firefox. What I want is the result page after submitting the query. Please tell me what’s wrong with my approach. Thanks in advance. |
|
Jul 14, 2008
|
Topic: Usage HOWTOs / Multiple constraints? Hmmm…formatting seems a little whacked. Trying again: Looking at the testcases, it looks like the constraints can be stacked. Here’s how I ultimately solved the problem I was working on: Note the symbols_table extractor has two constraints: ensure_presence_of_pattern(‘price’) and select_indices(:all_but_last). I don’t know how deep you can stack the constraints, or if certain constraints can’t be stacked on each other, but the above worked for me. Chris |
|
Jul 14, 2008
|
Topic: Usage HOWTOs / Multiple constraints? Looking at the testcases, it looks like the constraints can be stacked. Here’s how I ultimately solved the problem I was working on: def get_prices_all_funds() # Get a list of all the symbols (and their prices) in all the funds symbol_data = Scrubyt::Extractor.define do fetch ‘https://www.website.com’ fill_textfield ‘loginID’, ‘username’ fill_textfield ‘password’, ‘password’ submit click_link ‘Model Funds’ click_link ‘Model Fund #1’ click_link ‘View Holdings’ select_option(‘query.key’,’key’) submit(‘view_portal’,’GO’) symbols_table ”//table[@width=’100%’]” do symbol_row ”//tr” do end
Note the symbols_table extractor has two constraints: ensure_presence_of_pattern(‘price’) and select_indices(:all_but_last). I don’t know how deep you can stack the constraints, or if certain constraints can’t be stacked on each other, but the above worked for me. Chris |
|
Jul 14, 2008
|
Topic: Problems, Suggestions, Bugs and Other Insects / to_csv missing? Thanks! Yeah, I went looking for the to_csv function in the code, and it looks like the entire result_dumper.rb file (where to_csv was defined) is no longer used. |