Recent Posts by HarryLi

Subscribe to Recent Posts by HarryLi 2 posts found

Jul 25, 2008
Avatar HarryLi 2 posts

Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site?

Thank you so much Autch, your code is pretty straight forward ,It solved my problem.

 
Jul 12, 2008
Avatar HarryLi 2 posts

Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site?

as I searched into this ,I found only one solution with a japanese laguange site:

!/usr/bin/ruby -Ke

$KCODE = ‘e’

require ‘kconv’ require ‘rubygems’ require ‘scrubyt’ require ‘nkf’

NEWSFLASH = “http://www.sankei.co.jp/flash/flash.htm”

sankei = Scrubyt::Extractor.define do mechanize = WWW::Mechanize.new mechanize_doc = mechanize.get(NEWSFLASH)

  1. UTF-8 にすれば扱えるみたいだ mechanize_doc.body = NKF.nkf(‘—utf8’, mechanize_doc.body)
  1. navigate_actions.rb にパッチを当てないと、ここで引数不一致になる fetch(NEWSFLASH, mechanize_doc)
  1. 試すときは,最新の記事に差し替えて record do title “カーコリアン氏が買収提案”.toutf8 time ‘13:24’ abstract (“経営不振のクライスラーを45億ドルで。” + “ダイムラーが交渉を進めていることを認めた。”).toutf8 end.ensure_presence_of_pattern(‘abstract’)

end

sankei.to_xml.write($stdout, 1)

Scrubyt::ResultDumper.print_statistics(sankei)

so I just want to do similar things to a chinese search engine ,like this :

require ‘rubygems’ require ‘scrubyt’ require ‘iconv’

baidu_data=Scrubyt::Extractor.define do mechanize=WWW::Mechanize.new() mechanize_doc=mechanize.get(“http://www.baidu.com”) mechanize_doc.body=Iconv.iconv(“UTF-8//IGNORE”,”GB2312//IGNORE”, mechanize_doc.body)

fetch("http://www.baidu.com", mechanize_doc)
 fill_textfield 'wd',"ruby" 
 submit
end
result "Ruby_百度百科"

puts baidu_data.to_xml

it returns an error: /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:29:in `fetch’: undefined method `[]’ for #<www::mechanize::page:0xb76d281c> (NoMethodError)

what i want to do is find a general way to deal with no-unicode or english site . I think the basic considerings are :

1. decide the code of my program, since I use Netbeans editor, I set it to use Unicode, so my program is coded in Unicode; 2, because of 1, I must convert all the target website to unicode , just like this”www.baidu.com” it is coded in gb2312, so I need mechanize_doc.body=Iconv.iconv(“UTF-8//IGNORE”,”GB2312//IGNORE”, mechanize_doc.body) to change it to unicode;

3, use fetch(url,mechanize_doc) to do the fetch. actually I don’t quite understand what is this (two arguments) mean , I check the docs ,just guess it meaning.

is this the basic way to do with non-unicode site? and what’s wrong with my code and that error?

TIA!