Recent Posts by HarryLi
|
Jul 25, 2008
|
Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site? Thank you so much Autch, your code is pretty straight forward ,It solved my problem. |
|
Jul 12, 2008
|
Topic: Usage HOWTOs / Hi,how to deal with Non-unicode site? as I searched into this ,I found only one solution with a japanese laguange site: !/usr/bin/ruby -Ke$KCODE = ‘e’ require ‘kconv’ require ‘rubygems’ require ‘scrubyt’ require ‘nkf’ NEWSFLASH = “http://www.sankei.co.jp/flash/flash.htm” sankei = Scrubyt::Extractor.define do mechanize = WWW::Mechanize.new mechanize_doc = mechanize.get(NEWSFLASH)
end sankei.to_xml.write($stdout, 1) Scrubyt::ResultDumper.print_statistics(sankei) so I just want to do similar things to a chinese search engine ,like this : require ‘rubygems’ require ‘scrubyt’ require ‘iconv’ baidu_data=Scrubyt::Extractor.define do mechanize=WWW::Mechanize.new() mechanize_doc=mechanize.get(“http://www.baidu.com”) mechanize_doc.body=Iconv.iconv(“UTF-8//IGNORE”,”GB2312//IGNORE”, mechanize_doc.body)
puts baidu_data.to_xml it returns an error: /usr/lib/ruby/gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:29:in `fetch’: undefined method `[]’ for #<www::mechanize::page:0xb76d281c> (NoMethodError) what i want to do is find a general way to deal with no-unicode or english site . I think the basic considerings are : 1. decide the code of my program, since I use Netbeans editor, I set it to use Unicode, so my program is coded in Unicode; 2, because of 1, I must convert all the target website to unicode , just like this”www.baidu.com” it is coded in gb2312, so I need mechanize_doc.body=Iconv.iconv(“UTF-8//IGNORE”,”GB2312//IGNORE”, mechanize_doc.body) to change it to unicode; 3, use fetch(url,mechanize_doc) to do the fetch. actually I don’t quite understand what is this (two arguments) mean , I check the docs ,just guess it meaning. is this the basic way to do with non-unicode site? and what’s wrong with my code and that error? TIA! |