Caching Scraped Webpages
|
|
How would you go about saving the text of entire webpages for retrieval or review later? I’m interested in something almost like the google cache, so I can at least retrieve the text of a webpage if it goes offline. The problem I’ve encountered so far is that the text from an average webpage is too large to fit into a database row in something like mysql. Anyone have any suggestions / implementation ideas / further reading? |
|
|
Hmmm… did you try the download pattern? Something like this should work:
html "/html" do
page :type => :download
end
However I think if you need whole webpages (sites?) you are better off with wget/curl – scRUBYt! is really shining when you are sifting some data, and though it can do the above task, wget/curl have more options for this type of stuff. |