Caching Scraped Webpages

Subscribe to Caching Scraped Webpages 2 posts, 2 voices

 
Avatar activefx 4 posts

How would you go about saving the text of entire webpages for retrieval or review later? I’m interested in something almost like the google cache, so I can at least retrieve the text of a webpage if it goes offline. The problem I’ve encountered so far is that the text from an average webpage is too large to fit into a database row in something like mysql.

Anyone have any suggestions / implementation ideas / further reading?

 
Avatar rubyminer 9 posts

Hmmm… did you try the download pattern? Something like this should work:

  html "/html" do
    page :type => :download 
  end

However I think if you need whole webpages (sites?) you are better off with wget/curl – scRUBYt! is really shining when you are sifting some data, and though it can do the above task, wget/curl have more options for this type of stuff.