Flushing results
|
|
Greetings forum. I was wondering if there’s any way of flushing the data during an extraction to free up some memory. I’m thinking for use in larger crawls. For instance, I have a situation where I’ll be crawling a site that may have some 8,000 detail pages, and it would be nice to be able to flush the results of each details page after every extraction or perhaps after every next_page or something. Is this something that can be done now that I’ve missed, or is it a feature request? I suppose I could always separate the two extractions out, one for the harvesting of detail page links and one for the actual detail pages themselves, but I’d prefer to wrap it all in one since scRUBYt allows for crawling and such. Cheers J |
|
|
I have crawled whole sites, ending up with tens of thousands of records in the DB but never had any problems with this. I think it’s a question of splitting up the extractor – I usually have a crawling extractor (which crawls to all of the pages I want to scrape in turn) and a page extractor (which scrapes the concrete page and writes the results to the DB). This way, only the current page extractor’s stuff (and some things from the crawling extractor) are kept in the memory. If anybody ever runs into a real life problem because of this (I did not so far, and as I have said, I turned huge sites into DB already) LMK… |
|
|
Thanks, scrubber. That’s basically the same conclusion we came to here. Cheers |