Scraping article details from Pubmed
|
|
Hi, I’m new to Scrubyt (great tool by the way… amazing how easy and simple it is), and I’m trying to use it to scrape Pubmed. What I’m trying to do is enter a journal id and find all the articles for that journal, then navigate through the results and scrape the details for each article. I’ve figured out how to do it for the first page, but can’t get the next_page functionality to work. Here is my extractor for the first page (I only have it scraping the title from the detail page right now, but will later add more and can do this on my own easily):
require 'rubygems'
require 'scrubyt'
Scrubyt.logger = Scrubyt::Logger.new
pubmed_data = Scrubyt::Extractor.define do
fetch("http://www.ncbi.nlm.nih.gov/PubMed/")
fill_textfield("EntrezSystem2.PEntrez.Pubmed.SearchBar.Term", "0021-9258")
submit
article_link("/html/body/form/div/div/div/div/div/div/a", { :generalize => true }) do
article_detail({ :type => :detail_page }) do
article("/html/body/form/div/div/div/div/div/div/dl/dd", { :generalize => true }) do
title("/h2[1]")
end
end
end.ensure_presence_of_attribute({"class"=>"authors"})
end
pubmed_data.to_xml.write($stdout, 1)
I’ve tried adding something like next_page “Next” or next_page “a[Next]” but neither seem to work. I’d greatly appreciate any help with this. Thanks! |
|
|
Remington, I realize this is not going to answer your question, but why don’t you use EUtils, and parse the text or XML instead of screen scraping? I have done this in the past with great success. |