Scraping article details from Pubmed

Subscribe to Scraping article details from Pubmed 2 posts, 2 voices

 
Avatar Remington 3 posts

Hi, I’m new to Scrubyt (great tool by the way… amazing how easy and simple it is), and I’m trying to use it to scrape Pubmed. What I’m trying to do is enter a journal id and find all the articles for that journal, then navigate through the results and scrape the details for each article. I’ve figured out how to do it for the first page, but can’t get the next_page functionality to work. Here is my extractor for the first page (I only have it scraping the title from the detail page right now, but will later add more and can do this on my own easily):

require 'rubygems'
require 'scrubyt'

Scrubyt.logger = Scrubyt::Logger.new
pubmed_data = Scrubyt::Extractor.define do
  fetch("http://www.ncbi.nlm.nih.gov/PubMed/")
  fill_textfield("EntrezSystem2.PEntrez.Pubmed.SearchBar.Term", "0021-9258")
  submit

  article_link("/html/body/form/div/div/div/div/div/div/a", { :generalize => true }) do
    article_detail({ :type => :detail_page }) do
      article("/html/body/form/div/div/div/div/div/div/dl/dd", { :generalize => true }) do
        title("/h2[1]")
      end
    end
  end.ensure_presence_of_attribute({"class"=>"authors"})

end

pubmed_data.to_xml.write($stdout, 1)

I’ve tried adding something like next_page “Next” or next_page “a[Next]” but neither seem to work. I’d greatly appreciate any help with this. Thanks!

 
Avatar Ryan S 1 post

Remington, I realize this is not going to answer your question, but why don’t you use EUtils, and parse the text or XML instead of screen scraping? I have done this in the past with great success.