Monday, 15 September 2014

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

12 Answers

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:
  •     License the data from the provider. Call em up and ask 'em.
  •     Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.
  •     For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.
You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.

Scrape.it is free and open source, available on github.

I would definitely suggest looking at YQL from Yahoo (http://developer.yahoo.com/yql/)

It uses markup to define the structure of the webpage, then lets you run queries against it to extract data. It's a pretty neat idea, with lots of actively maintained markup structures for scraping popular sites.

http://trrdrr-scrapper.rhcloud.com is web based web scrapper, currently it have limited features, but its good to scrap a list of data. (example: scrap the list of questions and its autors of stackoverflow.com)

I like to add features like pagination, scheuler, regex support, scrap using html class, id ...

 down vote
   
scrape.ly lets you web scrape sites by writing a simple url.

for example to scrape all the questions from stackoverflow you would write the following into your browser address bar.

http://scrape.ly/s/{http://stackoverflow.com/}{Printing the data and placement of tree elements}*
{'ask':'//*[@id="question"]/table/tbody/tr[1]/td[2]/div/div[1]/p[1]','username':'user3011391'}

What the url does:
  •     Go to stackoverflow.com
  •     Get all the links like the example provided ("Printing the data...")
  •     Extract the question text into 'ask' column and asker's username into 'username'
  •     Download extracted data .csv file from http://scrape.ly/download/fMxj2x.csv
You could try the GrabzIt Screen Scraper tool, it has a wizard but for more advanced scraping you can use the inbuilt JavaScript instruction set.

Have a look at Visual Web Ripper. It cost you some money but I think it's worth it. http://www.visualwebripper.com/ProductInformation/Features.aspx

Source:http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

No comments:

Post a Comment