Monday, 15 September 2014

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

12 Answers

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:
  •     License the data from the provider. Call em up and ask 'em.
  •     Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.
  •     For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.
You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.

Scrape.it is free and open source, available on github.

I would definitely suggest looking at YQL from Yahoo (http://developer.yahoo.com/yql/)

It uses markup to define the structure of the webpage, then lets you run queries against it to extract data. It's a pretty neat idea, with lots of actively maintained markup structures for scraping popular sites.

http://trrdrr-scrapper.rhcloud.com is web based web scrapper, currently it have limited features, but its good to scrap a list of data. (example: scrap the list of questions and its autors of stackoverflow.com)

I like to add features like pagination, scheuler, regex support, scrap using html class, id ...

 down vote
   
scrape.ly lets you web scrape sites by writing a simple url.

for example to scrape all the questions from stackoverflow you would write the following into your browser address bar.

http://scrape.ly/s/{http://stackoverflow.com/}{Printing the data and placement of tree elements}*
{'ask':'//*[@id="question"]/table/tbody/tr[1]/td[2]/div/div[1]/p[1]','username':'user3011391'}

What the url does:
  •     Go to stackoverflow.com
  •     Get all the links like the example provided ("Printing the data...")
  •     Extract the question text into 'ask' column and asker's username into 'username'
  •     Download extracted data .csv file from http://scrape.ly/download/fMxj2x.csv
You could try the GrabzIt Screen Scraper tool, it has a wizard but for more advanced scraping you can use the inbuilt JavaScript instruction set.

Have a look at Visual Web Ripper. It cost you some money but I think it's worth it. http://www.visualwebripper.com/ProductInformation/Features.aspx

Source:http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Data From Web Scraping Using Node.JS Request Is Different From Data Shown In The Browser

Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.

I am using the node.js request module to do some simple web scraping,

app.get('/railway/arrival', function (req, res, next) {
    console.log("/railway/arrival/  "+req.query["city"]);
    var city = req.query["city"];
    if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
    getRailwayArrival(city,
       function(err,data){
           res.send(data);
        }
       );
});

function getRailwayArrival(station,callback){
   request({
    uri: "http://www.thetrainline.com/Live/arrivals/"+station,
   }, function(error, response, body) {
      var $ = cheerio.load(body);

      var a = new Array();
      $(".results-contents li a").each(function() {
        var link = $(this);
        //var href = link.attr("href");
        var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");   
        var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        if(on_time == undefined)  var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");

        var obj = new Object();
        obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
        a.push(obj);
console.log("arrival  ".green+due+"  "+destination+"  "+on_time+"  "+platform+"  "+on_time_no);      
    });
    console.log("get station data  "+a.length +"   "+ $(".updated-time").text());
    callback(null,a);

  });
}

The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?

thanks in advance.

2 Answers

They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.

Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.

Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.

Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:

Source:http://stackoverflow.com/questions/15785360/data-from-web-scraping-using-node-js-request-is-different-from-data-shown-in-the?rq=1

XPath tips from the web scraping trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors. In case you’re looking for a tutorial, here is a XPath tutorial with nice examples.

In this post, we’ll show you some tips we found valuable when using XPath in the trenches, using Scrapy Selector API for our examples.
Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
>>> xp = lambda x: sel.xpath(x).extract() # let's type this only once
>>> xp('//a//text()') # take a peek at the node-set
   [u'Click here to go to the ', u'Next Page']
>>> xp('string(//a//text())')  # convert it to a string
   [u'Click here to go to the ']

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp('//a[1]') # selects the first a node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> xp('string(//a[1])') # converts it to string
[u'Click here to go to the Next Page']

So, in general:

GOOD:

>>> xp("//a[contains(., 'Next Page')]")
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:

>>> xp("//a[contains(.//text(), 'Next Page')]")
[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")
[u'Page']

BAD:

>>> xp("substring-after(//a//text(), 'Next ')")
[u'']

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel=Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]") # get all first LI elements under whatever it is its parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]") # get the first LI element in the whole document
[u'<li>1</li>']
>>> xp("//ul/li[1]")  # get all first LI elements under an UL parent
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]") # get the first LI element under an UL parent in the document
[u'<li>1</li>']

Also,

//a[starts-with(@href, '#')][1] gets a collection of the local anchors that occur first under their respective parents.

(//a[starts-with(@href, '#')])[1] gets the first local anchor in the document.
When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

Let’s cook up some examples:

>>> sel = Selector(text='<p class="content-author">Someone</p><p class="content text-wrap">Some content</p>')
>>> xp = lambda x: sel.xpath(x).extract()

BAD: doesn’t work because there are multiple classes in the attribute

>>> xp("//*[@class='content']")
[]

BAD: gets more than we want

>>> xp("//*[contains(@class,'content')]")
[u'<p class="content-author">Someone</p>']

GOOD:

>>> xp("//*[contains(concat(' ', normalize-space(@class), ' '), ' content ')]")
[u'<p class="content text-wrap">Some content</p>']

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the tutorial to quickly review this.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.
Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes.

Source:http://blog.scrapinghub.com/

The PromptCloud Advantage- Web Scraping with an Edge

The global market is now more aware of its data scraping needs. And so with the demand, the list of suppliers has grown too. This post is dedicated to bringing out the PromptCloud Advantage among such providers.

PromptCloud-Winning-The Race

1. The know-how- Crawling the web, as mundane as it may sound, is a fairly complex task. No one is to be blamed for overlooking the complexity as these things surface only after you’ve tried it yourself and delved into the nitty-gritty. The design decisions you take sit at the core of what you build and eventually monetize. And the long-term effects of such architectural choices are as pleasing if you’ve done it right as disturbing they might turn out if you’re not far-sighted.

Although the expertise of building the tech stack for such large-scale data acquisition, distributing your clusters (and putting thoughts into their geographical locations), maintaining queues, databases and backups, does come from ‘been there done that’, we have been lucky to have the tech advantage imbibed into us since inception. Not that we got it right the first time, but our systems have evolved with technologies, improving each day. Now that we have been there in this business for the last 56 months, it does feel like a long journey for our stack and yes, we do know better :)

2. SLAs- SLAs are what bolsters the data itself. PromptCloud’s key SLAs are scale and quality; while not compromising the data coverage or the politeness policies on your sources. Since we perform focused crawls, there’s no dilution of data and you can consume it all or ask us to index it in order to search using logical combinations in queries. For your reference, here’s a list of all SLAs to visit while picking your data service provider.

changing_place_changing_time_changing_thouts_changing_future.

3. The Experience- There are many scraping tools and crawling services in the market which might just serve the need. What PromptCloud provides is a data acquisition experience; and we go as many number of extra miles as you’d like us to go for it. By leveraging our DaaS platform, we make sure you get what you need from the time you start your research for a data provider through importing the data feeds into your database. We hear your requirements in detail, make sure we’ve got it right by sharing samples and going multiple iterations of reprocessing the data to match your needs while you battle internally on freezing your requirements. But what’s more magical is the way all these feeds get delivered to you, at the intervals you requested; programatically.

It might be evident for the SLAs and the know-how fusing to provide the experience, but it’s that additional human touch that actually aids in sustaining it. We make sure you’re at peace while our systems handle the roadblocks and sort out the messiness on the web.

Source:http://promptcloud.com/blog/the-promptcloud-advantage-web-scraping/