Tuesday, 30 June 2015

Data Scraping - What Are Hand-Scraped Hardwood Floors and What Are the Benefits?

If you love the look of hardwood flooring with lots of character, then you may want to check out hand-scraped hardwood flooring. Hand-scraped wood provides a warm vintage look, providing the floor instant character. These types of scraped hardwoods are suitable for living rooms, dining rooms, hallways and bedrooms. But what exactly is hand-scraped hardwood flooring?

Well, it is literally what you think it is. Hand-scraped hardwood flooring is created by hand using specialized wood working tools to make each board unique and giving an overall "old worn" appearance.

At Innovation Builders we offer solid wood floors finished on site with an actual hand-scraping technique followed by stain and sealer. Solid wood floors are installed by an expert team of technicians who work each board with skilled craftsman-like attention to detail. Following the scraping procedure the floor is stained by hand with a customer selected stain color, and then protected with multiple coats of sealing and finishing polyurethane. This finishing process of staining, sealing and coating the wood floors contributes to providing the look and durability of an old reclaimed wood floor, but with today's tough, urethane finishes.

There are many, many benefits to hand-scraped wood flooring. Overall, these floors are extremely durable and hard wearing, providing years of trouble-free use. These wood floors remain looking newer for longer because the texture that the process provides hides the typical dents, dings and scratches that other floors can't hide so easily. That's great news for households with kids, dogs, and cats.

These types of wood flooring have another unique advantage as well. When you do scratch these floors during their lifetime, the scratches are easily repaired. As long as the scratch isn't too deep you can make them practically disappear without ever having to hire a professional. It's simple to hide the scratch by using a color-matched stain marker or repair kit that is readily available through local flooring distributors. These features make hand-scraped hardwood flooring a lot more durable and hassle-free to maintain than other types of wood flooring.

The expert processes utilized in the creation of these floors provides a custom look of worn wood with deep color and subtle highlights. When the light hits the wood at different times during the day, it provides an understated but powerful effect of depth and beauty. They instantly offer your rooms a rustic look full of character, allowing your home to become a warm and inviting environment. The rustic look of this wood provides a texture, style and rustic appeal that cannot be matched by any other type of flooring.

Hand-Scraped Hardwood Flooring is a floor that says welcome and adds a touch of elegance to any home. If you are looking to buy a new home and you haven't had the opportunity to see or feel hand scraped hardwoods, stop in any of the model homes at Innovation Builders in Keller, North Richland Hills or Grand Prairie, Texas and check it out!

Source: http://ezinearticles.com/?What-Are-Hand-Scraped-Hardwood-Floors-and-What-Are-the-Benefits?&id=6026646

Tuesday, 23 June 2015

Web scraping in under 60 seconds: the magic of import.io

Import.io is a very powerful and easy-to-use tool for data extraction that has the aim of getting data from any website in a structured way. It is meant for non-programmers that need data (and for programmers who don’t want to overcomplicate their lives).

I almost forgot!! Apart from everything, it is also a free tool (o_O)

The purpose of this post is to teach you how to scrape a website and make a dataset and/or API in under 60 seconds. Are you ready?

It’s very simple. You just have to go to http://magic.import.io; post the URL of the site you want to scrape, and push the “GET DATA” button. Yes! It is that simple! No plugins, downloads, previous knowledge or registration are necessary. You can do this from any browser; it even works on tablets and smartphones.

For example: if we want to have a table with the information on all items related to Chewbacca on MercadoLibre (a Latin American version of eBay), we just need to go to that site and make a search – then copy and paste the link (http://listado.mercadolibre.com.mx/chewbacca) on Import.io, and push the “GET DATA” button.

You’ll notice that now you have all the information on a table, and all you need to do is remove the columns you don’t need. To do this, just place the mouse pointer on top of the column you want to delete, and an “X” will appear.

Good news for those of us who are a bit more technically-oriented! There is a button that says “GET API” and this one is good to, well, generate an API that will update the data on each request. For this you need to create an account (which is also free of cost).

As you saw, we can scrape any website in under 60 seconds, even if it includes tons of results pages. This truly is magic, no? For more complex things that require logins, entering subwebs, automatized searches, et cetera, there is downloadable import.io software… But I’ll explain that in a different post.

Source: http://schoolofdata.org/2014/12/09/web-scraping-in-under-60-seconds-the-magic-of-import-io/

Thursday, 18 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Saturday, 6 June 2015

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport

search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select

from time import sleep

# Search by user to get some more

users = [row['from_user'] + ' tcamp12' for row in \

select('distinct from_user from swdata where from_user where user > "%s"' \

% get_var('previous_from_user', ''))]

for user in users:

    search([user], num_pages = 2)

    save_var('previous_from_user', user)

    sleep(2)

By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.

Library

Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Source: https://scraperwiki.wordpress.com/2012/07/04/twitter-scraper-python-library/

Sunday, 31 May 2015

Data Scraping Services - Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Wednesday, 27 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Screen Scraping with BeautifulSoup and lxml

Please enjoy this — a free Chapter of the Python network programming book that I revised for Apress in 2010!

I completely rewrote this chapter for the book's second edition, to feature two powerful libraries that have appeared since the book first came out. I show how to screen-scrape a real-life web page using both BeautifulSoup and also the powerful lxml library (their web sites are here and here).

I chose this chapter for release because screen scraping is often the first network task that a novice Python programmer tackles. Because this material is oriented towards beginners, it explains the entire process — from fetching web pages, to understanding HTML, to querying for specific elements in the document.

Program listings are available for this chapter in both Python 2 and also in Python 3. Let me know if you have any questions!

Most web sites are designed first and foremost for human eyes. While well-designed sites offer formal APIs by which you can construct Google maps, upload Flickr photos, or browse YouTube videos, many sites offer nothing but HTML pages formatted for humans. If you need a program to be able to fetch its data, then you will need the ability to dive into densely formatted markup and retrieve the information you need—a process known affectionately as screen scraping.

In one's haste to grab information from a web page sitting open in your browser in front of you, it can be easy for even experienced programmers to forget to check whether an API is provided for data that they need. So try to take a few minutes investigating the site in which you are interested to see if some more formal programming interface is offered to their services. Even an RSS feed can sometimes be easier to parse than a list of items on a full web page.

Also be careful to check for a “terms of service” document on each site. YouTube, for example, offers an API and, in return, disallows programs from trying to parse their web pages. Sites usually do this for very important reasons related to performance and usage patterns, so I recommend always obeying the terms of service and simply going elsewhere for your data if they prove too restrictive.

Regardless of whether terms of service exist, always try to be polite when hitting public web sites. Cache pages or data that you will need for several minutes or hours, rather than hitting their site needlessly over and over again. When developing your screen-scraping algorithm, test against a copy of their web page that you save to disk, instead of doing an HTTP round-trip with every test. And always be aware that excessive use can result in your IP being temporarily or permanently blocked from a site if its owners are sensitive to automated sources of load.

Fetching Web Pages

Before you can parse an HTML-formatted web page, you of course have to acquire some. Chapter 9 provides the kind of thorough introduction to the HTTP protocol that can help you figure out how to fetch information even from sites that require passwords or cookies. But, in brief, here are some options for downloading content.

From the Future

If you need a simple way to fetch web pages before scraping them, try Kenneth Reitz's requests library!

The library was not released until after the book was published, but has already taken the Python world by storm. The simplicity and convenience of its API has made it the tool of choice for making web requests from Python.

    You can use urllib2, or the even lower-level httplib, to construct an HTTP request that will return a web page. For each form that has to be filled out, you will have to build a dictionary representing the field names and data values inside; unlike a real web browser, these libraries will give you no help in submitting forms.

    You can to install mechanize and write a program that fills out and submits web forms much as you would do when sitting in front of a web browser. The downside is that, to benefit from this automation, you will need to download the page containing the form HTML before you can then submit it—possibly doubling the number of web requests you perform!

    If you need to download and parse entire web sites, take a look at the Scrapy project, hosted at scrapy.org, which provides a framework for implementing your own web spiders. With the tools it provides, you can write programs that follow links to every page on a web site, tabulating the data you want extracted from each page.

    When web pages wind up being incomplete because they use dynamic JavaScript to load data that you need, you can use the QtWebKit module of the PyQt4 library to load a page, let the JavaScript run, and then save or parse the resulting complete HTML page.

    Finally, if you really need a browser to load the site, both the Selenium and Windmill test platforms provide a way to drive a standard web browser from inside a Python program. You can start the browser up, direct it to the page of interest, fill out and submit forms, do whatever else is necessary to bring up the data you need, and then pull the resulting information directly from the DOM elements that hold them.

These last two options both require third-party components or Python modules that are built against large libraries, and so we will not cover them here, in favor of techniques that require only pure Python.

For our examples in this chapter, we will use the site of the United States National Weather Service, which lives at www.weather.gov.

Among the better features of the United States government is its having long ago decreed that all publications produced by their agencies are public domain. This means, happily, that I can pull all sorts of data from their web site and not worry about the fact that copies of the data are working their way into this book.

Of course, web sites change, so the online source code for this chapter includes the downloaded web page on which the scripts in this chapter are designed to work. That way, even if their site undergoes a major redesign, you will still be able to try out the code examples in the future. And, anyway—as I recommended previously—you should be kind to web sites by always developing your scraping code against a downloaded copy of a web page to help reduce their load.

Downloading Pages Through Form Submission

The task of grabbing information from a web site usually starts by reading it carefully with a web browser and finding a route to the information you need. Figure 10–1 shows the site of the National Weather Service; for our first example, we will write a program that takes a city and state as arguments and prints out the current conditions, temperature, and humidity. If you will explore the site a bit, you will find that city-specific forecasts can be visited by typing the city name into the small “Local forecast” form in the left margin.

Figure 10–1. The National Weather Service web site

(click to enlarge)

When using the urllib2 module from the Standard Library, you will have to read the web page HTML manually to find the form. You can use the View Source command in your browser, search for the words “Local forecast,” and find the following form in the middle of the sea of HTML:

<form method="post" action="http://forecast.weather.gov/zipcity.php" ...>

  ...

  <input type="text" id="zipcity" name="inputstring" size="9"

    value="City, St" onfocus="this.value='';" />

  <input type="submit" name="Go2" value="Go" />

</form>

The only important elements here are the <form> itself and the <input> fields inside; everything else is just decoration intended to help human readers.

This form does a POST to a particular URL with, it appears, just one parameter: an inputstring giving the city name and state. Listing 10–1 shows a simple Python program that uses only the Standard Library to perform this interaction, and saves the result to phoenix.html.

Listing 10–1. Submitting a Form with “urllib2”

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_urllib2.py

# Submitting a form and retrieving a page with urllib2

import urllib, urllib2

data = urllib.urlencode({'inputstring': 'Phoenix, AZ'})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

open('phoenix.html', 'w').write(content)

On the one hand, urllib2 makes this interaction very convenient; we are able to download a forecast page using only a few lines of code. But, on the other hand, we had to read and understand the form ourselves instead of relying on an actual HTML parser to read it. The approach encouraged by mechanize is quite different: you need only the address of the opening page to get started, and the library itself will take responsibility for exploring the HTML and letting you know what forms are present. Here are the forms that it finds on this particular page:

>>> import mechanize

>>> br = mechanize.Browser()

>>> response = br.open('http://www.weather.gov/')

>>> for form in br.forms():

...     print '%r %r %s' % (form.name, form.attrs.get('id'), form.action)

...     for control in form.controls:

...         print '   ', control.type, control.name, repr(control.value)

None None http://search.usa.gov/search

    hidden v:project 'firstgov'

    text query ''

    radio affiliate ['nws.noaa.gov']

    submit None 'Go'

None None http://forecast.weather.gov/zipcity.php

    text inputstring 'City, St'

    submit Go2 'Go'

'jump' 'jump' http://www.weather.gov/

    select menu ['http://www.weather.gov/alerts-beta/']

    button None None

Here, mechanize has helped us avoid reading any HTML at all. Of course, pages with very obscure form names and fields might make it very difficult to look at a list of forms like this and decide which is the form we see on the page that we want to submit; in those cases, inspecting the HTML ourselves can be helpful, or—if you use Google Chrome, or Firefox with Firebug installed—right-clicking the form and selecting “Inspect Element” to jump right to its element in the document tree.

Once we have determined that we need the zipcity.php form, we can write a program like that shown in Listing 10–2. You can see that at no point does it build a set of form fields manually itself, as was necessary in our previous listing. Instead, it simply loads the front page, sets the one field value that we care about, and then presses the form's submit button. Note that since this HTML form did not specify a name, we had to create our own filter function—the lambda function in the listing—to choose which of the three forms we wanted.

Listing 10–2. Submitting a Form with mechanize

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_mechanize.py

# Submitting a form and retrieving a page with mechanize

import mechanize

br = mechanize.Browser()

br.open('http://www.weather.gov/')

br.select_form(predicate=lambda(form): 'zipcity' in form.action)

br['inputstring'] = 'Phoenix, AZ'

response = br.submit()

content = response.read()


open('phoenix.html', 'w').write(content)

Many mechanize users instead choose to select forms by the order in which they appear in the page—in which case we could have called select_form(nr=1). But I prefer not to rely on the order, since the real identity of a form is inherent in the action that it performs, not its location on a page.

You will see immediately the problem with using mechanize for this kind of simple task: whereas Listing 10–1 was able to fetch the page we wanted with a single HTTP request, Listing 10–2 requires two round-trips to the web site to do the same task. For this reason, I avoid using mechanize for simple form submission. Instead, I keep it in reserve for the task at which it really shines: logging on to web sites like banks, which set cookies when you first arrive at their front page and require those cookies to be present as you log in and browse your accounts. Since these web sessions require a visit to the front page anyway, no extra round-trips are incurred by using mechanize.

The Structure of Web Pages

There is a veritable glut of online guides and published books on the subject of HTML, but a few notes about the format would seem to be appropriate here for users who might be encountering the format for the first time.

The Hypertext Markup Language (HTML) is one of many markup dialects built atop the Standard Generalized Markup Language (SGML), which bequeathed to the world the idea of using thousands of angle brackets to mark up plain text. Inserting bold and italics into a format like HTML is as simple as typing eight angle brackets:

The <b>very</b> strange book <i>Tristram Shandy</i>.

In the terminology of SGML, the strings <b> and </b> are each tags—they are, in fact, an opening and a closing tag—and together they create an element that contains the text very inside it. Elements can contain text as well as other elements, and can define a series of key/value attribute pairs that give more information about the element:

<p content="personal">I am reading <i document="play">Hamlet</i>.</p>

There is a whole subfamily of markup languages based on the simpler Extensible Markup Language (XML), which takes SGML and removes most of its special cases and features to produce documents that can be generated and parsed without knowing their structure ahead of time. The problem with SGML languages in this regard—and HTML is one particular example—is that they expect parsers to know the rules about which elements can be nested inside which other elements, and this leads to constructions like this unordered list <ul>, inside which are several list items <li>:

<ul><li>First<li>Second<li>Third<li>Fourth</ul>

At first this might look like a series of <li> elements that are more and more deeply nested, so that the final word here is four list elements deep. But since HTML in fact says that <li> elements cannot nest, an HTML parser will understand the foregoing snippet to be equivalent to this more explicit XML string:

<ul><li>First</li><li>Second</li><li>Third</li><li>Fourth</li></ul>

And beyond this implicit understanding of HTML that a parser must possess are the twin problems that, first, various browsers over the years have varied wildly in how well they can reconstruct the document structure when given very concise or even deeply broken HTML; and, second, most web page authors judge the quality of their HTML by whether their browser of choice renders it correctly. This has resulted not only in a World Wide Web that is full of sites with invalid and broken HTML markup, but also in the fact that the permissiveness built into browsers has encouraged different flavors of broken HTML among their different user groups.

If HTML is a new concept to you, you can find abundant resources online. Here are a few documents that have been longstanding resources in helping programmers learn the format:

    http://www.w3.org/MarkUp/Guide/

    http://www.w3.org/MarkUp/Guide/Advanced.html

    http://www.w3.org/MarkUp/Guide/Style

The brief Bare Bones Guide, and the long and verbose HTML standard itself, are good resources to have when trying to remember an element name or the name of a particular attribute value:

    http://werbach.com/barebones/barebones.html

    http://www.w3.org/TR/REC-html40/

When building your own web pages, try to install a real HTML validator in your editor, IDE, or build process, or test your web site once it is online by submitting it to

    http://validator.w3.org/

You might also want to consider using the tidy tool, which can also be integrated into an editor or build process:

    http://tidy.sourceforge.net/

We will now turn to that weather forecast for Phoenix, Arizona, that we downloaded earlier using our scripts (note that we will avoid creating extra traffic for the NWS by running our experiments against this local file), and we will learn how to extract actual data from HTML.

Three Axes

Parsing HTML with Python requires three choices:

    The parser you will use to digest the HTML, and try to make sense of its tangle of opening and closing tags

    The API by which your Python program will access the tree of concentric elements that the parser built from its analysis of the HTML page

    What kinds of selectors you will be able to write to jump directly to the part of the page that interests you, instead of having to step into the hierarchy one element at a time

The issue of selectors is a very important one, because a well-written selector can unambiguously identify an HTML element that interests you without your having to touch any of the elements above it in the document tree. This can insulate your program from larger design changes that might be made to a web site; as long as the element you are selecting retains the same ID, name, or whatever other property you select it with, your program will still find it even if after the redesign it is several levels deeper in the document.

I should pause for a second to explain terms like “deeper,” and I think the concept will be clearest if we reconsider the unordered list that was quoted in the previous section. An experienced web developer looking at that list rearranges it in her head, so that this is what it looks like:

<ul>

  <li>First</li>

  <li>Second</li>

  <li>Third</li>

  <li>Fourth</li>

</ul>

Here the <ul> element is said to be a “parent” element of the individual list items, which “wraps” them and which is one level “above” them in the whole document. The <li> elements are “siblings” of one another; each is a “child” of the <ul> element that “contains” them, and they sit “below” their parent in the larger document tree. This kind of spatial thinking winds up being very important for working your way into a document through an API.

In brief, here are your choices along each of the three axes that were just listed:

    The most powerful, flexible, and fastest parser at the moment appears to be the HTMLParser that comes with lxml; the next most powerful is the longtime favorite BeautifulSoup (I see that its author has, in his words, “abandoned” the new 3.1 version because it is weaker when given broken HTML, and recommends using the 3.0 series until he has time to release 3.2); and coming in dead last are the parsing classes included with the Python Standard Library, which no one seems to use for serious screen scraping.

    The best API for manipulating a tree of HTML elements is ElementTree, which has been brought into the Standard Library for use with the Standard Library parsers, and is also the API supported by lxml; BeautifulSoup supports an API peculiar to itself; and a pair of ancient, ugly, event-based interfaces to HTML still exist in the Python Standard Library.

    The lxml library supports two of the major industry-standard selectors: CSS selectors and XPath query language; BeautifulSoup has a selector system all its own, but one that is very powerful and has powered countless web-scraping programs over the years.

Given the foregoing range of options, I recommend using lxml when doing so is at all possible—installation requires compiling a C extension so that it can accelerate its parsing using libxml2—and using BeautifulSoup if you are on a machine where you can install only pure Python. Note that lxml is available as a pre-compiled package named python-lxml on Ubuntu machines, and that the best approach to installation is often this command line:

STATIC_DEPS=true pip install lxml

And if you consult the lxml documentation, you will find that it can optionally use the BeautifulSoup parser to build its own ElementTree-compliant trees of elements. This leaves very little reason to use BeautifulSoup by itself unless its selectors happen to be a perfect fit for your problem; we will discuss them later in this chapter.

But the state of the art may advance over the years, so be sure to consult its own documentation as well as recent blogs or Stack Overflow questions if you are having problems getting it to compile.

From the Future

The BeautifulSoup project has recovered! While the text below — written in late 2010 — has to carefully avoid the broken 3.2 release in favor of 3.0, BeautifulSoup has now released a rewrite named beautifulsoup4 on the Python Package Index that works with both Python 2 and 3. Once installed, simply import it like this:

from bs4 import BeautifulSoup

I just ran a test, and it reads the malformed phoenix.html page perfectly.

Diving into an HTML Document

The tree of objects that a parser creates from an HTML file is often called a Document Object Model, or DOM, even though this is officially the name of one particular API defined by the standards bodies and implemented by browsers for the use of JavaScript running on a web page.

The task we have set for ourselves, you will recall, is to find the current conditions, temperature, and humidity in the phoenix.html page that we have downloaded. Here is an excerpt: Listing 10–3, which focuses on the pane that we are interested in.

Listing 10–3. Excerpt from the Phoenix Forecast Page

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"><html><head>

<title>7-Day Forecast for Latitude 33.45&deg;N and Longitude 112.07&deg;W (Elev. 1132 ft)</title><link rel="STYLESHEET" type="text/css" href="fonts/main.css">

...

<table cellspacing="0" cellspacing="0" border="0" width="100%"><tr align="center"><td><table width='100%' border='0'>

<tr>

<td align ='center'>

<span class='blue1'>Phoenix, Phoenix Sky Harbor International Airport</span><br>

Last Update on 29 Oct 7:51 MST<br><br>

</td>
</tr>
<tr>

<td colspan='2'>

<table cellspacing='0' cellpadding='0' border='0' align='left'>

<tr>

<td class='big' width='120' align='center'>

<font size='3' color='000066'>

A Few Clouds<br>

<br>71&deg;F<br>(22&deg;C)</td>

</font><td rowspan='2' width='200'><table cellspacing='0' cellpadding='2' border='0' width='100%'>

<tr bgcolor='#b0c4de'>

<td><b>Humidity</b>:</td>

<td align='right'>30 %</td>

</tr>

<tr bgcolor='#ffefd5'>

<td><b>Wind Speed</b>:</td><td align='right'>SE 5 MPH<br>

</td>
</tr>

<tr bgcolor='#b0c4de'>

<td><b>Barometer</b>:</td><td align='right' nowrap>30.05 in (1015.90 mb)</td></tr>

<tr bgcolor='#ffefd5'>

<td><b>Dewpoint</b>:</td><td align='right'>38&deg;F (3&deg;C)</td>

</tr>
</tr>

<tr bgcolor='#ffefd5'>

<td><b>Visibility</b>:</td><td align='right'>10.00 Miles</td>

</tr>

<tr><td nowrap><b><a href='http://www.wrh.noaa.gov/total_forecast/other_obs.php?wfo=psr&zone=AZZ023' class='link'>More Local Wx:</a></b> </td>

<td nowrap align='right'><b><a href='http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=psr&sid=KPHX&num=72' class='link'>3 Day History:</a></b> </td></tr>

</table>

...

There are two approaches to narrowing your attention to the specific area of the document in which you are interested. You can either search the HTML for a word or phrase close to the data that you want, or, as we mentioned previously, use Google Chrome or Firefox with Firebug to “Inspect Element” and see the element you want embedded in an attractive diagram of the document tree. Figure 10–2 shows Google Chrome with its Developer Tools pane open following an Inspect Element command: my mouse is poised over the <font> element that was brought up in its document tree, and the element itself is highlighted in blue on the web page itself.

Image

Figure 10–2. Examining Document Elements in the Browser

(click to enlarge)

Note that Google Chrome does have an annoying habit of filling in “conceptual” tags that are not actually present in the source code, like the <tbody> tags that you can see in every one of the tables shown here. For that reason, I look at the actual HTML source before writing my Python code; I mainly use Chrome to help me find the right places in the HTML.

We will want to grab the text “A Few Clouds” as well as the temperature before turning our attention to the table that sits to this element's right, which contains the humidity.

A properly indented version of the HTML page that you are scraping is good to have at your elbow while writing code. I have included phoenix-tidied.html with the source code bundle for this chapter so that you can take a look at how much easier it is to read!

You can see that the element displaying the current conditions in Phoenix sits very deep within the document hierarchy. Deep nesting is a very common feature of complicated page designs, and that is why simply walking a document object model can be a very verbose way to select part of a document—and, of course, a brittle one, because it will be sensitive to changes in any of the target element's parent. This will break your screen-scraping program not only if the target web site does a redesign, but also simply because changes in the time of day or the need for the site to host different kinds of ads can change the layout subtly and ruin your selector logic.

To see how direct document-object manipulation would work in this case, we can load the raw page directly into both the lxml and BeautifulSoup systems.

>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> tree = lxml.etree.parse('phoenix.html', parser)

The need for a separate parser object here is because, as you might guess from its name, lxml is natively targeted at XML files.

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

Traceback (most recent call last):
  ...
HTMLParseError: malformed start tag, at line 96, column 720

What on earth? Well, look, the National Weather Service does not check or tidy its HTML! I might have chosen a different example for this book if I had known, but since this is a good illustration of the way the real world works, let's press on. Jumping to line 96, column 720 of phoenix.html, we see that there does indeed appear to be some broken HTML:

<a href="http://www.weather.gov"<u>www.weather.gov</u></a>

You can see that the <u> tag starts before a closing angle bracket has been encountered for the <a> tag. But why should BeautifulSoup care? I wonder what version I have installed.

>>> BeautifulSoup.__version__

'3.1.0'

Well, drat. I typed too quickly and was not careful to specify a working version when I ran pip to install BeautifulSoup into my virtual environment. Let's try again:

$ pip install BeautifulSoup==3.0.8.1

And now the broken document parses successfully:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

That is much better!

Now, if we were to take the approach of starting at the top of the document and digging ever deeper until we find the node that we are interested in, we are going to have to generate some very verbose code. Here is the approach we would have to take with lxml:
>>> fonttag = tree.find('body').find('div').findall('table')[3] \

...     .findall('tr')[1].find('td').findall('table')[1].find('tr') \

...     .findall('td')[1].findall('table')[1].find('tr').find('td') \

...     .find('table').findall('tr')[1].find('td').find('table') \

...     .find('tr').find('td').find('font')

>>> fonttag.text

'\nA Few Clouds'

An attractive syntactic convention lets BeautifulSoup handle some of these steps more beautifully:

>>> fonttag = soup.body.div('table', recursive=False)[3] \

...     ('tr', recursive=False)[1].td('table', recursive=False)[1].tr \

...     ('td', recursive=False)[1]('table', recursive=False)[1].tr.td \

...     .table('tr', recursive=False)[1].td.table \

...     .tr.td.font

>>> fonttag.text

u'A Few Clouds71&deg;F(22&deg;C)'

BeautifulSoup lets you choose the first child element with a given tag by simply selecting the attribute .tagname, and lets you receive a list of child elements with a given tag name by calling an element like a function—you can also explicitly call the method findAll()—with the tag name and a recursive option telling it to pay attention just to the children of an element; by default, this option is set to True, and BeautifulSoup will run off and find all elements with that tag in the entire sub-tree beneath an element!

Anyway, two lessons should be evident from the foregoing exploration.

First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on its tag name and position in the document.

Second, we clearly should not be using such primitive navigation to try descending into a real-world web page! I have no idea how code like the expressions just shown can easily be debugged or maintained; they would probably have to be re-built from the ground up if anything went wrong with them—they are a painful example of write-once code.

And that is why selectors that each screen-scraping library supports are so critically important: they are how you can ignore the many layers of elements that might surround a particular target, and dive right in to the piece of information you need.

Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can indent each tag to show you which ones are inside which other ones:
$ tidy phoenix.html > phoenix-tidied.html

You can also use either of these libraries to try tidying the code, with a call like one of these:

lxml.html.tostring(html)

soup.prettify()

See each library's documentation for more details on using these calls.

Selectors

A selector is a pattern that is crafted to match document elements on which your program wants to operate. There are several popular flavors of selector, and we will look at each of them as possible techniques for finding the current-conditions <font> tag in the National Weather Service page for Phoenix. We will look at three:

•    People who are deeply XML-centric prefer XPath expressions, which are a companion technology to XML itself and let you match elements based on their ancestors, their own identity, and textual matches against their attributes and text content. They are very powerful as well as quite general.

•    If you are a web developer, then you probably link to CSS selectors as the most natural choice for examining HTML. These are the same patterns used in Cascading Style Sheets documents to describe the set of elements to which each set of styles should be applied.

•    Both lxml and BeautifulSoup, as we have seen, provide a smattering of their own methods for finding document elements.

Here are standards and descriptions for each of the selector styles just described— first, XPath:

    http://www.w3.org/TR/xpath/

    http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text

    http://codespeak.net/lxml/xpathxslt.html

And here are some CSS selector resources:

    http://www.w3.org/TR/CSS2/selector.html

    http://codespeak.net/lxml/cssselect.html

And, finally, here are links to documentation that looks at selector methods peculiar to lxml and BeautifulSoup:

    http://codespeak.net/lxml/tutorial.html#elementpath

  http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching the Parse Tree

The National Weather Service has not been kind to us in constructing this web page. The area that contains the current conditions seems to be constructed entirely of generic untagged elements; none of them have id or class values like currentConditions or temperature that might help guide us to them.

Well, what are the features of the elements that contain the current weather conditions in Listing 10–3? The first thing I notice is that the enclosing <td> element has the class "big". Looking at the page visually, I see that nothing else seems to be of exactly that font size; could it be so simple as to search the document for every <td> with this CSS class? Let us try, using a CSS selector to begin with:

>>> from lxml.cssselect import CSSSelector

>>> sel = CSSSelector('td.big')

>>> sel(tree)

[<Element td at b72ec0a4>]

Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax of BeautifulSoup:

>>> soup.find('td', 'big')

<td class="big" width="120" align="center">

<font size="3" color="000066">

A Few Clouds<br />

<br />71&deg;F<br />(22&deg;C)</font></td>

Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute contains space-separated values and we do not know, in general, whether the class will be listed first, last, or in the middle.

>>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]")

[<Element td at a567fcc>]

This is a common trick when using XPath against HTML: by prepending and appending spaces to the class attribute, the selector assures that it can look for the target class name with spaces around it and find a match regardless of where in the list of classes the name falls.

Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a document that interest us. And if they break because the document is redesigned or because of a corner case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure of walking the document tree that we attempted first.

Once you have zeroed in on the part of the document that interests you, it is generally a very simple matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need. Compare the following code to the actual tree shown in Listing 10–3:

>>> td = sel(tree)[0]

>>> td.find('font').text

'\nA Few Clouds'

>>> td.find('font').findall('br')[1].tail

u'71°F'

If you are annoyed that the first string did not return as a Unicode object, you will have to blame the ElementTree standard; the glitch has been corrected in Python 3! Note that ElementTree thinks of text strings in an HTML file not as entities of their own, but as either the .text of its parent element or the .tail of the previous element. This can take a bit of getting used to, and works like this:

<p>

  My favorite play is    # the <p> element's .text

  <i>

    Hamlet                 # the <i> element's .text

  </i>

  which is not really      # the <i> element's .tail

  <b>

    Danish                 # the <b> element's .text

  </b>

  but English.             # the <b> element's .tail

</p>


This can be confusing because you would think of the three words favorite and really and English as being at the same “level” of the document—as all being children of the <p> element somehow—but lxml considers only the first word to be part of the text attached to the <p> element, and considers the other two to belong to the tail texts of the inner <i> and <b> elements. This arrangement can require a bit of contortion if you ever want to move elements without disturbing the text around them, but leads to rather clean code otherwise, if the programmer can keep a clear picture of it in her mind.

BeautifulSoup, by contrast, considers the snippets of text and the <br> elements inside the <font> tag to all be children sitting at the same level of its hierarchy. Strings of text, in other words, are treated as phantom elements. This means that we can simply grab our text snippets by choosing the right child nodes:

>>> td = soup.find('td', 'big')

>>> td.font.contents[0]

u'\nA Few Clouds'

>>> td.font.contents[4]

u'71&deg;F'

Through a similar operation, we can direct either lxml or BeautifulSoup to the humidity datum. Since the word Humidity: will always occur literally in the document next to the numeric value, this search can be driven by a meaningful term rather than by something as random as the big CSS tag. See Listing 10–4 for a complete screen-scraping routine that does the same operation first with lxml and then with BeautifulSoup.

This complete program, which hits the National Weather Service web page for each request, takes the city name on the command line:

$ python weather.py Springfield, IL

Condition:

Traceback (most recent call last):

  ..

AttributeError: 'NoneType' object has no attribute 'text'

And here you can see, superbly illustrated, why screen scraping is always an approach of last resort and should always be avoided if you can possibly get your hands on the data some other way: because presentation markup is typically designed for one thing—human readability in browsers—and can vary in crazy ways depending on what it is displaying.

What is the problem here? A short investigation suggests that the NWS page includes only a <font> element inside of the <tr> if—and this is just a guess of mine, based on a few examples—the description of the current conditions is several words long and thus happens to contain a space. The conditions in Phoenix as I have written this chapter are “A Few Clouds,” so the foregoing code has worked just fine; but in Springfield, the weather is “Fair” and therefore does not need a <font> wrapper around it, apparently.

Listing 10–4. Completed Weather Scraper

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - weather.py

# Fetch the weather forecast from the National Weather Service.

import sys, urllib, urllib2

import lxml.etree

from lxml.cssselect import CSSSelector

from BeautifulSoup import BeautifulSoup

if len(sys.argv) < 2:

    print >>sys.stderr, 'usage: weather.py CITY, STATE'

    exit(2)

data = urllib.urlencode({'inputstring': ' '.join(sys.argv[1:])})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

# Solution #1

parser = lxml.etree.HTMLParser(encoding='utf-8')

tree = lxml.etree.fromstring(content, parser)

big = CSSSelector('td.big')(tree)[0]

if big.find('font') is not None:

    big = big.find('font')

print 'Condition:', big.text.strip()

print 'Temperature:', big.findall('br')[1].tail

tr = tree.xpath('.//td[b="Humidity"]')[0].getparent()

print 'Humidity:', tr.findall('td')[1].text

print

# Solution #2

soup = BeautifulSoup(content)  # doctest: +SKIP

big = soup.find('td', 'big')

if big.font is not None:

    big = big.font

print 'Condition:', big.contents[0].string.strip()

temp = big.contents[3].string or big.contents[4].string  # can be either

print 'Temperature:', temp.replace('°', ' ')

tr = soup.find('b', text='Humidity').parent.parent.parent

print 'Humidity:', tr('td')[1].string

print

If you look at the final form of Listing 10–4, you will see a few other tweaks that I made as I noticed changes in format with different cities. It now seems to work against a reasonable selection of locations; again, note that it gives the same report twice, generated once with lxml and once with BeautifulSoup:

$ python weather.py Springfield, IL

Condition: Fair

Temperature: 54 °F

Humidity: 28 %</code>

<code>Condition: Fair

Temperature: 54  F

Humidity: 28 %

$ python weather.py Grand Canyon, AZ

Condition: Fair

Temperature: 67°F

Humidity: 28 %

Condition: Fair

Temperature: 67 F

Humidity: 28 %

You will note that some cities have spaces between the temperature and the F, and others do not. No, I have no idea why. But if you were to parse these values to compare them, you would have to learn every possible variant and your parser would have to take them into account.

I leave it as an exercise to the reader to determine why the web page currently displays the word “NULL”—you can even see it in the browser—for the temperature in Elk City, Oklahoma. Maybe that location is too forlorn to even deserve a reading? In any case, it is yet another special case that you would have to treat sanely if you were actually trying to repackage this HTML page for access from an API:

$ python weather.py Elk City, OK

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

I also leave as an exercise to the reader the task of parsing the error page that comes up if a city cannot be found, or if the Weather Service finds it ambiguous and prints a list of more specific choices!

Summary

Although the Python Standard Library has several modules related to SGML and, more specifically, to HTML parsing, there are two premier screen-scraping technologies in use today: the fast and powerful lxml library that supports the standard Python “ElementTree” API for accessing trees of elements, and the quirky BeautifulSoup library that has powerful API conventions all its own for querying and traversing a document.

If you use BeautifulSoup before 3.2 comes out, be sure to download the most recent 3.0 version; the 3.1 series, which unfortunately will install by default, is broken and chokes easily on HTML glitches.

Screen scraping is, at bottom, a complete mess. Web pages vary in unpredictable ways even if you are browsing just one kind of object on the site—like cities at the National Weather Service, for example.

To prepare to screen scrape, download a copy of the page, and use HTML tidy, or else your screen-scraping library of choice, to create a copy of the file that your eyes can more easily read. Always run your program against the ugly original copy, however, lest HTML tidy fixes something in the markup that your program will need to repair!

Once you find the data you want in the web page, look around at the nearby elements for tags, classes, and text that are unique to that spot on the screen. Then, construct a Python command using your scraping library that looks for the pattern you have discovered and retrieves the element in question. By looking at its children, parents, or enclosed text, you should be able to pull out the data that you need from the web page intact.

When you have a basic script working, continue testing it; you will probably find many edge cases that have to be handled correctly before it becomes generally useful. Remember: when possible, always use true APIs, and treat screen scraping as a technique of last resort!

Source: http://rhodesmill.org/brandon/chapters/screen-scraping/


Saturday, 23 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Friday, 22 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Monday, 18 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.

Experiments

Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Source: http://urlprofiler.com/blog/scraping-twitter/

Thursday, 14 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Monday, 4 May 2015

Lawyers & Attorneys Website Data Scraping Services

There are so many instances where one end’s up needing information from lawyers or bar associations. However, if you approach them directly or look for other ways to get information it might either be difficult or you might not get the information you are looking for. Thus, the best way to go about the scraping lawyer data.

Scraping lawyer data allow you to get information from various attorney websites, bar association websites, or other related websites. Using web scraping tools for getting such information makes it much easier to get all the relevant and important information without actually having to worry about the same.

If you wish to scrape data from lawyer, you are entitled to information such as lawyer name, firm names, address, contact details, history about the lawyers, educational qualifications, the bar association they are part of and much more.

Scraping lawyer data ensure that you also have images of the lawyer you are concentrating on. The result of scrape data form lawyer can be obtained in any format the user wants such as csv, excel, MySql etc. Scraping lawyer data also ensures that none of the information provided are repetitive or redundant.

If you are in need of information regarding any lawyer such as their contact details, address etc. it could end up being a huge and difficult task to get it manually or physically. Thus, taking off the help of scraping tools would ensure that you get all the needed information without actually having to bother about anything at all. The presence of lots of attorney websites and the fact that more and more lawyers are moving to the internet makes getting information easy with the help of some great tools. Scraping data is a very useful and handy method in which one can get all the required and relevant information and that too in a very easy to read format, which makes the method even worthier.

There are quite a few tools or services that you can take help of to get lawyers data scraped. Most of these services also provide with a sample demo and that free of cost. From the sample one can decide if they wish to continue with the services or try some other services. Thus, if you want any information from attorney websites or information about any lawyers, data scraping is a great way to get the same.

Source: https://3idatascraping.wordpress.com/2014/03/18/lawyers-attorneys-website-data-scraping-services/

Wednesday, 29 April 2015

Benefits of Scraping Data from Real Estate Website

With so much of growth in the recent times in real estate industry, it is likely that companies would want to create something different or use another method, so as to get desired benefits. Thus, it is best to go with the technological advancements and create real estate websites to get an edge over others in the industry. And to get all the information regarding website content, one can opt for real estate data scraping methods.

About real estate website scraping

Internet has become an important part of our daily lives and in industry marketing procedures too. With the use of website scraping one can easily scrape real estate listing from various websites. One just needs the help of experts and with proper software and tools; they can easily collect all the relevant real estate data from the required real estate websites and make a structured file containing the information. With internet becoming a valid platform for information and data submitted by numerous sources from around the globe, it is necessary to gather them all in one place for companies. In this way, the company can know what it lacks and work upon their strategies so as to gain profit and get to the top of the business world by taking one step at a time.

Uses of real estate website scraping

With proper use of website scraping one can collect and scrape the real estate listings which can help the company in the real estate market area. One can draw the attention of potential customers by designing the company strategies in such a way as contemplating the changing trends in the real estate global arena. All this is done with the help of the data collected from various real estate websites. With the help of proper website, one can collect the data and these get updated whenever new information gets into the web portal. In this way the company is kept updated about the various changes happening around the global market and thus, ensure in making plans regarding the company. This way one can plan ahead and take steps that can lead to the company gaining profits in future.

Thus, with the help of proper real estate website scraping one can be sure of getting all the information regarding real estate market. This way one can work upon making the company move as per the market trends and get a stronghold in real estate business.

Source: https://3idatascraping.wordpress.com/2013/09/25/benefit-of-scraping-data-from-real-estate-website/

Monday, 27 April 2015

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it
into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Friday, 24 April 2015

How to Properly Scrape Windows During The Cleaning Process

Removing ordinary dirt such as dust, fingerprints, and oil from windows seem simple enough. However, sometimes, you may find stubborn caked-on dirt or debris on your windows that cannot be removed by standard window cleaning techniques such as scrubbing or using a squeegee. The best way to remove caked-on dirt on your windows is to scrape it off. Nonetheless, you have to be extra careful when you are scraping your windows, because they can be easily scratched and damaged. Here are a number of rules that you need to follow when you are scraping windows.

Rule No. 1: It is recommended that you use a professional window scraper to remove caked-on dirt and debris from your windows. This type of scraper is specially made for use on glass, and it comes with certain features that can prevent scratching and other kinds of damage.

Rule No. 2: It is important to inspect your window scraper before using it. Take a look at the blade of the scraper and make sure that it is not rusted. Also, it must not be bent or chipped off at the corners. If you are not certain whether the blade is in a good enough condition, you should just play it safe by using a new blade.

Rule No. 3: When you are working with a window scraper, always use forward plow-like scraping motions. Scrape forward and lift the scraper off the glass, and then scrape forward again. Try not to slide the scraper backwards, because you may trap debris under the blade when you do so. Consequently, the scraper may scratch the glass.

Rule No. 4: Be extra cautious when you are using a window scraper on tempered glass. Tempered glass may have raised imperfections, which make it more vulnerable to scratches. To find out if the window that you are scraping is made of tempered glass, you have to look for a label in one of its corners.

Window Scraping Procedures

Before you start scraping, you have to wet your window with soapy water first. Then, find out how the window scraper works by testing it in a corner. Scrape on the same spot three or four times in forward motion. If you find that the scraper is moving smoothly and not scratching the glass, you can continue to work on the rest of the window. On the other hand, if you feel as if the scraper is sliding on sandpaper, you have to stop scraping. This indicates that the glass may be flawed and have raised imperfections, and scraping will result in scratches.

After you have ascertained that it is safe to scrape your window, start working along the edges. It is best that you start scraping from the middle of an edge, moving towards the corners. Work in a one or two inch pattern, until all the edges of the glass are properly scraped. After that, scrape the rest of the window in a straight pattern of four or five inches, working from top to bottom. If you find that the window is beginning to dry while you are working, wet it with soapy water again.

Source: http://ezinearticles.com/?How-to-Properly-Scrape-Windows-During-The-Cleaning-Process&id=6592930