411.com Data Scraping: 2014

Tuesday, 30 December 2014

How to scrape address from Google Maps

If you want to build a new online directory based website and want it to be popular with latest web contents, then you need the help of web scraping services from iWeb scraping. If you want to scrape address from maps.google.com, there is a specialized web scraping tool developed by iWeb scraping which can do the job for you. There are plenty of benefits with web scraping which includes market research, gathering customer information, managing product catalogs, compare prices, gather real estate data, gather job posting information etc. Web scraping technology is very popular nowadays and it saves lot of time and effort involved in manual extraction of data from websites.

The web scraping tools developed iWeb Scraping is very user-friendly and can extract specific information from targeted websites. It converts data from HTML web pages to useful formats like Excel spread sheets or Access database. Whatever web scraping requirements you have, you can contact iWeb Scraping as they have more than 3.5 years of web data extraction experience and offer the best prices in the industry. Also their services are available in 24x7 basis and free pilot projects will be done based on request.

Companies which require specific web data and look for an application which can automate the process and export the HTML data in structured format could benefit greatly from web scraping applications of iWeb scraping. You can easily extract data from multiple target websites, parse and re-assemble the information in HTML format to database or spread sheets as you wish. The application has simple point-and-click user-interface and any beginner can use it scrape address from Google Maps. If you want to gather address of people in particular region from Google maps, you can do it with help of web scraping application developed by iWebscraping.

Web Scraping is a technology that able to digest target website databases that are visible only as HTML web pages, and create a local, identical replica of those databases as a information or result. With our web scraping & web data extraction service we can capture web pages, then pin-point specific pieces of data/information you'd like to extract from web pages. What is needed in this process is much more than a Website crawler and set of Website wrappers. The time required to do web data extraction goes down in comparison to manually data copying and pasting job.

Source:http://www.articlesbase.com/information-technology-articles/how-to-scrape-address-from-google-maps-4683906.html

Sunday, 28 December 2014

So What Exactly Is A Private Data Scraping Services To Use You?

If your computer connects to the Internet or resources on the request for this information, and queries to different servers. If you have a website to introduce to the site server recognizes your computer's IP address and displays the data and much more. Many e - commerce sites use to log your IP address, and the browsing patterns for marketing purposes.

Related Articles

Follow Some Tips For Data Scraping Services

Web Data Scraping Assuring Scraping Success Proxy Data Services

Data Scraping Services with Proxy Data Scraping

Web Data Extraction Services for Data Collection - Screen Scrapping Services, Data Mining Services

The Scraping server you connect to your destination or to process your information and make a filter. For example, IP address or protocol filtering traffic through a Scraping service. As you might guess, there are many types of Scraping services. including the ability to a high demand for the software. Email messages are quickly sent to businesses and companies to help you search for contacts.

Although there are Sanding free Scraping IP addresses in this way can work, the use of payment services, and automatic user interface (plug and play) are easy to give. Scraping web information services, thus offering a variety of relevant sources of data. Scraping information service organizations are generally used where large amounts of data every day. It is possible for you to receive efficient, high precision is also affordable.

Information on the various strategies that companies, Scraping excellent information services, and use the structure planned out and has led to the introduction of more rapid relief of the Earth.

In addition, the application software that has flexibility as a priority. In addition, there is a software that can be tailored to the needs of customers, and satisfy various customer requirements play a major role. Particular software, allows businesses to sell, a customer provides the features necessary to provide the best experience.

If you do not use a private Data Scraping Services suggest that you immediately start your Internet marketing. It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Without the steady stream of data from these sites to get stopped? Scraping HTML page requests sent by argument on the web server, depending on changes in production, it is very likely to break their staff.

Data Scraping Services is common in the respective outsourcing company. Many companies outsource Data Scraping Services service companies are increasingly outsourcing these services, and generally dealing with the Internet business-related activities, in particular a lot of money, can earn.

Web Data Scraping Services, pull information from a structured plan format. Informal or semi-structured data source from the source.They are there to just work on your own server to extract data to execute. IP blocking is not a problem for them when they switch servers in minutes and back on track, scraping exercise. Try this service and you'll see what I mean.

It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Source:http://www.articlesbase.com/outsourcing-articles/so-what-exactly-is-a-private-data-scraping-services-to-use-you-5587140.html

Thursday, 25 December 2014

Limitations and Challenges in Effective Web Data Mining

Web data mining and data collection is critical process for many business and market research firms today. Conventional Web data mining techniques involve search engines like Google, Yahoo, AOL, etc and keyword, directory and topic-based searches. Since the Web's existing structure cannot provide high-quality, definite and intelligent information, systematic web data mining may help you get desired business intelligence and relevant data.

Factors that affect the effectiveness of keyword-based searches include:

• Use of general or broad keywords on search engines result in millions of web pages, many of which are totally irrelevant.

• Similar or multi-variant keyword semantics my return ambiguous results. For an instant word panther could be an animal, sports accessory or movie name.

• It is quite possible that you may miss many highly relevant web pages that do not directly include the searched keyword.

The most important factor that prohibits deep web access is the effectiveness of search engine crawlers. Modern search engine crawlers or bot can not access the entire web due to bandwidth limitations. There are thousands of internet databases that can offer high-quality, editor scanned and well-maintained information, but are not accessed by the crawlers.

Almost all search engines have limited options for keyword query combination. For example Google and Yahoo provide option like phrase match or exact match to limit search results. It demands for more efforts and time to get most relevant information. Since human behavior and choices change over time, a web page needs to be updated more frequently to reflect these trends. Also, there is limited space for multi-dimensional web data mining since existing information search rely heavily on keyword-based indices, not the real data.

Above mentioned limitations and challenges have resulted in a quest for efficiently and effectively discover and use Web resources. Send us any of your queries regarding Web Data mining processes to explore the topic in more detail.

Source: http://ezinearticles.com/?Limitations-and-Challenges-in-Effective-Web-Data-Mining&id=5012994

Monday, 22 December 2014

Scraping Fantasy Football Projections from the Web

In this post, I show how to download fantasy football projections from the web using R. In prior posts, I showed how to scrape projections from ESPN, CBS, NFL.com, and FantasyPros. In this post, I compile the R scripts for scraping projections from these sites, in addition to the following sites: Accuscore, Fantasy Football Nerd, FantasySharks, FFtoday, Footballguys, FOX Sports, WalterFootball, and Yahoo.

Why Scrape Projections?

Scraping projections from multiple sources on the web allows us to automate importing the projections with a simple script. Automation makes importing more efficient so we don’t have to manually download the projections whenever they’re updated. Once we import all of the projections, there’s a lot we can do with them, like:

•    Determine who has the most accurate projections
•    Calculate projections for your league
•    Calculate players’ risk levels
•    Calculate players’ value over replacement
•    Identify sleepers
•    Calculate the highest value you should bid on a player in an auction draft
•    Draft the best starting lineup
•    Win your auction draft
•    Win your snake draft

The R Scripts

To scrape the projections from the websites, I use the readHTMLTable function from the XML package in R. Here’s an example of how to scrape projections from FantasyPros:

1 2 3 4 5 6 7 8

#Load libraries

library("XML")

#Download fantasy football projections from FantasyPros.com

qb_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/qb.php", stringsAsFactors = FALSE)$data

rb_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/rb.php", stringsAsFactors = FALSE)$data

wr_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/wr.php", stringsAsFactors = FALSE)$data

te_fp <- readHTMLTable("http://www.fantasypros.com/nfl/projections/te.php", stringsAsFactors = FALSE)$data

view raw FantasyPros projections hosted with ? by GitHub

The R Scripts for scraping the different sources are located below:

1.    Accuscore
2.    CBS - Jamey Eisenberg
3.    CBS – Dave Richard
4.    CBS – Average
5.    ESPN
6.    Fantasy Football Nerd
7.    FantasyPros
8.    FantasySharks
9.    FFtoday
10.    Footballguys – David Dodds
11.    Footballguys – Bob Henry
12.    Footballguys – Maurile Tremblay
13.    Footballguys – Jason Wood
14.    FOX Sports
15.    NFL.com
16.    WalterFootball
17.    Yahoo

Density Plot

Below is a density plot of the projections from the different sources:Calculate projections

Conclusion

Scraping projections from the web is fast, easy, and automated with R. Once you’ve downloaded the projections, there’s so much you can do with the data to help you win your league! Let me know in the comments if there are other sources you want included (please provide a link).

Source:http://fantasyfootballanalytics.net/2014/06/scraping-fantasy-football-projections.html

Friday, 19 December 2014

Affordable Tooth Extractions

In recent times, the cost of dental care has skyrocketed. This includes all types of dentistry including teeth cleaning, extractions, and dental surgery. For those who live in Denver, CO, there are many options to choose from when paying for routine or emergency dental care. In fact, having a tooth extraction Denver might just be more easily afforded than what some may be aware of.

The flat fee for a tooth extraction in Denver may vary between dental offices. The type of extraction can also cause a difference in the price. A simple extraction may cost between $60-$75, but a wisdom tooth extraction that requires more time and effort could cost much more.

One of the great aspects of having dental services performed in Denver is the variety of payment forms that many dental offices accept. Most dental offices in this area accept several different health insurance plans that will allow patients to only be required to pay a small copay at the time of service. If you have chosen an in-network dental provider for your plan, this copay can be even less.

Many dental offices also provide services to those who have state medicaid or medicare as well. While cosmetic dental work may not be covered by these forms of health care, extractions are covered because they are considered a necessary part of the patients good health. Yearly checkups and teeth cleanings are also normally covered as a preventative measure to avoid bad dental health.

For those who may not have any type of health insurance, dental insurance, or state provided health care plan, most dental offices will offer a payment plan. The total cost will be calculated and can be divided up over a few months to make dental care more easily affordable. This will need to be arranged before services and you may need to pay a percentage of the cost upfront before any dental work is performed.

So, if you live in the Denver area and need to have a tooth extraction or other dental care, do not fear that it is impossible to obtain. By calling each dental office and discussing the types of payment forms they accept, you may find a payment plan that fits your budget nicely. You can compare the prices and options of all dentists in your area so that you can make a well informed decision more easily.

Source:http://ezinearticles.com/?Affordable-Tooth-Extractions&id=3241427

Wednesday, 17 December 2014

Data Mining - Techniques and Process of Data Mining

Data mining as the name suggest is extracting informative data from a huge source of information. It is like segregating a drop from the ocean. Here a drop is the most important information essential for your business, and the ocean is the huge database built up by you.

Recognized in Business

Businesses have become too creative, by coming up with new patterns and trends and of behavior through data mining techniques or automated statistical analysis. Once the desired information is found from the huge database it could be used for various applications. If you want to get involved into other functions of your business you should take help of professional data mining services available in the industry

Data Collection

Data collection is the first step required towards a constructive data-mining program. Almost all businesses require collecting data. It is the process of finding important data essential for your business, filtering and preparing it for a data mining outsourcing process. For those who are already have experience to track customer data in a database management system, have probably achieved their destination.

Algorithm selection

You may select one or more data mining algorithms to resolve your problem. You already have database. You may experiment using several techniques. Your selection of algorithm depends upon the problem that you are want to resolve, the data collected, as well as the tools you possess.

Regression Technique

The most well-know and the oldest statistical technique utilized for data mining is regression. Using a numerical dataset, it then further develops a mathematical formula applicable to the data. Here taking your new data use it into existing mathematical formula developed by you and you will get a prediction of future behavior. Now knowing the use is not enough. You will have to learn about its limitations associated with it. This technique works best with continuous quantitative data as age, speed or weight. While working on categorical data as gender, name or color, where order is not significant it better to use another suitable technique.

Classification Technique

There is another technique, called classification analysis technique which is suitable for both, categorical data as well as a mix of categorical and numeric data. Compared to regression technique, classification technique can process a broader range of data, and therefore is popular. Here one can easily interpret output. Here you will get a decision tree requiring a series of binary decisions.

Our best wishes are with you for your endeavors.

Source: http://ezinearticles.com/?Data-Mining---Techniques-and-Process-of-Data-Mining&id=5302867

Monday, 15 December 2014

Do blog scraping sites violate the blog owner's copyright?

I noticed that my blog has been posted on one of these website scraping sites. This is the kind of site that has no original content, but just repeats or scrapes content others have written and does it to get some small amount of ad income from ads on the scraping site. In essence the scraping site is taking advantage of the content of the originating site in order to make a few dollars from people who go to the site looking for something else. Some of these websites prey on misspelling. If you accidentally misspell the name of an original site, you just may end up with one of these patently commercial scraping sites.

Google defines scraping as follows:

•    Sites that copy and republish content from other sites without adding any original content or value
•    Sites that copy content from other sites, modify it slightly (for example, by substituting synonyms or using automated techniques), and republish it
•    Sites that reproduce content feeds from other sites without providing some type of unique organization or benefit to the user

My question, as set out in the title to this post, is whether or not scraping is a violation of copyright. It turns out that the answer is likely very complicated. You have to look at the definition of a scraping site very carefully. Let me give you some hypotheticals to show what I mean.

Let's suppose that I write a blog and put a link in my blog post to your blog. Does that link violate your copyright? I can't imagine that anyone would think that there was problem with linking to another website on the Web. In this case, there is no content from the originating site, just a link.

But let's carry the hypothetical a little further. What if I put a link to your site and quote some of your content? Does this violate copyright law? If you are acquainted with any of the terminology of copyright law; think fair use. The issue here is whether or not the "quoted" material is a substantial reproduction of the entire original content? I would have the opinion that duplicating an entire blog post either with or without attribution would be a violation of the originator's copyright.

So is the scraping website protected by the "fair use" doctrine? Does the fact that the motivation for listing the original websites is to make money have anything to do with how you would decide if there was or was not a violation of the originator's copyright? By the way, the copyright does not make a distinction between a commercial and non-commercial use of the original constituting or not constituting a violation of copyright. The fact that the reproducing (scraping) party does not make money from the reproduction is not a factor in the issue of violation, although it may ultimately be an issue as to the amount of damages assessed.

Does the fact that the actions of the scraper annoy me, make any difference? I would answer, not in the least. Whether or not you are annoyed by the violation of the copyright makes no difference as to whether or not there is a violation. Likewise, you have no independent claims for your wounded feelings because of the copied content. Copyright is a statutory action (i.e. based on statutory law) and unless the cause of action is recognized by the law, there is no cause of action. Now, in an outrageous case, you may have some kind of tort (personal injury) claim, but that is way outside of my hypothetical situation.

So what is the answer? Does scraping violate the originator's copyright? If only a small portion of the blog is copied (scraped) then I would have to have the opinion that it is not. Essentially, no matter what the motivation of the scrapper, there is not enough content copied to violate the fair use doctrine. Now, that is my opinion. Your's might differ. That is what makes lawsuits.

Do I think there are other reasons why scraping websites are objectionable? Certainly, but those reasons have nothing to do with copyright and they are probably the subject of another different blog post. So, if you are reading this from scraping website, bear in mind that there may be a serious problem with that type of website.

Source:http://genealogysstar.blogspot.in/2013/05/do-blog-scraping-sites-violate-blog.html

Saturday, 13 December 2014

Local ScraperWiki Library

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use
pip install scraperwiki_local
A dump truck dumping its payload

You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck
Differences

DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values
What happens if you do this?
import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw']
scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:
[u'carrots', u'orange juice', u'chainsaw']
In the local version, it is encoded to JSON, so it looks like this:
["carrots","orange juice","chainsaw"]

And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names
SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]: scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
In [3]: scraperwiki.sqlite.save([], {'sHOpPiNg_liST': shopping_list})
In [4]: scraperwiki.sqlite.select('* from swdata')

Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data = scraperwiki.sqlite.select('* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, scraperwiki.sqlite.select returns a list of ordinary dictionaries. In the local version, scraperwiki.sqlite.select returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!

Source:https://blog.scraperwiki.com/2012/06/local-scraperwiki-library/

Friday, 12 December 2014

A quick guide on web scraping: Why and how

Web scraping, which is the collection and cleaning of online data, is the first step in any
data-driven project. Here’s a short video that explains what scraping is, and how to create
automated scraping jobs using a digital tool.

This is a 15-minute video created by an instructor at Ohio State University. In the first six
minutes, the instructor talks about why we need web scraping; he then shows how to use a
scraping tool, OutWit Hub, to collect data scattered in a large database.

FYI: read reviews by Reporters’ Lab of OutWit Hub and other web scraping tools.

Source: http://www.mulinblog.com/quick-guide-web-scraping/

Monday, 8 December 2014

Scraping and Analyzing Angel List Syndicates: Kimono Labs + Silk

Because we use Silk to tell stories and visualize data, we are always looking for interesting ways to pull data into a Silk. Right now that means getting data into the CSV format. Fortunately, a wave of new and powerful visual webscraping tools and services have emerged. These make it very simple for anyone (no technical skills required) to scrape data from a website and export that data into a CSV which we can quickly upload into a Silk.

Cool New Scraping Tools

One of the tools we love in this new space is Kimono Labs. Backed by Y Combinator, Kimono combines a visual scraping editor with the ability to do very granular code-inspector level editing to scraping paths. Saved scrapes can be turned into APIs and exported as JSON. Kimono also lets you save time-series versioning of scrapes.

Data from angel-list-syndicates.silk.co

Like many startups, we watch the goings on at AngelList very closely. Syndicates are of particular interest. Basically, these are DIY venture capital pools that allow a qualified investor to serve as a syndicate leader and aggregate small investments from other qualified investors who are members of AngelList. The idea of the syndicates is to democratize the VC process and make it easier and less risky for individuals to participate.

We used Kimono to scrape information on the Top 25 Syndicates ranked by dollars backing each round. Kimono makes it very easy to visually designate which parts of a page to scrape and how many rows there are on a page. (Here you can see me highlighting the minimum dollar investment). We downloaded the information as a CSV and did a quick scrub to get it ready for upload to Silk. The process took no more than 15 minutes.

We could tell by eyeballing the numbers beforehand that a serious Power Law was in effect. And the actual data analysis on Silk bore this out. We chose to use a pie chart to show distribution. Three syndicates control nearly two-thirds of all the committed capital by Angel.co members in the syndicate program. One of the top three - Tim Ferriss - has no background as a venture capitalist or building technology companies but is rapidly becoming a force in startup investing. The person with the largest committed syndicate pool, Gil Penachina, is someone who is a quiet mover and shaker in Silicon Valley but he clearly packs a huge punch.

The largest syndicate in terms of likely commitments of deals per year is Foundry Group Angels, a group led by Brad Feld (@bfeld). While they put in less per deal, they are planning to back over 50 deals per year - a huge number. Trailing far behind those three was media impresario and Launch conference mogul Jason Calacanis, who is one of the most visible people in the startup space.

Source: http://blog.silk.co/post/83501793279/scraping-and-analyzing-angel-list-syndicates

Monday, 1 December 2014

Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

iPhone (including 4, 5, 5s, 5c, and 6)
Samsung (including Galaxy, S4, S3, Note)
Siri
iPad Cases
Snapchat
Google Glass
Apple iPad
BlackBerry Z10
Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon) let’s move 5. Train employees Other interesting declaration was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

Source:http://thewebminer.com/blog/2013/12/

Web Scraping’s 2013 Review – part 1

iPhone (including 4, 5, 5s, 5c, and 6)
Samsung (including Galaxy, S4, S3, Note)
Siri
iPad Cases
Snapchat
Google Glass
Apple iPad
BlackBerry Z10
Cloud Computing

Friday, 28 November 2014

Scraping SSL Labs Server Test Results With R

    NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

    fetch the contents of a URL with RCurl
    process the HTML page tags with R’s XML library
    identify the key elements from the page that need to be scraped
    organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML)   # XML (HTML) processing
library(plyr) # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

    d (the domain)
    s (the IP address to test)
    ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

    default to using "rud.is" as the domain (you can pick what you want in your implementation)
    not supply an IP address (which the script will then have to lookup with nsl)
    will pause 5s between GET+check attempts
    pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

if (ip == "") {

    tmp <- nsl(site)
    if (is.null(tmp)) {
      return(data.frame(site=site, ip=NA, Certificate=NA,
                        Protocol.Support=NA, Key.Exchange=NA,
                        Cipher.Strength=NA)) }
    ip <- tmp
}

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
Sys.sleep(pause)
rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,
                    Protocol.Support=NA, Key.Exchange=NA,
                    Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

        the overall rating (A-F)
        the “Certificate” score
        the “Protocol Support” score
        the “Key Exchange” score
        the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a <span> tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a <span> element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

                            rating=rating, rbind(vals),
                            row.names=NULL)

# we use the labs vector as the column names (in the right spot)

colnames(rating.result) <- c("site", "ip", "rating",

                              gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

##                site              ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1            rud.is 184.106.97.102      B         100               70           80              90

## 2 stackoverflow.com 198.252.206.140      A         100               90           80              90

## 3        er-ant.com            <NA>   <NA>        <NA>             <NA>         <NA>            <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

#' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a <span ...> tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Wednesday, 26 November 2014

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

    Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

    Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

    Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

    Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

    import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a lo
op, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.

    Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

    Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

    Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

    iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

    A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

    Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Source: http://blog.marcua.net/post/74655674340

Monday, 24 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor. Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Wednesday, 19 November 2014

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive.

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

Tuesday, 18 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Sunday, 16 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Friday, 14 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Wednesday, 12 November 2014

Web scraping services-importance of scraped data

Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data which can be stockpiled and scrutinized in a centralized databank. Therefore, web scraping services have a direct influence on the outcome of the reason as to why the data collected in necessary.

It is not very easy to scrap data from different websites due to the terms of service in place. So, the there are some legalities that have been improvised to protect altering the personal information on different websites. These ‘rules’ must be followed to the letter and to some extent have limited web scraping services.

Owing to the high demand for web scraping, various firms have been set up to provide the efficient and reliable guidelines on web scraping services so that the information acquired is correct and conforms to the security requirements. The firms have also improvised different software that makes web scraping services much easier.

Importance of web scraping services

Definitely, web scraping services have gone a long way in provision of very useful information to various organizations. But business companies are the ones that benefit more from web scraping services. Some of the benefits associated with web scraping services are:

    Helps the firms to easily send notifications to their customers including price changes, promotions, introduction of a new product into the market. Etc.
    It enables firms to compare their product prices with those of their competitors
    It helps the meteorologists to monitor weather changes thus being able to focus weather conditions more efficiently
    It also assists researchers with extensive information about peoples’ habits among many others.
    It has also promoted e-commerce and e-banking services where the rates of stock exchange, banks’ interest rates, etc. are updated automatically on the customer’s catalog.

Advantages of web scraping services

The following are some of the advantages of using web scraping services

    Automation of the data

    Web scraping can retrieve both static and dynamic web pages

    Page contents of various websites can be transformed

    It allows formulation of vertical aggregation platforms thus even complicated data can still be extracted from different websites.

    Web scraping programs recognize semantic annotation

    All the required data can be retrieved from their websites

    The data collected is accurate and reliable

Web scraping services mainly aims at collecting, storing and analyzing data. The data analysis is facilitated by various web scrapers that can extract any information and transform it into useful and easy forms to interpret.

Challenges facing web scraping

    High volume of web scraping can cause regulatory damage to the pages

    Scale of measure; the scales of the web scraper can differ with the units of measure of the source file thus making it somewhat hard for the interpretation of the data

    Level of source complexity; if the information being extracted is very complicated, web scraping will also be paralyzed.

It is clear that besides web scraping providing useful data and information, it experiences a number of challenges. The good thing is that the web scraping services providers are always improvising techniques to ensure that the information gathered is accurate, timely, reliable and treated with the highest levels of confidentiality.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/191-web-scraping-services-importance-of-scraped-data/

Tuesday, 11 November 2014

How to scrape Amazon with WebDriver in Java

Here is a real-world example of using Selenium WebDriver for scraping.
This short program is written in Java and scrapes book title and author from the Amazon webstore.
This code scrapes only one page, but you can easily make it scraping all the pages by adding a couple of lines.

You can download the souce here.

import java.io.*;
import java.util.*;
import java.util.regex.*;

import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;

public class FetchAllBooks {

    public static void main(String[] args) throws IOException {

        WebDriver driver = new FirefoxDriver();

driver.navigate().to("http://www.amazon.com/tag/center%20right?ref_=tag_dpp_cust_itdp_s_t&sto

re=1");

        List<WebElement> allAuthors = driver.findElements(By.className("tgProductAuthor"));
        List<WebElement> allTitles = driver.findElements(By.className("tgProductTitleText"));
        int i=0;
        String fileText = "";

        for (WebElement author : allAuthors){
            String authorName = author.getText();
            String Url = (String)((JavascriptExecutor)driver).executeScript("return

arguments[0].innerHTML;", allTitles.get(i++));
            final Pattern pattern = Pattern.compile("title=(.+?)>");
            final Matcher matcher = pattern.matcher(Url);
            matcher.find();
            String title = matcher.group(1);
            fileText = fileText+authorName+","+title+"\n";
        }

        Writer writer = new BufferedWriter(new OutputStreamWriter(new

FileOutputStream("books.csv"), "utf-8"));
        writer.write(fileText);
        writer.close();

        driver.close();
    }
}

Source: http://scraping.pro/scraping-amazon-webdriver-java/

Saturday, 8 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information

The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-enters-politics/

Wednesday, 5 November 2014

Web Scraping: The Invaluable Decision Making Tool

Business decisions are mandatory in any company. They reflect and directly influence about the future of the company. It is important to realize that decisions must be made in any business situation. The generation of new ideas calls for new actions. This in turn calls for decisions. Decisions can only be made when there is adequate information or data regarding the problem and the cause of action to be taken. Web scraping offers the best opportunity in getting the required information that will enable the management make a wise and sound decision.

Therefore web scraping is an important part in generation of the practical interpretations for the business decision making process. Since businesses take many courses of actions the following areas call for adequate web scraping in order to make outstanding decisions.

1. Suppliers. Whether you are running an offline business there is need to get information regarding your suppliers. In this case there are two situations. The first situation is about your current suppliers and the second situation is about the possibility of acquiring new suppliers. By web scraping you has the opportunity to gather about your suppliers. You need to know other business they are supplying to and the kind of discounts and prices they offer to them. Another important aspect about consumers is to determine the periods when they have surplus and therefore be able to determine the purchasing prices.

Web scraping can provide new information concerning new suppliers. This will make a cutting edge in the purchasing sector. You can get new suppliers that have reasonable prices. This will go a long way in ensuring a profitable business. Therefore web scraping is an integral process that should be taken first before making a vital decision concerning suppliers.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-invaluable-decision-making-tool/

Thursday, 11 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>

Similarly, you can create xpath expression of your choice to work with.

Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

Monday, 8 September 2014

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0

    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP: " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass

    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."

Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python