Wednesday, 10 August 2016

How to Scrape a Website into Excel without programming

How to Scrape a Website into Excel without programming

This web scraping tutorial will teach you visually step by step how to scrape or extract or pull data from websites using import.io(Free Tool) without programming skills into Excel.

Personally, I use web scraping for analysing my competitors’ best-performing blog posts or content such as what blog posts or content received most comments or social media shares.

In this tutorial,We will scrape the following data from a blog:

    All blog posts URLs.
    Authors names for each post.
    Blog posts titles.
    The number of social media shares each post received.

Then we will use the extracted data to determine what are the popular blog posts and their authors,which posts received much engagement from users through social media shares and on page comments.

Let’s get started.

Step 1:Install import.io app

The first step is to install import.io app.A free web scraping tool and one of the best web scraping software.It is available for Windows,Mac and Linux platforms.Import.io offers advanced data extraction features without coding by allowing you to create custom APIs or crawl entire websites.

After installation, you will need to sign up for an account.It is completely free so don’t worry.I will not cover the installation process.Once everything is set correctly you will see something similar to the window below after your first login.

Step 2:Choose how to scrape data using import.io extractor

With import.io you can do data extraction by creating custom APIs or crawling the entire websites.It comes equipped with different tools for data extraction such as magic,extractor,crawler and connector.

In this tutorial,I will use a tool called “extractor” to create a custom API for our data extraction process.

To get started click the “new” red button on the right top of the page and then click “Start Extractor” button on the pop-up window.

After clicking  “Start Extractor” the Import.io app internal browser window will open as shown below.

Step 3:Data scraping process

Now after the import.io browser is open navigate to the blog URL you want to scrape data from. Then once you already navigated to the target blog URL turn on extraction.In this tutorial,I will use this blog URL bongo5.com  for data extraction.

You can see from the window below I already navigated to www.bongo5.com but extraction switch is still off.

Turn extraction switch “ON” as shown in the window below and move to the next step.

Step 4:Training the “columns” or specifying the data we want to scrape

In this step,I will specify exactly what kind of data I want to scrape from the blog.On import.io app specifying the data you want to scrape is referred to as “training the columns”.Columns represent the data set I want to scrape(post titles,authors’ names and posts URLs).

In order to understand this step, you need to know the difference between a blog page and a blog post.A page might have a single post or multiple posts depending on the blog configuration.

A blog might have several blog posts,even hundreds or thousands of posts.But I will take only one session to train the “extractor” about the data I want to extract.I will do so by using an import.io visual highlighter.Once the data extraction is turned on the-the highlighter will appear by default.

I will do the training session for a single post in a single blog page with multiple posts then the extractor will extract data automatically for the remaining posts on the “same” blog page.
Step 4a:Creating “post_title” column

I will start by renaming “my_column” into the name of the data I want to scrape.Our goal in this tutorial is to scrape the blog posts titles,posts URLs,authors names and get social statistics later so I will create columns for posts titles,posts URLs,authors names.Later on, I will teach you how to get social statistics for the post URLs.

After editing “my_column” into “post_title” then point the mouse cursor over to any of the Posts title on the same blog page and the visual highlighter will automatically appear.Using the highlighter I can select the data I want to extract.

You can see below I selected one of the blog post titles on the page.The rectangular box with orange border is the visual highlighter.

The app will ask you how is the data arranged on the page.Since I have more than one post in a single page then you have rows of repeating data.This blog is having 25 posts per page.So you will select “many rows”.Sometimes you might have a single post on a page for that case you need to select “Just one row”.

Source: http://nocodewebscraping.com/web-scraping-for-dummies-tutorial-with-import-io-without-coding/

Thursday, 4 August 2016

Three Common Methods For Web Data Extraction

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Monday, 1 August 2016

Scraping data from LinkedIn

Scraping data from LinkedIn

How to scrape data from LinkedIn public profile for marketing purposes?

You can scrape data from a LinkedIn public profile using data scraper software. LinkedIn data extraction is most beneficial for marketers and most medium size companies rely on LinkedIn for their marketing purpose.

I would recommend you to use "LinkedIn Lead Extractor" software, which helps to quickly scrape public profiles from LinkedIn. With this tool your can scrape profile link, First Name, Last Name, Email, Phone Address, Twitter id, Yahoo messenger id, Skype Id, Google Talk ID, Job Role, Company Name, Address, Country, Connections. This company has built this tool specially for LinkedIn marketers who are not satisfied with their drop ship supplier's digital data.

LinkedIn advance search provides you the targeted customers profiles list with your requirements like country, country, city, company, job title, and much more.

In few weeks you can developed new ways to set-up differently the sales teams and create a much more technologic environment in the strategy department. An internal platform that generated targeted leads can be of a very big help. You can easily execute go to market to any area or city in so much little time compared with some years ago.

Source: http://www.ahmadsoftware.com/blogs/4/scraping-data-from-linkedin.html

Monday, 11 July 2016

Extract Data from Multiple Web Pages into Excel using import.io

In this tutorial, i will show you how to extract data from multiple web pages of a website or blog and save the extracted data into Excel spreadsheet for further processing.There are various methods and tools to do that but I found them complicated and I prefer to use import.io to accomplish the task.Import.io doesn’t require you to have programming skills.The platform is quite powerful,user-friendly with a lot of support online and above all FREE to use.

You can use the online version of their data extraction software or a desktop application.The online version will be covered in this tutorial.

Let us get started.

Step 1:Find a web page you want to extract data from.
You can extract data such as prices, images, authors’ names, addresses,dates etc

Step 2:Enter the URL for that web page into the text box here and click “Extract data”.

Then click  “Extract data” Import.io will transform the web page into data in seconds.Data such as authors,images,posts published dates and posts title will be pulled from the web page as shown in the image below.

Import.io extracted only 40 posts or articles from the first page of the blog!.
If you visit bongo5.com you will notice that the web page is having a total of 600+ pages at the time of writing this article and each page has 40 posts or articles on it as can be shown by the image below.
Next step will show you how to extract data from multiple pages of the web page into excel.

Step 3:Extract Data from Multiple Web Pages into Excel

Using the import.io online tool you can extract data from 20 web pages maximum.Go to the bottom right corner of the import.io online tool page and click “Download CSV” to save the extracted data from those 20 pages into Excel.
Note:Using the import.io desktop application you can extract an unlimited number of pages and pin point only the data you want to extract.Check out this tutorial on how to use the desktop application.
Once you click “Download CSV” the following pop up window will appear.You can specify the number of pages you want to get data from up to a maximum of 20 pages then click “Go!”
You will need to Sign up for a free account to download that data as a CSV, or save it as an API.If you save it as an API you can go back to the API later to extract new data if the web page is updated without the need to repeat the steps we have done so far.Also, you can use the API for integration into other platforms.
Below image shows 20 rows out of 800 rows of data extracted from the 20 pages of the web page.

Conclusion

The online tool doesn’t offer much flexibility than the desktop application.For example, you can not extract more than 20 pages and you can not pin point the type of data you want to extract.For a more advanced tutorial on how to use the desktop application, you can check out this tutorial I created earlier.

Source URL : http://nocodewebscraping.com/extract-multiple-web-pages-data-into-excel/

Sunday, 10 July 2016

4 Web Scraping Tools To Save You Time On Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

Uipath  specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

 Import.io  offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

 Screen Scraper  is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. Launch the software, add a proxy, start recording the list of your actions and creating extracting patterns (some coding required). Works great with HTML and Javascript, however you should test it with Citrix and other platforms. Basically, screen scraper helps you writing simple web scraping scripts and lets you download the extracted data in txt/csv/excel format.

Pros: When set correctly, there’s no data extraction tasks Screen scraper fails to handle.
Cons: The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.

Source URL :  http://tech.co/4-web-scraping-tools-save-time-data-extraction-2015-03

Friday, 8 July 2016

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.


Source URL :http://yellowpagesdatascraping.blogspot.in/2015/07/ecj-clarifies-database-directive-scope.html

Wednesday, 29 June 2016

An Easy Way For Data Extraction

There are so many data scraping tools are available in internet. With these tools you can you download large amount of data without any stress. From the past decade, the internet revolution has made the entire world as an information center. You can obtain any type of information from the internet. However, if you want any particular information on one task, you need search more websites. If you are interested in download all the information from the websites, you need to copy the information and pate in your documents. It seems a little bit hectic work for everyone. With these scraping tools, you can save your time, money and it reduces manual work.

The Web data extraction tool will extract the data from the HTML pages of the different websites and compares the data. Every day, there are so many websites are hosting in internet. It is not possible to see all the websites in a single day. With these data mining tool, you are able to view all the web pages in internet. If you are using a wide range of applications, these scraping tools are very much useful to you.

The data extraction software tool is used to compare the structured data in internet. There are so many search engines in internet will help you to find a website on a particular issue. The data in different sites is appears in different styles. This scraping expert will help you to compare the date in different site and structures the data for records.

And the web crawler software tool is used to index the web pages in the internet; it will move the data from internet to your hard disk. With this work, you can browse the internet much faster when connected. And the important use of this tool is if you are trying to download the data from internet in off peak hours. It will take a lot of time to download. However, with this tool you can download any data from internet at fast rate.There is another tool for business person is called email extractor. With this toll, you can easily target the customers email addresses. You can send advertisement for your product to the targeted customers at any time. This the best tool to find the database of the customers.

 Source  URL : http://ezinearticles.com/?An-Easy-Way-For-Data-Extraction&id=3517104