Yahoo Local data scraping

Saturday, 11 July 2015

Mobile app developers “duped” into distributing data-scraping malware: NICTA

The surge in mobile malware has led many to condemn developers' poor security practices, yet recent NICTA research suggests that – even though data-stealing is ubiquitous among both paid and free Android applications – many mobile application developers are in fact being “duped” into incorporating data-stealing routines into their applications.

A methodical analysis of Android applications and source code found that all of the top 100 paid and non-paid apps in Australia were collecting personal information, with 60 percent of the apps incorporating some sort of tracking library and 20 percent of the apps featuring more than three different tracking libraries.

While many have blamed developers for their poor security, NICTA mobile systems research group leader, Aruna Seneviratne, who leads the organisation's Networks Research Group, told CSO Australia that many tracking libraries were inadvertently added when developers incorporated third-party libraries into their mobile apps.

“In most cases app developers just use third-party libraries and don't know what's in them,” he said. “They're not being malicious for the sake of being malicious; they are just being duped into doing a thing that collects a lot of information.”

And collect they do. Apps analysed by the team – whose paper 'early detection of spam mobile apps' was accepted for presentation at the recent WWW 2015 conference in Florence, Italy – were siphoning all kinds of personal information off of users' mobile devices, often sending it to enlarge what have become massive databases of personal preferences and behavioural modeling.

“It's amazing how much information each of those apps collects,” he said, “and the scary thing is that most of them actually go to a small number of sources – which means these guys can actually infer a lot of information about you. They have a very good idea of who you are and what you're doing – and they are cross-matching the information they collect.”

Ever more-clever data-siphoning routines were making data collection richer all the time, with many Android apps now being designed with libraries that collect information about nearby Wi-Fi access points and can correctly extrapolate the user's location 90 percent of the time.

Read more: The week in security: Android apps collecting your location data, home routers hit by drive-by malware

Seneviratne blamed Google's relatively lax app-approval process for the proliferation of such apps, which join the malware-laden apps that by the team's figures account for around 3 percent of all Google Play Store apps.

Recognising that developers are often as clueless as users about the extent of the data collection going on, the team has proposed an app-rating system that will give consumers a better idea of what they're enabling by downloading and installing a particular app.

A basic prototype has already been developed and a pilot site is expected to be up and running by the fourth quarter of this year. The service, which rates apps on criteria such as privacy and security, will be available to third parties as a Web service that Seneviratne hopes will eventually help it gain traction on app-rating and other sites.

Read more: Surveillance laws driving companies to limit data collection, developers to boost security

“We've been working to come up with a scheme that is similar to the energy-ratings system that you have for electrical appliances,” he said, noting that the site will also seek to boost developers' security awareness by correlating app ratings “to let consumers know they can download an alternate app that has the same functionality but a higher security rating”.

Israeli developer-tools firm Checkmarx has taken its own approach to improving developers' security skills, recently learning extensive lessons as hackers worked to manipulate its Game of Hacks security application – which is now under development to be sold to large corporates for developer training and testing.

This article is brought to you by Enex TestLab, content directors for CSO Australia.

Read more: The week in security: Budget flags encryption troubles, cross-government IAM

Feeling social? Follow us on Twitter and LinkedIn Now!

Read More:

Victorian Commissioner for Privacy and Data Protection sorts sheep from the goats

Better than email: VISA launches FireEye threat intel platform for merchants

Source: http://www.cso.com.au/article/576533/mobile-app-developers-duped-into-distributing-data-scraping-malware-nicta/

Saturday, 27 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Saturday, 20 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/" "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1024648/" "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/" "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer http://www.imdb.com/title/tt1045658/

## 2 The Hobbit: An Unexpected Journey     2 8.2 / trailer http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Tuesday, 9 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Wednesday, 3 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload) # This'll be the results page

html = r.text

root = lxml.html.fromstring(html) # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Friday, 29 May 2015

Data Scraping Services - Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.

Yelp Scraping Python Script

import urllib from bs4 import BeautifulSoup import re from threading import Thread #List of yelp urls to scrape url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg'] i=0 #function that will do actual scraping job def scrape(ur): html = urllib.urlopen(ur).read() soup = BeautifulSoup(html) title = soup.find('h1',itemprop="name") saddress = soup.find('span',itemprop="streetAddress") postalcode = soup.find('span',itemprop="postalCode") print title.text print saddress.text print postalcode.text print "-------------------" threadlist = [] #making threads while i<len(url): t = Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for b in
threadlist: b.join()

import urllib

from bs4 import BeautifulSoup

import re

from threading import Thread

#List of yelp urls to scrape

url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

i=0

#function that will do actual scraping job

def scrape(ur):

           html = urllib.urlopen(ur).read()

          soup = BeautifulSoup(html)

       title = soup.find('h1',itemprop="name")

          saddress = soup.find('span',itemprop="streetAddress")

          postalcode = soup.find('span',itemprop="postalCode")

          print title.text

          print saddress.text

          print postalcode.text

          print "-------------------"

threadlist = []

#making threads

while i<len(url):

          t = Thread(target=scrape,args=(url[i],))

          t.start()

          threadlist.append(t)

          i=i+1

for b in threadlist:

          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday, 26 May 2015

Data Mining Services

Data Mining Services, through its data mining services can mine required data for you from any of the available sources. Over the years, we have successfully catered to wide variety of outsource data mining requirements, which specifies our competency in dealing with your data mining requirements.

Based on your requirements, we can mine data from your preferred data sources, or we will use our own reliable sources to mine the data required by you. We have been using automated as well manual data mining strategies to deliver superior data mining services.

Types of data mining services delivered by us

With an extensive variety of data mining services provided by us, you will definitely be able to find the most perfect service package to cater to your requirements. Below listed are just some of the data mining services offered by us:

•    Web data mining
•    Data extraction
•    Data capture
•    Data gathering
•    Collection of required data
•    Validation of data

Outsource data mining requirements to us, and we are sure that the data mining India unit of Hi-Tech BPO Services will be able to formulate the most appropriate and cost effective solutions to include your entire requirements.

Highlights of our data mining services:

•    Most affordable rates
•    Dedicated data mining India unit
•    Latest data mining technologies used to mine all required data
•    Data will be mined, gathered, processed and validated as per your requirements
•    Mined data can be directly included into your database

Competitive advantage of using our data mining services

To mine accurate and relevant data, some level of internet knowledge is essential. And it would also consume a lot of your valuable time. With our data mining services, we will take care of all your data mining tasks, while you look after your business and its core functions.

The affordably priced data mining services delivered by the data mining India unit will also help you to save considerable amount of your money, which you can put into more productive purposes.

Source: http://www.hitechbposervices.com/data-mining.php