A few months ago I discussed with a friend of mine the potential of personal data from online classified ads (like OLX, Gumtree). The conclusion was simple – it’s illegal to scrape them and use for business purposes (at least in EU). But If I’ll publish an offer online – is it easy to scrape them and use illegally? I audited 5 online marketplaces to check if they protect our data from scrapping. You will find below the information, how online classified ads portals prevent scrapping along with the information if somebody would be able to scrape phone number from ads. If possible, I’ll also explain how to do it. I’ll often use the verbs scaping or crawling which stands for automatic data extraction from websites.
Illegal, but what if…
What if somebody will crawl the data to use them against the law? As a user of online classified ads, I agree to display my phone number, so that people could contact me regarding the offer. As soon as the offer is outdated, I remove the offer along with my phone number displayed on it. However, if the marketplace gives the opportunity to crawl the data, it’s easy for somebody to save a vast amount of phone numbers and use them later dishonestly. Two dangers come up into my mind as first thought:
- Call center. It’s illegal to use such data for marketing purposes, however, it’s hard to determine (in Poland) how did the telemarketer get your phone number. He says that your number has been drawn by the computer and refuses to provide details about the company which he represents. It’s hard to investigate how he got your phone number, which would inhibit the proof of misuse if the phone came from the online classified ads marketplace.
- Phishing. Someone can pretend to call from a bank and ask for more personal details, like personal ID. It sounds suspected, but still, some people won’t suspect a scam.
Obviously, scammers can be much more creative.
Protecting personal data in online classified ads
Online classified ads have solid armor to protect our data. They can use the robots.txt file to inform bots, what they are allowed to do on the website. However, the most important protection is:
- Don’t display personal data.
People are able to contact you only by the form. Besides that, you are not allowed to place your contact data in the advertisement’s description. It’s usually used to avoid displaying email.
- Hide phone number behind XXX.
In fact, the number is not placed in the HTML code. Instead, it’s displayed after hitting the button ‘Display’ (hitting a button send a request and return a phone as a result). What’s the advantage? (Python) scrapers like BS4 or Scrappy are not able to click and thus they can only craw information located in HTML code.
Many of you know Recaptcha button “I’m not a robot”. But in 2017 Google released invisible Recaptcha which doesn’t bother the user. The Recaptcha analyses user’s interaction and determines if you are a bot. If yes, it can provide limits for you, e.g. forbidding you to click any button. From my experience, it’s the most powerful protection against scrapping.
I’ve performed an audit on 5 polish online classified ads marketplaces to check if anybody could scrape personal data easily:
I’ve chosen them based on the first hits in Google. I focused on scrapping phone numbers, as emails are usually hidden behind the offer. I also measured, how long does it take to crawl the data. Based on crawl results, I split marketplaces into three categories: unscrapable, unworthy scraping and scrapable. Below you can the result of the audit. I’ll start with the least protected websites 🙂
from bs4 import BeautifulSoup import requests import re import json r = requests.get("https://sprzedajemy.pl/honda-vtx-1300-tomiczki-2-6bea9f-nr57262583") # here put the offer URL data = r.text soup = BeautifulSoup(data, "html.parser") scripts = soup.find_all('script') lista =  for script in scripts: try: d = json.loads(script.text) print(d['owns']['telephone']) except: pass
To scrape many pages, merely find all links to the offer and for each of the links run the script above. I won’t provide the full code deliberately, to hinder the scrapping for people not familiar with programming, but willing to gain personal data.
On 1 out of 5 audited websites (Krakowlokalnie) I was able to crawl phone numbers, but it’s not trivial and I believe nobody would take the effort.
In this case, some digits in the phone number were also replaced with X on a display, but the phone number wasn’t located in the HTML code. To obtain the full phone, the button “show” must be pressed and that standard crawlers like Scrappy or BS4 can’t cope with it. However, it’s still possible to be scraped using Selenium. Selenium is a tool designed for developers to test the behavior of the website and it’s able to ‘press’ the buttons and obtain the results from the website. I’ve used the following code to scrape data from Krakowlokalnie.pl:
from selenium import webdriver import time browser = webdriver.Firefox() browser.get("http://krakowlokalnie.pl/ogloszenie/sukienka,16614.html#") time.sleep(0.1) elem2 = browser.find_element_by_id("showphone") elem2.click() time.sleep(20) phone = browser.find_element_by_id("phone") print ("Phone: ", phone.get_attribute('outerHTML')) print("Done")
Why I called it ‘unworthy scrapping’? It’s very slow – crawling one page took me ~25 seconds at night, during the day I got Runtime error, which means that retrieving the data took longer than the default maximum time.
The last of 5 audited pages – OLX, remain unscrapable. It literally means it’s impossible to crawl the data from it. OLX not only hide the phone number, as unworthy scraping websites but also uses the invisible Recaptcha, which is able to detect whether I’m a human or a crawler. In the beginning, I didn’t notice Recaptcha, as Selenium scrapper has constantly returned Runtime error (I thought due to slow network connection). After several attempts by very poor internet connection the page displayed message “Failed to load Google reCAPTCHA. Please check the internet connection and reload this page”.
I expected, that I wouldn’t be able to crawl personal data from online advertising websites, however, 3 out of 5 pages were easy to crawl. It gives the opportunity to save a vast amount of phone numbers and misuse them later. I guess few people would misuse the phone numbers due to prospective legal consequences. Nevertheless, If I left my phone number along with the offer, I would expect the website provider to protect my personal data from misuse.
What do you think about it? Do you have any experience regarding phone/ email abuse? Or maybe I’m too much worried? Feel free to comment!