Skip to main content

Bringing ethical awareness to data extraction practices

IT
(Image credit: Shutterstock / carlos castilla)

The internet is currently undergoing a similar phenomenon to the gold rushes of the early eighteenth century, specifically when it comes to data extraction. With data now dubbed by some analysts as the “new oil” in terms of its value, the field is still open to small and large players alike, which has led to some unprofessional activities that extend all the way towards the acquisition of password-protected data. Even in cases of easily extracted public data, ethics can come into question, particularly when the information is acquired without explicit permission.

While many websites do contain defensive measures such as IP bans, the invisible conflicts between scrapers and servers are ongoing and gaining in intensity, due to increased competition and economic factors. Most people don’t realise these are taking place between e-commerce stores, although they are happily taking advantage of the low prices found on aggregator websites like Expedia, Google Shopping, PriceGrabber and Skyscanner.

About the author

Julius Cerniauskas is CEO at Oxylabs

Ethical web scraping: the importance of intention

Tools can be used for positive and negative purposes, and web scraping is no exception. A fairly common scenario is the scraping of personal data for marketing purposes. Hundreds of millions of users agree to release their data through terms of service agreements on e-commerce sites - whether they realise it or not. The issue with the exposed data, however, is that it has been extracted by social media agencies and used by now-defunct websites that create profiles and list personal details without user permission.

As a result, web scraping is increasingly being subjected to negative press that has resulted in increased awareness from the public with respect to the value and privacy of their data. There is nothing inherently unethical about web scraping as it automates activities that people often do on a manual basis. Consider an individual who wants to buy a car and is researching different years, models and brands before purchasing. Inputting those details onto a spreadsheet for comparison and analysis is exactly what web scraping does. The main difference is that web scraping does it on a much bigger scale by using bots to crawl numerous websites and extract huge amounts of information in seconds.

Extracting publicly available data requires proxies. In short, proxies act as intermediaries between the web scraper and web server. Employing proxies allows distributing data requests evenly to the web server, ensuring that the data is requested at a fair rate, as well as providing the anonymity factor to the requesting party.

The consequences of unethical scraping

Unethical scraping uses data extraction in a way that may compromise privacy and result in server overload. As more businesses become aware of the importance of data, web scraping is increasing and the logical outcome of this is a rise in unethical activities.

While many websites try to prevent it through IP bans, this is becoming futile due to the use of proxies and their function in circumventing server issues by simulating human behaviour. The end results can be server overloads that cost online businesses money, reduced internet transparency and more distrust from the public with respect to privacy issues.

A web scraping code of ethics is necessary

Web scraping has many benefits that depend upon the availability of a free and transparent internet. I believe it would benefit the entire tech space if we adopted a few guidelines in order to make the landscape fair for everyone:

  1. Scrape publicly available web pages only
  2. Study the target website’s legal documents to determine whether you will legally accept their terms of service and if you will do so – whether you will not breach these terms
  3. Make reasonable requests for data in order to ensure that server function is not compromised (DDoS attack)
  4. Respect privacy concerns of source websites with regards to any data obtained
  5. Make use of proxies procured in an ethical manner

Not all proxies are equal

It is commonly known that some proxies operating today are not ethically sourced, with many often obtained through applications downloaded by people on their devices. Whether these individuals are aware that their device is being used is difficult to ascertain. What’s certain is that it’s definitely not ethical to use them as a proxy in cases where they consented to misleading or confusing terms of service that unwillingly turn their device into a participant on a residential proxy network.

Ethical practices lead to increased fairness and accountability

There are some aspects of modern web scraping activity that are missing clarity, and a code of ethics is needed to bring order to the industry. If those in the industry can come together in agreement over a professional approach to web scraping, it will help to maintain a fair, open and free internet that will benefit both businesses and consumers. We are still in the early stages of discovering the full potential of data scraping in different industries, so let’s take advantage of this golden opportunity to drive innovation and create growth in the most ethical way possible.