As we shared in the first blog post in this series, the continued investment in the anti-bot and anti-scraping space makes it continually harder for automated processes to access data on these protected sites. For the most part, there are three different strategies that sites may take in order to limit the efficacy of automated tools.
- Sites are making navigation difficult.
- Regular website changes in content and structure are often designed to prevent web scraping. A website changes occur, they frequently require script updates that can at times be large in scale.
- Dynamic content is a web page element that changes according to user data and behavior. This represents a challenge to web crawling bots, which are programmed to scrape static elements.
- Honeypots (or spoofing) are systems designed to attract crawlers and then block them from accessing websites, either by enticing them to extract incorrect data or sending them into an infinite loop of requests.
2. Sites are adding intelligence to detect automated scraping.
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a tool used to differentiate between real users and automated users, and one of the most difficult obstacles to overcome in web scraping. Below are some examples of the more difficult CAPTCHAs that websites have begun employing.
3. Sites are preventing scraping through blocking.
IP blocking through fingerprinting is one of the most common and easiest ways for sites to block web scraping traffic. Whitelisting or blacklisting specific (or ranges) IP addresses will cause incoming requests to not be able to access a site at all.
So, how is acquiring competitive data for our customers different from what the bad bots are doing?
How QL2 Ensures Ethical Data Acquisition
With the bad players out there, it makes a lot of sense to protect your site, and it’s understandable why so much money is being invested in these technical innovations. Unfortunately, when we discuss bad or malicious bots, web scraping companies, like QL2, can unfairly get lumped into this category. Because of this, companies that are focused on sharing publicly available information or providing insights on the competitive landscape have to contend with the same technologies that are designed to stop players who are attempting to steal information or crash your website. This is what makes the web scraping industry so hard to navigate.
The US court systems have upheld several rulings that state there is nothing illegal about web scraping. The latest landmark ruling in April 2022 by the Ninth Circuit of Appeals “reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA” (source: TechCrunch).
There are several actions that QL2 takes with regard to ensuring that we continue to provide the services and competitive data that our customers need to inform their business decisions. Legal precedents notwithstanding, we focus on maintaining the following unwritten rules and standards:
- We ONLY scrape publicly accessible data.
- QL2 does not and will not attempt to scrape or access anything that would be considered personally identifiable information (PII).
- We do everything possible to ensure smooth data access with as little impact on sites as possible.
- QL2 won’t use fake logins, access areas that require secure access, or request data behind paywalls.
- QL2 does not use botnets; we rely on our own systems and hardware to run scraping tools. We are not using malware or hijacking devices from unsuspecting users to run our network.
- When running spiders or crawling sites, we are very careful not to generate click fraud or traverse advertising links.
Our goal is to continue to be a good citizen on the internet and limit our impact on the sites where we need competitive data. We will continue to focus on the ethical and legal precedents set within our industry.
Next week, we will wrap up our competitive data acquisition blog series by discussing more fingerprinting and some of the challenges that we deal with from an implementation standpoint.
For more information on the topics discussed in this blog series, check out our webinar, The Trials, and Tribulations of Competitive Data Acquisition.
Written by: Jeremy Frank, SVP of Engineering