What are the key factors when choosing web scraping proxy? What is proxy diversity? Do proxies actually have reputation? Here’s a quick guide!
What is IP Address and How It Works
An Internet Protocol address (IP address) is a numerical label such as 192.0.2.1 that is connected to a network interface that uses the Internet Protocol for communication.
It is composed of a 32-bit number, usually read and written with a “dot-decimal” notation, that splits the 32 digits in 4 octets, each divided by a dot.
Due to the rising number of devices connected to the internet, the IPV6 protocol will increase the actual IPV4 size from 32 to 128 bits.
When a device connects to the internet, the Internet Service Provider assigns a free IP address to it, choosing between the addresses in the range that one of the five regional Internet registries has assigned to the ISP.
What Is Rayobyte
Rayobyte is a proxy vendor, they sell data center, residential and mobile proxies that can be used for scraping the web. Describing the proxy business Neil said that every potential user of proxies should be aware of two key aspects:
- diversity, in terms of IPs located in different subnets
- reputation of the IPs
Let’s see them in detail and why we should be careful about them.
Web Scraping Proxy Diversity Is Key
One thing that every web scraper developer is well aware of, is that we cannot make too many requests from the same IP address in a certain timeframe, otherwise we would be blocked.
That’s the main reason why proxy providers are used when it comes to web scraping.
But I didn’t know that some large websites like Google or Amazon, heavily targeted by bots, would temporarily ban not only your IP address but all the other 255 IP addresses in your subnet.
Let’s say Amazon supports 2000 requests per hour from a certain subnet.
It means that from IP 188.8.131.52 I can make 2000 requests in one hour before getting blocked. But not only my IP will be blocked, but also IPs from 184.108.40.206 to 220.127.116.11 will be blocked from requesting data from Amazon.
But this also means that If I make 1000 requests from 18.104.22.168 and 1000 from 22.214.171.124, then all the addresses between 126.96.36.199 to 188.8.131.52 will be blocked again.
This leads to the “noisy neighbor problem”: I don’t know what the other users on the same subnetwork are doing, if they are scraping Amazon too, “burning” my total request number I can make.
This is the reason why the diversity of the sources in the IP rotation for the scrapers (and also for proxy providers) is a key success factor in web scraping projects involving large websites.
Web Scraping Proxy IPs Have a Reputation
Several services offer IP address blacklisting when bad actions are performed on them, like a spam campaign or fraud.
Being on these lists impacts the IP reputation and one of the measures that anti-bot software takes to prevent bots from accessing the websites is to check this reputation.
Some years ago, 4 million IP addresses were stolen from the Regional Internet Registry of Africa AFRINIC and sold on the black market to be used for fraud and spam.
As a result, these IP addresses and others in the same subnets are almost unusable for web scraping because of their low reputation and, even when browsing, CAPTCHAs are often triggered.
This must be considered when choosing the proxy provider for our web scraping project, and usually, when prices that are too good to be true it is due to the reputation of the IP addresses underlying the proxies.
Fun Fact: buying an IPv4 address in 2021 performed better than Dow Jones as an investment.
Due to scarcity and increasing the increasing need for IP addresses, their prices on marketplaces like Neterra Cloud are skyrocketing!
If you’re uncertain if you should invest your 1000$ in the latest ape’s NFT or in some IPv4 address, I would go for the second, at least there’s a real need and an intrinsic scarcity, until the usage of IPv6 finally takes off.
Jokes aside, thanks for reading this post.
This article was kindly provided by Pierluigi Vinciguerra, web scraping expert and founder of Web Scraping Club. Follow this link to see the original post.
Find the best and trusted proxy deals + promo offers for Web Scraping in our brand-new Proxy Catalogue!