反反爬虫利器！教你怎么用代理，拨号换IP……

0x01 前言

一般而言，抓取稍微正规一点的网站，都会有反爬虫的制约。反爬虫主要有以下几种方式：

通过UA判断。这是最低级的判断，一般反爬虫不会用这个做唯一判断，因为反反爬虫非常容易，直接随机UA即可解决。通过单IP频繁访问判断。这个判断简单，而且反反爬虫比较费力，反爬虫绝佳方案。需采用多IP抓取。通过Cookie判断，例如通过会员制账号密码登陆，判断单账号短时间抓取次数判断。这个反反爬虫也很费力。需采用多账号抓取。动态页面加载。这个考验前端工程师的功底，如果前端写的好，各种JS判断，各种逻辑，像百度，淘宝一样，post登录很难。较好的方法，但是对于大牛，还是防不胜防。反反爬虫多采用渲染浏览器抓取，效率低下。采用验证码。这里要么是登录的时候有验证码，要么是判断是爬虫时，不封IP，而是采用验证码验证，例如链家网。验证码是反爬虫性价比较高的方案。反反爬虫一般接入OCR验证码识别平台或者人工打码平台，亦或者利用Tesseract OCR识别，亦或者采用神经网络训练识别验证码等。

0x02 概要

今天我们先主要来讲一讲，如何应对第2条的反反爬虫，如何通过多IP抓取。

通过多IP爬虫，又分为以下几种形式：

通过ADSL拨号换IP。每拨一次就会有一个新IP，较好解决IP单一问题。如果是局域网，带路由器的，第一种方法可能不好用。这个时候可以模拟登陆路由器，控制路由器重新拨号，换IP，这其实是一种折中的办法，曲线救国。代理IP，利用购买的或者网上抓取的免费代理IP，实现多IP爬虫。分布式爬虫。采用多个服务器，多个IP，多个slave爬虫同时运行，由master负责调度。效率较高，属于大型分布式抓取，一般用redis分布式抓取，不表。最近了解到一种新的加密的代理网络。Tor匿名网络，利用这个也能匿名换IP。这个还没有详细了解，不表。

0x03 正文

1. ADSL拨号

我一般是在windows平台ADSL拨号，其他平台暂时没用过。windows平台拨号，我一般用python的代码为：

# -*- coding: utf-8 -*- import os g_adsl_account = {"name": u"宽带连接", "username": "xxxx", "password": "xxxx"} class Adsl(object): # ============================= # __init__ : name: adsl名称 # ============================= def __init__(self): self.name = g_adsl_account["name"] self.username = g_adsl_account["username"] self.password = g_adsl_account["password"] # ============================= # set_adsl : 修改adsl设置 # ============================= def set_adsl(self, account): self.name = account["name"] self.username = account["username"] self.password = account["password"] # ============================= # connect : 宽带拨号 # ============================= def connect(self): cmd_str = "rasdial %s %s %s" % (self.name, self.username, self.password) os.system(cmd_str) time.sleep(5) # ============================= # disconnect : 断开宽带连接 # ============================= def disconnect(self): cmd_str = "rasdial %s /disconnect" % self.name os.system(cmd_str) time.sleep(5) #============================= # reconnect : 重新进行拨号 #============================= def reconnect(self): self.disconnect() self.connect()

2. 路由器拨号

如果是局域网，带路由器的。直接调用windows的rasdial命令无法拨号时，这个时候可以模拟登陆路由器，控制路由器重新拨号，换IP，这其实是一种折中的办法，曲线救国。下面以登录小米路由器示例：

# -*- coding: utf-8 -*- import requests import urllib from Crypto.Hash import SHA import time import json import re import random import datetime class Adsl(): def __init__(self): self.host = 192.168.31.1/ self.username = admin self.password = huangxin250 def connect(self): host = self.host homeRequest = requests.get(http:// + host + /cgi-bin/luci/web/home) key = re.findall(rkey: \(.*)\,, homeRequest.text)[0] mac = re.findall(rdeviceId = \(.*)\;, homeRequest.text)[0] aimurl = "http://" + host + "/cgi-bin/luci/api/xqsystem/login" nonce = "0_" + mac + "_" + str(int(time.time())) + "_" + str(random.randint(1000, 10000)) pwdtext = self.password pwd = SHA.new() pwd.update(pwdtext + key) hexpwd1 = pwd.hexdigest() pwd2 = SHA.new() pwd2.update(nonce + hexpwd1) hexpwd2 = pwd2.hexdigest() data = { "logtype": 2, "nonce": nonce, "password": hexpwd2, "username": self.username } response = requests.post(url=aimurl, data=data, timeout=15) resjson = json.loads(response.content) token = resjson[token] webstop = urllib.urlopen(;stok= + token + /api/xqnetwork/pppoe_stop) #time.sleep(1) webstart = urllib.urlopen(;stok= + token + /api/xqnetwork/pppoe_start) date = datetime.datetime.now() nowtime = str(date)[:-10] print nowtime + , congratulations, the IP is changed !

利用这个方法，就实现了用路由器换IP的目的。该方法的缺陷也是很明显的。就是不像第一种方法那样通用。基本上一个路由器就得编一套代码，属于定制代码。

3. 代理IP

代理IP是最常见的一种多IP爬虫方法。在请求Headers中加入代理IP地址，即可实现代理IP抓取。缺陷是爬取速度和代理IP的速度息息相关。而且好的IP费用较高，免费的速度普遍不高。

附上requests抓取携带代理IP和selenium抓取携带代理IP的代码。

requests:

# -*- coding: utf-8 -*- import requests reload(sys) sys.setdefaultencoding(utf-8) type = sys.getfilesystemencoding() s = requests.session() proxie = { http : :80 } url = xxx response = s.get(url, verify=False, proxies = proxie, timeout = 20) print response.text

selenium:

from selenium import webdriver from selenium.webdriver.common.proxy import Proxy from selenium.webdriver.common.proxy import ProxyType proxy = Proxy( { proxyType: ProxyType.MANUAL, httpProxy: ip:port } ) desired_capabilities = DesiredCapabilities.PHANTOMJS.copy() proxy.add_to_capabilities(desired_capabilities) driver = webdriver.PhantomJS( executable_path="/path/of/phantomjs", desired_capabilities=desired_capabilities ) driver.get(http://httpbin.org/ip) print driver.page_source driver.close()

0x04 尾言

本节主要讲了反爬虫的一些概念，常用的方法，反反爬虫的一些方法，并且主要介绍了多IP爬虫的实现方式，属于爬虫领域基础内容。掌握了这些基础内容，以后爬虫步伐才能迈得坚实。

接下来，我还会谈一谈如何应对验证码的反爬虫，敬请期待~

欢迎关注我的：一只IT汪。文章将会首发于，并且我将不定期分享和更新各种IT知识，谢谢！

反反爬虫利器！教你怎么用代理，拨号换IP……

0x01 前言

0x02 概要

0x03 正文

0x04 尾言

相关文章

100人的公司要同时使用代理IP有什么好的方案

IP代理中住宅IP和机房IP区别分析

torrent文件怎么打开

移动端开发者常用火狐插件23款

美团开动“收割机”：突击收费，200万商家难逃

什么牌子的台灯对孩子的视力好？618好用性价比高的护眼台灯