目录
关于Requests代理,你应该知道的
说到代理,写过爬虫的小伙伴一定都不陌生。但是你的代理真的生效了么?
代理主要分为以下几类:
如果是爬虫的话,最常见的选择是高匿代理。
Requests 设置代理非常方便,只需传递一个 proxies 参数即可。如官方示例:
import requests proxies = { http: :3128, https: :1080, } requests.get( proxies=proxies)留意一个地方,proxies 字典中有两个 key :https 和 http,为什么要写两个 key,如果只有一个可以么?
试试就知道了
准备验证函数
这个函数会使用代理去访问两个 IP 验证网站,一个是 https,一个是 http。
import requests from bs4 import BeautifulSoup def validate(proxies): https_url = http_url = headers = {User-Agent: curl/7.29.0} https_r = requests.get(https_url, headers=headers, proxies=proxies, timeout=10) http_r = requests.get(http_url, headers=headers, proxies=proxies, timeout=10) soup = BeautifulSoup(http_r.content, html.parser) result = soup.find(class_=card-body).get_text().strip().split(\n)[0] print(f"当前使用代理:{proxies.values()}") print(f"访问https网站使用代理:{https_r.json()}") print(f"访问http网站使用代理:{result}")测试
其他测试
通过 wireshark 抓包发现,当协议不匹配时,根本不会向代理服务器发起请求。
通过 postman 测试,结果与 Requests 一致,协议不同的情况下,不会走代理。
猜测可能是一种约定或者规则,类似 PAC ?(如果你知道答案,请告诉我)
寻找答案
从源码入手试试?在requests.ultis 中找到了这个函数:
def select_proxy(url, proxies): """Select a proxy for the url, if applicable. :param url: The url being for the request :param proxies: A dictionary of schemes or schemes and hosts to proxy URLs """ proxies = proxies or {} urlparts = urlparse(url) if urlparts.hostname is None: return proxies.get(urlparts.scheme, proxies.get(all)) proxy_keys = [ urlparts.scheme + :// + urlparts.hostname, urlparts.scheme, all:// + urlparts.hostname, all, ] proxy = None for proxy_key in proxy_keys: if proxy_key in proxies: proxy = proxies[proxy_key] break return proxy答案揭晓了,Requests 会根据目标 url 的协议按照一定顺序来为它选择代理。 就拿上面的 Case 2 来说:
proxies = { http: 222.189.244.56:48304 }请求时,按照以下顺序在 proxies 字典中为这个链接选用代理:
协议+域名 :http://ip111.cn 协议:http all + 域名:all://ip111.cn all:all在第 2 步匹配到222.189.244.56:48304,然后就使用这个代理去访问目标地址。
而在请求时,按照上面顺序匹配不到任何内容,就使用本地的 ip 去访问目标地址了。
这样也就能说明上面 3 个例子了。
扩展
官方示例中的代理包含协议,而我们测试的例子中没有但同样能够成功访问。这又是为什么呢?
# 官方的 proxies = { http: :3128, https: :1080, } # 我们的 proxies = { http: 222.189.244.56:48304, https: 222.189.244.56:48304 }答案同样可以在源码里找到,请看下面这两个函数:
requests.apdpters
def get_connection(self, url, proxies=None): """Returns a urllib3 connection for the given URL. This should not be called from user code, and is only exposed for use when subclassing the :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`. :param url: The URL to connect to. :param proxies: (optional) A Requests-style dictionary of proxies used on this request. :rtype: urllib3.ConnectionPool """ proxy = select_proxy(url, proxies) if proxy: proxy = prepend_scheme_if_needed(proxy, http) proxy_url = parse_url(proxy) if not proxy_url.host: raise InvalidProxyURL("Please check proxy URL. It is malformed" " and could be missing the host.") proxy_manager = self.proxy_manager_for(proxy) conn = proxy_manager.connection_from_url(url) else: # Only scheme should be lower case parsed = urlparse(url) url = parsed.geturl() conn = self.poolmanager.connection_from_url(url) return conn看这一行代码:proxy = prepend_scheme_if_needed(proxy, http),找到这个函数的定义:
def prepend_scheme_if_needed(url, new_scheme): """Given a URL that may or may not have a scheme, prepend the given scheme. Does not replace a present scheme with the one provided as an argument. :rtype: str """ scheme, netloc, path, params, query, fragment = urlparse(url, new_scheme) # urlparse is a finicky beast, and sometimes decides that there isnt a # netloc present. Assume that its being over-cautious, and switch netloc # and path if urlparse decided there was no netloc. if not netloc: netloc, path = path, netloc return urlunparse((scheme, netloc, path, params, query, fragment))从注释中可以找到答案:
如果代理提供了协议,不做改变;如果代理没有协议的话,就为代理加上http协议。
结论
Requests 会按照目标url的协议来为它配置代理。基于此你可以为不同的协议甚至不同域名设置不同的代理,如果想为所有请求使用同一个代理,那直接使用 all 作为 key 来设置即可。代理地址如果没有指明协议,则默认使用 http 请求。参考资料
[1] HTTP 代理原理及实现(一): https://imququ.com/post/web-proxy.html
[2] HTTP 代理原理及实现(二): https://imququ.com/post/web-proxy-2.html
[3] 什么是透明、匿名、高匿代理?详解!: https://blog.51cto.com/14062184/2315723
[4] HTTP代理和HTTPS代理的区别: https://www.hitoy.org/difference-between-http-and-https-proxy.html