如果一个固定的ip在短暂的时间内,快速大量的访问一个网站,很容易被服务器查出异常从而被封掉ip。代理IP简单的说,就是通过ip代理,从不同的ip进行访问,这样就不会被封掉ip了。本次项目就是自己动手构建一个免费的代理ip池。
#1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数 url = headers={User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36} #1分析目标网页(快代理,一个获得免费代理IP的网站),确定爬取的url路径,headers参数 url = headers={User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36}In [5]:
#2发送请求 ---》requests模拟游览器发送请求,获取响应数据 response=requests.get(url=url,headers=headers).text #print(response)In [6]:
#3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等 all_ip=re.findall(r"\"IP\">(.*?)</td>",response) #print(len(all_ip)) all_type=re.findall(r"\"类型\">(.*?)</td>",response) #print(len(all_type)) all_port=re.findall(r"\"PORT\">(.*?)</td>",response) all_data=zip(all_type,all_ip,all_port) for i in enumerate(all_data): print(i) (0, (HTTP, 123.55.114.77, 9999)) (1, (HTTP, 36.248.133.117, 9999)) (2, (HTTP, 123.163.117.62, 9999)) (3, (HTTP, 1.197.204.52, 9999)) (4, (HTTP, 175.155.140.34, 1133)) (5, (HTTP, 115.218.5.120, 9000)) (6, (HTTP, 182.87.38.156, 9000)) (7, (HTTP, 60.168.207.253, 1133)) (8, (HTTP, 114.99.13.141, 1133)) (9, (HTTP, 119.108.172.169, 9000)) (10, (HTTP, 36.248.133.81, 9999)) (11, (HTTP, 114.239.29.206, 9999)) (12, (HTTP, 1.196.177.81, 9999)) (13, (HTTP, 175.44.109.13, 9999)) (14, (HTTP, 113.124.86.220, 9999))In [14]:
#上面是爬取一页代理IP的代码,我们进行修改,用一个for循环爬取多页 all_datas=[] #用一个变量接收各个页的ip import time for page in range(2):#先爬取三页试试 url = {}/.format(page+1) #url需要更改的地方用{}.foemat进行传参 headers={User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36} #2发送请求 ---》requests模拟游览器发送请求,获取响应数据 response=requests.get(url=url,headers=headers).text #print(response) #3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等 all_ip=re.findall(r"\"IP\">(.*?)</td>",response) #print(len(all_ip)) all_type=re.findall(r"\"类型\">(.*?)</td>",response) #print(len(all_type)) all_port=re.findall(r"\"PORT\">(.*?)</td>",response) all_data=zip(all_type,all_ip,all_port) for i in enumerate(all_data): #print(i) all_datas.append(i) #将变量all_datas依次接收ip time.sleep(1)#设置休眠时间一秒,防止请求服务器太频繁被服务器拒绝请求In [15]:
print(all_datas) [(0, (HTTP, 123.55.114.77, 9999)), (1, (HTTP, 36.248.133.117, 9999)), (2, (HTTP, 123.163.117.62, 9999)), (3, (HTTP, 1.197.204.52, 9999)), (4, (HTTP, 175.155.140.34, 1133)), (5, (HTTP, 115.218.5.120, 9000)), (6, (HTTP, 182.87.38.156, 9000)), (7, (HTTP, 60.168.207.253, 1133)), (8, (HTTP, 114.99.13.141, 1133)), (9, (HTTP, 119.108.172.169, 9000)), (10, (HTTP, 36.248.133.81, 9999)), (11, (HTTP, 114.239.29.206, 9999)), (12, (HTTP, 1.196.177.81, 9999)), (13, (HTTP, 175.44.109.13, 9999)), (14, (HTTP, 113.124.86.220, 9999)), (0, (HTTP, 175.42.68.174, 9999)), (1, (HTTP, 182.34.34.20, 9999)), (2, (HTTP, 113.194.140.60, 9999)), (3, (HTTP, 113.194.48.139, 9999)), (4, (HTTP, 123.55.114.42, 9999)), (5, (HTTP, 175.43.151.240, 9999)), (6, (HTTP, 123.169.118.141, 9999)), (7, (HTTP, 123.55.101.33, 9999)), (8, (HTTP, 1.197.203.69, 9999)), (9, (HTTP, 120.83.122.236, 9999)), (10, (HTTP, 36.248.132.246, 9999)), (11, (HTTP, 58.22.177.123, 9999)), (12, (HTTP, 36.248.133.68, 9999)), (13, (HTTP, 121.232.199.252, 9000)), (14, (HTTP, 114.99.15.142, 1133))]In [16]:
用一个方法检测代理ip,去掉一些质量差的ip def check_ip(all_datas): headers = {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36} good_id=[] for i, (type, ip, port) in all_datas: id_dict = {} #print({}:{}:{}.format(type, ip, port)) id_dict[type] = ip + : + port #print(id_dict) try: #获取百度的页面,如果响应时间在0.1秒内,则认为是好用的IP res=requests.get(https://www.baidu.com,headers=headers,proxies=id_dict,timeout=0.1) if res.status_code==200: good_id.append(id_dict) except Exception as error: print(id_dict,error) return good_id good_id=check_ip(all_datas) print(好用的ip有:,good_id) 好用的ip有: [{HTTP: 123.55.114.77:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.117:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.163.117.62:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.197.204.52:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.155.140.34:1133, klab_external_proxy_service_port: 80}, {HTTP: 115.218.5.120:9000, klab_external_proxy_service_port: 80}, {HTTP: 182.87.38.156:9000, klab_external_proxy_service_port: 80}, {HTTP: 60.168.207.253:1133, klab_external_proxy_service_port: 80}, {HTTP: 114.99.13.141:1133, klab_external_proxy_service_port: 80}, {HTTP: 119.108.172.169:9000, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.81:9999, klab_external_proxy_service_port: 80}, {HTTP: 114.239.29.206:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.196.177.81:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.44.109.13:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.124.86.220:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.42.68.174:9999, klab_external_proxy_service_port: 80}, {HTTP: 182.34.34.20:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.194.140.60:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.194.48.139:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.114.42:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.43.151.240:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.169.118.141:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.101.33:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.197.203.69:9999, klab_external_proxy_service_port: 80}, {HTTP: 120.83.122.236:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.132.246:9999, klab_external_proxy_service_port: 80}, {HTTP: 58.22.177.123:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.68:9999, klab_external_proxy_service_port: 80}, {HTTP: 121.232.199.252:9000, klab_external_proxy_service_port: 80}, {HTTP: 114.99.15.142:1133, klab_external_proxy_service_port: 80}]In [19]:
#将获取IP的代码也装封成一个函数,根据需要爬取的页数进行传参 def get_id(pages): all_datas = [] for page in range(pages): url = {}/.format(page + 1) headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36} # 2发送请求 ---》requests模拟游览器发送请求,获取响应数据 response = requests.get(url=url, headers=headers).text # print(response) # 3解析数据 --》最强的是re正则,其他的如jsonpath和bs、parsel等等 all_ip = re.findall(r"\"IP\">(.*?)</td>", response) # print(len(all_ip)) all_type = re.findall(r"\"类型\">(.*?)</td>", response) # print(len(all_type)) all_port = re.findall(r"\"PORT\">(.*?)</td>", response) all_data = zip(all_type, all_ip, all_port) for i in enumerate(all_data): # print(i) all_datas.append(i) time.sleep(1) print("获取的ip有{}个".format(len(all_datas))) return all_datas def check_ip(page=4): 检测代理ip的方法 all_datas=get_id(page) headers = {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36} good_id=[] for i, (type, ip, port) in all_datas: id_dict = {} id_dict[type] = ip + : + port try: res=requests.get(https://www.baidu.com,headers=headers,proxies=id_dict,timeout=0.1) if res.status_code==200: good_id.append(id_dict) except Exception as error: print(id_dict,error) return good_idIn [20]:
#可以运行一个函数check_ip来获得好用的IP,方便在其他模块进行调用,因为获取代理IP不是爬虫的最终目的 #需要在其他代码里获取代理IP时,直接从获取代理IP的模块中导入 这个函数即可 if __name__==__main__: good_ID=check_ip(4) print("好的代理IP",good_ID) 获取的ip有60个 好的代理IP [{HTTP: 123.55.114.77:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.117:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.163.117.62:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.197.204.52:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.155.140.34:1133, klab_external_proxy_service_port: 80}, {HTTP: 115.218.5.120:9000, klab_external_proxy_service_port: 80}, {HTTP: 182.87.38.156:9000, klab_external_proxy_service_port: 80}, {HTTP: 60.168.207.253:1133, klab_external_proxy_service_port: 80}, {HTTP: 114.99.13.141:1133, klab_external_proxy_service_port: 80}, {HTTP: 119.108.172.169:9000, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.81:9999, klab_external_proxy_service_port: 80}, {HTTP: 114.239.29.206:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.196.177.81:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.44.109.13:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.124.86.220:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.42.68.174:9999, klab_external_proxy_service_port: 80}, {HTTP: 182.34.34.20:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.194.140.60:9999, klab_external_proxy_service_port: 80}, {HTTP: 113.194.48.139:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.114.42:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.43.151.240:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.169.118.141:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.101.33:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.197.203.69:9999, klab_external_proxy_service_port: 80}, {HTTP: 120.83.122.236:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.132.246:9999, klab_external_proxy_service_port: 80}, {HTTP: 58.22.177.123:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.68:9999, klab_external_proxy_service_port: 80}, {HTTP: 121.232.199.252:9000, klab_external_proxy_service_port: 80}, {HTTP: 114.99.15.142:1133, klab_external_proxy_service_port: 80}, {HTTP: 123.149.136.23:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.149.136.209:9999, klab_external_proxy_service_port: 80}, {HTTP: 114.104.138.12:3000, klab_external_proxy_service_port: 80}, {HTTP: 121.232.148.211:9000, klab_external_proxy_service_port: 80}, {HTTP: 1.198.72.73:9999, klab_external_proxy_service_port: 80}, {HTTP: 110.243.29.17:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.42.123.89:9999, klab_external_proxy_service_port: 80}, {HTTP: 140.255.184.238:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.163.27.180:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.101.167:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.248.133.37:9999, klab_external_proxy_service_port: 80}, {HTTP: 171.12.221.227:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.44.108.106:9999, klab_external_proxy_service_port: 80}, {HTTP: 182.34.34.222:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.44.109.141:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.55.102.66:9999, klab_external_proxy_service_port: 80}, {HTTP: 1.197.204.185:9999, klab_external_proxy_service_port: 80}, {HTTP: 171.12.115.240:9999, klab_external_proxy_service_port: 80}, {HTTP: 171.12.115.236:9999, klab_external_proxy_service_port: 80}, {HTTP: 171.35.160.131:9999, klab_external_proxy_service_port: 80}, {HTTP: 36.249.53.36:9999, klab_external_proxy_service_port: 80}, {HTTP: 112.111.217.106:9999, klab_external_proxy_service_port: 80}, {HTTP: 139.155.41.15:8118, klab_external_proxy_service_port: 80}, {HTTP: 115.218.7.209:9000, klab_external_proxy_service_port: 80}, {HTTP: 163.125.112.207:8118, klab_external_proxy_service_port: 80}, {HTTP: 110.243.15.151:9999, klab_external_proxy_service_port: 80}, {HTTP: 175.155.143.39:1133, klab_external_proxy_service_port: 80}, {HTTP: 221.224.136.211:35101, klab_external_proxy_service_port: 80}, {HTTP: 36.250.156.31:9999, klab_external_proxy_service_port: 80}, {HTTP: 123.101.237.216:9999, klab_external_proxy_service_port: 80}]