爬取免费代理,拥有自己的代理池

很久很久以前,我有个梦想,就是SQL注入不要被ban,于是ip代理成为了首选,但是奈何钱包有限,只能爬取免费代理,于是借鉴了许多文章,形成了今日的ip代理池。

python编写,内容主要涉及多线程问题、以及爬虫。以下代码看不懂都无所谓,白嫖就可以了。

一、具体实现流程:

包括三个模块:获取ip代理模块(爬虫获取ip),代理池模块(实现存储爬取的ip,写入有效ip代理),验证模块(对爬取的ip进行验证),获取模块(获取有效ip,并去重!)。

二、代码分析:(按程序执行顺序)代码分析:(按程序执行顺序)

(一)、入口文件:StartApi.py

1.Init: 获取必要的参数。

2.getproxy: 调用module.GetProxy,多线程使用exec函数,实例化

3.module.GetProxy下的类,开始爬虫获取代理ip。

4.isproxy:调用module.IsProxy和module.OptPool类,循环判断获取代理池的ip,并进行判断。

startproxy:程序入口函数,三个模式:

-c api 可获取已爬取存在ips.txt的有效代理0ip;

-t 100 100个多线程各个爬虫爬取一次获取有效代理;

-t 50 -m 1 -time 10 50个多线程无限次循环爬取有效ip,并且每10分钟进行循环。

具体代码如下:

# -*- coding:utf-8 -*- # Author:qiuzishan import module.GetProxy from module.OptPool import ProxyPool from module.IsProxy import IsProxy from threading import Thread import time from queue import Queue from argparse import ArgumentParser from module.GetApi import GetApi class StartApi: def __init__(self): arg = ArgumentParser(description=baidu_url_collection) arg.add_argument(-t, help=线程数量 默认10个线程, type=int, default=10) arg.add_argument(-c, help=默认爬取ip进行验证 -c api 即可获取ips.txt的ip值!, type=str, default=getproxy) arg.add_argument(-m, help=默认只爬取一次,-m 1 可实现无线循环,默认8分钟爬取一次, type=int, default=0) arg.add_argument(-time, help=默认8分钟爬取一次,-m 10 实现每10分钟爬取一次,搭配-m使用,最好不低于5分钟, type=int, default=8) args = arg.parse_args() self.que = Queue() self.c = args.c self.t = args.t# 获取线程!!!这很重要! self.m = args.m self.time = args.time def getproxy(self): 有几个代理,就会有几个线程爬取代理! :param msg: :return: 循环获取代理,这里可以加入获取代理的实例 thread = [] usemodels = [BeautifulSoup, ProxyPool, __builtins__, __cached__, __doc__, __file__, __loader__, __name__, __package__, __spec__, datetime, re, requests, time] getmodels = dir(module.GetProxy) for i in getmodels: if i not in usemodels: # 用于测试是否存在代理print(i) self.que.put(import module.GetProxy\nmodule.GetProxy. + i + ()) proxyNum = self.que.qsize() # 有几个爬虫就开启几个线程 for i in range(0, proxyNum): t = Thread(target=exec, args=(self.que.get(),)) thread.append(t) for i in thread: i.start() # print(f线程数:{i}开始) for i in thread: i.join() def isproxy(self, msg): """ :return: 循环判断代理 """ pp = ProxyPool() ip = IsProxy(self.t) print(msg) while True: flag = pp.get_length_pool() if flag == 0: print("退出循环") break # 退出循环 ip.startIsProxy() def startproxy(self): 多进程同时运行 if self.c == api: ga = GetApi() ga.start(self.t) else: if self.m: while True: getproxy = Thread(target=self.getproxy) isproxy = Thread(target=self.isproxy, args=(代理判断中,)) getproxy.start() time.sleep(5) # IpPool一开始无数据,因此先爬取加入IpPool列表中,防止出错 isproxy.start() time.sleep(self.time*60) # 每隔多长时间爬取一次 else: getproxy = Thread(target=self.getproxy) isproxy = Thread(target=self.isproxy, args=(代理判断中,)) getproxy.start() time.sleep(5) # IpPool一开始无数据,因此先爬取加入IpPool列表中,防止出错 isproxy.start() getproxy.join() isproxy.join() if __name__ == __main__: print( IIIpppppppplll IIIpp pp lll IIIpppplll IIIpppppppppp ppOOOOOOOOOOOOOO lll IIIpp pp ppppppppOOOO OOOO lll IIIpppppp OOOO OOOOlll IIIpp pp pp OOOO OOOOlll IIIpppppppp _________ppOOOO OOOO lll pppp OOOOOOOO OOOOOOOOlll pp pp {v 1.0 author:秋紫山 : 欢迎加q交流~~} pp ) start = time.perf_counter() sa = StartApi() sa.startproxy() end = time.perf_counter() print("use time:"+str(end - start) + "s")

(二)爬取代理ip: module/GetProxy.py

自定义爬虫类:

1.Init:不需要改变,把print的内容稍微改改即可,实现实例化即可启动该爬虫,并把获取的爬虫加入代理池ProxyPool里面

2.get_lxml_by_url:获取需要爬取页面的内容

3.get_proxy_by_lxml:自定义的爬虫,需要返回ip列表。

具体代码:

例如ip88Proxy

# -*- coding:utf-8 -*- # Author:qiuzishan import requests from bs4 import BeautifulSoup from module.OptPool import ProxyPool import time import re 这里可以定义多个代理类,只要保证以“**.**.**.**:**”形式如池即可,然后在getproxy.py中加入多线程即可。 class demoProxy: # demo Get Agent from web 89ip def __init__(self): # 自动爬取并且把ip输入代理池!此段函数不需要更改,print 的内容更改一下即可 self.proxy = [] try: print("demo代理开始爬取") self.proxy = self.get_proxy_by_lxml(self.get_lxml_by_url()) except Exception as e: print("出错了:{}\n~aha代理demo凉了,求你了快去看看吧!".format(e)) if self.proxy != []: pp = ProxyPool() for p in self.proxy: pp.get_into_pool(p) else: print("demo代理爬取的代理ip为空!代理可能凉了!") def get_lxml_by_url(self): headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 } r = requests.get(, headers=headers) # 输入需要爬取的网站! if r.status_code == 200: response = r.text return response else: exit(Error: Module GetAgent Url Error) def get_proxy_by_lxml(self, response): ip = [] # 爬取你的代理包括端口噢,例如,123.12.2.33:8080,然后放进ip列表里。 # 请输入您的爬虫~ # print(ip) return ip class ip89Proxy: # 89ip Get Agent from web 89ip def __init__(self): self.proxy = [] try: print("89ip代理开始爬取") self.proxy = self.get_proxy_by_lxml(self.get_lxml_by_url()) except Exception as e: print("出错了:{}\n~aha代理89ip凉了,求你了快去看看吧!".format(e)) if self.proxy != []: pp = ProxyPool() for p in self.proxy: pp.get_into_pool(p) else: print("89ip代理爬取的代理ip为空!代理可能凉了!") def get_lxml_by_url(self): headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 } r = requests.get(?num=200&address=&kill_address=&port=&kill_port=&isp=, headers=headers) if r.status_code == 200: response = r.text return response else: print(Error: Module GetAgent Url Error) def get_proxy_by_lxml(self, response): ip = [] soup = BeautifulSoup(response, html5lib) soup.find_all(div) for div in soup.find_all(div, style="padding-left:20px;"): for child in div.children: if child.string is not None: if 更好用的代理ip请访问 not in child.string: ip.append(child.string.strip()) # print(ip) return ip

(三)ip加入代理池: module/OptPool.py

相信很容易看懂!

# -*- coding:utf-8 -*- # Author:qiuzishan from queue import Queue # IpPool存储所有未检验的代理,Proxy用于存储可用的代理 IpPool = Queue() Proxy = Queue() class ProxyPool: 代理池! def get_into_pool(self, args): # 放进代理池一个ip :param args: :return: unknown IP warehousing IpPool.put(args) def get_length_pool(self): # 返回代理池未验证的ip数量 return IpPool.qsize() def get_out_pool(self): # 取得代理池一个ip :return: unknown IP outgoing if IpPool.qsize(): getip = IpPool.get() return getip else: print("IpPool变量无代理ip未检测") return 0 # 防止报错 def get_into_proxy(self, args): # 写入有效ip :param args: :return: effective IP warehousing Proxy.put(args) ip = Proxy.get() print(ip + 有效已写入ips.txt) with open(ips.txt, a, encoding=utf-8) as file: file.write(ip+\n) def get_out_proxy(self): # 输出有效ip :return: effective IP outgoing proxy_txt = [] with open(ips.txt, r, encoding=utf-8) as file: self.pass_isproxy = file.readlines() for i in self.pass_isproxy: proxy_txt.append(i.replace(\n, )) return proxy_txt

(四)验证代理池ip: module/IsProxy.py

# -*- coding:utf-8 -*- # Author:qiuzishan import requests from module.OptPool import ProxyPool from threading import Thread class MyThread(Thread): # 事实上没有调用这个函数,忽悠一下~ def __init__(self, func): Thread.__init__(self) #调用父类的init方法 self.func = func def run(self): self.result = self.func() def get_result(self): Thread.join(self) print("asdasd") try: return self.result except Exception: return None class IsProxy: def __init__(self, t) -> None: self.pp = ProxyPool() self.t = t def startIsProxy(self): self.manyTread() def manyTread(self): """ 多线程开启! """ thread = [] if self.pp.get_length_pool() < self.t: # 最后代理池的剩余ip的不够线程,线程等于剩余的数量 self.t = self.pp.get_length_pool() print("线程为:{}".format(self.t)) for i in range(0, self.t): t = Thread(target=self.test_proxy) thread.append(t) for i in thread: i.start() # print(f线程数:{i}开始) for i in thread: i.join() def test_proxy(self): ip = str(self.pp.get_out_pool()) # print(ip) if ip != 0: proxies = { http: http:// + ip, https: https:// + ip, } try: requests.get( proxies=proxies, timeout=10) self.pp.get_into_proxy(ip) return 1 except requests.exceptions.ProxyError: print("无效ip" + ip) return -1 except requests.exceptions.ConnectTimeout: print("无效ip" + ip) return -2 except requests.exceptions.ReadTimeout: print("无效ip" + ip) return -3 except requests.exceptions.ConnectionError: print("无效ip" + ip) return -4 else: return 0 if __name__ == __main__: ip = IsProxy()

(五)、获取有效ip: module/GetApi.py

# -*- coding:utf-8 -*- # Author:qiuzishan from module.OptPool import ProxyPool from module.QuChong import QuChong from module.IsProxy import IsProxy class GetApi: def start(self, t): self.t = t self.get_api() def get_api(self): pp = ProxyPool() proxy_txt = pp.get_out_proxy() for i in proxy_txt: print("ips.txt里面的ip代理有:" + i) pp.get_into_pool(i) with open(ips.txt, w, encoding=utf-8) as file: file.write() # 清除ips.txt,清除过期无效代理ip ip_test = IsProxy(self.t) while True: ip_test.startIsProxy() flag = pp.get_length_pool() if flag == 0: print("退出循环") break # 退出循环 xiaochu = QuChong() # 去重后,显示代理 xiaochu.quchong() if __name__ == __main__: ga = GetApi() ga.start()

(六)、去重module/QuChong.py

# -*- coding:utf-8 -*- # Author:qiuzishan class QuChong: def quchong(self): f = open(ips.txt, r, encoding=utf-8) lines = f.readlines() n = 0 lines_clear = [] # 列表去重 for i in lines: if i not in lines_clear: lines_clear.append(i) n = n+1 f.close() # print(lines_clear) f_clear = open("ips.txt", w, encoding=utf-8) for i in range(0, n): f_clear.write(lines_clear[i]) print("最终有效ip:" + lines_clear[i]) f_clear.close()

三、使用方法:

1.命令:python StartApi.py -t 100开了100个线程,最后把有效ip存在ips.txt(未去重)效果:

2.命令:python StartApi.py -c api -t 20再次验证ips.txt的ip,开启20个线程再次去验证有效ip,并去重。最后高效代理ip再次存在ips.txt

3.查看使用方法:python StartApi.py -h

爬取大量的ip进行验证,爬取高效的ip

付费的代理太贵了,一天就得20,如果你不想花钱的的话,这个免费的代理就是你的选择啦

戳此获网络安全学习资料包!