马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有帐号?立即注册 
x
本帖最后由 jasonlv 于 2018-4-4 18:08 编辑
分两步实现爬取和验证ip是否可用。
经验证可用性很低,哈哈,不花钱 还是不行啊。 1、第一个py文件代码实现爬取代理ip和port - import re
- import os
- import time
- from urllib import request,parse
- # 响应头
- headers = {
- 'Accept': '*/*',
- 'Accept-Language': 'zh-CN,zh;q=0.8',
- 'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
- 'Hosts': 'hm.baidu.com',
- 'Referer': 'http://www.xicidaili.com/nn',
- 'Connection': 'keep-alive'
- }
- for i in range(1,1000):
- url="http://www.xicidaili.com/nn/{}".format(i)
- print(url)
- req = request.Request(url=url,headers=headers)
- try:
- req = request.urlopen(req,timeout=3,).read()
- except Exception:
- print("异常!")
- continue
- req = req.decode("utf-8")
- # 提取ip和端口
- ip_list = re.findall("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*?(\d{2,6})", req, re.S)
- # 将提取的ip和端口写入文件
- f = open("ip.txt","a+")
- for li in ip_list:
- ip = li[0] + ':' + li[1] + '\n'
- f.write(ip.strip()+"\n")
- time.sleep(2) # 每爬取一页暂停两秒
复制代码
2、第二个py文件验证上面ip.txt文件中的ip是否可用 - from urllib import request
- import socket
- socket.setdefaulttimeout(3)
- inf = open("ip.txt") # 这里打开刚才存ip的文件
- lines = inf.readlines()
- proxys = []
- for i in range(0,len(lines)):
- proxy_host = "http://" + lines
- proxy_temp = {"http":proxy_host}
- proxys.append(proxy_temp)
- # 用这个网页去验证,遇到不可用ip会抛异常
- url = "http://ip.chinaz.com/getip.aspx"
- # 将可用ip写入valid_ip.txt
- valid_temp = []
- count = 0
- for proxy in proxys:
- count += 1
- if count == 10:
- break
- try:
- # 代理配置
- proxy_obj = request.ProxyHandler(proxy)
- opener = request.build_opener(proxy_obj)
- request.install_opener(opener)
- #验证代理,5秒超时
- res = request.urlopen(url).read()
- valid_ip = proxy['http'][7:]
- print('有效ip:{} '.format(valid_ip))
- valid_temp.append(valid_ip)
- except Exception:
- #不可用打印
- print("不可用:{}".format(proxy))
- continue
- with open("可用ip.txt", "w" ) as f1:
- for i in valid_temp:
- f1.write(i)
复制代码
|