没有写不出的代码,只有不努力的程序员

python爬取免费代理服务器(二)

HI,今天想了想,更了判断的代码

上期传送门:https://www.jimmyblog.com.cn/2020/09/22/get_proxy/

整体代码:

import requests, bs4,time

IPlist=[]
PORTlist=[]
proxylist=[]
success=[]

def test_baidu(proxyip):
    proxy = {‘http’:proxyip}
    try:
        testconnect = requests.get(“http://httpbin.org/get”,proxies=proxy,timeout=2)
    except:
        return “none”
    else:
        try:
            proxy={“https”:proxyip}
            testconnect = requests.get(“http://httpbin.org/get”,proxies=proxy,timeout=2)
        except:
            success.append(“http:”+proxyip)
            return “http:”+proxyip
        else:
            success.append(“https”+proxyip)
            return “https:”+proxyip
def get_proxy(x):
    response =requests.get(“https://www.kuaidaili.com/free/inha/{}/”.format(x))
    soup = bs4.BeautifulSoup(response.text,”html.parser”)
    data1 = soup.find_all(name=”table”, class_=”table table-bordered table-striped”)
    for n in data1:
        text=n.find_all(name=”td”,attrs={“data-title”: “IP”})
        for i in text:
            IPlist.append(i.text)
        text=n.find_all(name=’td’,attrs={“data-title”:”PORT”})
        for i in text:
            PORTlist.append(i.text)
        for x in range(len(IPlist)):
            a = IPlist[x]
            b = PORTlist[x]
            proxylist.append(a+”:”+b)
for i in range(1,10):
    get_proxy(i)
for i in proxylist:
    test = test_baidu(i)
    if test != “none”:
        print(test)
print(success)

特点:打包成了函数,增加了http/https的判断与区分

下期预告:增加多进程爬取功能

赞(2)
欢迎转载,转载请在文章末尾加上原地址,违者必究Jimmy的博客 » python爬取免费代理服务器(二)
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址