深度IP转换器
服务时间 :周一至周日 9:00-23:00 电话:400-998-9776 转2
您的位置:首页 > 新闻资讯 > 正文
UC新闻内容代理IP教你如何获取
发布时间:2019年11月28日

  选择优质的代理IP,我们能够利用它来完成很多网络工作,比如网上的大数据抓取,其实就是要依靠代理IP来进行的。今天,IP精灵向大家介绍一个爬取新闻网站内容的教程。

UC新闻内容代理IP教你如何获取

  IP精灵以UC网站为例子:

  这个网站并没有太复杂的反爬虫,我们可以直接解析爬取就好。

  from bs4 import BeautifulSoup

  from urllib import request

  def download(title,url):

  req = request.Request(url)

  response = request.urlopen(req)

  response = response.read().decode('utf-8')

  soup = BeautifulSoup(response,'lxml')

  tag = soup.find('div',class_='sm-article-content')

  if tag == None:

  return 0

  title = title.replace(':','')

  title = title.replace('"','')

  title = title.replace('|','')

  title = title.replace('/','')

  title = title.replace('\\','')

  title = title.replace('*','')

  title = title.replace('<','')

  title = title.replace('>','')

  title = title.replace('?','')

  with open(r'D:\code\python\spider_news\UC_news\society\\' + title + '.txt','w',encoding='utf-8') as file_object:

  file_object.write('\t\t\t\t')

  file_object.write(title)

  file_object.write('\n')

  file_object.write('该新闻地址:')

  file_object.write(url)

  file_object.write('\n')

  file_object.write(tag.get_text())

  #print('正在爬取')

  if __name__ == '__main__':

  for i in range(0,7):

  url = 'https://news.uc.cn/c_shehui/'

  # headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",

  # "cookie":"sn=3957284397500558579; _uc_pramas=%7B%22fr%22%3A%22pc%22%7D"}

  # res = request.Request(url,headers = headers)

  res = request.urlopen(url)

  req = res.read().decode('utf-8')

  soup = BeautifulSoup(req,'lxml')

  #print(soup.prettify())

  tag = soup.find_all('div',class_ = 'txt-area-title')

  #print(tag.name)

  for x in tag:

  news_url = 'https://news.uc.cn' + x.a.get('href')

  print(x.a.string,news_url)

  download(x.a.string,news_url)

  这样,我们就完成了网站新闻数据的抓取,可以检查运行结果看到,我们的数据是否成功获得。