update

yingjinghuang · May 13, 2019 · e825d9b · e825d9b
1 parent 5e23533
commit e825d9b
Show file tree

Hide file tree

Showing 6 changed files with 423 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -24,7 +24,11 @@ git clone https://github.com/RealIvyWong/WeiboCrawler.git
 
 ## 文件说明
 
-**WeiboLocationCrawler**是对**签到页**进行爬取的项目，**WeiboUserInfoCrawler**是对**用户信息**进行爬取的项目（需要用到你自己的cookie）。
+**WeiboLocationCrawler**是对**签到页**进行爬取的项目
+
+**WeiboUserInfoCrawler**是对**用户信息**进行爬取的项目（需要用到你自己的cookie）。
+
+**WeiboRepoandComm**是对**单条微博的评论或者转发**进行爬取的项目（可能需要用到你自己的cookie）。
 
 具体项目介绍，请参考具体项目页的README。
 

diff --git a/WeiboRepoandComm/Method.md b/WeiboRepoandComm/Method.md
@@ -0,0 +1,100 @@
+# 单条微博转发或评论爬虫 之 爬取方法
+
+## 目标网页
+
+爬虫是针对网页进行爬取，所以我们要知道需要爬的目标网页。
+
+直接爬 weibo.com 肯定是不行的，电脑端的反爬还是挺麻烦的。然后看了其他人的爬虫，get 了以下两个网址。
+
+**转发爬取**`https://m.weibo.cn/api/statuses/repostTimeline?id={}&page={}`
+
+**评论爬取**`https://m.weibo.cn/api/comments/show?id={}&page={}`
+
+其中 id 这个参数指的就是微博的二进制 id 或者数字 id，page就是翻页很好理解。
+
+获取的是 json 数据。结构如下图所示。其中的0-9就是每条转发或者评论的详细信息。
+
+![1557764155552](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557764155552.png)
+
+但是这两个网页基本上都只能爬个 100 页左右，所以最多爬可能大概 1000 条左右就返回的 ok 值为 0 了。
+
+但是的但是也不是很稳定的样子？我看其他人使用这个评论爬取页面都没加 cookie，但是我自己使用的使用到第 3 页的时候不加 cookie 就会跳转登录页面，不管是浏览器还是爬虫。然后，转发爬取页面刚刚跑了下，竟然爬到了120+页？
+
+所以**应对策略**就是写一个 while True 死循环嘛，page 一直加上去，直到返回的 ok 值为 0 就好了。如果不加 cookie，无法访问跳转登录页面的话，状态码会是 404。所以再加一个判断语句，状态码是404的时候，我们才加我们的 cookie。能不用自己的账号信息就不用嘛。如果用了 cookie 的话，稳妥一点还是让程序等个 3 秒钟，假装是人为地刷微博，保护一下账号不要被封（或许可以再短一点，但是 3 秒这个间隔，我是可以接受的，所以就没有试其他的）。
+
+所以这写了下面这个函数来获取网页回传的 json 数据。
+
+```python
+def getJson(mid,page,type,ippool):
+    global cookie
+
+    if type=='repost':
+        url='https://m.weibo.cn/api/statuses/repostTimeline?id={}&page={}'.format(mid,page)
+    else:
+        url='https://m.weibo.cn/api/comments/show?id={}&page={}'.format(mid,page)
+    headers = {
+    'User-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
+    'Host' : 'm.weibo.cn',
+    'Accept' : 'application/json, text/plain, */*',
+    'Accept-Language' : 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
+    'Accept-Encoding' : 'gzip, deflate, br',
+    'Referer' : 'https://m.weibo.cn/status/' + mid,
+    'DNT' : '1',
+    'Connection' : 'keep-alive',
+    }
+
+    proxy_ip=ippool[random.randint(0,len(ippool)-1)]
+    while True:
+        try:
+            res=requests.get(url=url,headers=headers, proxies=proxy_ip,timeout=20)
+            res.encoding='utf-8'
+            if res.status_code==404:
+                if 'cookie' in headers.keys():
+                    cookie=input('cookie失效了，重新输入下：')
+                headers['cookie']=cookie
+                time.sleep(3)
+                continue
+            break
+        except requests.exceptions.ProxyError as e:
+            # 如果是代理不行的话
+            print('代理不行，换代理')
+            ippool.remove(proxy_ip)
+            # 代理池没有存货了的话，那就再爬一遍代理
+            if len(ippool)<1:
+                ippool=ippool.buildippool()
+            proxy_ip=ippool[random.randint(0,len(ippool)-1)]
+        except Exception as e:
+            # 其他错误的话，就打印看下，退出程序
+            # 但是好像一般没有
+            print(e)
+            sys.exit()
+
+    jd = json.loads(res.text)
+
+    if jd['ok']==1:
+        return jd,ippool
+    else:
+        print(jd) # 也可以不打印这个，但是有点怕是其他的报错情况让ok为0
+        print('这里好像没内容了')
+        return 0,ippool
+```
+
+## 其他
+
+解析 json 写入数据库的代码就不解释了，基本上都是基础操作。
+
+值得一提的是 buildippool.py 代码里使用的免费代理网页（https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list）。忘记是哪里看到的了，大概有几千个？非常好用！！
+
+由于一一验证可用性会非常耗费时间，所以我只是把爬取的内容进行了洗牌，挑出了前面的100个。
+
+而我们要爬取的网页都是 https 的，所以 100 个中还过滤了 http 的代理，所以你们看到一般可能才爬到 40 , 50 个左右。然后验证可用性之后，可能只剩下 10 多个的样子。不要慌张！完全够用！
+
+而且上面那个函数里，还有个机制就是，如果使用这个代理无法访问或者访问超时，会从当前代理池里剔除掉。如果代理池为空了，那就重新再爬一遍这个代理网站。
+
+## Contact Me
+
+如果有什么建议或意见，欢迎联系我（[email protected])或者提issue！
+
+
+
+【注】本人真的超级社恐，所以请各位真别加我QQ。有问题发我邮件或者提issue，我看到都会第一时间回复的。感谢！
diff --git a/WeiboRepoandComm/README.md b/WeiboRepoandComm/README.md
@@ -0,0 +1,83 @@
+# 单条微博转发或评论爬虫
+
+项目链接：https://github.com/RealIvyWong/WeiboCrawler/tree/master/WeiboLocationCrawler
+
+## 1 实现功能
+
+这个项目是用来爬取单条微博的转发或者评论博数据，并写进 sqlite 数据库。
+
+可能会用到你自己的 cookie（代码中要求必须写，不然会报错）。
+
+## 2 依赖环境
+
+使用的是 Python 3.7。
+
+无需额外的第三方库。
+
+## 3 使用方法
+
+**step1.** 修改 start.py 中的 cookie, mid（微博 id），type（转发 repost 还是评论 comment )。
+
+**step2.** Run start.py。
+
+如果使用了 cookie 访问网页的话，会比较慢一点，因为不想被封号……所以设置的3s访问一次。
+
+> 【**解释一下什么是微博id**】
+>
+> 对于电脑端 weibo.com 来说，点击一条微博的评论处的查看更多，就会跳转这条微博的微博页。类似下图。
+>
+> ![1557762440364](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557762440364.png)
+>
+> 而地址栏中的网址`<https://weibo.com/1929075382/Hu2bGcN5r?filter=hot&root_comment_id=0&type=comment#_rnd1557761434227>`中的`Hu2bGcN5r`就是这条微博的二进制 id 了。其实微博还有数字 id ，但是都一样，不用 care 这么多。
+
+## 4 文件说明
+
+包含三个文件。
+
+### buildippool.py
+
+这个文件是一个实现爬取代理网站上的代理 IP 来构建代理池的模块。
+
+### crawler.py
+
+爬虫本体。
+
+### start.py
+
+一个启动文件。
+
+## 5 爬取示例
+
+如果开始成功运行之后，控制台输出大概是这样的。先获取代理 ip（这可能需要 5 分钟左右？）。
+
+![1557762195827](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557762195827.png)
+
+然后获取完代理就一页一页的转发评论开始爬，限制是最多爬 100 页，所以最多可能只能爬个 1000 条的样子。运行时像下图这样，我爬的示例微博评论有点少，所以才两页。
+
+![1557762990107](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557762990107.png)
+
+得到的`weibo.sqlite`结构是包含一个表，如果爬的是转发，就是 repost 表；爬的是评论就是 comment 表。
+
+——**repost 表**
+
+表的字段有 id（序号）, mid（转发微博的 id）, uid（用户 id）, user（用户名）, content（转发内容）, time（转发时间）, root_mid（原微博 id)。
+
+![1557763634145](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557763634145.png)
+
+——**comment 表**
+
+表的字段有 id（序号）, cid（评论 id）, uid（用户 id）, user（用户名）, content（评论内容）, time（评论时间）, root_mid（原微博 id)。
+
+![1557762965481](https://github.com/RealIvyWong/ImageHosting/raw/master/assets/1557762965481.png)
+
+## 5 爬取细节
+
+想要了解一下爬取的细节的话，看这个文件夹里的 **Method.md** 文件。大致讲了一下我的代码的思路。
+
+## 6 Contact Me
+
+如果有什么建议或意见，欢迎联系我（[email protected])或者提issue！
+
+
+
+【注】本人真的超级社恐，所以请各位真别加我QQ。有问题发我邮件或者提issue，我看到都会第一时间回复的。感谢！
diff --git a/WeiboRepoandComm/buildippool.py b/WeiboRepoandComm/buildippool.py
@@ -0,0 +1,66 @@
+# coding:utf-8
+# version:python3.7
+# author:Ivy
+
+import requests,json
+import random
+
+# 爬取代理网站上可以用的代理，建立代理池
+class Proxies:
+    def __init__(self):
+        self.proxy_list = []
+
+    # 爬取西刺代理的国内高匿代理
+    def get_proxy_nn(self):
+        proxy_list = []
+        res = requests.get("https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list")
+        proxyList=str(res.text).split('\n')
+        random.shuffle(proxyList)
+        for i in range(100):
+            proxy=proxyList[i]
+            if '{' not in proxy:
+                continue
+            jd=json.loads(proxy)
+            if jd['type']=='http':
+                continue
+            host_port=str(jd['host'])+':'+str(jd['port'])
+            proxy_list.append(host_port)
+            #rint(host_port,'Success')
+        return proxy_list
+
+    # 验证代理是否能用
+    def verify_proxy(self, proxy_list):
+        for proxy in proxy_list:
+            proxies = {
+                "https": proxy
+            }
+            try:
+                if requests.get('https://www.baidu.com', proxies=proxies, timeout=5).status_code == 200:
+                    if proxy not in self.proxy_list:
+                        self.proxy_list.append(proxy)
+                    print('Success',proxy)
+            except:
+                print('Fail',proxy)
+
+    # 保存到ippool这个List里
+    def save_proxy(self):
+        ippool=[]
+        print("开始存入代理池...")
+        # 把可用的代理添加到代理池中
+        for proxy in self.proxy_list:
+            proxies={"http":proxy}
+            ippool.append(proxies)
+        return ippool
+
+
+# 使用上面的类建立代理池
+def buildippool():
+    p = Proxies()
+    results = p.get_proxy_nn()
+    print("爬取到的代理数量", len(results))
+    print("开始验证：")
+    p.verify_proxy(results)
+    print("验证完毕：")
+    print("可用代理数量：", len(p.proxy_list))
+    ippool = p.save_proxy()
+    return ippool