Skip to content

Latest commit

 

History

History
482 lines (311 loc) · 21.5 KB

2939.md

File metadata and controls

482 lines (311 loc) · 21.5 KB

用 Spark、Optimus 和 Twint 在几分钟内分析推文的 NLP

www.kdnuggets.com/2019/05/analyzing-tweets-nlp-spark-optimus-twint.html

c 评论 figure-name

介绍

如果你来到这里,很可能是因为你对分析推文(或类似的东西)感兴趣,并且你有很多推文,或者可以获取它们。为此,最烦人的事情之一就是获取 Twitter 应用程序,获取认证以及所有其他内容。而且如果你使用 Pandas,就没有办法进行扩展。

那么,有一个系统不需要通过 Twitter API 认证,可以获取几乎无限数量的推文,并且具备分析这些推文的能力,包括 NLP 和更多功能,你觉得怎么样?你真幸运,因为这正是我现在要展示给你的。

获取项目和仓库

figure-name matrixds.com/

你可以非常轻松地跟随我展示的内容。只需克隆这个 MatrixDS 项目:

MatrixDS | 数据项目工作台

MatrixDS 是一个可以在任何规模上构建、共享和管理数据项目的地方。 社区平台

figure-name

另外,还有一个包含所有内容的 GitHub 仓库:

FavioVazquez/twitter_optimus_twint

使用 Twint、Optimus 和 Apache Spark 分析推文。 - FavioVazquez/twitter_optimus_twint github.com

使用 MatrixDS,你实际上可以免费运行笔记本,获取数据并进行分析,因此如果你想了解更多,请动手尝试。

获取 Twint 和 Optimus

figure-name

Twint 利用 Twitter 的搜索操作符,让你可以从特定用户那里抓取推文,抓取与某些主题、标签和趋势相关的推文,或从推文中筛选出敏感信息,如电子邮件和电话号码。

使用我共同创建的库 Optimus,你可以清理数据、准备数据、分析数据、创建分析器和图表,并进行机器学习和深度学习,所有这些都是以分布式方式进行的,因为在后台我们有 Spark、TensorFlow、Sparkling Water 和 Keras。

所以,首先让我们安装你所需的一切,在 Matrix 项目中,转到 Analyze Tweets 笔记本并运行(你也可以通过 JupyterLab 终端来完成):

!pip install --user -r requirements.txt

在此之后,我们需要安装 Twint,运行以下命令:

!pip install --upgrade --user -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

这将下载一个 scr/ 文件夹,所以我们需要做一些配置:

!mv src/twint .
!rm -r src

然后要导入 Twint,我们需要运行:

%load_ext autoreload
%autoreload 2
import sys
sys.path.append("twint/")

最后:

import twint

Optimus 在第一步中已安装,所以让我们直接启动它(这将为你启动一个 Spark 集群):

from optimus import Optimus
op = Optimus()

设置 Twint 以抓取推文

figure-namewww.theverge.com/2015/11/3/9661180/twitter-vine-favorite-fav-likes-hearts

# Set up TWINT config
c = twint.Config()

如果你在笔记本上运行,你还需要运行:

# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

搜索数据科学推文

figure-name

我将开始我们的分析,抓取关于数据科学的推文,你可以将其更改为你想要的内容。

为了做到这一点,我们只需运行:

c.Search = "data science"
# Custom output format
c.Format = "Username: {username} |  Tweet: {tweet}"
c.Limit = 1
c.Pandas = True
twint.run.Search(c)

让我向你解释这段代码。在最后一节中,当我们运行代码时:

c = twint.Config()

我们开始了新的 Twint 配置。之后我们需要传递不同的选项来抓取推文。以下是配置选项的完整列表:

Variable             Type       Description
--------------------------------------------
Username             (string) - Twitter user's username
User_id              (string) - Twitter user's user_id
Search               (string) - Search terms
Geo                  (string) - Geo coordinates (lat,lon,km/mi.)
Location             (bool)   - Set to True to attempt to grab a Twitter user's location (slow).
Near                 (string) - Near a certain City (Example: london)
Lang                 (string) - Compatible language codes: https://github.com/twintproject/twint/wiki/Langauge-codes
Output               (string) - Name of the output file.
Elasticsearch        (string) - Elasticsearch instance
Timedelta            (int)    - Time interval for every request (days)
Year                 (string) - Filter Tweets before the specified year.
Since                (string) - Filter Tweets sent since date (Example: 2017-12-27).
Until                (string) - Filter Tweets sent until date (Example: 2017-12-27).
Email                (bool)   - Set to True to show Tweets that _might_ contain emails.
Phone                (bool)   - Set to True to show Tweets that _might_ contain phone numbers.
Verified             (bool)   - Set to True to only show Tweets by _verified_ users
Store_csv            (bool)   - Set to True to write as a csv file.
Store_json           (bool)   - Set to True to write as a json file.
Custom               (dict)   - Custom csv/json formatting (see below).
Show_hashtags        (bool)   - Set to True to show hashtags in the terminal output.
Limit                (int)    - Number of Tweets to pull (Increments of 20).
Count                (bool)   - Count the total number of Tweets fetched.
Stats                (bool)   - Set to True to show Tweet stats in the terminal output.
Database             (string) - Store Tweets in a sqlite3 database. Set this to the DB. (Example: twitter.db)
To                   (string) - Display Tweets tweeted _to_ the specified user.
All                  (string) - Display all Tweets associated with the mentioned user.
Debug                (bool)   - Store information in debug logs.
Format               (string) - Custom terminal output formatting.
Essid                (string) - Elasticsearch session ID.
User_full            (bool)   - Set to True to display full user information. By default, only usernames are shown.
Profile_full         (bool)   - Set to True to use a slow, but effective method to enumerate a user's Timeline.
Store_object         (bool)   - Store tweets/user infos/usernames in JSON objects.
Store_pandas         (bool)   - Save Tweets in a DataFrame (Pandas) file.
Pandas_type          (string) - Specify HDF5 or Pickle (HDF5 as default).
Pandas               (bool)   - Enable Pandas integration.
Index_tweets         (string) - Custom Elasticsearch Index name for Tweets (default: twinttweets).
Index_follow         (string) - Custom Elasticsearch Index name for Follows (default: twintgraph).
Index_users          (string) - Custom Elasticsearch Index name for Users (default: twintuser).
Index_type           (string) - Custom Elasticsearch Document type (default: items).
Retries_count        (int)    - Number of retries of requests (default: 10).
Resume               (int)    - Resume from a specific tweet id (**currently broken, January 11, 2019**).
Images               (bool)   - Display only Tweets with images.
Videos               (bool)   - Display only Tweets with videos.
Media                (bool)   - Display Tweets with only images or videos.
Replies              (bool)   - Display replies to a subject.
Pandas_clean         (bool)   - Automatically clean Pandas dataframe at every scrape.
Lowercase            (bool)   - Automatically convert uppercases in lowercases.
Pandas_au            (bool)   - Automatically update the Pandas dataframe at every scrape.
Proxy_host           (string) - Proxy hostname or IP.
Proxy_port           (int)    - Proxy port.
Proxy_type           (string) - Proxy type.
Tor_control_port     (int) - Tor control port.
Tor_control_password (string) - Tor control password (not hashed).
Retweets             (bool)   - Display replies to a subject.
Hide_output          (bool)   - Hide output.
Get_replies          (bool)   - All replies to the tweet.

所以在这段代码中:

c.Search = "data science"
# Custom output format
c.Format = "Username: {username} |  Tweet: {tweet}"
c.Limit = 1
c.Pandas = True

我们设置搜索词,然后格式化响应(只是检查一下),只获取 20 条推文,限制为 1(它们是以 20 为增量的),最后使结果与 Pandas 兼容。

然后当我们运行:

twint.run.Search(c)

我们正在启动搜索。结果是:

Username: tmj_phl_pharm |  Tweet: If you're looking for work in Spring House, PA, check out this Biotech/Clinical/R&D/Science job via the link in our bio: KellyOCG Exclusive: Data Access Analyst in Spring House, PA- Direct Hire at Kelly Services #KellyJobs #KellyServices
Username: DataSci_Plow |  Tweet: Bring your Jupyter Notebook to life with interactive widgets  https://www.plow.io/post/bring-your-jupyter-notebook-to-life-with-interactive-widgets?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: ottofwagner |  Tweet: Top 7 R Packages for Data Science and AI   https://noeliagorod.com/2019/03/07/top-7-r-packages-for-data-science-and-ai/#DataScience #rstats #MachineLearning
Username: semigoose1 |  Tweet: ëäSujy #crypto #bitcoin #java #competition #influencer #datascience #fintech #science #EU  https://vk.com/id15800296  https://semigreeth.wordpress.com/2019/05/03/easujy-crypto-bitcoin-java-competition-influencer-datascience-fintech-science-eu- https-vk-com-id15800296/ …
Username: Datascience__ |  Tweet: Introduction to Data Analytics for Business  http://zpy.io/c736cf9f  #datascience #ad
Username: Datascience__ |  Tweet: How Entrepreneurs in Emerging Markets can master the Blockchain Technology  http://zpy.io/f5fad501  #datascience #ad
Username: viktor_spas |  Tweet: [Перевод] Почему Data Science командам нужны универсалы, а не специалисты  https://habr.com/ru/post/450420/?utm_source=dlvr.it&utm_medium=twitter&utm_campaign=450420pic.twitter.com/i98frTwPSE
Username: gp_pulipaka |  Tweet: Orchestra is a #RPA for Orchestrating Project Teams. #BigData #Analytics #DataScience #AI #MachineLearning #Robotics #IoT #IIoT #PyTorch #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann  http://bit.ly/2Hn6qYd  pic.twitter.com/kXizChP59U
Username: amruthasuri |  Tweet: "Here's a typical example of a day in the life of a RagingFX trader. Yesterday I received these two signals at 10am EST. Here's what I did... My other activities have kept me so busy that ...  http://bit.ly/2Jm9WT1  #Learning #DataScience #bigdata #Fintech pic.twitter.com/Jbes6ro1lY
Username: PapersTrending |  Tweet: [1/10] Real numbers, data science and chaos: How to fit any dataset with a single parameter - 192 stars - pdf:  https://arxiv.org/pdf/1904.12320v1.pdf- github: https://github.com/Ranlot/single-parameter-fitUsername: webAnalyste |  Tweet: Building Data Science Capabilities Means Playing the Long Game  http://dlvr.it/R41k3t  pic.twitter.com/Et5CskR2h4
Username: DataSci_Plow |  Tweet: Building Data Science Capabilities Means Playing the Long Game  https://www.plow.io/post/building-data-science-capabilities-means-playing-the-long-game?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: webAnalyste |  Tweet: Towards Well Being, with Data Science (part 2)  http://dlvr.it/R41k1K  pic.twitter.com/4VbljUcsLh
Username: DataSci_Plow |  Tweet: Understanding when Simple and Multiple Linear Regression give Different Results  https://www.plow.io/post/understanding-when-simple-and-multiple-linear-regression-give-different-results?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: DataSci_Plow |  Tweet: Artificial Curiosity  https://www.plow.io/post/artificial-curiosity?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: gp_pulipaka |  Tweet: Synchronizing the Digital #SCM using AI for Supply Chain Planning. #BigData #Analytics #DataScience #AI #RPA #MachineLearning #IoT #IIoT #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann  http://bit.ly/2KX8vrt  pic.twitter.com/tftxwilkQf
Username: DataSci_Plow |  Tweet: Extreme Rare Event Classification using Autoencoders in Keras  https://www.plow.io/post/extreme-rare-event-classification-using-autoencoders-in-keras?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: DataSci_Plow |  Tweet: Five Methods to Debug your Neural Network  https://www.plow.io/post/five-methods-to-debug-your-neural-network?utm_source=Twitter&utm_campaign=Data_science+1 Hal2000Bot #data #science
Username: iamjony94 |  Tweet: 26 Mobile and Desktop Tools for Marketers  http://bit.ly/2LkL3cN  #socialmedia #digitalmarketing #contentmarketing #growthhacking #startup #SEO #ecommerce #marketing #influencermarketing #blogging #infographic #deeplearning #ai #machinelearning #bigdata #datascience #fintech pic.twitter.com/mxHiY4eNXR
Username: TDWI |  Tweet: #ATL #DataPros: Our #analyst, @prussom is headed your way to speak @ the #FDSRoadTour on Wed, 5/8! Register to attend for free, learn about Modern #DataManagement in the Age of #Cloud & #DataScience: Trends, Challenges & Opportunities.  https://bit.ly/2WlYOJb  #Atlanta #freeevent

看起来不是很好,但我们得到了我们想要的。推文!

将结果保存到 Pandas

figure-name

可惜 Twint 和 Spark 之间没有直接的连接,但我们可以通过 Pandas 实现,然后将结果传递给 Optimus。

我创建了两个简单的函数,你可以在实际项目中看到,它们可以帮助你处理 Pandas 和 Twint API 的这一部分。所以当我们运行:

available_columns()

你会看到:

Index(['conversation_id', 'created_at', 'date', 'day', 'hashtags', 'hour','id', 'link', 'location', 'name', 'near', 'nlikes', 'nreplies','nretweets', 'place', 'profile_image_url', 'quote_url', 'retweet','search', 'timezone', 'tweet', 'user_id', 'user_id_str', 'username'],dtype='object')

这些是我们刚刚查询得到的列。对于这些数据有很多不同的操作,但在这篇文章中我只使用其中的一部分。所以为了将 Twint 的结果转换为 Pandas,我们运行:

df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"])

你会看到这个 Pandas 数据框:

figure-name

好得多,不是吗?

情感分析(简单方法)

figure-name

我们将对一些推文进行情感分析,使用 Optimus 和 TextBlob,一个用于自然语言处理的库。我们需要做的第一件事是清理这些推文,为此 Optimus 是最佳选择。

为了将数据保存为 Optimus (Spark) 数据框,我们需要运行:

df = op.create.data_frame(pdf= df_pd)

我们将使用 Optimus 移除重音符号和特殊字符(在实际工作场景中你需要做更多的工作,比如移除链接、图片和停用词),为此:

clean_tweets = df.cols.remove_accents("tweet") \
                 .cols.remove_special_chars("tweet")

然后我们需要从 Spark 收集这些推文,将其放入一个 Python 列表中,为此:

tweets = clean_tweets.select("tweet").rdd.flatMap(lambda x: x).collect()

然后为了分析这些推文的情感,我们将使用 TextBlob 情感 函数:

from textblob import TextBlob
from IPython.display import Markdown, display

# Pretty printing the result
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))
for tweet in tweets:
    print(tweet)
    analysis = TextBlob(tweet)
    print(analysis.sentiment)
    if analysis.sentiment[0]>0:
        printmd('Positive', color="green")
    elif analysis.sentiment[0]<0:
        printmd('Negative', color="red")
    else:
        printmd("No result", color="grey")
        print("")

这将给我们:

IAM Platform Curated Retweet  Via  httpstwittercomarmaninspace  ArtificialIntelligence AI What About The User Experience  httpswwwforbescomsitestomtaulli20190427artificialintelligenceaiwhatabouttheuserexperience  AI DataScience MachineLearning BigData DeepLearning Robots IoT ML DL IAMPlatform TopInfluence ArtificialIntelligence
Sentiment(polarity=0.0, subjectivity=0.0)

中立

Seattle Data Science Career Advice Landing a Job in The Emerald City Tips from Metis Seattle Career Advisor Marybeth Redmondhttpsbitly2IYjzaj  pictwittercom98hMYZVxsu
Sentiment(polarity=0.0, subjectivity=0.0)

中立

This webinarworkshop is designed for business leaders data science managers and decision makers who want to build effective AI and data science capabilities for their organization Register here  httpsbitly2GDQeQT  pictwittercomxENQ0Dtv1X
Sentiment(polarity=0.6, subjectivity=0.8)

积极

Contoh yang menarik dari sport science kali ini dari sisi statistik dan pemetaan lapangan Dengan makin gencarnya scientific method masuk di sport maka pengolahan data seperti ini akan semakin menjadi hal biasa  httpslnkdinfQHqgjh
Sentiment(polarity=0.0, subjectivity=0.0)

中立

Complete handson machine learning tutorial with data science Tensorflow artificial intelligence and neural networks  Machine Learning Data Science and Deep Learning with Python   httpsmedia4yousocialcareerdevelopmenthtmlmachinelearning  python machine learning online data science udemy elearning pictwittercomqgGVzRUFAM
Sentiment(polarity=-0.16666666666666666, subjectivity=0.6)

消极

We share criminal data bases have science and medical collaoarations Freedom of movement means we can live and work in EU countries with no hassle at all much easier if youre from a poorer background We have Erasmus loads more good things
Sentiment(polarity=0.18939393939393936, subjectivity=0.39166666666666666)

积极

Value of Manufacturers Shipments for Durable Goods BigData DataScience housing rstats ggplot pictwittercomXy0UIQtNHy
Sentiment(polarity=0.0, subjectivity=0.0)

中立

Top DataScience and MachineLearning Methods Used in 2018 2019 AI MoRebaie TMounaged AINow6 JulezNorton  httpswwwkdnuggetscom201904topdatasciencemachinelearningmethods20182019html
Sentiment(polarity=0.5, subjectivity=0.5)

积极

Come check out the Santa Monica Data Science  Artificial Intelligence meetup to learn about In PersonComplete Handson Machine Learning Tutorial with Data Science  httpbitly2IRh0GU
Sentiment(polarity=-0.6, subjectivity=1.0)

消极

Great talks about the future of multimodality clinical translation and data science Very inspiring 1stPETMRIsymposium unitue PETMRI molecularimaging AI pictwittercomO542P9PKXF
Sentiment(polarity=0.4833333333333334, subjectivity=0.625)

积极

Did engineering now into data science last 5 years and doing MSC in data science this year
Sentiment(polarity=0.0, subjectivity=0.06666666666666667)

中立

Program OfficerData Science  httpbitly2PV3ROF
Sentiment(polarity=0.0, subjectivity=0.0)

中立。

以此类推。

嗯,这非常简单,但它不能扩展,因为最终我们是从 Spark 中收集数据,所以驱动程序的 RAM 是限制。让我们做得更好一点。


我们的前三个课程推荐

1. Google 网络安全证书 - 快速进入网络安全职业生涯。

2. Google 数据分析专业证书 - 提升你的数据分析技能

3. Google IT 支持专业证书 - 支持你的组织的 IT 工作


更多相关内容