Client for the ZEIT ONLINE Content API - Interface to gather newspaper articles from DIE ZEIT and ZEIT ONLINE, based on a multilevel query.
This package is a lightweight successor of the rzeit package. The main functions are completly rewritten using httr
. Additionally, the package provides new functionalities to directly download article texts, comments and even images using web scraping. Old grouping and visualisation functions are removed and will probably also rewritten in the future.
The package is under continuous development and will be extended with additional features in the future.
# install package from CRAN
install.packages("rzeit2")
# load package
library(rzeit2)
# install devtools package if it's not already
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# install package from GitHub
devtools::install_github("jandix/rzeit2")
# load package
library(rzeit2)
You need an API key to access the content endpoint. You can apply for a key at http://developer.zeit.de/quickstart/. The function below appends the key automatically to your R environment file. Hence, every time you start R the key is loaded. get_content
and get_content_all
access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY")
. Replace api_key
with the key you receive from ZEIT and path
with the location of your R environment.
# save the api key in the .Renviron file
set_api_key(api_key = "xxx",
path = "~/.Renviron")
Important: This function requires an API key. See Authentication.
You can query the whole ZEIT and ZEIT ONLINE archive using the content endpoint. Define your query using the query
argument. The API supports query syntax such as +
and AND/OR
. You find more information at the ZEIT API documentation. As stated above, get_content
and get_content_all
access your key automatically by executing Sys.getenv("ZEIT_ONLINE_KEY")
, but you can provide a key by your own using the api_key
argument.
Why are there two functions?
rzeit2 provides two functions to query the content endpoint: get_content
and get_content_all
. get_content
supports only 1000 rows per call because of the API specifications. You can use get_content_all
to derive all articles matching your query. Internally, get_content_all
executes get_content
but fills all the arguments automatically. Please set a timeout to be polite.
# fetch articles up to 1000 rows
tatort_articles <- get_content(query = "Tatort",
begin_date = "20180101",
end_date = "20180131")
# fetch ALL articles
tatort_articles <- get_content(query = "Tatort",
timeout = 2)
get_article_text
allows you to download the article text for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. Please set a timeout to be polite.
tatort_content <- get_article_text(url = tatort_articles$content$href,
timeout = 1)
get_article_comments
allows you to download the article comments for a given url. The function is not vectorized. This function may take quite long due to the ZEIT ONLINE website structure. Since ZEIT does not provide an API for comments this function downloads all comments as your browser would do. Please set a timeout to be polite.
tatort_comments <- get_article_comments(url = tatort_articles$content$href[1],
timeout = 1)
get_article_images
allows you to download the article images for a given url. The function is vectorized. Hence, you can define multiple urls. The function automatically scrapes all pages if the article has multiple pages. You can directly download the images if you define a path in download
. Please ensure that the folder you define exists. The file name is derived as md5 hash of the image url. Hence, it should be un Please set a timeout to be polite.
Please set a timeout to be polite.
tatort_images <- get_article_images(url = tatort_articles$content$href,
timeout = 1,
download = "~/Documents/tatort-img/")
Jan Dix [email protected]
Special thanks to Simon Munzert, who helped me entering the world of R and GitHub. Additionally, I would like to thank Peter Meißner and Christian Graul who helped with the first version of this package. Lastly, I would like to thank Jana Blahak who wrote the documentation for the first package which is widely reused.