Skip to content
This repository has been archived by the owner on Jun 18, 2021. It is now read-only.

Importing from Wikipedia

Jesse Himmelstein edited this page Jun 13, 2013 · 2 revisions

Importing Wikipedia pages is currently done in several steps. The import/getwiki.js script starts at a certain category and descends down a given number of levels of subcategories, grabbing all article titles in those categories. It creates two text files (article.txt, and category.txt) which list the articles and categories imported, respectively. Currently the script has the category and depth hard-coded at the end, but this could be easily modified to accept command-line parameters.

Next, there is a script that takes a list of article names (such as those in article.txt), and asks KnowNodes to create a new node for it. This script is called makeNodes.js. It can be given the file to read on the command line, as well a line number to start on (in case the script was interrupted previously) and a maximum number of lines to read.

Finally, the actual import of the Wikipedia article is done within the KnownNodes API, in controllers/knownodes/index.coffee. It uses the "nodemw" module to make requests for the article text as well as links to other Wikipedia articles. The API url is a POST request to /knownodes/wikinode, with a form argument of title: <article title>.

Bugs and Missing Features

  • #20 - article.txt, and category.txt are CSV files, but Wikipedia names can have commas in them! They should be changed to be tab-separated.
  • There is no way to update a Wikipedia page or handle its removal