Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeWarning and UnicodeEncodeError issues #136

Open
nemobis opened this issue Jun 30, 2014 · 15 comments
Open

UnicodeWarning and UnicodeEncodeError issues #136

nemobis opened this issue Jun 30, 2014 · 15 comments

Comments

@nemobis
Copy link
Member

nemobis commented Jun 30, 2014

Simple incompatibility between old image list and current master, or something more?

Resuming download, using directory eswikiarquitecturacom-20140628-wikidump
[...]
You didn't provide a path for index.php, we try this one: http://es.wikiarquitectura.com/index.php
Checking api.php... http://es.wikiarquitectura.com/api.php
api.php is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
Analysing http://es.wikiarquitectura.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
./dumpgenerator.py:1232: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if filename2 not in listdir:

@emijrp
Copy link
Member

emijrp commented Jun 30, 2014

Now it reads the image list file as unicode, and it is comparing with os.listdir() which is returning not unicode. I don't think it is serious, but I can check it tomorrow.

@nemobis
Copy link
Member Author

nemobis commented Jun 30, 2014

Ok. The dump is proceeding, I'll check at the end if some image is missing. (Update: I forgot to count them, there is a big dump at https://archive.org/details/wiki-eswikiarquitecturacom though.)

@nemobis
Copy link
Member Author

nemobis commented Jul 5, 2014

Some more despite #124 , on wikihow.com with latest master:

Downloaded 30 pages
"Hit" Someone on Pandanda, 0 edits
"Hog Flip" in Halo, 0 edits
File "dumpgenerator.py", line 1503, in
main()
File "dumpgenerator.py", line 1495, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 1241, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "dumpgenerator.py", line 579, in generateXMLDump
xml = getXMLPage(config=config, title=title, session=session)
File "dumpgenerator.py", line 512, in getXMLPage
print ' %s, %d edits' % (title, numberofedits)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 119: ordinal not in range(128)

@nemobis nemobis changed the title UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal UnicodeWarning and UnicodeEncodeError issues Jul 5, 2014
@nemobis nemobis added bug and removed question labels Aug 24, 2014
@nemobis nemobis added this to the 0.3 milestone Aug 24, 2014
@PiRSquared17
Copy link
Member

Can you reproduce this error still? The one you mentioned in the last comment has already been fixed. Not sure about the original one.

@nemobis
Copy link
Member Author

nemobis commented Sep 19, 2014

Can't reproduce now either. Though the original comment might have been about an image list produced with one version of dumpgenerator and then used with another, incompatible one.

federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images  
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 8
Excluding titles from namespaces = None
1 namespaces found
    Retrieving titles in the namespace 8
.    5 titles retrieved in the namespace 8
5 page titles loaded
Titles saved at... eswikiarquitecturacom-20140919-titles.txt
Retrieving the XML for every page from "start"
    MediaWiki:Common.css, 8 edits
    MediaWiki:Mainpage, 1 edit
    MediaWiki:Newarticletext, 1 edit
    MediaWiki:Sidebar, 1 edit
    MediaWiki:Sitenotice, 1 edit
XML dump saved at... eswikiarquitecturacom-20140919-history.xml
Retrieving image filenames
....................................................................    Found 33592 images
33592 image names loaded
Image filenames and URLs saved at... eswikiarquitecturacom-20140919-images.txt
Retrieving images from "start"
Creating "./eswikiarquitecturacom-20140919-wikidump/images" directory
    Downloaded 10 images
^CTraceback (most recent call last):
  File "dumpgenerator.py", line 1602, in <module>
    main()
  File "dumpgenerator.py", line 1594, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1288, in createNewDump
    generateImageDump(config=config, other=other, images=images, session=other['session'])
  File "dumpgenerator.py", line 869, in generateImageDump
    filename), session=session)  # use Image: for backwards compatibility
  File "dumpgenerator.py", line 377, in getXMLFileDesc
    return getXMLPage(config=config, title=title, verbose=False, session=session)
  File "dumpgenerator.py", line 472, in getXMLPage
    xml = getXMLPageCore(params=params, config=config, session=session)
  File "dumpgenerator.py", line 440, in getXMLPageCore
    r = session.post(url=config['index'], data=params, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/adapters.py", line 327, in send
    timeout=timeout
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
    body=body, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php

Warning!: "./eswikiarquitecturacom-20140919-wikidump" path exists
There is a dump in "./eswikiarquitecturacom-20140919-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./eswikiarquitecturacom-20140919-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
17 images were found in the directory from a previous session
Retrieving images from "00 centro kimmel.jpg"
    Downloaded 10 images

@nemobis
Copy link
Member Author

nemobis commented Dec 1, 2014

Analysing http://africanspecies.net/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "불활성화 백신"
Retrieving the XML for every page from "불활성화 백신"
./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if title == start: # start downloading from start, included
XML dump saved at... africanspeciesnet-20141127-history.xml
Image list is incomplete. Reloading...
Retrieving image filenames
. Found 337 images

1 similar comment
@nemobis
Copy link
Member Author

nemobis commented Dec 1, 2014

Analysing http://africanspecies.net/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "불활성화 백신"
Retrieving the XML for every page from "불활성화 백신"
./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if title == start: # start downloading from start, included
XML dump saved at... africanspeciesnet-20141127-history.xml
Image list is incomplete. Reloading...
Retrieving image filenames
. Found 337 images

@nemobis
Copy link
Member Author

nemobis commented Dec 4, 2014

I'm also wondering whether resume works... it would be terrible if the bug makes us "close" incomplete dumps.

Analysing http://wiki.megatec.ru/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Мастер-Web:Установка версии 7.2"
Retrieving the XML for every page from "Мастер-Web:Установка версии 7.2"
./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if title == start: # start downloading from start, included
XML dump saved at... wikimegatecru-20141203-history.xml
Image list is incomplete. Reloading...
Retrieving image filenames
........ Found 3722 images

@nemobis nemobis reopened this Dec 4, 2014
@DrDevice
Copy link

Sorry if this is bad etiquette (I'm new), but I was wondering if there was any update on this? Getting UnicodeEncodeError whenever I run python dumpgenerator.py --api=http://ark.gamepedia.com/api.php --xml --curonly --images --delay 5 --resume --path=arkgamepediacom-20150717-wikidump/, I get the following results:

Analysing http://ark.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
195 images were found in the directory from a previous session
Retrieving images from "Campfire.png"
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2031, in <module>
    main()
  File "dumpgenerator.py", line 2021, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1745, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 1071, in generateImageDump
    imagefile = open(filename3, 'wb')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 53: ordinal not in range(128)

I'm using the most recent dumpgenerator.py as of this writing.

@emijrp
Copy link
Member

emijrp commented Jul 18, 2015

Hello DrDevice. This bug still need a fix. A workaround: You can remove the image filename in the -images.txt file in the dump directory, and then resume. According to that wiki, it is "Capture d'écran 2015-06-13 11.20.59.png". If you find more errors, remove them too, but I don't see more weird chars in the list.

http://ark.gamepedia.com/index.php?title=Special%3APrefixIndex&prefix=&namespace=6

@DrDevice
Copy link

emijrp, thank you very much! That seems to have cleared it up! It's been trucking on for a couple hours now, no errors. Crossing my fingers! :)

@burner1024
Copy link

This is still an issue. I've tried patches from #279, didn't help.

@ouaibe
Copy link

ouaibe commented Aug 15, 2018

I recently ran into the same issue with a similar message but for another part of the script.

The decode statement at https://github.com/WikiTeam/wikiteam/blob/master/dumpgenerator.py#L1999
was causing an exception, which had the script consider the image folder wasn't found and forced a dump resume to re-download all the images for no good reason.
This line should probably be modified to distinguish non-existing dir from some other exception.

Anyways, the exception thrown was:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xxxx' in position YY: ordinal not in range(128)

And it turns out it was due to the fact that the Python 2.7 script used 'ascii' as a default encoding for the sys module as shown by python -c 'import sys; print(sys.getdefaultencoding())'

This was fixed by modifying /usr/lib/python2.7/sitecustomize.py to add the following lines that force utf8 default encoding in the Python 2.7 environment.

import sys
sys.setdefaultencoding('UTF8')

@Slider-Whistle
Copy link

@ouaibe Thanks for the tip, I thought it must've been a bug in wikiteam. They should be able to set this somewhere theirselves right?

@wlhlm
Copy link

wlhlm commented Aug 25, 2019

I'd like to pile on and say that I've also stumbled upon this issue or a similar one:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml --images
[...]
    Downloaded 5600 images
    Downloaded 5610 images
    Downloaded 5620 images
Traceback (most recent call last):
  File "../wikidump/wikiteam/dumpgenerator.py", line 2323, in <module>
    main()
  File "../wikidump/wikiteam/dumpgenerator.py", line 2313, in main
    resumePreviousDump(config=config, other=other)
  File "../wikidump/wikiteam/dumpgenerator.py", line 2030, in resumePreviousDump
    session=other['session'])
  File "../wikidump/wikiteam/dumpgenerator.py", line 1318, in generateImageDump
    text=u'The page "%s" was missing in the wiki (probably deleted)' % (title.decode('utf-8'))
  File "/home/wlhlm/vault/share/mc/wikidump/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 13: ordinal not in range(128)

Trying to resume, I'm hitting #250, meaning that dumpgenerator.py fails to detect previously downloaded images and starts from the beginning:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml
 --images --resume --path minecraft_degamepediacom-20190825-wikidump/
Checking API... https://minecraft-de.gamepedia.com/api.php
API is OK: https://minecraft-de.gamepedia.com/api.php
Checking index.php... https://minecraft-de.gamepedia.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2019 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://minecraft-de.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
0 images were found in the directory from a previous session
Retrieving images from "start"
    Downloaded 10 images
^C

But, of course, resuming doesn't do a whole since it will hit the same UnicodeEncodeError again.

The workaround described by @ouaibe worked. Editing siteconfig.py and adding sys.setdefaultencoding('UTF8') was unproblematic, because I was working in a virtualenv, but not sure how well it'd work when the global /usr/lib/python2.7/sitecustomize.py, since this can affect other python scripts.

Python 2.7.16
dumpgenerator.py 080b723

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants