Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let's see what's new in this fork of the warc lib #22

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
6ec3366
Test to make sure arc file header is written just once.
May 15, 2012
7fd90e3
Fixed the issue of writing duplicate file header in arc file.
May 15, 2012
249953d
ARCFile: filename is specified should be used and should work even if…
May 15, 2012
c5d1dfa
bump version to 0.2.1
May 15, 2012
9297ded
Added some bugfixes
recrm Nov 15, 2014
09d0e31
Ran 2to3, fixed CaseInsensitiveDict
recrm Nov 15, 2014
bd83608
Update for python3
recrm Nov 22, 2014
9bd3ff8
Quick update of readme
recrm Nov 22, 2014
dae3938
Updated testing, improved HTTPObject. Numerous bug fixes in warc
recrm Dec 3, 2014
c4895d5
update to HTTPObject
recrm Dec 12, 2014
ed170b7
Added warcscrape.py and supporting files.
recrm Dec 15, 2014
00be647
Fix TypeError gzip.open()
jpbruinsslot Jul 27, 2015
9679ead
Fix TypeError
jpbruinsslot Jul 27, 2015
dea56d6
Fix TypeError: Unicode-objects must be encoded before hashing
jpbruinsslot Jul 27, 2015
31a1217
Factor out HTTPObject
jpbruinsslot Jul 27, 2015
1fd24d8
Fix _compute_digest()
jpbruinsslot Jul 27, 2015
34c990f
Remove reference to HTTPObject
jpbruinsslot Jul 27, 2015
985be9c
* warc.py: fix for creating warc files based on a requests response.
Jul 28, 2015
dff7aca
* warc.py: disable encoding/decoding and simply store and work with r…
Aug 11, 2015
7a5fc7d
* Remove outdated build link, add documentation note, add credits.
almer-t Aug 13, 2015
70aaad5
* stylify
almer-t Aug 13, 2015
f89837e
Fix corner case when record_length is zero
sacabuche Apr 14, 2016
142bc7a
extract http headers and return a io.BytesIO as payload
sacabuche Apr 22, 2016
ef20c0c
parse http status code
sacabuche Apr 22, 2016
fc927ce
fix bug when there are only 2 valios in HTTP protocol
sacabuche Apr 22, 2016
152ce58
promote status_code to utils.py
sacabuche Apr 26, 2016
cf920f3
start support for 0.18 version and parse http headers
sacabuche Apr 26, 2016
52469aa
a little of pep8 for arc.py
sacabuche Apr 26, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ docs/_build/
build/
.coverage
htmlcov/
.ropeproject/
31 changes: 23 additions & 8 deletions Readme.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
warc: Python library to work with WARC files
============================================
warc3: Python3 library to work with WARC files
==============================================

.. image:: https://secure.travis-ci.org/anandology/warc.png?branch=master
:alt: build status
:target: http://travis-ci.org/anandology/warc
Note: This is a fork of the original (now dead) warc repository.

WARC (Web ARChive) is a file format for storing web crawls.

Expand All @@ -12,18 +10,35 @@ http://bibnum.bnf.fr/WARC/
This `warc` library makes it very easy to work with WARC files.::

import warc
f = warc.open("test.warc")
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
with warc.open("test.warc") as f:
for record in f:
print record['WARC-Target-URI'], record['Content-Length']

Documentation
-------------

The documentation of the warc library is available at http://warc.readthedocs.org/.

Apart from the install from pip, which will not work for this warc3 version, the
interface as described there is unchanged.

License
-------

This software is licensed under GPL v2. See LICENSE_ file for details.

.. LICENSE: http://github.com/internetarchive/warc/blob/master/LICENSE

Authors
-------

Original Python2 Versions:

* Anand Chitipothu
* Noufal Ibrahim

Python3 Port:

* Ryan Chartier
* Jan Pieter Bruins Slot
* Almer S. Tigelaar
12 changes: 6 additions & 6 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@
master_doc = 'index'

# General information about the project.
project = u'warc'
copyright = u'2012, Internet Archive'
project = 'warc'
copyright = '2012, Internet Archive'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -178,8 +178,8 @@
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
('index', 'warc.tex', u'WARC Documentation',
u'Internet Archive', 'manual'),
('index', 'warc.tex', 'WARC Documentation',
'Internet Archive', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
Expand Down Expand Up @@ -211,6 +211,6 @@
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'warc', u'WARC Documentation',
[u'Internet Archive'], 1)
('index', 'warc', 'WARC Documentation',
['Internet Archive'], 1)
]
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
pytest
nose
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

setup(
name="warc",
version="0.2.0",
version="0.2.2",
description="Python library to work with ARC and WARC files",
long_description=open('Readme.rst').read(),
license='GPLv2',
Expand All @@ -19,7 +19,7 @@
'Development Status :: 4 - Beta',
'Environment :: Web Environment',
'Intended Audience :: Developers',
'License :: OSI Approved :: BSD License',
'License :: OSI Approved :: GNU General Public License v2 (GPLv2)',
'Operating System :: OS Independent',
'Programming Language :: Python',
],
Expand Down
19 changes: 11 additions & 8 deletions warc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,35 @@
:copyright: (c) 2012 Internet Archive
"""

from .arc import ARCFile, ARCRecord, ARCHeader
from .warc import WARCFile, WARCRecord, WARCHeader, WARCReader
from .arc import ARCFile
from .warc import WARCFile


def detect_format(filename):
"""Tries to figure out the type of the file. Return 'warc' for
WARC files and 'arc' for ARC files"""

if ".arc" in filename:
return "arc"
if ".warc" in filename:
if filename.endswith(".warc") or filename.endswith(".warc.gz"):
return "warc"

if filename.endswith('.arc') or filename.endswith('.arc.gz'):
return 'arc'

return "unknown"

def open(filename, mode="rb", format = None):

def open(filename, mode="rb", format=None):
"""Shorthand for WARCFile(filename, mode).

Auto detects file and opens it.

"""
if format == "auto" or format == None:
if format == "auto" or format is None:
format = detect_format(filename)

if format == "warc":
return WARCFile(filename, mode)
elif format == "arc":
return ARCFile(filename, mode)
else:
raise IOError("Don't know how to open '%s' files"%format)
raise IOError("Don't know how to open '%s' files" % format)
Loading