- Improved handling of page encoding see PR #92
- Improved author and published date extraction see PR #93 Thanks @timoilya!
- Added additional schema extractors for schema.org parser see PR #89
- Allow for pulling more then the first og:type data for Opengraph see PR #90
- Added additional date parsing see PR #71 Thanks @dlrobertson!
- Added datetime representation of the publish date
publish_datetime_utc
see issue #72 - Fixed mismatch encoding error see issue #74
- Fixed og_type with NoneType error see issue #81 Thanks dust0x!
- Fix IndexError when title has only an title splitter or is the site name see issue #59 Thanks @dlrobertson!
- Retry the calculate_top_node function with the root node if the first pass failed to find an article which may occur if one or more known article patterns are found, but none contain content see PR #66 Thanks @dlrobertson!
- Add parsing of schema.org's ReportageNewsArticle tags see PR #67 Thanks @dlrobertson!
- Add additional parsing of opengraph tags see PR #64 Thanks @dlrobertson!
- Parse headers and include in
cleaned_text
- Additional Configuration options:
- Parse Headers:
parse_headers
- Parse Lists:
parse_lists
- Pretty Lists:
pretty_lists
- Parse Headers:
- Catch mismatch encoding meta tag and document encoding see pull request #53 Thanks @jeffquach!
- Capture lists from text see issue #48 Thanks @polosatyi!
- Catch more PIL exceptions see issue #42
- Update opengraph parsing to maintain all information see issue #45
- Changed configuration to not pull images by default see issue #31
- Update
get_encodings_from_content
to return a string and remove trailing spaces see PR #35 - Remove infinite recursion on parser selection see PR #39
- Document video and image classes
- Re-add remaining image tests
- Add
soup
as a parser option to uselxml.html.soupparser
see issue #27 - Fix an issue with passing the requests session object to the crawler
- Pylint changes
- Added pylintrc file
- Updated variable and positional argument names to be more pythonic
- Fixed line continuation issues
- Updated variable names when ambiguous
- Cleaned up class and static methods
- Fix using different
requests
session for each url fetched- Added
close
method to the Goose object
- Added
- Allow the Goose object to be a context manager
from goose3 import Goose
with Goose() as g:
g.extract(url='some-url-here')
NOTE: No need to change code as it will attempt to automatically close the connection on garbage collection
- Configuration object changes
- Better handling of the
known_context_patterns
configuration - Added http_headers configuration option to be passed to
requests
- Added http_proxies configuration option to be passed to
requests
- Added http_auth configuration option to be passed to
requests
- Better handling of the
- Fix base64 image parsing see issue #7
- Fix installation issue
- Removed unused/broken regex
- Include all necessary files
- Fix failed tests (most)
- Resolved relative URL issue see issue #21
- Resolved temporary files not being properly removed see issue #18
- Removed unused dependencies and code to support python 2 see issue #16
- Fix error when using the configuration object to configure goose see issue #14
- First working version of Goose3!