Goose3

Version 3.1.6

Improved handling of page encoding see PR #92
Improved author and published date extraction see PR #93 Thanks @timoilya!
Added additional schema extractors for schema.org parser see PR #89
Allow for pulling more then the first og:type data for Opengraph see PR #90

Version 3.1.5

Added additional date parsing see PR #71 Thanks @dlrobertson!
Added datetime representation of the publish date publish_datetime_utc see issue #72
Fixed mismatch encoding error see issue #74
Fixed og_type with NoneType error see issue #81 Thanks dust0x!

Version 3.1.4

Fix IndexError when title has only an title splitter or is the site name see issue #59 Thanks @dlrobertson!
Retry the calculate_top_node function with the root node if the first pass failed to find an article which may occur if one or more known article patterns are found, but none contain content see PR #66 Thanks @dlrobertson!
Add parsing of schema.org's ReportageNewsArticle tags see PR #67 Thanks @dlrobertson!
Add additional parsing of opengraph tags see PR #64 Thanks @dlrobertson!

Version 3.1.3

Parse headers and include in cleaned_text
Additional Configuration options:
- Parse Headers: parse_headers
- Parse Lists: parse_lists
- Pretty Lists: pretty_lists
Catch mismatch encoding meta tag and document encoding see pull request #53 Thanks @jeffquach!

Version 3.1.2

Capture lists from text see issue #48 Thanks @polosatyi!

Version 3.1.1

Catch more PIL exceptions see issue #42
Update opengraph parsing to maintain all information see issue #45

Version 3.1.0

Changed configuration to not pull images by default see issue #31
Update get_encodings_from_content to return a string and remove trailing spaces see PR #35
Remove infinite recursion on parser selection see PR #39
Document video and image classes
Re-add remaining image tests

Version 3.0.9

Add soup as a parser option to use lxml.html.soupparser see issue #27
Fix an issue with passing the requests session object to the crawler
Pylint changes
- Added pylintrc file
- Updated variable and positional argument names to be more pythonic
- Fixed line continuation issues
- Updated variable names when ambiguous
- Cleaned up class and static methods

Version 3.0.8

Fix using different requests session for each url fetched
- Added close method to the Goose object
Allow the Goose object to be a context manager

from goose3 import Goose
with Goose() as g:
    g.extract(url='some-url-here')

NOTE: No need to change code as it will attempt to automatically close the connection on garbage collection

Configuration object changes
- Better handling of the known_context_patterns configuration
- Added http_headers configuration option to be passed to requests
- Added http_proxies configuration option to be passed to requests
- Added http_auth configuration option to be passed to requests
Fix base64 image parsing see issue #7

Version 3.0.7

Fix installation issue
- Removed unused/broken regex
- Include all necessary files
- Fix failed tests (most)
Resolved relative URL issue see issue #21
Resolved temporary files not being properly removed see issue #18
Removed unused dependencies and code to support python 2 see issue #16
Fix error when using the configuration object to configure goose see issue #14

Version 3.0.1

First working version of Goose3!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Goose3

Version 3.1.6

Version 3.1.5

Version 3.1.4

Version 3.1.3

Version 3.1.2

Version 3.1.1

Version 3.1.0

Version 3.0.9

Version 3.0.8

Version 3.0.7

Version 3.0.1

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Goose3

Version 3.1.6

Version 3.1.5

Version 3.1.4

Version 3.1.3

Version 3.1.2

Version 3.1.1

Version 3.1.0

Version 3.0.9

Version 3.0.8

Version 3.0.7

Version 3.0.1