Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

Detect publish date #10

Open
bejean opened this issue May 18, 2012 · 3 comments
Open

Detect publish date #10

bejean opened this issue May 18, 2012 · 3 comments

Comments

@bejean
Copy link
Contributor

bejean commented May 18, 2012

A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.

@karussell
Copy link
Owner

Any ideas of 'how'?

Or even better some code :) ?

@karussell
Copy link
Owner

BTW: at the moment the date is guessed from the URL only

@bejean
Copy link
Contributor Author

bejean commented Sep 23, 2012

Hi, I tested this and it is a good first step.
I didn't really think about doing this. May be create an array of regexp and apply it in the extracted text.

Anyway, today, it is not possible to get the date directly with a ArticleTextExtractor object, the only way is to use SHelper class

ArticleTextExtractor extractor = new ArticleTextExtractor();
JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = SHelper.completeDate(SHelper.estimateDate(url));

kinow added a commit to kinow/snacktory that referenced this issue Feb 2, 2015
karussell added a commit that referenced this issue Feb 4, 2015
Fix issue #10 allow users to set a proxy
rborer referenced this issue in finity-ai/snacktory Aug 27, 2015
added some fields to JResult:  page type, site name and language (locale).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants