Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting results beyond defined time period #671

Closed
juste97 opened this issue Dec 1, 2022 · 11 comments
Closed

Getting results beyond defined time period #671

juste97 opened this issue Dec 1, 2022 · 11 comments

Comments

@juste97
Copy link

juste97 commented Dec 1, 2022

Hey ,

I'm using twarc2 as a library in python (not from command line) with search_all and academic access.

Everything is working fine except the defined start and end time does not seem to be working properly.

Whenever my time settings look like this from 2017-01-28 01:00:00+00:00 to 2017-01-28 23:00:00+00:00 or from 2017-01-28 00:00:00+00:00 to 2017-01-29 00:00:00+00:00 I get tweets from 2017-01-29 as well.

To clarify, I just want to get Tweets for one particular day, à 500 Tweets per query (6 pages using enumerate) which returns me around 60 tweets for the following day as well.

Can someone help or point me toward a solution (can poste python code too if needed)?

@igorbrigadir
Copy link
Contributor

What's the full code snippet you have? How are you defining the times?

@juste97
Copy link
Author

juste97 commented Dec 1, 2022

    date_string = (str(sample_df['Date'][ind]))
    year = date_string[0:4]
    month = date_string[5:7]
    day = date_string[8:10]
                           
    base_date = datetime.datetime(int(year), int(month), int(day), 0, 0, 0, 0, datetime.timezone.utc)    
    start_list = []
    
    for x in range(7, -7, -1):
        start_list.append(base_date + timedelta(days=x))
        
    end_list = [(y+timedelta(days=1)) for y in start_list]

This is the snippet relevant for the dates. I'm iterating over a dataframe getting dates like 2017-01-28 per line and turn them into datetime objects. They are then turned into 15 dates (+-7 days on the original date) using a loop and then appended to start_list whose items are later used as start_date. Using timedelta again on each element in start_list I get my end_date.
Code works fine and returns results with the exception of the problem explained above.

@igorbrigadir
Copy link
Contributor

Ah very strange - does the same thing happen if you run the same query in command line? What's the twarc.log output if you try this?

@juste97
Copy link
Author

juste97 commented Dec 1, 2022

Yeah, unfortunately same thing using command line.
twarc2 search --limit 500 --archive --start-time "2019-10-01T00:00:00" --end-time "2019-10-02T00:00:00" "@Amazon -is:retweet" test_date_amazon_2.jsonl
or
twarc2 search --limit 500 --archive --start-time "2019-10-01T00:00:00" --end-time "2019-10-01T23:30:00" "@Amazon -is:retweet" test_date_amazon_3.jsonl

both yield results including tweets from day 2019-10-02.

Attached the log file for first command:
twarc.log

@edsu
Copy link
Member

edsu commented Dec 1, 2022

It's probably the result of some microservice Elon turned off.

@edsu
Copy link
Member

edsu commented Dec 1, 2022

More seriously, if you are using twarc2 as a library then you should be able to notice when you've passed the limit and stop right?

@juste97
Copy link
Author

juste97 commented Dec 1, 2022

Yeah, I print every argument to check if they're correct.
Problem is since I want to analyze 500 tweets of only particular days for a given query, results going beyoned the set date take away from my total results as it seems like the date is simply being shifted a few hours beyoned the set end date.

@SamHames
Copy link
Contributor

SamHames commented Dec 2, 2022 via email

@igorbrigadir
Copy link
Contributor

Something to note is that --limit can get more than the set limit, because it's more like a threshold than a target exact number - if the limit is 500, and it gets 499 from 5 calls where 1 returned 99 instead of 100, it will get another 100 in the next page, but that's just an aside for future reference, (and to remind myself of #647 too). Also the results are in reverse chronological order, so it's important to not use limit for "sampling" of any kind.

I can't seem to reproduce the same thing in command line now anyway - so it may be something with their index or cache - i don't think we can or should try to fix it in twarc in code though.

@juste97
Copy link
Author

juste97 commented Dec 3, 2022

At first I thought it was only +1h but the more results I get for a given query the higher the amount of tweets belonging to the next day which would fit igorbrigadir's explanation of reverse chronological order...

@SamHames leaving out the timezone argument does not change the results at all...

I'm not using twarc from command line but I suppose iterating over the pages returned by the API would be the same as using the --limit argument?

So there is nothing you can help me with?

@igorbrigadir
Copy link
Contributor

Yeah command line uses the exact same code so there shouldn't be any difference.

It's a very strange error - Is it still happening? Since i can't reproduce the error with exact dates, maybe the bug is actually somewhere in this code that reads a dataframe and extracts dates with substrings here: #671 (comment) (this seems very specific to your case so i can't offer any help here - but pandas tag on stackoverflow is usually much more active for figuring these things out!)

@edsu edsu closed this as completed Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants