Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid off-by-one when scraping 'servings_text' #3363

Closed
wants to merge 1 commit into from

Conversation

jmerdich
Copy link

This was showing something like servings=4 and servings_text="['4']" for all recipes I imported. Use the first item in the list instead of the second-- my list only has one entry which threw an exception, dropped it, and stringified the list. This matches the behavior of servings.

This was showing something like `servings=4` and `servings_text="['4']"` for all recipes I imported. Use the *first* item in the list instead of the second-- my list only has one entry which threw an exception, dropped it, and stringified the list. This matches the behavior of `servings`.
@CLAassistant
Copy link

CLAassistant commented Oct 13, 2024

CLA assistant check
All committers have signed the CLA.

@vabene1111
Copy link
Collaborator

Interesting, thanks for the PR. I guess that using 1 as an index was done on purpose at some point because the person implementing it (probably smilerz or me) had test data that was like ["1","pcs"].

What do you think about looping the list for an entry and trying do some regex matching to find the best fit (contains number/is only number/ does not contain a number) to potentially improve the results?

@jmerdich
Copy link
Author

I'm not super keen on writing a heuristic, since I don't have a lot of experience on real-world data here and I'm not entirely sure what the intent of the field is (original text? units?). I also may need to change the numeric equivalent depending on the heuristic (which just takes the first item today).

Here are a few more options, do any sound good?

  • Last item in list (maybe what the last one meant?)
  • String of all items in list (" ".join(...))
  • String of all items in list, deduplicated.

@vabene1111
Copy link
Collaborator

Interesting, I like your second option combined with removing the first item that looks like a number, do you think that makes sense?

So

  1. remove the first item in the list that regex matches a number (in any style with , and . as seperators)
  2. join the rest of the list into one string with spaces as delimiter

@vabene1111 vabene1111 closed this Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants