Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arxivce 1273 browse refactor #240

Merged
merged 31 commits into from
Mar 28, 2024
Merged

Arxivce 1273 browse refactor #240

merged 31 commits into from
Mar 28, 2024

Conversation

mnazzaro
Copy link
Contributor

No description provided.


CLASSIC_DB_URI = os.environ.get("CLASSIC_DB_URI", DEFAULT_DB)
LATEXML_DB_URI = os.environ.get("LATEXML_DB_URI", DEFAULT_LATEXML_DB)
ECHO_SQL = os.environ.get("ECHO_SQL", True)
Copy link
Contributor

@kyokukou kyokukou Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we would want the default ECHO_SQL to be true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We very much do not want ECHO_SQL default enabled.


Usually the same as BASE_SERVER but can be configured.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for completeness RSS also has a server (although I doubt anyone outside of RSS cares)




t_arXiv_bogus_subject_class = Table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are some tables and some classes for the models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few reasons this happens. One example is that some tables don't have primary keys and therefore can't be orm models. Another reason is that sqlacodegen decides to make tables that represent many-to-many relationships tables instead of orm models.

@mnazzaro
Copy link
Contributor Author

This is more of a draft PR. Lots of stuff left to do here:

  • Make object store useful
  • Try to get rid of cloudpathlib stuff. Only browse uses it...
  • Get tests running, including new tests for relevant stuff from browse
  • Make the config into a pydantic model
  • Refactor everything in arxiv/base because that's the biggest mess

@mnazzaro mnazzaro marked this pull request as draft March 12, 2024 14:00
arxiv/config/__init__.py Outdated Show resolved Hide resolved

with get_db() as session:
self.query = session.query(Metadata.paper_id, Metadata.version) \
.filter(Metadata.paper_id >= start_yymm) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi this will only work for papers after march 2007

.filter(Metadata.paper_id >= start_yymm) \
.filter(Metadata.paper_id < f'{end_yymm}.999999')
if categories:
self.query = self.query.filter(or_(*[Metadata.abs_categories.like(f'%{category.id}%') for category in categories]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible hiccup here due to some entries not including the most modern/relevant name of their category. Technically this code is fine, but to find every paper whatever calls this would have to include all possible category names (both versions of aliases and subsumed arxivs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I wrote this piece of code sort of carelessly, and I don't know what it would be useful for. I'd like to search/sort by announce date but that doesn't seem easy with the columns available to me. I may just delete it and write again when I start work on converting the full corpus to HTML

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive spent way too much time figuring how to search for all papers of a given demographic. LMK when you work on that

arxiv/py.typed Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding empty file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells mypy in other packages to observe the type hints from this package

@@ -304,7 +318,14 @@
'test': date(2010, 1, 1)
}

CATEGORIES = {
class tCategory(TypedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for combining these with the Archive Category and Group classes. Anything trying to use them will want the features of both

Copy link
Contributor

@ntai-arxiv ntai-arxiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about my not-so-curret docstring.

arxiv/files/__init__.py Outdated Show resolved Hide resolved
arxiv/files/__init__.py Outdated Show resolved Hide resolved
arxiv/files/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdc34 bdc34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some minor changes requested. Mostly I've left comments.

arxiv/authors/__init__.py Outdated Show resolved Hide resolved
arxiv/base/__init__.py Outdated Show resolved Hide resolved

CLASSIC_DB_URI = os.environ.get("CLASSIC_DB_URI", DEFAULT_DB)
LATEXML_DB_URI = os.environ.get("LATEXML_DB_URI", DEFAULT_LATEXML_DB)
ECHO_SQL = os.environ.get("ECHO_SQL", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We very much do not want ECHO_SQL default enabled.

arxiv/base/config.py Outdated Show resolved Hide resolved
arxiv/config/__init__.py Outdated Show resolved Hide resolved
arxiv/ops/fastly_log_ingest/app.py Outdated Show resolved Hide resolved
arxiv/urls/__init__.py Outdated Show resolved Hide resolved
arxiv/util/tests/test_authors.py Outdated Show resolved Hide resolved
mypy.ini Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@mnazzaro mnazzaro marked this pull request as ready for review March 27, 2024 19:33
@bdc34 bdc34 self-requested a review March 27, 2024 19:39
@ntai-arxiv ntai-arxiv self-requested a review March 27, 2024 19:40
@kyokukou kyokukou self-requested a review March 27, 2024 19:41
Copy link
Contributor

@kyokukou kyokukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently has a bug in it with the new category structure, I've fixed it in a branch that hasn't been pulled into this one yet. Also I think it would be good to get all of the tests running.

PR here: #244

@mnazzaro mnazzaro merged commit 0dfef6c into develop Mar 28, 2024
1 check passed
@kyokukou kyokukou deleted the ARXIVCE-1273-browse-refactor branch April 1, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants