Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken PKs #345

Closed
karafecho opened this issue Feb 1, 2023 · 11 comments
Closed

Broken PKs #345

karafecho opened this issue Feb 1, 2023 · 11 comments
Assignees

Comments

@karafecho
Copy link

This issue is to report a known issue with broken PKs. This is something that @MarkDWilliams is aware of and has investigated, but I don't believe he has identified a root cause. The issue arose most recently when preparing the QotM manuscript. Here is an example: https://arax.ncats.io/?source=ARS&id=26ae39c7-d601-4315-b978-5a48d4596677, pulled from the QotM #1 ticket, Chris B.'s post on March 16, 2022.

image

I have not been able to identify any systematic cause for this issue, although I can speculate:

  1. Something may be broken on the ARS backend, in terms of ensuring persistence in the PKs.
  2. There may be a time limit on the length of time that PKs are stored.
  3. Folks may not be saving the full PK or are saving an ARA- or KP-specific PK, rather than the ARS PK.
  4. None of the above.

I may have posted this issue to another repo a while back, but I think this might be the most appropriate one to post the ticket. If not, please feel free to reassign.

Thanks!

@edeutsch
Copy link
Contributor

edeutsch commented Feb 2, 2023

Turning on some debugging on the ARAX back end, here's debugging info from what ARAX is trying to do. This may be useful for tracking this down @MarkDWilliams

Trying ars-prod.transltr.io...
--- Fetch of 26ae39c7-d601-4315-b978-5a48d4596677 from ars-prod.transltr.io yielded 404
Trying ars.test.transltr.io...
--- Fetch of 26ae39c7-d601-4315-b978-5a48d4596677 from ars.test.transltr.io yielded 404
Trying ars.ci.transltr.io...
--- Fetch of 26ae39c7-d601-4315-b978-5a48d4596677 from ars.ci.transltr.io yielded 404
Trying ars-dev.transltr.io...
--- Fetch of 26ae39c7-d601-4315-b978-5a48d4596677 from ars-dev.transltr.io yielded 404
Trying ars.transltr.io...
--- Fetch of 26ae39c7-d601-4315-b978-5a48d4596677 from ars.transltr.io yielded 502
Cannot fetch from ARS a response corresponding to response_id=26ae39c7-d601-4315-b978-5a48d4596677
b'<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n'

@MarkDWilliams
Copy link
Collaborator

I believe that as part of the ITRB switch, there is a 30 day data retention policy on records at the moment. I think it would be helpful for us to have some means of flagging PKs that people want retained for longer periods of time. I don't imagine that there would be any problem holding the small number of PKs that people intend to go back to at a later date, but I believe there were issues with retaining all records, especially the volume sent in by automated testing.

@MarkDWilliams
Copy link
Collaborator

MarkDWilliams commented Feb 2, 2023

I've created a ticket to allow for longer retention of records and will speak with ITRB about any strictures I'll need to adhere to in implementing something like this. I apologize for any inconvenience.
#346

EDIT: To clarify, I suspect what you are seeing with the record you referenced is due to the automated data cleaning, but still have not found anything in our logs to reveal the source of the previous idiosyncratic data loss.

@karafecho
Copy link
Author

Thanks, Mark!

@MarkDWilliams
Copy link
Collaborator

MarkDWilliams commented Feb 3, 2023

I spoke with ITRB, and they actually haven't implemented the automated data clearing yet. Looking back through past tickets in the Testing repo, it does look like there is a consistent date before which PKs become inaccessible (mid June), but their is no code that I am aware of in the ARS that would delete older records and ITRB confirms they're not doing anything like that either. I'm hunting down a few possibilities on this and will let you know what I find. That said, I do still intend to add an option for PKs to be flagged for retention before the automated data clearing is implemented.

@karafecho
Copy link
Author

Thanks for the investigative work. WRT flags, would it make sense (or even be feasible) to flag all PKs for retention and toss those generated for automated testing? I ask because I'm not sure I would a know when (and when not) to flag a PK for retention.

@MarkDWilliams
Copy link
Collaborator

I don't know how feasible it would be to detect whether a query was submitted by an automated means. Would it be acceptable to have an endpoint where you could submit a PK (after you've run your query and seen the results) to be retained?

@karafecho
Copy link
Author

Hmmmm. I'm not sure if a separate endpoint for submitting PKs would work for users. Let's brainstorm during our next TACT call.

@karafecho
Copy link
Author

Following up from today's TACT call, I'm not sure that your proposal for PK persistence will address the issue, at least I see it. Specifically, I have often gone back to review query results using PKs that either I have generated or others have. This happens most commonly for publication development, but also for testing related to the TCDC or independent efforts, e.g., a query that I ran previously no longer runs or a query that is nearly identical to a query that I ran previously no longer runs. In both cases, but especially for publication development, I would not know at the time of query generation that I would later want to pull the PK and review the results. Moreover, even if I flagged every PK that I generate as "for persistence", I would not be able to predict the need for future reuse of PKs that others generate.

I'll add that I've encountered far too many broken PKs for the issue to be considered one-offs.

I'm thinking it may make sense to explore the option I suggested above, which is to flag all PKs as "for persistence" and toss all PKs that are generated by automatic testing.

@MarkDWilliams
Copy link
Collaborator

MarkDWilliams commented Mar 3, 2023

I see. I can continue to look into the issue with ITRB and see if there is some root cause for the broken PKs that we're able to surface. I agree that it has happened too often, and apologize for the inconvenience this has caused.

With respect to data retention, is it the case that you know at some point after the query generation that you'll want to go back to them? The retention flag would allow things to be marked for retention at any point before the automated data cleanup (which hasn't been implemented yet). I worry that doing the inverse, where things are flagged for deletion, rather than retention, would still lead to a lot of data being retained that isn't needed and the DB size growing to a size that could cause issues.

Would a system of 'keep-alive' be any better? That is that in addition to being able to manually flag PKs for retention, they could also be shielded from data clean up if they had been accessed within the last .

How much time has typically passed after a query was run when you first know or suspect that you'll want to reference it further? Is there a set of sources (standup testing, Relay sessions, etc) that you usually pull from? I think blanket retaining all standups, QotM, etc would be viable. I'm just trying to think of some way that retention could be applied where needed without being the default behavior.

@karafecho
Copy link
Author

karafecho commented Mar 3, 2023

Let's assume that the ITRB data migration is the root cause of the issue and hold off on any further investigative work. If another broken PK emerges, then we can reconsider that decision.

Thanks for your suggestions. Of those listed, the last one might address the issues I've encountered to date. Specifically, if we could retain all PKs posted to the testing repo (standups, QotM), then I'm pretty sure the issue will be resolved, although a new use case may require a different solution.

I also like the retention flags and 'keep-alive' system solutions. In terms of a time frame, I think I started gathering materials for the QotM manuscript in December 2022. The broken PKs dated back to February 2022. So, I'd say if we implemented a 'keep-alive' system, in which PKs are shielded from clean up for one year, then that might also help.

Anyway, I appreciate all the digging you've done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants