-
Notifications
You must be signed in to change notification settings - Fork 105
Use datapoint timestamp to determine persist to cassandra idx #1281
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We had another customer complaining about this. They publish forecasted data every Sunday for the next week. Because of the compressed publishing window, the One thing to think about is that accelerated publish rate could also cause a lot of writes to be triggered. Imagine, for example, backfilling ten years of data. If the update interval is 3 hours, we might trigger a lot of useless intermediary updates. |
original link from OP doesn't work anymore, but it was pointing to this code - which is still unchanged metrictank/idx/cassandra/cassandra.go Line 260 in 8626886
You mention LastUpdate, but that code uses LastSave. the difference is documented here: https://github.com/grafana/metrictank/blob/master/docs/metadata.md#lastsave-vs-lastupdate (TLDR LastUpdate uses date time, LastSave uses wallclock) In both of your comments/cases, did you mean LastSave ? |
Yes, I said and meant |
To be very explicit: The IssueData published at an accelerated rate will have datapoints published faster than real-time (e.g. backfill or forecast data). This means that the value of Generally, the stored Proposed Solution
|
OK I get it now.
are you referring to 1) current status, or 2) with your proposal applied? I'm having trouble finding any issues with your suggestion, other than the above, which is the main reason I want to explore alternative options To be clear, the issue is with young metrictank instances that haven't seen any of the data for the given series from kafka. How about this: We replace the cassandra writeQueue with a smart queue
The smart queue is not just a channel, it has a staging area
This is obviously more complicated, and doing it right means we need to be aware of how long saves take (in the write queue, but also the staging background routine), but in exchange we won't overload cassandra with lots of intermediary updates |
That was with the proposal applied. I was pointing out a potential issue.
I sort of had a similar idea to this, but not quite as involved. My thought was that instead of writing exactly what was pulled off the queue the cassandra writer could do a lookup in the index to get the latest values. If this latest value in the index has a I think this version is less work than your proposal but not as "complete" in preventing duplicate writes. |
So, a |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
In the cassandra index current time is used to determine if lastUpdate should be saved to the cassandra table. When doing an initial backfill, this means that we need to wait
3h
(by default) to publish the final datapoint to make sure that the lastUpdate is properly persisted.If the datapoint timestamp was used instead, we wouldn't need to worry about it.
The text was updated successfully, but these errors were encountered: