-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sync utility with aria2c. #69
Conversation
Aria2_jll, well this is exciting. I'll take it for a test run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still seem to be some functionally missing, or I don't know how to use the new sync properly.
here's what I did
using SpaceLiDAR #feat / aria2
using Extents
test_area = Extent(X=(102.0999, 102.1), Y=(11.999, 12.0))
granules = search(:ICESat2, :ATL08; extent=test_area, version=6)
# Afterward you can download the dataset
folder = "/Users/gardnera/Downloads/test"
# his does not work [# thought `folder` was a keyword]
fn = SpaceLiDAR.download(granules, folder=folder)
# this does
fn = SpaceLiDAR.download(granules, folder)
# this does not work but if seems it should
SpaceLiDAR.sync(granules, folder)
# this causes all of the data to be downloaded ????
SpaceLiDAR.sync!(granules, folder, false)
if I have a local archive of granules and I only want to update the granules for a certain region how can I do that with sync?
function download!(granules::Vector{Granule}, folder::AbstractString = ".") | ||
function download!(granules::Vector{<:Granule}, folder::AbstractString = ".") | ||
|
||
# Download serially if s3 links are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can s3 links not be downloaded in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aria2c doesn't support s3 downloading afaik.
src/granule.jl
Outdated
`all`, false by default, will search only for granules past the date of | ||
the latest granule found in `folders`. If true, will search for all granules. | ||
""" | ||
function sync(folders::AbstractVector{<:AbstractString}, all::Bool = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test for sync? I didn't see one added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add one, but I need to be craft something to make sure I only sync a single granule or something.
src/granule.jl
Outdated
`all`, false by default, will search only for granules past the date of | ||
the latest granule found in `folders`. If true, will search for all granules. | ||
""" | ||
function sync(folders::AbstractVector{<:AbstractString}, all::Bool = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sync is only supplied a folder -or- a product and folder does it just download the full global dataset for that product? If so this is super helpful for me and you but maybe dangerous for others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it could download the whole archive. Especially the case for ICESat, because I can't derive a date for those granule filenames (ICESat-2 and GEDI have the date encoded in them). Will add a warning.
src/granule.jl
Outdated
Base.hash(g::Granule, h::UInt) = hash(g.id, h) | ||
|
||
""" | ||
sync(folder::AbstractString, all::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing "sync!(granules, ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there also be a "sync(granules, ..." ? it's a little bit confusing to need to add the !
to sync a list of granules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I should be more clear in the distinction between sync and download. Sync is for doing search and download all at once, to update an existing archive. sync(granules,) is missing, because if you do have granules (from search), download(granules,) would just work.
src/granule.jl
Outdated
end | ||
sync(product::Symbol, folder::AbstractString, all::Bool = false) = sync(product, [folder], all) | ||
|
||
function sync!(granules, folder, all) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should folder
and all
have default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt this was a more internal function, as it is called from the other sync() functions. Downloading many granules to a temporary folder seems unwise, also given that significant diskspace is probably needed.
At the moment you can't. At some point, you'd have to construct the search, filter and download yourself. But what could work is passing any sync keywords to search, so an extent could be passed. Like so |
Ok, I think I addressed the review comments. Made the sync! function internal, clarified the docstring, which now includes a warning, and sync now has kwargs which are passed to search. |
This looks good to me, one question about the design. What improved functionality do I get out of sync over download? Is it simply that I download requires 2 steps (search and download) and sync only requires 1? If so it seems that a new function is not really needed and only adds complexity to the package (I'm probably missing something). Is it that sync only updates latest files by default? Could this not be handled by passing a update_latest_only flag to downloads? |
A fair question. It's a feature I use a lot, but in a script. I thought others could make use of it, now that Aria2_jll can be easily installed. Sync, on top of the search and download, also detects the local granules, and skips these for download, detecting the datetime of the granule. An update_latest_only flag for download can only work if you know which local granules there are. Note that download, like sync, skips existing files, it won't overwrite them. |
OK, so the real benefit of sync is that I don't need to do a CMR search of all granules that meet the search criteria, i can do the search that starts at the last datetime in the local granule list since the datetime is parsed from the granule file names. However, if I use sync with all=true then I guess this is identical search -> download? |
Yep! |
I get an error when trying to update an existing folder, it looks like a string vs symbol parsing issue: readdir(folder)
15-element Vector{String}:
".DS_Store"
"ATL08_20181109192834_06440107_006_02.h5"
"ATL08_20190208150819_06440207_006_02.h5"
"ATL08_20190510104758_06440307_006_02.h5"
"ATL08_20200507172706_06440707_006_01.h5"
"ATL08_20201105084639_06440907_006_01.h5"
"ATL08_20210309144835_11551001_006_01.h5"
"ATL08_20210407132440_02101101_006_02.h5"
"ATL08_20210506000626_06441107_006_01.h5"
"ATL08_20210804194616_06441207_006_01.h5"
"ATL08_20211207014820_11551301_006_01.h5"
"ATL08_20220202110606_06441407_006_01.h5"
"ATL08_20220803022559_06441607_006_01.h5"
"ATL08_20230103070339_02101801_006_01.h5"
"ATL08_20230502132512_06441907_006_01.h5"
julia> SpaceLiDAR.sync(folder, extent=test_area, version=6)
ERROR: ArgumentError: Mission ICESat2 not supported. Currently supported are :ICESat, :ICESat2, and :GEDI.
Stacktrace:
[1] search(::SpaceLiDAR.Mission{…}, ::Symbol, ::Pair{…}, ::Vararg{…}; kwargs::@Kwargs{…})
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:115
[2] search(::Symbol, ::Symbol, ::Pair{…}, ::Vararg{…}; kwargs::@Kwargs{…})
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:134
[3] search(::ICESat2_Granule{:ATL08}, ::Pair{Symbol, Any}, ::Vararg{Pair{…}}; kwargs::@Kwargs{after::Dates.DateTime})
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:138
[4] _sync!(granules::Vector{…}, folder::String, all::Bool; kwargs::@Kwargs{…})
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:223
[5] sync(folders::Vector{String}, all::Bool; kwargs::@Kwargs{extent::Extent{(:X, :Y), Tuple{…}}, version::Int64})
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:205
[6] sync
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:203 [inlined]
[7] sync
@ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:207 [inlined]
[8] top-level scope
@ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types. |
Thanks for reporting. Fixed it, and made sure if it happens again somewhere else, you get a better error message. |
Looks good-to-go |
Which updates your local folder(s) with newer granules if available. Downloads are now handled in parallel with Aria2_jll, instead of per file using Downloads.