Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sync utility with aria2c. #69

Merged
merged 3 commits into from
Jan 9, 2024
Merged

Introduce sync utility with aria2c. #69

merged 3 commits into from
Jan 9, 2024

Conversation

evetion
Copy link
Owner

@evetion evetion commented Jan 6, 2024

Which updates your local folder(s) with newer granules if available. Downloads are now handled in parallel with Aria2_jll, instead of per file using Downloads.

@evetion evetion requested a review from alex-s-gardner January 6, 2024 14:26
@alex-s-gardner
Copy link
Collaborator

Aria2_jll, well this is exciting. I'll take it for a test run

Copy link
Collaborator

@alex-s-gardner alex-s-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still seem to be some functionally missing, or I don't know how to use the new sync properly.

here's what I did

using SpaceLiDAR #feat / aria2
using Extents

test_area = Extent(X=(102.0999, 102.1), Y=(11.999, 12.0))
granules = search(:ICESat2, :ATL08; extent=test_area, version=6)

# Afterward you can download the dataset
folder = "/Users/gardnera/Downloads/test"

# his does not work [# thought `folder` was a keyword]
fn = SpaceLiDAR.download(granules, folder=folder)

# this does
fn = SpaceLiDAR.download(granules, folder)

# this does not work but if seems it should
SpaceLiDAR.sync(granules, folder)

# this causes all of the data to be downloaded ????
SpaceLiDAR.sync!(granules, folder, false)

if I have a local archive of granules and I only want to update the granules for a certain region how can I do that with sync?

function download!(granules::Vector{Granule}, folder::AbstractString = ".")
function download!(granules::Vector{<:Granule}, folder::AbstractString = ".")

# Download serially if s3 links are present
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can s3 links not be downloaded in parallel?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aria2c doesn't support s3 downloading afaik.

src/granule.jl Outdated
`all`, false by default, will search only for granules past the date of
the latest granule found in `folders`. If true, will search for all granules.
"""
function sync(folders::AbstractVector{<:AbstractString}, all::Bool = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for sync? I didn't see one added.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add one, but I need to be craft something to make sure I only sync a single granule or something.

src/granule.jl Outdated
`all`, false by default, will search only for granules past the date of
the latest granule found in `folders`. If true, will search for all granules.
"""
function sync(folders::AbstractVector{<:AbstractString}, all::Bool = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sync is only supplied a folder -or- a product and folder does it just download the full global dataset for that product? If so this is super helpful for me and you but maybe dangerous for others?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it could download the whole archive. Especially the case for ICESat, because I can't derive a date for those granule filenames (ICESat-2 and GEDI have the date encoded in them). Will add a warning.

src/granule.jl Outdated
Base.hash(g::Granule, h::UInt) = hash(g.id, h)

"""
sync(folder::AbstractString, all::Bool=false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "sync!(granules, ..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be a "sync(granules, ..." ? it's a little bit confusing to need to add the ! to sync a list of granules

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I should be more clear in the distinction between sync and download. Sync is for doing search and download all at once, to update an existing archive. sync(granules,) is missing, because if you do have granules (from search), download(granules,) would just work.

src/granule.jl Outdated
end
sync(product::Symbol, folder::AbstractString, all::Bool = false) = sync(product, [folder], all)

function sync!(granules, folder, all)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should folder and all have default values?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt this was a more internal function, as it is called from the other sync() functions. Downloading many granules to a temporary folder seems unwise, also given that significant diskspace is probably needed.

@evetion
Copy link
Owner Author

evetion commented Jan 7, 2024

# this does
1) fn = SpaceLiDAR.download(granules, folder)

# this does not work but if seems it should
2) SpaceLiDAR.sync(granules, folder)

# this causes all of the data to be downloaded ????
3) SpaceLiDAR.sync!(granules, folder, false)
  1. Yes, that is existing functionality, but now works with aria2c in parallel, should be significantly more stable, faster, and has resume functionality.
  2. Sync is for doing a search based on a local folder, to get you the correct granules automatically. If you already have them, you can just use download in 1.
  3. I should make this function internal. It might download (almost) all granules, except for the ones you already have. Note that granules are local granules here.

if I have a local archive of granules and I only want to update the granules for a certain region how can I do that with sync?

At the moment you can't. At some point, you'd have to construct the search, filter and download yourself. But what could work is passing any sync keywords to search, so an extent could be passed. Like so sync(folder; extent=Extent()).

@evetion
Copy link
Owner Author

evetion commented Jan 7, 2024

Ok, I think I addressed the review comments. Made the sync! function internal, clarified the docstring, which now includes a warning, and sync now has kwargs which are passed to search.

@alex-s-gardner
Copy link
Collaborator

This looks good to me, one question about the design. What improved functionality do I get out of sync over download? Is it simply that I download requires 2 steps (search and download) and sync only requires 1? If so it seems that a new function is not really needed and only adds complexity to the package (I'm probably missing something). Is it that sync only updates latest files by default? Could this not be handled by passing a update_latest_only flag to downloads?

@evetion
Copy link
Owner Author

evetion commented Jan 8, 2024

This looks good to me, one question about the design. What improved functionality do I get out of sync over download? Is it simply that I download requires 2 steps (search and download) and sync only requires 1? If so it seems that a new function is not really needed and only adds complexity to the package (I'm probably missing something). Is it that sync only updates latest files by default? Could this not be handled by passing a update_latest_only flag to downloads?

A fair question. It's a feature I use a lot, but in a script. I thought others could make use of it, now that Aria2_jll can be easily installed. Sync, on top of the search and download, also detects the local granules, and skips these for download, detecting the datetime of the granule. An update_latest_only flag for download can only work if you know which local granules there are. Note that download, like sync, skips existing files, it won't overwrite them.

@alex-s-gardner
Copy link
Collaborator

Sync, on top of the search and download, also detects the local granules, and skips these for download, detecting the datetime of the granule

OK, so the real benefit of sync is that I don't need to do a CMR search of all granules that meet the search criteria, i can do the search that starts at the last datetime in the local granule list since the datetime is parsed from the granule file names. However, if I use sync with all=true then I guess this is identical search -> download?

@evetion
Copy link
Owner Author

evetion commented Jan 8, 2024

OK, so the real benefit of sync is that I don't need to do a CMR search of all granules that meet the search criteria, i can do the search that starts at the last datetime in the local granule list since the datetime is parsed from the granule file names. However, if I use sync with all=true then I guess this is identical search -> download?

Yep!

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Jan 8, 2024

I get an error when trying to update an existing folder, it looks like a string vs symbol parsing issue:

readdir(folder)
15-element Vector{String}:
 ".DS_Store"
 "ATL08_20181109192834_06440107_006_02.h5"
 "ATL08_20190208150819_06440207_006_02.h5"
 "ATL08_20190510104758_06440307_006_02.h5"
 "ATL08_20200507172706_06440707_006_01.h5"
 "ATL08_20201105084639_06440907_006_01.h5"
 "ATL08_20210309144835_11551001_006_01.h5"
 "ATL08_20210407132440_02101101_006_02.h5"
 "ATL08_20210506000626_06441107_006_01.h5"
 "ATL08_20210804194616_06441207_006_01.h5"
 "ATL08_20211207014820_11551301_006_01.h5"
 "ATL08_20220202110606_06441407_006_01.h5"
 "ATL08_20220803022559_06441607_006_01.h5"
 "ATL08_20230103070339_02101801_006_01.h5"
 "ATL08_20230502132512_06441907_006_01.h5"

julia> SpaceLiDAR.sync(folder, extent=test_area, version=6)
ERROR: ArgumentError: Mission ICESat2 not supported. Currently supported are :ICESat, :ICESat2, and :GEDI.
Stacktrace:
 [1] search(::SpaceLiDAR.Mission{…}, ::Symbol, ::Pair{…}, ::Vararg{…}; kwargs::@Kwargs{})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:115
 [2] search(::Symbol, ::Symbol, ::Pair{…}, ::Vararg{…}; kwargs::@Kwargs{})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:134
 [3] search(::ICESat2_Granule{:ATL08}, ::Pair{Symbol, Any}, ::Vararg{Pair{…}}; kwargs::@Kwargs{after::Dates.DateTime})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/search.jl:138
 [4] _sync!(granules::Vector{…}, folder::String, all::Bool; kwargs::@Kwargs{})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:223
 [5] sync(folders::Vector{String}, all::Bool; kwargs::@Kwargs{extent::Extent{(:X, :Y), Tuple{…}}, version::Int64})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:205
 [6] sync
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:203 [inlined]
 [7] sync
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/Uw901/src/granule.jl:207 [inlined]
 [8] top-level scope
   @ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types.

@evetion
Copy link
Owner Author

evetion commented Jan 8, 2024

Thanks for reporting. Fixed it, and made sure if it happens again somewhere else, you get a better error message.

@alex-s-gardner
Copy link
Collaborator

Looks good-to-go

@evetion evetion merged commit b5829fe into master Jan 9, 2024
11 of 12 checks passed
@evetion evetion deleted the feat/aria2 branch January 9, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants