Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String-valued dimension incorrectly loaded as matrix of characters #237

Closed
Datseris opened this issue Nov 8, 2023 · 5 comments
Closed

Comments

@Datseris
Copy link
Contributor

Datseris commented Nov 8, 2023

Describe the bug

A colleague of mine that uses Python and xarray has sent me a .nc file. One of the dimensions of the .nc file has string values (i.e., it is like a list of names). When I try to load this file I get:

Dimensions        
   time = 4000    
   diagnostic = 19
   ic = 101       
   string13 = 13  

Variables
  values   (101 × 19 × 4000)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  ic × diagnostic × time
    Attributes:
     _FillValue           = NaN

  time   (4000)
    Datatype:    Union{Missing, Float64} (Float64)
    Dimensions:  time
    Attributes:
     _FillValue           = NaN

  ic   (101)
    Datatype:    Int32 (Int32)
    Dimensions:  ic

  diagnostic   (13 × 19)
    Datatype:    Char (Char)
    Dimensions:  string13 × diagnostic
    Attributes:
     _Encoding            = utf-8

and accessing the diagnostic variable gives:

julia> v = data["diagnostic"]; v[:]
13×19 Matrix{Char}:
 's'   's'   's'   't'   't'   …  's'  's'   'a'   'a'   'a'        
 'a'   'a'   'a'   'e'   'e'      'a'  'e'   'm'   'm'   'a'        
 'l'   'l'   'l'   'm'   'm'      'l'  'a'   'o'   'o'   'b'        
 't'   't'   't'   'p'   'p'      't'  'i'   'c'   'c'   'w'        
 '_'   '_'   '_'   '_'   '_'      '_'  'c'   '_'   '_'   '\0'       
 't'   's'   's'   's'   's'   …  'f'  'e'   'm'   'E'   '\0'       
 'o'   'u'   'u'   'u'   'u'      'o'  '\0'  'a'   'Q'   '\0'       
 't'   'b'   'b'   'b'   'b'      'r'  '\0'  'x'   '\0'  '\0'       
 '\0'  '_'   '_'   '_'   '_'      'c'  '\0'  '\0'  '\0'  '\0'       
 '\0'  'N'   'S'   'N'   'S'      '_'  '\0'  '\0'  '\0'  '\0'       
 '\0'  'A'   'A'   'A'   'A'   …  't'  '\0'  '\0'  '\0'  '\0'       
 '\0'  '\0'  '\0'  '\0'  '\0'     'o'  '\0'  '\0'  '\0'  '\0'       
 '\0'  '\0'  '\0'  '\0'  '\0'     't'  '\0'  '\0'  '\0'  '\0'

each column here is a variable name. So each column should have been a string.

To Reproduce

Please give me an email address I can give access to to the file, as it is not possible to share the data publicly on GitHub. Once the file is downloaded, to reproduce do simply:

data = NCDataset("filename.nc")
v = data["diagnostic"]
v[:]

Expected behavior

The dimension values for "diagnostic" should be a vector of strings instead of a matrix of chars.
I admit, I do not know where the problem comes from. My colleague insists that he saves the data "correctly" with xarray and once he loads the data he gets the dimension as a vector of strings.

Environment

  • operating system: Windows 10
  • Julia version: 1.9.3
  • NCDatasets version: ⌅ [85f8d34a] NCDatasets v0.12.17 (currently checking if problem persists in new version 0.13)
@Alexander-Barth
Copy link
Member

Alexander-Barth commented Nov 8, 2023 via email

@Datseris
Copy link
Contributor Author

Datseris commented Nov 9, 2023

Hi, I do not know where I have to run this command, my shell does not have ncdump name and it doesn't appear to be a Julia command.

Meanwhile, my colleague has given me a way to reproduce the problem. In Python's xarray do:

Ntime = 4000
Nobs = 19
N = 101
data = np.empty((Ntime, Nobs, N))

observables = ['salt_tot', 'salt_sub_NA', 'salt_sub_SA', 'temp_sub_NA', 'temp_sub_SA', 'sst_NA', 'sst_SA', 'sss_NA', 'sss_SA', 'rho_sub_NA', 'rho_sub_SA', 'rho_NA', 'rho_SA', 'salt_forc', 'salt_forc_tot', 'seaice', 'amoc_max', 'amoc_EQ', 'aabw']

time_vector = 5.*np.arange(Ntime)
initial_cond = np.arange(Nobs)

ds = xr.Dataset({'values': (['time', 'diagnostic', 'ic'], data)}, coords={'time': time_vector, 'diagnostic': observables, 'ic': initial_cond})

ds.to_netcdf('file.nc')

@wobagi
Copy link

wobagi commented Nov 9, 2023

Just stumbled upon this topic and checked the output on my machine. The python code has a typo. It should have
initial_cond = np.arange(N)
and it is lacking imports:

import xarray as xr
import numpy as np

Anyways. Everything's looking good on OS X 12.6.6 and Linux Centos 7 using NCDatasets 0.12.17

julia> v = data["diagnostic"]
diagnostic (19)
  Datatype:    String
  Dimensions:  diagnostic

julia> v[:]
19-element Vector{String}:
 "salt_tot"
 "salt_sub_NA"
 "salt_sub_SA"
 "temp_sub_NA"
 "temp_sub_SA"
 "sst_NA"
 "sst_SA"
 "sss_NA"
 "sss_SA"
 "rho_sub_NA"
 "rho_sub_SA"
 "rho_NA"
 "rho_SA"
 "salt_forc"
 "salt_forc_tot"
 "seaice"
 "amoc_max"
 "amoc_EQ"
 "aabw"

ncdump gives correct data here as well (I lowered the numbers for time and ic dimensions):

dimensions:
	time = 24 ;
	diagnostic = 19 ;
	ic = 3 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	string diagnostic(diagnostic) ;
	int64 ic(ic) ;
}

Hopefully that helps.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Nov 9, 2023

@Datseris If you need in future ncdump, here is some information for windows users: https://docs.unidata.ucar.edu/netcdf-c/current/winbin.html
https://pjbartlein.github.io/REarthSysSci/install_netCDF.html

It helps me a lot when users provide this additional information as ncdump is independent of NCDatasets (and xarray) and gives the metadata in the NetCDF as it is stored. I know it can take some time to get these installed on windows, but ncdump is really valuable to troubleshoot issues with NetCDF files. With the shell tool ncgen I can generate a NetCDF file with exactly the same metadata as your file (but with "blank" data). So you do not even need to share your data file and we can still have a reproducible issue report.

@wobagi Thanks a lot for your input and correcting @Datseris example. I just installed xarray ( 2023.10.1) and I get :

abarth@GHERLaptop ~ $ ncdump -h file.nc 
netcdf file {
dimensions:
	time = 4000 ;
	diagnostic = 19 ;
	ic = 101 ;
	string13 = 13 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	int ic(ic) ;
	char diagnostic(diagnostic, string13) ;
		diagnostic:_Encoding = "utf-8" ;
}

So the data is indeed stored as a matrix of chars. Also the file is a NetCDF 3 file (NetCDF 3 does not support strings).
But If I install the package netCDF4 (pip install -U netCDF4), the data is stored by xarray as:

netcdf file {
dimensions:
	time = 4000 ;
	diagnostic = 19 ;
	ic = 101 ;
variables:
	double values(time, diagnostic, ic) ;
		values:_FillValue = NaN ;
	double time(time) ;
		time:_FillValue = NaN ;
	string diagnostic(diagnostic) ;
	int64 ic(ic) ;
}

Now the data is a vector of strings (as in @wobagi case) and the format is in NetCDF4 (note the data type of ic also changed).

I think that NCDatasets is correct to read a matrix as a matrix and a vector as a vector.

(Maybe xarray should give the user a warning when NetCDF 4 features are "approximated" (when python-netCDF4 is not installed) as in this case. )

@Datseris
Copy link
Contributor Author

Thank you very much, you have proven concretely that this is not an issue with NCDatasets.jl. I will ask my colleague to update to NetCDF4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants