Read YAXArrays and Datasets
This section describes how to read files, URLs, and directories into YAXArrays and datasets.
Read Zarr
Open a Zarr store as a Dataset
:
using YAXArrays
using Zarr
path="gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/"
store = zopen(path, consolidated=true)
ds = open_dataset(store)
YAXArray Dataset
Shared Axes:
None
Variables:
height
Variables with additional axes:
Additional Axes:
(↓ lon Sampled{Float64} 0.0:0.9375:359.0625 ForwardOrdered Regular Points,
→ lat Sampled{Float64} [-89.28422753251364, -88.35700351866494, …, 88.35700351866494, 89.28422753251364] ForwardOrdered Irregular Points,
↗ Ti Sampled{DateTime} [2015-01-01T03:00:00, …, 2101-01-01T00:00:00] ForwardOrdered Irregular Points)
Variables:
tas
Properties: Dict{String, Any}("initialization_index" => 1, "realm" => "atmos", "variable_id" => "tas", "external_variables" => "areacella", "branch_time_in_child" => 60265.0, "data_specs_version" => "01.00.30", "history" => "2019-07-21T06:26:13Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.", "forcing_index" => 1, "parent_variant_label" => "r1i1p1f1", "table_id" => "3hr"…)
We can set path
to a URL, a local directory, or in this case to a cloud object storage path.
A zarr store may contain multiple arrays. Individual arrays can be accessed using subsetting:
ds.tas
╭────────────────────────────────────╮
│ 384×192×251288 YAXArray{Float32,3} │
├────────────────────────────────────┴─────────────────────────────────── dims ┐
↓ lon Sampled{Float64} 0.0:0.9375:359.0625 ForwardOrdered Regular Points,
→ lat Sampled{Float64} [-89.28422753251364, -88.35700351866494, …, 88.35700351866494, 89.28422753251364] ForwardOrdered Irregular Points,
↗ Ti Sampled{DateTime} [2015-01-01T03:00:00, …, 2101-01-01T00:00:00] ForwardOrdered Irregular Points
├──────────────────────────────────────────────────────────────────── metadata ┤
Dict{String, Any} with 10 entries:
"units" => "K"
"history" => "2019-07-21T06:26:13Z altered by CMOR: Treated scalar dime…
"name" => "tas"
"cell_methods" => "area: mean time: point"
"cell_measures" => "area: areacella"
"long_name" => "Near-Surface Air Temperature"
"coordinates" => "height"
"standard_name" => "air_temperature"
"_FillValue" => 1.0f20
"comment" => "near-surface (usually, 2 meter) air temperature"
├─────────────────────────────────────────────────────────────── loaded lazily ┤
data size: 69.02 GB
└──────────────────────────────────────────────────────────────────────────────┘
Read NetCDF
Open a NetCDF file as a Dataset
:
using YAXArrays
using NetCDF
using Downloads: download
path = download("https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc", "example.nc")
ds = open_dataset(path)
YAXArray Dataset
Shared Axes:
(↓ lon Sampled{Float64} 1.0:2.0:359.0 ForwardOrdered Regular Points,
→ lat Sampled{Float64} -79.5:1.0:89.5 ForwardOrdered Regular Points,
↗ Ti Sampled{CFTime.DateTime360Day} [CFTime.DateTime360Day(2001-01-16T00:00:00), …, CFTime.DateTime360Day(2002-12-16T00:00:00)] ForwardOrdered Irregular Points)
Variables:
tos
Properties: Dict{String, Any}("cmor_version" => 0.96f0, "references" => "Dufresne et al, Journal of Climate, 2015, vol XX, p 136", "realization" => 1, "Conventions" => "CF-1.0", "contact" => "Sebastien Denvil, sebastien.denvil@ipsl.jussieu.fr", "history" => "YYYY/MM/JJ: data generated; YYYY/MM/JJ+1 data transformed At 16:37:23 on 01/11/2005, CMOR rewrote data to comply with CF standards and IPCC Fourth Assessment requirements", "table_id" => "Table O1 (13 November 2004)", "source" => "IPSL-CM4_v1 (2003) : atmosphere : LMDZ (IPSL-CM4_IPCC, 96x71x19) ; ocean ORCA2 (ipsl_cm4_v1_8, 2x2L31); sea ice LIM (ipsl_cm4_v", "title" => "IPSL model output prepared for IPCC Fourth Assessment SRES A2 experiment", "experiment_id" => "SRES A2 experiment"…)
A NetCDF file may contain multiple arrays. Individual arrays can be accessed using subsetting:
ds.tos
╭────────────────────────────────────────────────╮
│ 180×170×24 YAXArray{Union{Missing, Float32},3} │
├────────────────────────────────────────────────┴─────────────────────── dims ┐
↓ lon Sampled{Float64} 1.0:2.0:359.0 ForwardOrdered Regular Points,
→ lat Sampled{Float64} -79.5:1.0:89.5 ForwardOrdered Regular Points,
↗ Ti Sampled{CFTime.DateTime360Day} [CFTime.DateTime360Day(2001-01-16T00:00:00), …, CFTime.DateTime360Day(2002-12-16T00:00:00)] ForwardOrdered Irregular Points
├──────────────────────────────────────────────────────────────────── metadata ┤
Dict{String, Any} with 10 entries:
"units" => "K"
"missing_value" => 1.0f20
"history" => " At 16:37:23 on 01/11/2005: CMOR altered the data in t…
"cell_methods" => "time: mean (interval: 30 minutes)"
"name" => "tos"
"long_name" => "Sea Surface Temperature"
"original_units" => "degC"
"standard_name" => "sea_surface_temperature"
"_FillValue" => 1.0f20
"original_name" => "sosstsst"
├─────────────────────────────────────────────────────────────── loaded lazily ┤
data size: 2.8 MB
└──────────────────────────────────────────────────────────────────────────────┘
Please note that netCDF4 uses HDF5 which is not thread-safe in Julia. Add manual locks in your own code to avoid any data-race:
my_lock = ReentrantLock()
Threads.@threads for i in 1:10
@lock my_lock @info ds.tos[1, 1, 1]
end
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
[ Info: missing
This code will ensure that the data is only accessed by one thread at a time, i.e. making it actual single-threaded but thread-safe.
Read GDAL (GeoTIFF, GeoJSON)
All GDAL compatible files can be read as a YAXArrays.Dataset
after loading ArchGDAL:
using YAXArrays
using ArchGDAL
using Downloads: download
path = download("https://github.com/yeesian/ArchGDALDatasets/raw/307f8f0e584a39a050c042849004e6a2bd674f99/gdalworkshop/world.tif", "world.tif")
ds = open_dataset(path)
YAXArray Dataset
Shared Axes:
(↓ X Sampled{Float64} -180.0:0.17578125:179.82421875 ForwardOrdered Regular Points,
→ Y Sampled{Float64} 90.0:-0.17578125:-89.82421875 ReverseOrdered Regular Points)
Variables:
Blue, Green, Red
Properties: Dict{String, Any}("projection" => "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]")
Load data into memory
For datasets or variables that could fit in RAM, you might want to load them completely into memory. This can be done using the readcubedata
function. As an example, let's use the NetCDF workflow; the same should be true for other cases.
readcubedata
readcubedata(ds.tos)
╭────────────────────────────────────────────────╮
│ 180×170×24 YAXArray{Union{Missing, Float32},3} │
├────────────────────────────────────────────────┴─────────────────────── dims ┐
↓ lon Sampled{Float64} 1.0:2.0:359.0 ForwardOrdered Regular Points,
→ lat Sampled{Float64} -79.5:1.0:89.5 ForwardOrdered Regular Points,
↗ Ti Sampled{CFTime.DateTime360Day} [CFTime.DateTime360Day(2001-01-16T00:00:00), …, CFTime.DateTime360Day(2002-12-16T00:00:00)] ForwardOrdered Irregular Points
├──────────────────────────────────────────────────────────────────── metadata ┤
Dict{String, Any} with 10 entries:
"units" => "K"
"missing_value" => 1.0f20
"history" => " At 16:37:23 on 01/11/2005: CMOR altered the data in t…
"cell_methods" => "time: mean (interval: 30 minutes)"
"name" => "tos"
"long_name" => "Sea Surface Temperature"
"original_units" => "degC"
"standard_name" => "sea_surface_temperature"
"_FillValue" => 1.0f20
"original_name" => "sosstsst"
├──────────────────────────────────────────────────────────── loaded in memory ┤
data size: 2.8 MB
└──────────────────────────────────────────────────────────────────────────────┘
Note how the loading status changes from loaded lazily
to loaded in memory
.