Cache Input Data on Fast Drives
This page describes how to set up a cache of GEOS-Chem input data. This is useful if you want to temporarily transfer a simulation’s input data to a performant hard drive. This can improve the speed of your GCHP simulation by reducing the time spent reading input data. Caching input data is also useful if the file system that stores your GEOS-Chem input data repository has issues that are causing simulations to crash (i.e., you can transfer the data for your simulation to more stable hard drives).
Install the bashdatacatalog
Install the bashdatacatalog with the following command. Follow the prompts and restart your console.
gcuser:~$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)
Note
You can rerun this command to upgrade to the latest version.
Set Up the ExtDataCache Directory
Next, we are going to set up the ExtDataCache
directory.
You should put this directory in the appropriate path so that desired hard drives are used.
For example, if you have performance hard drives at /scratch/
, create a directory like /scratch/ExtDataCache/
.
We are going to use ExtDataCache/
to temporarily store the input data for simulations.
In the future, the idea is that you will copy the prerequisite input data to ExtDataCache/
before you run a simulation.
Since ExtDataCache/
is temporary data, you can delete it periodically to “purge” it.
Alternatively, you can use bashdatacatalog commands to selectively remove files.
If you are running long simulations, you can keep a few years of data in ExtDataCache/
, sort of like a moving window tracking the progress of your simulation.
Create a subdirectory in ExtDataCache/
to store catalog files.
You need a set of four catalog files for each simulation:
MeteorologicalInputs.csv – Specifies the simulation’s meteorological input data
ChemistryInputs.csv – Specifies the simulation’s chemistry input data
EmissionsInputs.csv – Specifies the simulation’s emissions input data
InitialConditions.csv – Specifies the default restart files for the simulation
A good directory structure for catalog files is ExtDataCache/CatalogFiles/SIMULATION_ID
where SIMULATION_ID
is a placeholder for a unique identifier for your simulation.
These instructions will put a demo set of catalog files in ExtDataCache/CatalogFiles/DemoSimulation
:
gcuser:~$ cd /scratch
gcuser:/scratch$ mkdir ExtDataCache # for storing input data for simulations
gcuser:/scratch$ mkdir ExtDataCache/CatalogFiles # for storing catalog files
gcuser:/scratch$ mkdir ExtDataCache/CatalogFiles/DemoSimulation # for storing catalog files for a specific simulation
Next, download the catalog files for the appropriate version of GEOS-Chem. You can find the GEOS-Chem catalog files here.
gcuser:/scratch$ cd ExtDataCache/CatalogFiles/DemoSimulation
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv
Edit the catalog files according to your simulation configuration. You can enable/disable data collections by editing column 3 (1
to enable a collection, 0
to disable a collection).
If you are not sure if your simulation needs a collection, it is better to err on the side of inclusion.
The meteorological data collections are the largest by volume.
Only one meteorological data collection in MeteorologicalInputs.csv
needs to be enabled.
Update the Collection URLs
The default collection URLs in the catalog files point to http://geoschemdata.wustl.edu/ExtData.
To copy data from your primary ExtData repository, edit column 2 of the catalog files.
For example, if your primary ExtData repository is at /storage/ExtData
you would replace http://geoschemdata.wustl.edu/ExtData
with file:///storage/ExtData
in column 2 of the catalog files.
Below is a sed command that will do the replacement.
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ export FIND_STR="http://geoschemdata.wustl.edu/ExtData"
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ export REPLACE_STR="file:///storage/ExtData" # replace '/storage/ExtData' with the path to your ExtData
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ sed -i "s#${FIND_STR}#${REPLACE_STR}#g" *.csv # do url find/replace
Copy Data to ExtDataCache
Navigate to ExtDataCache/
.
One you are there, run bashdatacatalog-fetch to fetch metadata from ExtData.
The arguments to bashdatacatalog-fetch are catalog files.
This metadata includes the file list for each data collection, and the details to classify each file as a temporal or static file.
gcuser:/scratch/ExtDataCache/CatalogFiles/DemoSimulation$ cd ../..
gcuser:/scratch/ExtDataCache$ bashdatacatalog-fetch CatalogFiles/DemoSimulation/*.csv
Now you can run bashdatacatalog-list commands to generate file lists.
The output of bashdatacatalog-list is controlled using flags.
For example, add the -s
to list “static” files (input files that are always required regardless of the simulation period).
You can list “temporal” files with the -t
flag.
You can filter temporal files according to a date range with the -r START,END
argument.
You can filter out files that exist using the -m
flag (lists files that are missing).
You can specify different file list formats using the -f FORMAT argument.
Below is a command that lists all the files in ExtDataCache that are missing for a simulation starting on 2017-01-01 and ending on 2017-12-31.
gcuser:/scratch/ExtDataCache$ bashdatacatalog-list -stm -r 2016-12-31,2018-01-01 CatalogFiles/DemoSimulation/*.csv
Note
You need to subtract/add one day to the period of your simulation.
The example above uses -r 2016-12-31,2018-01-01
because the simulation period is 2017-01-01 to 2017-12-31.
To copy the missing files to ExtDataCache, you can use the argument -f xargs-curl
to specify the output list should be formatted as input to xargs curl
.
You can use a command similar to the one below to copy all the missing files for your simulation to ExtDataCache.
gcuser:/scratch/ExtDataCache$ bashdatacatalog-list -stm -r 2016-12-31,2018-01-01 -f xargs-curl CatalogFiles/DemoSimulation/*.csv | xargs -P 4 curl
Note
The -P 4
argument to xargs allows for 4 parallel copies at a time.
Update Run Directory to use ExtDataCache
To update a run directory to use ExtDataCache, you can run the following commands.
Make sure to set FIND_PATH
to ExtData and REPLACE_PATH
to ExtDataCache.
gcuser:/scratch/ExtDataCache$ cd /MyRunDirectory # cd to your run directory
gcuser:/MyRunDirectory$ export FIND_PATH=/storage/ExtData # replace path to your primary ExtData
gcuser:/MyRunDirectory$ export REPLACE_PATH=/scratch/ExtDataCache # replace with the path to your ExtDataCache
gcuser:/MyRunDirectory$ function swap_extdata_link { ln -sfn $(readlink $1 | sed "s#${FIND_PATH}/*#${REPLACE_PATH}/#") $1; }
gcuser:/MyRunDirectory$ swap_extdata_link ChemDir
gcuser:/MyRunDirectory$ swap_extdata_link HcoDir
gcuser:/MyRunDirectory$ swap_extdata_link MetDir
gcuser:/MyRunDirectory$ sed -i "s#${FIND_PATH}#${REPLACE_PATH}#g" HEMCO_Config.rc geoschem_config.yml
Now your GCHP simulation will use input data from ExtDataCache.