Research data management services at UQ
Data management is extremely important in research, but the variety of options for managing research data can be a little overwhelming at first. This guide will explain the options available for reliable, long-term research data storage at UQ and will walk through some best-practices for safely transferring your data from HPC clusters to long-term storage.
Table of Contents
- What options are available for long-term research data storage?
- Transferring data to and from HPC clusters
- Changes with Setonix - the Acacia object storage
- Useful links
What options are available for long-term research data storage?
There are two main options for long-term storage of research data for UQ researchers: UQ’s Research Data Manager (RDM) and AARNet’s CloudStor. These resources are not mutually exclusive and serve different needs - you can and should make heavy use of both of them when managing your research data.
Additionally, the Pawsey Centre has developed a new service for long-term research data storage to be used with the new Setonix cluster. This service is an object storage service called Acacia and is intended to provide research data storage for the lifetime of our projects at the Pawsey centre. It is substantially different from traditional hierarchical filesystems, so will require a different workflow to what you may be used to from Magnus.
UQ RDM
The Research Data Manager is a long-term research data storage service managed by the University of Queensland. The service is available to all UQ researchers and research students, and access is organised as per-project records. If you don’t already have an RDM record associated with your project, you can apply through the RDM web interface. Researchers can be a member of more than one record, with access-control handled through the RDM web interface, which also has tools for granting access to collaborators from other institutions.
Each RDM record has 1TB of storage, which is backed up across multiple locations both on- and off-campus. Data is protected against hardware and software failure, but not user-error like accidental deletion (it doesn’t have file versioning like Dropbox or Google Drive). Data on the RDM is meant to last beyond the duration of a single project, so it prioritises storage size and robustness beyond speedy access. This is not to say the transferring data to and from the RDM is intolerably slow, just that it’s intended use case is for archival storage rather than as a working directory for simulations.
You can access your data on the RDM through a few different mechanisms but there’s no singular “best” option - all have their use cases and you’ll likely find yourself using all of the above as needed.
Accessing the RDM through the web interface
You can access your RDM data through the web interface at
Accessing the RDM through a mapped drive
You can access your RDM as a folder using your computer’s file manager by mounting it as a networked drive. General instructions can be found here, while AIBN-specific instructions for high-speed access can be found here.
Accessing the RDM through the Nextcloud and ownCloud sync clients
RDM also supports access through open-source desktop sync clients NextCloud and ownCloud which automatically synchronise files between a folder on your local computer and the RDM, in a similar fashion to Dropbox. To use Nextcloud to synchronise files with your RDM, follow the instructions in this guide provided by the UQ Library.
The guide also works for ownCloud (meaning you can use the same
program to synchronise with the RDM and AARNet’s CloudStor), with the following exception: in step
2 of the guide for UQ users,
you’ll need to generate a temporary password to use with ownCloud, since it doesn’t natively
support two-factor authentication. You can do this
through the RDM web interface at
Use your UQ email address and the temporary password when prompted to provide a username and password for ownCloud, then follow the rest of the steps in the UQ library guide.
Accessing the RDM via SSH/SFTP
You can access your RDM from the command line through the SSH/SFTP interface provided by QCIF’s
QRISCloud.
QCIF recommend using the rsync
command-line tool when transferring data via the SSH interface, as
it is more robust against network connectivity interruptions - an important requirement when
transferring large amounts over unreliable networks (e.g. weak WiFi connections). This is the
only way to directly access the RDM from external clusters like Gadi or Magnus.
The data access servers for QRISCloud have the following addresses:
ssh1.qriscloud.org.au
andssh2.qriscloud.org.au
, anddata.qriscloud.org.au
, which is a load-balancer to distribute traffice between the twossh
nodes.
QCIF recommends using the data.qriscloud.org.au
load-balancer for most data transfers, but this may
cause degraded performance in some cases - use the ssh1
and ssh2
addresses directly if this happens.
The basic syntax for transferring data to your RDM record using rsync
is as follows:
rsync -rvz /path/to/your_data <uq_username>@data.qriscloud.org.au
where you’d replace <uq_username>
with the UQ username you use to log in to Tinaroo, etc. The -r
flag
tells rsync
to do a “recursive” copy, so it will copy all the contents of a folder (and any sub-folders)
to the RDM. You need this flag if you’re copying a folder, but not if you just want to copy a single file
(like a compressed archive). The -v
flag enables “verbose” output and -z
compresses the data before
sending it to the QRISCloud server. It’s a good idea to use -z
if you’re transferring a lot of data,
but is unecessary if you’ve already compressed it with tar
or zip
.
Accessing the RDM from RCC’s clusters
The QRISCloud directories are also mounted on RCC clusters (e.g. Tinaroo). These are available under the
/QRISData/
mount point and are labelled according to your RDM record Q-number, which can be found on the
RDM web interface. For example, if your RDM has the Q-number 9999 then it would be mounted
on Tinaroo at /QRISData/Q9999/
.
The mounted directories behave the same as any other filesystem directory on Tinaroo, with one important
exception: do not run calculations in /QRISData
. The RDM filesystem is not designed to handle lots of
small reads and writes, so you must not run batch jobs in that directory or directly
output the results of simulations into QRISData
. You must output your simulation results to /scratch
first and then copy over to your RDM in one go once it’s all finished.
Accessing the RDM with Rclone
Rclone
is another command-line tool for accessing remote file servers, but with a broader set of
supported protocols than rscync
. It is particularly suited to cloud-storage solutions, but can work
equally well with UQ’s RDM. rclone
is available on Windows, Mac and
Linux, as well as Tinaroo, Gadi and the Pawsey clusters.
It is the only tool on this list which works identically with both RDM and CloudStor.
You’ll need to configure a remote with the correct settings before you can connect to RDM, which
is a long but mostly turnkey process. Before you get started, it’s best to create an app-specific password
for rclone
so you can revoke access if needed without changing your main UQ password. You can do this
through the RDM web interface at
Once you’ve set up the app password, the basic steps to setup rclone
are as follows:
- Run
rclone config
and typen
to create a new configuration. Give it a descriptive name likerdm
- Select
webdav
from the list of protocols. - Enter the following remote URL:
https://cloud.rdm.uq.edu.au/remote.php/dav/files/<your_username>@uq.edu.au
- Choose
nextcloud
for “vendor” - Enter
<your_username>@uq.edu.au
when asked for a username - Enter the temporary app password when prompted
- Leave the “bearer token” field empty
- Select
n
when asked to edit the advanced config - Accept the configuration
- (Optional) You’ll want to encrypt the config if you’re using
rclone
on a shared system (like Magnus) to make sure nobody else can use the configuration to mess with your RDM. Return to therclone config
menu and chooses) Set configuration password
then enter a password.
You can then use rclone
to push and pull files from the RDM, as well as run basic queries and filesystem
operations like listing the contents of directories. A full list of options is available via rclone --help
,
but the basic commands are as follows:
rclone ls <remote>:<directory>
- list the files in<directory>
in your remote (replace<remote>
with the name you chose in step 1 above). The directory paths must use the project’s full name as their root (which is different to the QRISCloud convention, which only uses the numbers), so if your project is calledMyProject-Q1234
, then you would dorclone ls rdm:MyProject-Q1234/some_folder
to list the contents ofsome_folder
in your RDM.rclone lsd <remote>:<directory>
- as forrclone ls
, but lists directories rather than files.rclone copy <src> <dst>
- copy a file or directory from source to destination. Usually at least one of these will be a remote, e.g.rclone copy some_file <remote>:<directory>
to copysome_file
to your RDM orrclone copy <remote>:<directory>/some_file ./some_file
to copysome file
from the RDM to your current directory on whichever machine you’re usingrclone
on. It’s also possible to copy from remote to remote, e.g. copying data from RDM to CloudStor. Also, from therclone copy --help
page: “Note that it is always the contents of the directory that is synced, not the directory so when source:path is a directory, it’s the contents of source:path that are copied, not the directory name and contents.”. This means that if you runrclone copy
on a folder, it will splat the contents in your remote directory without creating a new folder. This means it’s usually a good idea to create a folder on the RDM first if you want to copy a whole folder to the RDM.rclone mkdir <remote>:<directory>
- create a directory on the remote if it doesn’t already exist.rclone --dry-run <command>
- do a dry-run of anrclone
command, which will show what operations would be performed, but will not actually make any changes. It’s a good idea to do a rdy-run before running a potentially destructive command to make sure you’re not deleting or overwriting anything you don’t intend to.rclone --progress <command>
- show real time statistics like file uploads while running a command.
Full documentation for Rclone can also be found on the project’s website: https://rclone.org/.
AARNet CloudStor
IMPORTANT UPDATE 31/10/2022: AARNet has announced that Cloudstor will be decomissioned at the end of 2023. Any data still on the service at that point will be lost, so if you use CloudStor then start transferring your data off the service ASAP so you’re not caught out.
CloudStor is a cloud storage platform for research data provided by AARNet - Australia’s Academic and Research Network. AARNet maintains the IT and communications infrastructure used by Australian universities and research institutions, including fast fibre connections between institutes and the eduroam wireless network. Most Australian universities are connected to AARNet, as are NCI and the Pawsey Centre.
CloudStor is a service which provides secure, high capacity cloud storage free-of-charge to Australian researchers and research students as an alternative to commercial services like Dropbox and Google Drive. Data on CloudStor is backed up in multiple locations and has file versioning, allowing you to revert files to previous versions if you make a mistake. CloudStor is on the same high-speed fibre network which connects Australian universities, so has very high upload/download speeds on campus, but can also be accessed off-campus as well. It supports multiple protocols, including an open-source desktop sync client, ownCloud, which automatically synchronises file changes when they occur - almost a drop-in replacement for Dropbox.
Any Australian researcher or research student can create an account with CloudStor, which comes with 1TB of storage by default. You will also be able to take your data with you when you leave UQ, provided you stay within the AARNet network (see this link for instructions). The signup page is available at https://cloudstor.aarnet.edu.au/, with detailed instructions at this link.
There are three main methods for transferring data to and from CloudStor: the web app, the ownCloud desktop client and the Rclone command line tool.
Accessing CloudStor through the web app
You can log in to the CloudStor web app at the following URL: https://cloudstor.aarnet.edu.au/. The web app can upload/download files through a web browser, and perform rudimentary file-system operations like renaming and moving files and folders.
Accessing CloudStor with the ownCloud desktop client
First, download the ownCloud desktop client from this link and
follow the setup instructions in the CloudStor user guide.
You’ll need to designate a folder on your computer to sync with CloudStor - by default ownCloud will create
a folder called ownCloud
, but you can choose a different folder of you want. The contents of this
folder will be synchronised with CloudStor and any changes (file creations, deletions and modifications)
will be automatically mirrored
across both copies. You can also install ownCloud on multiple computers which will all be kept
synchronised with each other and with CloudStor. If you’ve ever used Dropbox, this is functionally the
same behaviour.
Accessing CloudStor with Rclone
Rclone
is a command-line tool for accessing remote file servers, but with a broader set of
supported protocols than rscync
. It is particularly suited to cloud-storage solutions and CloudStor is
no exception.
rclone
is available on Windows, Mac and Linux, as well as RCC and
Pawsey clusters (but not Gadi), and is the only tool which works identically with both RDM and CloudStor.
The setup process for using rsync
with CloudStor is very similar to the process for
UQ’s RDM, but with a few small differences in configuration options.
You’ll need to configure a remote with the correct settings before you can connect to CloudStor, which
is a long but mostly turnkey process. Before you get started, it’s best to create an app-specific password
for rclone
so you can revoke access if needed without changing your main CloudStor password. You can do this
through the web interface at https://cloudstor.aarnet.edu.au/client-download/.
Log in to the web interface then click the gear in the top right corner of the page to go to the “settings”
page. In the “security” tab on the left side of the settings page, there will be an option to “create new app
password” (you may have to scroll down to find it); create a new password and give it a descriptive name like
“rclone”. Copy the password somewhere safe before leaving the page - I recommend a secure password manager
from this list
by the UQ Library. You can delete the app-password from the same page in the CloudStor web interface if you no
longer need it.
Once you’ve set up the app password, the basic steps to setup rclone
are as follows:
- Run
rclone config
and typen
to create a new configuration. Give it a descriptive name likecloudstor
- Select
webdav
from the list of protocols. - Enter the following remote URL:
https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/
- Choose
owncloud
for “vendor” - Enter the username you used when signing up for CloudStor. This will most likely be your UQ email
<uq_username>@uq.edu.au
. - Enter the temporary app password when prompted
- Leave the “bearer token” field empty
- Select
n
when asked to edit the advanced config - Accept the configuration
- (Optional) You’ll want to encrypt the config if you’re using
rclone
on a shared system (like Magnus) to make sure nobody else can use the configuration to mess with your RDM. Return to therclone config
menu and chooses) Set configuration password
then enter a password.
A detailed guide can be found in the CloudStor knowledge base.
You can then use rclone
to push and pull files from CloudStor, as well as run basic queries and filesystem
operations like listing the contents of directories. A full list of options is available via rclone --help
,
but the basic commands are as follows:
rclone ls <remote>:<directory>
- list the files in<directory>
in your remote (replace<remote>
with the name you chose in step 1 above). The directory paths are relative to the root of your CloudStor repository and, by extension, the sync directory on your computer. For example, if your sync directory isownCloud
and you want to find the contents ofownCloud/my_folder
, then you would dorclone ls cloudstor:/my_folder
.rclone lsd <remote>:<directory>
- as forrclone ls
, but lists directories rather than files.rclone copy <src> <dst>
- copy a file or directory from source to destination. Usually at least one of these will be a remote, e.g.rclone copy some_file <remote>:<directory>
to copysome_file
to CloudStor orrclone copy <remote>:<directory>/some_file ./some_file
to copysome file
from CloudStor to your current directory on whichever machine you’re usingrclone
on. It’s also possible to copy from remote to remote, e.g. copying data from your RDM to CloudStor. Also, from therclone copy --help
page: “Note that it is always the contents of the directory that is synced, not the directory so when source:path is a directory, it’s the contents of source:path that are copied, not the directory name and contents.”. This means that if you runrclone copy
on a folder, it will splat the contents in your remote directory without creating a new folder. This means it’s usually a good idea to create a folder in CloudStor first if you want to copy a whole folder to the remote.rclone mkdir <remote>:<directory>
- create a directory on the remote if it doesn’t already exist.rclone --dry-run <command>
- do a dry-run of anrclone
command, which will show what operations would be performed, but will not actually make any changes. It’s a good idea to do a dry-run before running a potentially destructive command to make sure you’re not deleting or overwriting anything you don’t intend to.rclone --progress <command>
- show real time statistics like file uploads while running a command.
Full documentation for Rclone can also be found on the project’s website: https://rclone.org/.
Transferring data to and from HPC clusters
rsync
RCC, Pawsey and NIC clusters all support using rsync
for data transfer, either to/from your computer
or another server. The rsync man page (man rsync
) and web page has
comprehensive documentation with example commands for common operations, but the basic syntax for
transferring a file or directory from one location (src
) to another (dest
) is:
rsync <src> <dest>
rsync works equally well with both local and remote files, but remote files must be prefaced with a host URL to tell rsync where to find them. The remote URL must also contain your username for that server, with the general form:
rsync <src> <username>@<SSH_URL>:<dest>
rsync <username>@<SSH_URL>:<src> <dest>
Files without a remote prefix are assumed to be local to the machine you’re running rsync
on.
The remote prefixes for each set of clusters are presented in their respective sections below.
In all cases, you will (probably) need to initiate transfers on your local computer, since it is
possible to initiate an SSH connection from your computer to the cluster, but not vice versa.
Tinaroo
The remote prefix to connect to Tinaroo is uq_username@tinaroo.rcc.uq.edu.au
(the same address
you use for an SSH session). You can also use rsync
on the cluster (during an SSH session) to
transfer data from Tinaroo to other clusters like Gadi or Magnus. Just log in to the cluster and
use rsync
with the desired remote prefix.
Pawsey clusters - Magnus, Topaz and Zeus
Pawsey clusters have special nodes dedicated to moving data on and off the cluster, which helps
reduce the load on the login nodes. You should use these nodes for large data transfers to
improve data transfer speeds and avoid getting a cranky email from the Pawsey system administrators.
To initiate an rsync
transfer from your local computer or a non-Pawsey server, use the remote
prefix pawsey_username@hpc-data.pawsey.org.au
- this will connect to the filesystem shared
between all Pawsey clusters.
To initiate a transfer while logged into a Pawsey cluster, use
a SLURM batch or interactive job in the copyq
partition. You’ll need to use Zeus - a
“cloud-like” cluster maintained by Pawsey for high-throughput and data-intensive workloads.
Log in using the same username and password you use for Magnus, at the SSH address
pawsey_username@zeus.pawsey.org.au
. Do not do large data transfers on Magnus, as it does
not have the same optimisations for data movement.
For more detailed instructions and best-practices for data transfers at Pawsey, see this page in the Pawsey user guide.
NCI - Gadi
As with Pawsey clusters, Gadi has special nodes dedicated to data transfers on and off clusters.
For rsync
commands from your local computer or non-NCI servers, use the remote prefix
nci_username@gadi-dm.nci.org.au
. To initiate a transfer while logged into Gadi, use a PBS job
(batch or interactive) with the copyq
queue. See this page
in the Gadi user guide for more information.
Rclone
Rclone is available on Tinaroo (UQ RCC) and Zeus (Pawsey) as a software module. To use it,
you’ll need to load the module with module load rclone
then follow the instructions
for RDM and/or CloudStor from earlier in this guide. Rclone is also installed on Gadi, but
is available on login without needing to load a software module.
Changes with Setonix - the Acacia object storage
Acacia is a new long-term research data storage service maintained by Pawsey to work in tandem with the new Setonix cluster.
Setonix will use a different set of filesystems to the /home
,
/group
and /scratch
setup used by Magnus, Topaz and Zeus. Setonix will still have /home
and /scratch
, which will behave as they did on Magnus - i.e. /home
is for storing
configuration files and scripts and /scratch
is for short-term storage of files. The /group
filesystem will be replaced with two separate services: /software
, which is a standard Linux
filesystem for storing application executables, while long-term research data will be stored on Acacia.
Acacia is different from the other long-term data storage services in this guide as it is an object
store. Object storage uses a fundamentally different model of data storage and management than
traditional hierarchical filesystems (like /home
and /scratch
) in that it is designed from the
ground up to store unstructured data.
There are no folders, directories or hierarchies in an object store - data is stored as binary objects and organised into buckets. As this metaphor would suggest, objects are not organised within buckets - you chuck the data in and let the object storage system figure out what to do with it. Instead of accessing data by filename and path, you instead access objects by a unique identifier string, with the ability to search for objects based on a rich set of user-supplied metadata. Object stored support a much wider range of file metadata than traditional filesystems, as you can define any arbitrary key-value pairs to attach to an object (note: metadata is limited to 2KB per object). For example, you can tag a molecular structure file with its molecular formula, number of atoms it contains, the software used to create it or anything else that might be relevant when looking for it later.
Object storage requires a different workflow to traditional filesystems. Most importantly, objects
cannot be modified in-place on the object server, a property called atomicity. Instead of the
traditional I/O workflow of applications modifying files with many small, incremental reads and writes
(e.g. writing output as it is generated), an object store requires you to first check out a file into
local storage (e.g. /scratch
or your local computer), make modifications locally, then check it back
in to the object store (overwriting the old data). As such, object storage is particularly well suited
to data which is read more often than it is written, such as molecular structure files you might use as
the starting point for an MD simulation. Frequently written files like checkpoint files or MD
trajectories should not be stored on Acacia - write them to /scratch
and then move to long-term
storage once you’re done with them.
Fortunately, the Pawsey Centre has some nice tutorials and very thorough documentation on object storage in general and Acacia in particular, including suggested workflows for using Acacia with Setonix. Check out the following links for more information:
- Video tutorials on YouTube: https://www.youtube.com/playlist?list=PLmu61dgAX-aYxrbqtSYHS1ufVZ9xs1AnI
- Watch these two videos first. They aren’t as thorough as the written documentation, but is a nicer introduction to the general concepts.
- Acacia user guide: https://support.pawsey.org.au/documentation/display/US/Acacia+-+User+Guide
- Check this guide if you have questions or need reminders of specific commands.
- Follow the instructions in this guide when you access Acacia from Setonix.
- Acacia user portal: https://portal.pawsey.org.au/origin/portal/account/acaciabuckets
- Use this web interface to manage access keys and check buckets/storage usage.
Accessing Acacia with rclone
It is also possible to access data on Acacia using rclone, through its S3
interface. This will allow
you to create and delete buckets, upload and download objects and synchronise with external storage
(including CloudStor and UQ RDM). rclone’s support for user-generated metadata is limited, though, so
you’ll need to use one of the other tools mentioned in the Acacia user guide
to tag objects.
Setup is less involved than for CloudStor or the RDM, as Pawsey will provide an rclone configuration
when you create a new access key for Acacia. Simply copy the parameters from this sample configuration
when doing rclone config
, or paste the configuration into .config/rclone/rclone.conf
if you haven’t
encrypted your rclone configuration file (don’t do this if your config is encrypted, as it could break
your configuration). As always, make sure to give the remote a descriptive name such as acacia
. Here
is an example of an rclone config for Acacia:
[acacia]
type = s3
provider = ceph
access_key_id = <your_ID_key>
secret_access_key = <your_secret_key>
endpoint = https://projects.pawsey.org.au
acl = private
For information on usage, see the rclone user documentation for the S3
protocol: https://rclone.org/s3/.
The basic syntax is similar to using rclone for CloudStor or the RDM - refer to buckets as if they
were folders on the remote (even though they technically aren’t) and everything should Just Work. Some
basic commands:
rclone lsd <remote>:
will list all buckets you have on Acaciarclone mkdir <remote>:<bucket>
will create a new bucket on Acaciarclone ls <remote>:<bucket>
will list the objects in a bucketrclone copy <some_file> <remote>:<bucket>
to copy a file to a bucket
Some rclone commands can cause data loss, so it’s important to test the command first by running rclone
with the --dry-run
command (e.g. rclone --dry-run <command>
) to make sure you don’t accidentally
delete something you didn’t mean to.
Useful links
UQ RDM
- UQ RDM knowledge base: https://rdm.uq.edu.au/resources
- More RDM documentation: https://guides.library.uq.edu.au/for-researchers/uq-research-data-manager
AARNet CloudStor
- AARNet CloudStor knowledge base: https://support.aarnet.edu.au/hc/en-us/categories/200217608-CloudStor
- ownCloud desktop client documentation: https://owncloud.com/docs-guides/
Rclone and rsync
- Rclone project website: https://rclone.org/
- rsync documentation: https://rsync.samba.org/documentation.html
HPC clusters
- Tinaroo storage system documentation: http://www2.rcc.uq.edu.au/hpc/guides/index.html?secure/Tinaroo_userguide.html#The-Storage-Subsystem
- QRISCloud data documentation (for RDM): https://www.qriscloud.org.au/support/qriscloud-documentation/93-using-qrisdata-collections
- Pawsey data workflow documentation: https://support.pawsey.org.au/documentation/display/US/Data+Documentation
- Gadi user guide: https://opus.nci.org.au/display/Help/Gadi+User+Guide
Acacia
- Video tutorials on the Pawsey YouTube channel: https://www.youtube.com/playlist?list=PLmu61dgAX-aYxrbqtSYHS1ufVZ9xs1AnI
- Acacia user guide: https://support.pawsey.org.au/documentation/display/US/Acacia+-+User+Guide
- Acacia user portal: https://portal.pawsey.org.au/origin/portal/account/acaciabuckets
- rclone documentation for S3 object stores: https://rclone.org/s3