Research Data Support
The library helps campus research labs and centers manage and publish their research data.
CaltechDATA and DOIs
CaltechDATA offers standard data preservation and DOI (permanent identifier) services. We also offer services (at an additional cost) for preserving large volumes of data (> 500 GB). Contact us at data [at] library.caltech.edu to discuss the options.
Caltech Library also manages custom DOIs for groups on campus. Find out more information, see our DOI page.
Data Consultations and Training
Library staff can help you or your research group work through challenges associated with research data. We can provide:
- Data Management Plan (DMP) assistance
- Data management guidance
- Capacity planning
- Storage technology recommendations
- Data processing suggestions
- Interactive visualization hosting
- Long-term archival and data sharing
Contact us at data [at] library.caltech.edu to schedule a consultation appointment.
Library staff offer workshops and training in the following data-related areas:
- Data management
- Data visualization
- Software and Data Carpentry
Contact us at library [at] caltech.edu to schedule a session for your research group, class, or organization.
We've worked with Information Management Systems & Services (IMSS) to put together a centralized Research Data FAQ.
Data Management Plans
Funding agencies, such as the National Science Foundation or the National Institutes of Health have specific requirements and templates for Data Management Plans (DMP). Funder templates and further guidance are available at the DMPTool.
Caltech Library has created a number of Caltech-specific DMP resources that are available on this webpage, including standard language, a DMP checklist, and an example DMP.
Project Close-Out Checklist for Research Data
The close-out checklist describes a range of activities for helping ensure that research data are properly managed at the end of a project or at researcher departure. Activities include: making stewardship decisions, preparing files for archiving, sharing data, and setting aside important files in a "FINAL" folder.
File Naming Convention Worksheet
The naming convention worksheet walks researchers through the process of creating a file naming convention for a group of files. This process includes: choosing metadata, encoding and ordering the metadata, adding version information, and properly formatting the file names.
Data Storage Hardware and Services
Network Attached Storage
Network attached storage devices are boxes that contain both storage and the hardware needed to manage the storage. They can be thought of as a small computer with lots of storage. Using a NAS to store your research data has many benefits. Because they are internet accessible, it is easy to centralize data collected on different instruments and to access data for later analysis. Most models contain multiple hard drives and can be set up with RAID to protect against data loss in case of a hard drive failure. NAS devices are generally affordable ($300-$1500 depending on the storage space needed) and is usually cheaper than purchasing cloud storage over 4-5 years. We currently provide instructions for setting up Synology NAS devices, but many manufacturers make a similar product. If you've got a different NAS, let us know at data [at] caltech.edu and we can work on putting together setup instructions.
Interested in trying out a NAS? Send an email to data [at] caltech.edu to get access to our demo NAS!
Caltech IMSS manages a campus site license for Box.com. Campus users get 50 GB of free storage. Box.com is a good resource for storing backup copies of data and syncing between computers, but should not be used as a primary data storage location. Note that Box.com has a 5 GB individual file limit, and lacks a Linux sync client. Continued availability of Box.com is dependent on IMSS and Box. A comparison of IMSS provided file storage systems is available at http://imss.caltech.edu/services/collaboration-storage-backups/storage-comparison
Have questions about other storage services? Send an email to data [at] caltech.edu.
High Performance Computing Resources
IMSS now provides centralized HPC resources via a campus cluster. There is a per-hour charge for computing, and research groups can make an investment to get additional computing time. Find all the details at hpc.caltech.edu
XSEDE is a National Science Foundation funded nationwide high performance computing resource. Researchers can request time on more than 10 national supercomputers, visualization resources, storage systems, and scientific gateways , also listed below. A separate NSF grant is not required to gain access to these resources. Caltech users interested in testing one of these systems can contact the Caltech Campus Champion, Tom Morrell at tmorrell [at] caltech.edu, for trial access. Faculty, Postdoctoral Researchers, and NSF Graduate Research Fellows can submit a startup allocation which provides up to 50,000 compute hours to test XSEDE resources. The startup allocation request process is very straightforward and enables users to quickly access computing resources. After one year or when the startup resources are exhausted, researchers can submit a more thorough research allocation proposal.
Current XSEDE Resources (3/2020)
Labeled with Host, Node specifications, Max queue time
- Comet, SDSC, 24 Core; 128 GB RAM; 320 GB SSD, 2 days
- Stampede2, TACC, select KNL or SKX: 68 or 48 Core; 96 or 192 GB RAM; 107 or 144 GB SSD, 4 days
- Bridges, PSC, 28 core; 128 GB RAM; 8 TB Storage, 2 days (can also schedule segments of a node)
- Bridges Large, PSC, 3 or 12 TB RAM; 16 or 64 TB Storage, 4 Days
- Comet, SDSC, 64 Core 1.5 TB; RAM 400 GB SSD, 2 days
- Jetstream, Indiana/TACC, Can spin up various sized custom imaged environments
- Open Science Grid, Distributed computing for smaller jobs (single thread, < 2 GB memory, 1-12 hour execution, <10 GB storage)
Cyverse (formerly iplant) is another NSF-funded cyberinfrastructure project that provided computing resources, primarily targeted at life science researchers. They offer free access to Atmosphere, a cloud-based computing resource where you can spin up computing resources with specific images. A basic allocation is available by registering, and additional allocations require an application.
Have questions about high performance computing resources? Send an email to data [at] caltech.edu.