How long will data be stored in CaltechDATA?

In most cases, indefinitely. Any data that violate the Terms of Deposit or which fails to meet minimum standards for retention may be withdrawn. For example, we may eventually remove deposits that consist of unusable/obsolete files or are inadequately described. Files that are larger in size may have higher standards for retention. The DOI for all records will be retained and will lead to a tombstone page listing the reason for withdrawal.

Who runs CaltechDATA?

CaltechDATA is a service of the Caltech library as part of CODA ( All members of the Caltech community can upload research data for long term preservation and public access.

Who can upload files to CaltechDATA?

Any member of the Caltech community with an access.caltech username can upload files to CaltechDATA. For issues with your username or password, please contact IMSS at

What can I upload to CaltechDATA?

You can upload any digital files to CaltechDATA. You can also directly import software from Github for long term preservation. Publications should be submitted to


Is there anything that can't be deposited to CaltechDATA?

You must have the rights to any data you deposit. Data cannot violate the publicity, privacy or confidentiality rights of others or be covered by HIPPA or FERPA. Read our Terms of Deposit at for complete details.

How much data can I store on CaltechDATA?

There are currently no hard storage limits, but you should only deposit data that will be useful to others. All data must be described in sufficient detail so that others can understand it. If you're planning on uploading more that 500 GB of data, please data [at] (subject: Uploading%20Large%20Amounts%20of%20Data%20to%20CaltechDATA) (contact us) first.

Does CaltechDATA have any file size restrictions?

No, but file size can impact the availability of files. Files under 1GB will always be immediately available. Files under 100GB may be immediately available only if they are accessed every four months. Otherwise files may be stored on Infrequent Access Storage (IAS) and may take up to 24 hours to retrieve.


How do I access my CaltechDATA files in Infrequent Access Storage (IAS)?

Currently no files are stored on IAS. When this feature becomes active, instructions will be provided.


Can I make my files in CaltechDATA private?

Not indefinitely. You can embargo data for a specific period of time, but all data must be intended for public access at some point in the future.

Can I restrict data in CaltechDATA to specific users or groups?

No, CaltechDATA is an open repository.

Does my data need to be published before uploading to CaltechDATA?

No, you can upload unpublished data. If you wish, you can embargo the data until publication. You can easily link your data to publications by entering the DOI in the related publications field.

What metadata fields are used in CaltechDATA?

Our metadata is derived from the DataCite 4.0 schema, and is compliant with the Project Open Data 1.1 schema (used by US Federal Government Agencies). Our metadata includes an explicit related publications list.

What can I do with data in CaltechDATA?

You can use the read API to load files in CaltechDATA into another application. For example, the web application at allows you to interactively plot two mineral spectra files (, You can see the api code that generates this demo at Feel free to send us an email at data [at] if you'd like help integrating the repository with your application.

Is there any charge for storing data in CaltechDATA?

CaltechDATA storage is provided by the library at no cost in most cases. Users planning on depositing more than 500 GB of data should email us at data [at] to discuss your requirements.

Can I make changes to a record in CaltechDATA?

Yes, once you're logged in you should see an edit button appear for all records you created. This allows you to edit the metadata in the record. If you need to change the files associated with a record, send us an email at data [at]

How do I use the CaltechDATA API to create records?

First, you need to generate an access token. Log into CaltechDATA, and then click on your user menu (the person icon in the upper right hand corner). Then click "Applications".

Menu Option

Click on the "+ New Token" button in the Personal access tokens section.

Token button example

Make up a name for your token and check all of the scope buttons.

Token details

Your token will be shown on screen. Copy it down and store it somewhere secure. It functions just like an account password.  

You can create records using our python library caltechdata_api. You can install the library by downloading the source code of the latest release, extracting the file, and navigating to the caltechdata_api-x.x.x directory using the command line. Then type 'python install' to install the library.

To use the library, you'll need to set the access token you just created. Type 'export TINDTOK=TOKEN', where TOKEN is replaced by your actual token - or use the token.bash script that is distributed with the library.

Some scripts used for creating more complex data records are located in the caltechdata_migrate repository. An example that published a mercurial repository to CaltechDATA is available at caltechdata_hg.

We're also here to help - just send us an email at data [at]

Does the CaltechData repository provide private links for journal peer reviewers to use before publication?

A peer review access password/link is the most requested feature for CaltechDATA.  Our development partner is working on implementing the feature, but we do not have a schedule for when it will be available.  

For now, we recommend submitting the files to CaltechDATA with an embargo.  This will generate a permanent DOI link that you can include in the paper text or references.  The reviewers will be able to follow the DOI and see that the files have been uploaded, but they will not be able to download the files.  You should upload the files to a sharing service like, where you can generate a link (or password) that you can provide the reviewers so they can access all the files.


How do I get a DOI badge on my GitHub repository?

First submit your software to CalechDATA, or Zenodo if you're not affiliated with Caltech.  Once you log in, select GitHub from the profile menu in the upper right hand side of the screen.  You'll need to enter your GitHub credentials and then select the repository of interest (You might have to refresh the page for the buttons to activate).  Your software will be preserved only when you make a release in GitHub.  

Different formats of the GitHub badge will appear on the GitHub section of CaltechDATA or Zenodo once you make your first software release. If you want the DOI badge to appear with your first release, find your Github repo id at: (swap out your repo name).

Copy the markdown snippet below to your README file and replace the two long number sequences with your Github repo id:


The badge will not appear until you make a release in GitHub and a DOI is assigned.

Who should be listed an an author in CaltechDATA?

The authors of a dataset are "The main researchers involved in producing the data", according to the DataCite specification.  The authors listed in CaltechDATA do not need to match the authors in a related publication.  You can list other individuals who assisted with a research project as contributors. 

How can I test making records in CaltechDATA?

You should not conduct testing in CaltechDATA, as submitting a record will automatically generate a DOI that cannot be deleted.  We have a separate instance for testing - send us an email at data [at] so we can set up a test account for you.

How does CaltechDATA ensure that files will be available over time?

CaltechDATA is designed as a long-term data repository and it is expected that data will be available for the foreseeable future. This service is run by TIND and managed by the Caltech Library, which has a decade-plus history of preserving and making digital content available online.  All records are assigned a digital object identifier (DOI), which provides a persistent machine-readable web link to the data set. All data is preserved using the OAIS reference model and multiple copies of all files are stored in different locations using varied storage approaches.

How do I preserve a GitHub repository in CaltechDATA?

Log into CaltechDATA with your Caltech (access.caltech) account credentials.  Then select GitHub from the profile menu in the upper right hand side of the screen. 

You'll need to enter your GitHub credentials and then select the repository of interest (You might have to refresh the page for the buttons to activate).  Your software will be preserved only when you make a release in GitHub.  

If you want to have more control of the metadata in CaltechDATA, you should add a codemeta.json file to your GitHub repository.  You can see an example in one of our library repositories, which generates a CaltechDATA record that includes multiple authors with ORCIDS, custom keywords, and license information.  

How do I make my GitHub repository interactive using Binder?

Binder is an open-source service for making GitHub repos interactive.  With the click of a button, users get a virtual compute environment where they can run your code and reproduce your results.  It's also a great way to test whether you've defined all the dependencies for your project.  Setting up a Binder is dependent on your programming language.


See a demo. One approach for running a Jupyter notebook is to add a requirements.txt file to your GitHub repo listing all your Python dependencies and versions. If you need libraries that aren't installed by pip, you can also add an apt.txt file with any Linux (Debian) packages to install.  Add a binder button to your repo by copying the following to your README file:


and swapping out 'github' for your GitHub account and 'repo' for your GitHub repo name.  If you want a specific Jupyter notebook to launch, add '?filepath=tccon-plotting.ipynb' to the end of the URL. 


See a demo.  You can generate an RStudio session by adding an install.R file with your project dependencies to your GitHub repo.  Then add a runtime.txt file with the date you last tested your code.  Add a Binder button by copying the following to your README:


and swapping out 'AuthorCarpentry' for your GitHub account and 'repo' for your GitHub repo name.


There are lots of more configurations available at the Binder support page.  

How are usage statistics collected in CaltechDATA?

CaltechDATA tracks aggregate visits to a record page and downloads of files.  We follow the COUNTER Code of Practice for Research Data Usage Metrics to define usage of our data files.  Our reported statistics include only unique access, which is one or more downloads or views of content within a 1-hour time-window. 

All usage tracking is via JavaScript using matomo, so our usage statistics definitely undercount actual usage.  Automated downloads from applications like curl or individuals who disable JavaScript manually or via an ad blocker are not included.  Research has shown this may be up to 60% of repository usage.  We hope to capture more of this usage in the future.

Our repository service provider, TIND, hosts matomo on behalf of Caltech Library.  Caltech Library shares aggregate usage statistics with DataCite which are available as standard Usage Reports and via the DataCite search service.  We do not share raw usage statistics with third parties, and do not connect usage with individual logins.

How do I get a Binder badge on my CaltechDATA record?

Binder badges allow users to instantly re-run software in CaltechDATA in their browser with a single click.  Caltech library staff will add a badge to your record after they verify that your submission correctly builds.  You should test that your software works correctly on from the original source (e.g. GitHub) before submitting the software to CaltechDATA.  Once you're submitted working software, email data [at] to request a badge.

What Resource Type should I select in CaltechDATA?

Dataset and Software are the two most commonly used Resource Types in CaltechDATA. We provide the DataCite taxonomy of other Resource Types that may be appropriate for specific submissions.

  • InteractiveResource - A resource requiring interaction from the user to be understood. This is used for Virtual Reality content, training modules, or query/response portals.
  • Model - An abstract, conceptual, graphical, mathematical or visualization that represents empirical objects, phenomena, or physical processes. This could be descriptions of languages or a molecular biology reaction chain.
  • Workflow - A structured series of steps which can be executed to produce a final outcome, allowing users a means to specify and enact their work in a reproducible manner.
  • Audiovisual - Movies with or without sound. If a record has movies and other files, it should be described as a Dataset or Workflow.
  • Image - Image files. If a record has image and other files, it should be described as a Dataset or Workflow.
  • Sound - Audio files. If a record has sound and other files, it should be described as a Dataset or Workflow.
  • Text - A resource consisting primarily of words for reading. CaltechAUTHORS is a better repository for most written content from Caltech.
  • Collection - An aggregation of resources that are already described with persistent identifiers. You can use this type for a record that contains links to other CaltechDATA records.


How can I display a video from CaltechDATA on my web site?

We've developed a JavaScript-based solution to embed video content from CaltechDATA into any web site.  You only need to know the DOI of the CaltechDATA record.  Paste the following code into your website:

<div id="videodiv1"></div>
 <script src=""></script>
            let div = document.getElementById("videodiv1"),
            doi = '10.22002/D1.1278',
            item_no = 0;

If your CaltechDATA record contains multiple videos, select the specific one you want to display using the item_no variable.  Our CL.js function is general and will work with any DOI where the content provider has provided media information to DataCite. The underlying viewer is video.js, an open source javascript viewer.