How long will data be stored in CaltechDATA?

In most cases, indefinitely. Any data that violate the Terms of Deposit or which fails to meet minimum standards for retention may be withdrawn. For example, we may eventually remove deposits that consist of unusable/obsolete files or are inadequately described. Files that are larger in size may have higher standards for retention. The DOI for all records will be retained and will lead to a tombstone page listing the reason for withdrawal.

Who runs CaltechDATA?

CaltechDATA is a service of the Caltech library as part of CODA ( All members of the Caltech community can upload research data for long term preservation and public access.

Who can upload files to CaltechDATA?

Any member of the Caltech community with an access.caltech username can upload files to CaltechDATA. For issues with your username or password, please contact IMSS at

What can I upload to CaltechDATA?

You can upload any digital files to CaltechDATA. You can also directly import software from Github for long term preservation. Publications should be submitted to


Is there anything that can't be deposited to CaltechDATA?

You must have the rights to any data you deposit. Data cannot violate the publicity, privacy or confidentiality rights of others or be covered by HIPPA or FERPA. Read our Terms of Deposit at for complete details.

How much data can I store on CaltechDATA?

There are currently no hard storage limits, but you should only deposit data that will be useful to others. All data must be described in sufficient detail so that others can understand it. If you're planning on uploading more that 500 GB of data, please data [at] (subject: Uploading%20Large%20Amounts%20of%20Data%20to%20CaltechDATA) (contact us) first.

Does CaltechDATA have any file size restrictions?

No, but file size can impact the availability of files. Files under 1GB will always be immediately available. Files under 100GB may be immediately available only if they are accessed every four months. Otherwise files may be stored on Infrequent Access Storage (IAS) and may take up to 24 hours to retrieve.


Can I make my files in CaltechDATA private?

Not indefinitely. You can embargo data for a specific period of time, but all data must be intended for public access at some point in the future.

Can I restrict data in CaltechDATA to specific users or groups?

No, CaltechDATA is an open repository.

Does my data need to be published before uploading to CaltechDATA?

No, you can upload unpublished data. If you wish, you can embargo the data until publication. You can easily link your data to publications by entering the DOI in the related publications field.

What metadata fields are used in CaltechDATA?

Our metadata is derived from the DataCite 4.0 schema, and is compliant with the Project Open Data 1.1 schema (used by US Federal Government Agencies). Our metadata includes an explicit related publications list.

What can I do with data in CaltechDATA?

You can use the read API to load files in CaltechDATA into another application. For example, the web application at allows you to interactively plot two mineral spectra files (, You can see the api code that generates this demo at Feel free to send us an email at data [at] if you'd like help integrating the repository with your application.

Is there any charge for storing data in CaltechDATA?

Groups get 500 GB of free storage in CaltechDATA provided by the library. Groups planning on depositing more than 500 GB of data should email us at data [at] to discuss your options.

Can I make changes to a record in CaltechDATA?

Yes, once you're logged in you should see an edit button appear for all records you created. This allows you to edit the metadata in the record. If you need to change the files associated with a record, send us an email at data [at]

How do I use the CaltechDATA API to create records?

First, you need to generate an access token. Log into CaltechDATA, and then click on your user menu (the person icon in the upper right hand corner). Then click "Applications".

Menu Option

Click on the "+ New Token" button in the Personal access tokens section.

Token button example

Make up a name for your token and check all of the scope buttons.

Token details

Your token will be shown on screen. Copy it down and store it somewhere secure. It functions just like an account password.  

You can create records using our python library caltechdata_api. You can install the library by downloading the source code of the latest release, extracting the file, and navigating to the caltechdata_api-x.x.x directory using the command line. Then type 'python install' to install the library.

To use the library, you'll need to set the access token you just created. Type 'export TINDTOK=TOKEN', where TOKEN is replaced by your actual token - or use the token.bash script that is distributed with the library.

Some scripts used for creating more complex data records are located in the caltechdata_migrate repository. An example that published a mercurial repository to CaltechDATA is available at caltechdata_hg.

We're also here to help - just send us an email at data [at]

Does the CaltechDATA repository provide private links for journal peer reviewers to use before publication?

A peer review access password/link is the most requested feature for CaltechDATA. Our development partner is working on implementing the feature, but we do not have a schedule for when it will be available.  

For now, we recommend submitting the files to CaltechDATA with an embargo. This will generate a permanent DOI link that you can include in the paper text or references. The reviewers will be able to follow the DOI and see that the files have been uploaded, but they will not be able to download the files. You should upload the files to a sharing service like, where you can generate a link (or password) that you can provide the reviewers so they can access all of the files.

How do I get a DOI badge on my GitHub repository?

First submit your software to CalechDATA, or Zenodo if you're not affiliated with Caltech. Once you log in, select GitHub from the profile menu in the upper right hand side of the screen. You'll need to enter your GitHub credentials and then select the repository of interest (you might have to refresh the page for the buttons to activate). Your software will be preserved only when you make a release in GitHub.  

Different formats of the GitHub badge will appear on the GitHub section of CaltechDATA or Zenodo once you make your first software release. If you want the DOI badge to appear with your first release, find your Github repo id at: (swap out your repo name).

Copy the markdown snippet below to your README file and replace the two long number sequences with your Github repo id:


The badge will not appear until you make a release in GitHub and a DOI is assigned.

Who should be listed an an author in CaltechDATA?

The authors of a dataset are "The main researchers involved in producing the data", according to the DataCite specification.  The authors listed in CaltechDATA do not need to match the authors in a related publication.  You can list other individuals who assisted with a research project as contributors. 

How can I test making records in CaltechDATA?

You should not conduct testing in CaltechDATA, as submitting a record will automatically generate a DOI that cannot be deleted. We have a separate instance for testing — send us an email at data [at] so we can set up a test account for you.

How does CaltechDATA ensure that files will be available over time?

CaltechDATA is designed as a long-term data repository and it is expected that data will be available for the foreseeable future. This service is run by TIND and managed by the Caltech Library, which has a decade-plus history of preserving and making digital content available online. All records are assigned a digital object identifier (DOI), which provides a persistent machine-readable web link to the data set. All data is preserved using the OAIS reference model and multiple copies of all files are stored in different locations using varied storage approaches.

How do I preserve a GitHub repository in CaltechDATA?

Log into CaltechDATA with your Caltech (access.caltech) account credentials. Then select GitHub from the profile menu in the upper right hand side of the screen.

You'll need to enter your GitHub credentials and then select the repository of interest (you might have to refresh the page for the buttons to activate). Your software will be preserved only when you make a release in GitHub.  

If you want to have more control of the metadata in CaltechDATA, you should add a codemeta.json file to your GitHub repository. You can see an example in one of our library repositories, which generates a CaltechDATA record that includes multiple authors with ORCIDS, custom keywords, and license information.

How do I make my GitHub repository interactive using Binder?

Binder is an open-source service for making GitHub repos interactive. With the click of a button, users get a virtual compute environment where they can run your code and reproduce your results. It's also a great way to test whether you've defined all the dependencies for your project. Setting up a Binder is dependent on your programming language.


See a demo. One approach for running a Jupyter notebook is to add a requirements.txt file to your GitHub repo listing all your Python dependencies and versions. If you need libraries that aren't installed by pip, you can also add an apt.txt file with any Linux (Debian) packages to install. Add a binder button to your repo by copying the following to your README file:


and swapping out 'github' for your GitHub account and 'repo' for your GitHub repo name. If you want a specific Jupyter notebook to launch, add '?filepath=tccon-plotting.ipynb' to the end of the URL.


See a demo. You can generate an RStudio session by adding an install.R file with your project dependencies to your GitHub repo. Then add a runtime.txt file with the date you last tested your code. Add a Binder button by copying the following to your README:


and swapping out 'AuthorCarpentry' for your GitHub account and 'repo' for your GitHub repo name.

There are lots of more configurations available at the Binder support page.

How are usage statistics collected in CaltechDATA?

CaltechDATA tracks aggregate visits to a record page and downloads of files. We follow the COUNTER Code of Practice for Research Data Usage Metrics to define usage of our data files. Our reported statistics include only unique access, which is one or more downloads or views of content within a 1-hour time-window. 

All usage tracking is via JavaScript using Matomo, so our usage statistics definitely undercount actual usage.  Automated downloads from applications like curl or individuals who disable JavaScript manually or via an ad blocker are not included. Research has shown this may be up to 60% of repository usage. We hope to capture more of this usage in the future.

Our repository service provider, TIND, hosts Matomo on behalf of Caltech Library. Caltech Library shares aggregate usage statistics with DataCite which are available as standard Usage Reports and via the DataCite search service. We do not share raw usage statistics with third parties, and do not connect usage with individual logins.

How do I get a Binder badge on my CaltechDATA record?

Binder badges allow users to instantly re-run software in CaltechDATA in their browser with a single click. Caltech library staff will add a badge to your record after they verify that your submission correctly builds. You should test that your software works correctly on from the original source (e.g. GitHub) before submitting the software to CaltechDATA. Once you're submitted working software, email data [at] to request a badge.

What Resource Type should I select in CaltechDATA?

Dataset and Software are the two most commonly used Resource Types in CaltechDATA. We provide the DataCite taxonomy of other Resource Types that may be appropriate for specific submissions.

  • InteractiveResource - A resource requiring interaction from the user to be understood. This is used for Virtual Reality content, training modules, or query/response portals.
  • Model - An abstract, conceptual, graphical, mathematical or visualization that represents empirical objects, phenomena, or physical processes. This could be descriptions of languages or a molecular biology reaction chain.
  • Workflow - A structured series of steps which can be executed to produce a final outcome, allowing users a means to specify and enact their work in a reproducible manner.
  • Audiovisual - Movies with or without sound. If a record has movies and other files, it should be described as a Dataset or Workflow.
  • Image - Image files. If a record has image and other files, it should be described as a Dataset or Workflow.
  • Sound - Audio files. If a record has sound and other files, it should be described as a Dataset or Workflow.
  • Text - A resource consisting primarily of words for reading. CaltechAUTHORS is a better repository for most written content from Caltech.
  • Collection - An aggregation of resources that are already described with persistent identifiers. You can use this type for a record that contains links to other CaltechDATA records.

How can I display a video from CaltechDATA on my website?

We've developed a JavaScript-based solution to embed video content from CaltechDATA into any web site. You only need to know the DOI of the CaltechDATA record. Paste the following code into your website:

<div id="videodiv1"></div>
 <script src=""></script>
            let div = document.getElementById("videodiv1"),
            doi = '10.22002/D1.1278',
            item_no = 0;

If your CaltechDATA record contains multiple videos, select the specific one you want to display using the item_no variable. Our CL.js function is general and will work with any DOI where the content provider has provided media information to DataCite. The underlying viewer is video.js, an open source javascript viewer.

If you need to resize the video player, you can pass in the width and height like:

CL.doi_video_player(div, doi, item_no, width, height)

How do I handle data from a Contract Research Organization (CRO) in my publication? Do they become authors?

It’s important to separate out the legal requirements of data ownership from the scientific norms of authorship. Legal ownership (copyright/trade secrets) of any of the data files or images should be laid out in the contract or policies of the CRO. While I’d expect by paying the CRO you’d have full control over the data they give you, you’d want to check the contract to be sure.

Authorship is a separate concept from the legal ownership of the data. It’s a good idea to maintain professional relationships by offering to collaborate on authoring the paper (this isn’t a legal thing just a good interpersonal thing). The International Committee on Medical Journal Editors has a good definition of authorship that includes collecting data, drafting the work, and providing final approval of the publication . If the CRO does not want to be part of the publication writing or approval process, they should not be authors. Including them in an acknowledgment would be appropriate. Alternatively, you could release the files as a dataset through a data repository (like CaltechDATA or Zenodo). There you could include the CRO members as authors or contributors to the specific data files they collected, and then reference the data files in the publication.