Datasets
Unity hosts a variety of commonly used public datasets for easy access for Unity workloads. You can find all of Unity’s hosted datasets in the /datasets
directory from any Unity node.
You can also view
/datasets
in the Open OnDemand file browser by navigating to the “/datasets
” entry in the “Files” dropdown.To get information about each dataset, see the menu below.
AI and ML
AlpacaFarm
infoAlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
folder_open
/datasets/ai/alpaca-farm
biomed_clip
infoBiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning
folder_open
/datasets/ai/biomed-clip
blip_2
infoBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
folder_open
/datasets/ai/blip2
coco
infoCOCO is a large-scale object detection, segmentation, and captioning dataset
folder_open
/datasets/ai/coco
Code Llama
DeepAccident
infoDeepAccident is the first V2X (vehicle-to-everything simulation) autonomous driving dataset that contains diverse collision accidents that commonly occur in real-world driving scenarios
folder_open
/datasets/ai/deep-accident
DeepSeek
infoDeepSeek trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.
folder_open
/datasets/ai/deepseek
DINO v2
infoDINOv2 is a self-supervised method to learn visual representation
folder_open
/datasets/ai/dinov2
epic-kitchens
infoEpic-Kitchens-100 is a large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in kitchen environments.
folder_open
/datasets/ai/epic-kitchens
gemma
infoGemma is a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models
folder_open
/datasets/ai/gemma
gte-Qwen2
infogte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).
folder_open
/datasets/ai/alibaba
ibm-granite
infoGranite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters
folder_open
/datasets/ai/ibm-granite
Idefics2
infoIdefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs
folder_open
/datasets/ai/idefics2
Imagenet 1K
inaturalist
infoThe iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories
folder_open
/datasets/ai/inaturalist
infly
infoINF-Retriever-v1 is an LLM-based dense retrieval model developed by INF TECH. It is built upon the gte-Qwen2-7B-instruct model and specifically fine-tuned to excel in retrieval tasks, particularly for Chinese and English data
folder_open
/datasets/ai/infly
instruct-blip
infoInstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
folder_open
/datasets/ai/instruct-blip
intfloat
infoA novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps
folder_open
/datasets/ai/intfloat
LAION
linq
infoLinq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models
folder_open
/datasets/ai/intfloat
llama
infoLLaMA, a collection of foundation language models ranging from 7B to 65B parameters
folder_open
/datasets/ai/llama
Llama2
llama3
infoLlama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage
folder_open
/datasets/ai/llama3
Llava_OneVision
mixtral
msmarco
infoThe MS MARCO dataset is a large-scale information retrieval benchmark that uses real-world questions from Bing’s search queries to evaluate the performance of machine learning models in generating answers
folder_open
/datasets/ai/msmarco
natural-questions
infoNatural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine
folder_open
/datasets/ai/natural-questions
objaverse
infoObjaverse is a Massive Dataset with 800K+ Annotated 3D Objects
folder_open
/datasets/ai/objaverse
qwen
infoQwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model
folder_open
/datasets/ai/qwen
R1-1776
rag-sequence-nq
infoRAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever
folder_open
/datasets/ai/rag-sequence-nq
red-pajama-v2
infoRedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata
folder_open
/datasets/ai/red-pajama-v2
s1-32B
infos1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
folder_open
/datasets/ai/simplescaling
satlas_pretrain
infoSatlasPretrain, a remote sensing dataset that is large in both breadth and scale, combining Sentinel-2 and NAIP images with 302M labels under 137 categories and seven label types
folder_open
/datasets/ai/allenai
SlimPajama
infoSlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together
folder_open
/datasets/ai/slim-pajama
t5
infoThe T5 model, short for Text-to-Text Transfer Transformer, is a machine learning model developed by Google
folder_open
/datasets/ai/t5
V2X
infoV2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving
folder_open
/datasets/ai/v2x
vit
infoThe Vision Transformer (ViT) model uses the transformer architecture to process image patches for tasks like image classification
folder_open
/datasets/ai/vit
wildchat
infoWildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns
folder_open
/datasets/ai/wildchat
Bioinformatics
AlphaFold3 Databases
infoProtein structure and sequence databases used with AlphaFold3, an updated version of AlphaFold capable of predicting the structure and interactions of biomolecules.
folder_open
/datasets/bio/alphafold3
BFD/MGnify
infoBFD/MGnify is a database built for ColabFold by combining the Big Fantastic Database (BFD) with the MGnify database.
folder_open
/datasets/bio/colabfold/bfd_mgy_colabfold
Big Fantastic Database
infoBig Fantastic Database (BFD) is a protein sequence database. BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust, Soil Reference Catalog and Marine Eukaryotic Reference Catalog. It consists of over 65M protein families represented as multiple sequence alignments and hidden Markov models. BFD was built using the Uniclust pipeline and is one of the protein sequence databases used with AlphaFold.
folder_open
/datasets/bio/alphafold/bfd
checkm
infoDatabase associated with CheckM, a tool for assessing the quality of genomes recovered from isolates, single cells, or metagenomes.
folder_open
/datasets/bio/checkm/
ColabFoldDB
infoColabFoldDB is a protein database built for ColabFold by extending BFD/MGnify with additional metagenomic protein catalogs containing eukaryotic proteins, phage catalogs and an updated version of MetaClust.
folder_open
/datasets/bio/colabfold/colabfold_envdb_202108
dfam
infoDfam is a database of Transposable Element DNA sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations.
folder_open
/datasets/bio/dfam/
EggNOG
infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.
folder_open
/datasets/bio/eggnog-data/
EggNOG
infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.
folder_open
/datasets/bio/eggnog6-data/
gmap
GTDB
infoThe Genome Taxonomy Database (GTDB) is a genome-based taxonomy for prokaryotic genomes collected from the NCBI RefSeq and GenBank Assembly databases.
folder_open
/datasets/bio/gtdb/
igenomes
Kraken2
infoDatabase for Kraken2, a tool that assigns taxonomic labels to DNA sequences. The database was built with the complete archaeal, bacterial and viral genomes downloaded from the NCBI Reference Sequence Database on July 22nd 2024.
folder_open
/datasets/bio/kraken2
MGnify
infoMGnify is a database of non-redundant protein sequences predicted from metagenomic assemblies. MGnify is one of the protein sequence databases that can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/mgnify
NCBI BLAST databases
infoNational Center for Biotechnology Information (NCBI) database presented in the format required for running Basic Local Alignment Search Tool (BLAST) as well as the sequence aligner DIAMOND. It contains the nucleotide database, the non-redundant Reference Sequence protein database for archaeal and bacterial genomes, the Reference Sequence Prokaryotic Representative Genome Database and the Reference Sequence Eukaryotic Representative Genome Database. NCBI’s BLAST databases are downloaded weekly. See the full details for more information.
folder_open
/datasets/bio/ncbi-db/
NCBI RefSeq database
infoComplete archaeal, bacterial and viral genomes retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.
folder_open
/datasets/bio/ncbi-refseq/
params
PDB70
infoPDB70 is a protein database that contains profile hidden Markov models for a representative set of protein sequences from the Protein Data Bank database filtered with a maximum pairwise sequence identity of 70%. PDB70 can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/pdb70
PDB70 for ColabFold
infoPDB70 database (see PDB70 database) built in MMseqs2 format to be used with ColabFold.
folder_open
/datasets/bio/colabfold
PINDER
infoPINDER or Protein Interaction Dataset and Evaluation Resource, is a dataset and resource for training and evaluation of protein-protein docking algorithms.
folder_open
/datasets/bio/pinder
PLINDER
infoPLINDER or Protein Ligand Interactions Dataset and Evaluation Resource, is a comprehensive, annotated, high quality dataset and resource for training and evaluation of protein-ligand docking algorithms.
folder_open
/datasets/bio/plinder
Protein Data Bank
infoProtein sequences from the Protein Data Bank in CIF format.
folder_open
/datasets/bio/colabfold/pdb
Protein Data Bank database in mmCIF format
infoProtein sequences from the Protein Data Bank in mmCIF format.
folder_open
/datasets/bio/alphafold/pdb_mmcif
Protein Data Bank database in SEQRES records
infoProtein sequences from the Protein Data Bank in SEQRES records. SEQRES records contain the amino acid sequence of residues in each chain of the proteins.
folder_open
/datasets/bio/alphafold/pdb_seqres
Tara Oceans 18S amplicon
info18S amplicon sequencing data from the Tara Oceans expedition (2009-2013) DNA samples corresponding to size fractions for protists. The sequence files were downloaded from the European Nucleotide Archive under project number PRJEB6610.
folder_open
/datasets/bio/tara-oceans/18S-amplicon
Tara Oceans MATOU gene catalog
infoReference collection of expressed eukaryotic genes called Marine Atlas of Tara Oceans Unigenes (MATOU), obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open
/datasets/bio/tara-oceans/MATOU-gene-catalog
Tara Oceans MGT transcriptomes
infoCollection of metagenomics-based transcriptomes (MGTs) of eukaryotic marine plankton communities obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open
/datasets/bio/tara-oceans/MGT-transcriptomes
Uniclust30
infoUniclust30 is a database of annotated protein sequences and alignments. It is built by clustering the sequences in UniProt Knowledgebase (UniProtKB) at the level of 30% pairwise sequence identity. Uniclust30 can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/uniclust30
UniProtKB
infoThe UniProt Knowledgebase (UniProtKB) is a database of protein sequences consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains manually annotated and non-redundant protein sequence records while UniProtKB/TrEMBL contains computationally analyzed and unreviewed protein sequence records.
folder_open
/datasets/bio/alphafold/uniprot
UniRef100
infoUniRef100 is a database of protein sequences from UniProtKB and selected UniParc records.
folder_open
/datasets/bio/uniref100
UniRef30
infoUniRef30 is a database of protein sequences built for ColabFold by clustering UniRef100 sequences with 30% sequence identity.
folder_open
/datasets/bio/colabfold/uniref30_2103
UniRef90
infoUniRef90 is a database of protein sequences from UniProtKB and selected UniParc records. UniRef90 is built by clustering UniRef100 sequences such that each clustered set is composed of sequences that have at least 90% sequence identity to, and 80% overlap with, the longest sequence in the cluster.
folder_open
/datasets/bio/alphafold/uniref90