Unity
Unity
About
News
Events
Docs
Contact Us
code
search
login
Unity
Unity
About
News
Events
Docs
Contact Us
dark_mode
light_mode
code login
search

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
      • Gypsum
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • gemma
      • glm
      • gpt
      • gte-Qwen2
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • internLM
      • intfloat
      • lg
      • linq
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • Lumina
      • mims
      • mixtral
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • Perplexity AI
      • phi
      • playgroundai
      • pythia
      • qwen
      • rag-sequence-nq
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • videoMAE-v2
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • Databases for ColabFold
      • dfam
      • EggNOG
      • EggNOG
      • GMAP-GSNAP database (human genome)
      • GTDB
      • Illumina iGenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • NCBI RefSeq database
      • Parameters of AlphaFold
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • PDB70
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
      • Gypsum
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • gemma
      • glm
      • gpt
      • gte-Qwen2
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • internLM
      • intfloat
      • lg
      • linq
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • Lumina
      • mims
      • mixtral
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • Perplexity AI
      • phi
      • playgroundai
      • pythia
      • qwen
      • rag-sequence-nq
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • videoMAE-v2
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • Databases for ColabFold
      • dfam
      • EggNOG
      • EggNOG
      • GMAP-GSNAP database (human genome)
      • GTDB
      • Illumina iGenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • NCBI RefSeq database
      • Parameters of AlphaFold
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • PDB70
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets
  1. Unity
  2. Documentation
  3. Datasets

Datasets

Unity hosts a variety of commonly used public datasets for easy access for Unity workloads. You can find all of Unity’s hosted datasets in the /datasets directory from any Unity node.

stylus_note
You can also view /datasets in the Open OnDemand file browser by navigating to the “/datasets” entry in the “Files” dropdown.

To get information about each dataset, see the menu below.

AI and ML

AlpacaFarm

infoAlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
folder_open/datasets/ai/alpaca-farm
zoom_inView more info...

audioset

infoAudioSet is an ontology and human-labeled dataset for audio event detection. It consists of 2,084,320 ten-second sound clips from YouTube videos labeled with a hierarchical ontology of 632 audio event classes, including human and animal sounds, musical instruments, and everyday environmental noises.
folder_open/datasets/ai/audioset
zoom_inView more info...

bigcode

infoBigCode is an open scientific collaboration working on responsible training of large language models for coding applications
folder_open/datasets/ai/bigcode
zoom_inView more info...

biomed_clip

infoBiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning
folder_open/datasets/ai/biomed-clip
zoom_inView more info...

blip_2

infoBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
folder_open/datasets/ai/blip
zoom_inView more info...

coco

infoCOCO is a large-scale object detection, segmentation, and captioning dataset
folder_open/datasets/ai/coco
zoom_inView more info...

Code Llama

infoModel for Code Llama LLM
folder_open/datasets/ai/codellama/
zoom_inView more info...

DeepAccident

infoDeepAccident is the first V2X (vehicle-to-everything simulation) autonomous driving dataset that contains diverse collision accidents that commonly occur in real-world driving scenarios
folder_open/datasets/ai/deep-accident
zoom_inView more info...

DeepSeek

infoDeepSeek trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors
folder_open/datasets/ai/deepseek
zoom_inView more info...

DINO v2

infoDINOv2 is a self-supervised method to learn visual representation
folder_open/datasets/ai/dinov2
zoom_inView more info...

epic-kitchens

infoEpic-Kitchens-100 is a large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in kitchen environments.
folder_open/datasets/ai/epic-kitchens
zoom_inView more info...

florence

infoFlorence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.
folder_open/datasets/ai/florence
zoom_inView more info...

gemma

infoGemma is a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models
folder_open/datasets/ai/gemma
zoom_inView more info...

glm

infoChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage
folder_open/datasets/ai/glm
zoom_inView more info...

gpt

infoLanguage models are unsupervised multitask learners
folder_open/datasets/ai/gpt
zoom_inView more info...

gte-Qwen2

infogte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).
folder_open/datasets/ai/alibaba
zoom_inView more info...

ibm-granite

infoGranite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters
folder_open/datasets/ai/ibm-granite
zoom_inView more info...

Idefics2

infoIdefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs
folder_open/datasets/ai/idefics2
zoom_inView more info...

Imagenet 1K

infoImagenet 1K dataset
folder_open/datasets/ai/imagenet/
zoom_inView more info...

inaturalist

infoThe iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories
folder_open/datasets/ai/inaturalist
zoom_inView more info...

infly

infoINF-Retriever-v1 is an LLM-based dense retrieval model developed by INF TECH. It is built upon the gte-Qwen2-7B-instruct model and specifically fine-tuned to excel in retrieval tasks, particularly for Chinese and English data
folder_open/datasets/ai/infly
zoom_inView more info...

internLM

infoInternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques
folder_open/datasets/ai/blip
zoom_inView more info...

intfloat

infoA novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps
folder_open/datasets/ai/intfloat
zoom_inView more info...

lg

infoLarge Language Models (LLMs) and Large Multimodal Models (LMMs) developed by LG AI Research. EXAONE stands for EXpert AI for EveryONE, a vision that LG is committed to realizing
folder_open/datasets/ai/lg
zoom_inView more info...

linq

infoLinq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models
folder_open/datasets/ai/intfloat
zoom_inView more info...

Llama2

infoModels for Llama 2 LLM
folder_open/datasets/ai/llama2/
zoom_inView more info...

llama3

infoLlama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage
folder_open/datasets/ai/llama3
zoom_inView more info...

llama4

infoLlama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture
folder_open/datasets/ai/llama4/
zoom_inView more info...

Llava_OneVision

infoLLaVA-OneVision Easy Visual Task Transfer
folder_open/datasets/ai/llava
zoom_inView more info...

Lumina

infoLumina-Image 2.0: A Unified and Efficient Image Generative Framework
folder_open/datasets/ai/lumina
zoom_inView more info...

mims

infoTxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies
folder_open/datasets/ai/mims
zoom_inView more info...

mixtral

infoModel for Laion 2 (2B)
folder_open/datasets/ai/mixtral/
zoom_inView more info...

msmarco

infoThe MS MARCO dataset is a large-scale information retrieval benchmark that uses real-world questions from Bing’s search queries to evaluate the performance of machine learning models in generating answers
folder_open/datasets/ai/msmarco
zoom_inView more info...

natural-questions

infoNatural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine
folder_open/datasets/ai/natural-questions
zoom_inView more info...

objaverse

infoObjaverse is a Massive Dataset with 800K+ Annotated 3D Objects
folder_open/datasets/ai/objaverse
zoom_inView more info...

openai-whisper

infoWhisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation
folder_open/datasets/ai/whisper
zoom_inView more info...

Perplexity AI

infoR1-1776, Perplexity AI
folder_open/datasets/ai/perplexity
zoom_inView more info...

phi

infoPhi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3
folder_open/datasets/ai/phi
zoom_inView more info...

playgroundai

infoA model that generates highly aesthetic images of resolution 1024x1024, as well as portrait and landscape aspect ratios
folder_open/datasets/ai/playgroundai
zoom_inView more info...

pythia

infoPythia is the first LLM suite designed specifically to enable scientific research on LLMs
folder_open/datasets/ai/pythia
zoom_inView more info...

qwen

infoQwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model
folder_open/datasets/ai/qwen
zoom_inView more info...

rag-sequence-nq

infoRAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever
folder_open/datasets/ai/rag-sequence-nq
zoom_inView more info...

s1-32B

infos1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
folder_open/datasets/ai/simplescaling
zoom_inView more info...

satlas_pretrain

infoSatlasPretrain, a remote sensing dataset that is large in both breadth and scale, combining Sentinel-2 and NAIP images with 302M labels under 137 categories and seven label types
folder_open/datasets/ai/allenai
zoom_inView more info...

scalabilityai

infoA novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens
folder_open/datasets/ai/stabilityai/
zoom_inView more info...

sft

infoA sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2
folder_open/datasets/ai/sft
zoom_inView more info...

SlimPajama

infoSlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together
folder_open/datasets/ai/slim-pajama
zoom_inView more info...

t5

infoThe T5 model, short for Text-to-Text Transfer Transformer, is a machine learning model developed by Google
folder_open/datasets/ai/t5
zoom_inView more info...

Tulu

infoTülu 3: Pushing Frontiers in Open Language Model Post-Training
folder_open/datasets/ai/tulu
zoom_inView more info...

V2X

infoV2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving
folder_open/datasets/ai/v2x
zoom_inView more info...

video-MAE

infoVideo masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models
folder_open/datasets/ai/opengvlab
zoom_inView more info...

videoMAE-v2

infoVideo masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models
folder_open/datasets/ai/opengvlab
zoom_inView more info...

vit

infoThe Vision Transformer (ViT) model uses the transformer architecture to process image patches for tasks like image classification
folder_open/datasets/ai/vit
zoom_inView more info...

wildchat

infoWildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns
folder_open/datasets/ai/wildchat
zoom_inView more info...

Bioinformatics

AlphaFold3 Databases

infoProtein structure and sequence databases used with AlphaFold3, an updated version of AlphaFold capable of predicting the structure and interactions of biomolecules.
folder_open/datasets/bio/alphafold3
zoom_inView more info...

BFD/MGnify

infoBFD/MGnify is a database built for ColabFold by combining the Big Fantastic Database (BFD) with the MGnify database.
folder_open/datasets/bio/colabfold/bfd_mgy_colabfold
zoom_inView more info...

Big Fantastic Database

infoBig Fantastic Database (BFD) is a protein sequence database. BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust, Soil Reference Catalog and Marine Eukaryotic Reference Catalog. It consists of over 65M protein families represented as multiple sequence alignments and hidden Markov models. BFD was built using the Uniclust pipeline and is one of the protein sequence databases used with AlphaFold.
folder_open/datasets/bio/alphafold/bfd
zoom_inView more info...

checkm

infoDatabase associated with CheckM, a tool for assessing the quality of genomes recovered from isolates, single cells, or metagenomes.
folder_open/datasets/bio/checkm/
zoom_inView more info...

ColabFoldDB

infoColabFoldDB is a protein database built for ColabFold by extending BFD/MGnify with additional metagenomic protein catalogs containing eukaryotic proteins, phage catalogs and an updated version of MetaClust.
folder_open/datasets/bio/colabfold/colabfold_envdb_202108
zoom_inView more info...

Databases for ColabFold

info“Databases built in MMseqs2 format to be used with ColabFold. The databases include PDB70 (version 220313), UniRef70 (versions 2103 and 2202), BFD/Mgnfy and the environmental database ColabFoldDB (version 202108)”
folder_open/datasets/bio/colabfold
zoom_inView more info...

dfam

infoDfam is a database of Transposable Element DNA sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations.
folder_open/datasets/bio/dfam/
zoom_inView more info...

EggNOG

infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.
folder_open/datasets/bio/eggnog-data/
zoom_inView more info...

EggNOG

infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.
folder_open/datasets/bio/eggnog6-data/
zoom_inView more info...

GMAP-GSNAP database (human genome)

infoThe programs GMAP (Genomic Mapping and Aligment Program) and GSNAP (Genomic Short-read Nucleotide Alignment Program) align RNA and DNA sequences from next-generation sequencing data to a genome reference sequence. The GMAP-GSNAP human genomic database available on Unity was built using the human genome assembly GRCh38.p14 (NCBI RefSeq assembly GCF_000001405.40)
folder_open/datasets/bio/gmap-gsnap
zoom_inView more info...

GTDB

infoThe Genome Taxonomy Database (GTDB) is a genome-based taxonomy for prokaryotic genomes collected from the NCBI RefSeq and GenBank Assembly databases.
folder_open/datasets/bio/gtdb/
zoom_inView more info...

Illumina iGenomes

infoThe Illumina iGenomes dataset is an assortment of genomes and annotation files (downloaded from UCSC, NCBI, or Ensembl) for commonly analyzed organisms.
folder_open/datasets/bio/igenomes
zoom_inView more info...

Kraken2

infoDatabase for Kraken2, a tool that assigns taxonomic labels to DNA sequences. The database was built with the complete archaeal, bacterial and viral genomes downloaded from the NCBI Reference Sequence Database on July 22nd 2024.
folder_open/datasets/bio/kraken2
zoom_inView more info...

MGnify

infoMGnify is a database of non-redundant protein sequences predicted from metagenomic assemblies. MGnify is one of the protein sequence databases that can be used with AlphaFold.
folder_open/datasets/bio/alphafold/mgnify
zoom_inView more info...

NCBI BLAST databases

infoNational Center for Biotechnology Information (NCBI) database presented in the format required for running Basic Local Alignment Search Tool (BLAST) as well as the sequence aligner DIAMOND. It contains the nucleotide database, the non-redundant Reference Sequence protein database for archaeal and bacterial genomes, the Reference Sequence Prokaryotic Representative Genome Database and the Reference Sequence Eukaryotic Representative Genome Database. NCBI’s BLAST databases are downloaded weekly. See the full details for more information.
folder_open/datasets/bio/ncbi-db/
zoom_inView more info...

NCBI RefSeq database

infoComplete archaeal, bacterial and viral genomes retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.
folder_open/datasets/bio/ncbi-refseq/
zoom_inView more info...

NCBI RefSeq database

infoComplete archaeal, bacterial and viral genomes retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.
folder_open/datasets/bio/ncbi-refseq/
zoom_inView more info...

Parameters of AlphaFold

infoAlphaFold is a deep leaning model designed to predict the 3D structure of proteins.
folder_open/datasets/bio/alphafold/params
zoom_inView more info...

Parameters of Evolutionary Scale Modeling (ESM) models

infoESM models are a group of transformer protein language models designed to predict variant effects on protein function, protein sequences from backbone atom coordinates or protein structures from primary sequences.
folder_open/datasets/bio/esm/
zoom_inView more info...

Parameters of Evolutionary Scale Modeling (ESM) models

infoESM models are a group of transformer protein language models designed to predict variant effects on protein function, protein sequences from backbone atom coordinates or protein structures from primary sequences.
folder_open/datasets/bio/esm/
zoom_inView more info...

PDB70

infoPDB70 is a protein database that contains profile hidden Markov models for a representative set of protein sequences from the Protein Data Bank database filtered with a maximum pairwise sequence identity of 70%. PDB70 can be used with AlphaFold.
folder_open/datasets/bio/alphafold/pdb70
zoom_inView more info...

PINDER

infoPINDER or Protein Interaction Dataset and Evaluation Resource, is a dataset and resource for training and evaluation of protein-protein docking algorithms.
folder_open/datasets/bio/pinder
zoom_inView more info...

PLINDER

infoPLINDER or Protein Ligand Interactions Dataset and Evaluation Resource, is a comprehensive, annotated, high quality dataset and resource for training and evaluation of protein-ligand docking algorithms.
folder_open/datasets/bio/plinder
zoom_inView more info...

Protein Data Bank

infoProtein sequences from the Protein Data Bank in CIF format.
folder_open/datasets/bio/colabfold/pdb
zoom_inView more info...

Protein Data Bank database in mmCIF format

infoProtein sequences from the Protein Data Bank in mmCIF format.
folder_open/datasets/bio/alphafold/pdb_mmcif
zoom_inView more info...

Protein Data Bank database in SEQRES records

infoProtein sequences from the Protein Data Bank in SEQRES records. SEQRES records contain the amino acid sequence of residues in each chain of the proteins.
folder_open/datasets/bio/alphafold/pdb_seqres
zoom_inView more info...

Tara Oceans 18S amplicon

info18S amplicon sequencing data from the Tara Oceans expedition (2009-2013) DNA samples corresponding to size fractions for protists. The sequence files were downloaded from the European Nucleotide Archive under project number PRJEB6610.
folder_open/datasets/bio/tara-oceans/18S-amplicon
zoom_inView more info...

Tara Oceans MATOU gene catalog

infoReference collection of expressed eukaryotic genes called Marine Atlas of Tara Oceans Unigenes (MATOU), obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open/datasets/bio/tara-oceans/MATOU-gene-catalog
zoom_inView more info...

Tara Oceans MGT transcriptomes

infoCollection of metagenomics-based transcriptomes (MGTs) of eukaryotic marine plankton communities obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open/datasets/bio/tara-oceans/MGT-transcriptomes
zoom_inView more info...

Uniclust30

infoUniclust30 is a database of annotated protein sequences and alignments. It is built by clustering the sequences in UniProt Knowledgebase (UniProtKB) at the level of 30% pairwise sequence identity. Uniclust30 can be used with AlphaFold.
folder_open/datasets/bio/alphafold/uniclust30
zoom_inView more info...

UniProtKB

infoThe UniProt Knowledgebase (UniProtKB) is a database of protein sequences consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains manually annotated and non-redundant protein sequence records while UniProtKB/TrEMBL contains computationally analyzed and unreviewed protein sequence records.
folder_open/datasets/bio/alphafold/uniprot
zoom_inView more info...

UniRef100

infoUniRef100 is a database of protein sequences from UniProtKB and selected UniParc records.
folder_open/datasets/bio/uniref100
zoom_inView more info...

UniRef30

infoUniRef30 is a database of protein sequences built for ColabFold by clustering UniRef100 sequences with 30% sequence identity.
folder_open/datasets/bio/colabfold/uniref30_2103
zoom_inView more info...

UniRef90

infoUniRef90 is a database of protein sequences from UniProtKB and selected UniParc records. UniRef90 is built by clustering UniRef100 sequences such that each clustered set is composed of sequences that have at least 90% sequence identity to, and 80% overlap with, the longest sequence in the cluster.
folder_open/datasets/bio/alphafold/uniref90
zoom_inView more info...

Updated databases for ColabFold

info“Databases built in MMseqs2 format to be used with ColabFold. The databases include PDB100 (version 230517), UniRef30 (version 2302) and the environmental database ColabFoldDB (version 202108)”
folder_open/datasets/bio/colabfold_new
zoom_inView more info...

Using HuggingFace Datasets

Last modified: Tuesday, April 15, 2025 at 11:38 AM. See the commit on GitLab.
University of Massachusetts Amherst University of Massachusetts Amherst University of Rhode Island University of Rhode Island University of Massachusetts Dartmouth University of Massachusetts Dartmouth University of Massachusetts Lowell University of Massachusetts Lowell University of Massachusetts Boston University of Massachusetts Boston Mount Holyoke College Mount Holyoke College Smith College Smith College
search
close