AI and ML
Datasets commonly used by Artificial Intelligence and other Machine Learning
AlpacaFarm
infoAlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
folder_open
/datasets/ai/alpaca-farmamass
infoAMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning
folder_open
/datasets/ai/amassaudioset
infoAudioSet is an ontology and human-labeled dataset for audio event detection. It consists of 2,084,320 ten-second sound clips from YouTube videos labeled with a hierarchical ontology of 632 audio event classes, including human and animal sounds, musical instruments, and everyday environmental noises.
folder_open
/datasets/ai/audiosetbigcode
infoBigCode is an open scientific collaboration working on responsible training of large language models for coding applications
folder_open
/datasets/ai/bigcodebiomed_clip
infoBiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning
folder_open
/datasets/ai/biomed-clipblip_2
infoBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
folder_open
/datasets/ai/blipbloom
infoBLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.
folder_open
/datasets/ai/bloomcoco
infoCOCO is a large-scale object detection, segmentation, and captioning dataset
folder_open
/datasets/ai/cocoCode Llama
DeepAccident
infoDeepAccident is the first V2X (vehicle-to-everything simulation) autonomous driving dataset that contains diverse collision accidents that commonly occur in real-world driving scenarios
folder_open
/datasets/ai/deep-accidentDeepSeek
infoDeepSeek trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors
folder_open
/datasets/ai/deepseekDINO v2
infoDINOv2 is a self-supervised method to learn visual representation
folder_open
/datasets/ai/dinov2epic-kitchens
infoEpic-Kitchens-100 is a large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in kitchen environments.
folder_open
/datasets/ai/epic-kitchensflorence
infoFlorence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.
folder_open
/datasets/ai/florenceFLUX.1 Kontext
fomo
infoFOMO-60K is a large-scale dataset of brain MRI scans, including both clinical and research-grade scans. The dataset includes a wide range of sequences, including T1, MPRAGE, T2, T2*, FLAIR, SWI, T1c, PD, DWI, ADC, and more.
folder_open
/datasets/ai/fomogemma
infoGemma is a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models
folder_open
/datasets/ai/gemmaglm
infoChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage
folder_open
/datasets/ai/glmgte-Qwen2
infogte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).
folder_open
/datasets/ai/alibabaHiDream-I1
ibm-granite
infoGranite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters
folder_open
/datasets/ai/ibm-graniteIdefics2
infoIdefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs
folder_open
/datasets/ai/idefics2Imagenet 1K
inaturalist
infoThe iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories
folder_open
/datasets/ai/inaturalistinfly
infoINF-Retriever-v1 is an LLM-based dense retrieval model developed by INF TECH. It is built upon the gte-Qwen2-7B-instruct model and specifically fine-tuned to excel in retrieval tasks, particularly for Chinese and English data
folder_open
/datasets/ai/inflyinternLM
infoInternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques
folder_open
/datasets/ai/internlminternvl3-8b-hf
infoInternVL3-8B is an open-source multimodal vision-language model optimized for fine-grained visual understanding, multimodal reasoning, and instruction-following. It supports complex tasks including visual question answering, captioning, OCR, and diagram reasoning. Built upon advanced scaling strategies and alignment techniques, InternVL2.5 bridges the gap to proprietary models like GPT-4V through high-quality pretraining and preference optimization.
folder_open
/datasets/ai/internvlintfloat
infoA novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps
folder_open
/datasets/ai/intfloatkinetics
infoKinetics is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version.
folder_open
/datasets/ai/kineticslg
infoLarge Language Models (LLMs) and Large Multimodal Models (LMMs) developed by LG AI Research. EXAONE stands for EXpert AI for EveryONE, a vision that LG is committed to realizing
folder_open
/datasets/ai/lglinq
infoLinq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models
folder_open
/datasets/ai/linqLlama2
llama3
infoLlama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage
folder_open
/datasets/ai/llama3llama4
infoLlama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture
folder_open
/datasets/ai/llama4/Llava_OneVision
llm-compiler
infoLLM Compiler: Foundation Language Models for Compiler Optimization
folder_open
/datasets/ai/llm-compilerLumina
infoLumina-Image 2.0: A Unified and Efficient Image Generative Framework
folder_open
/datasets/ai/luminamims
infoTxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies
folder_open
/datasets/ai/mimsmixtral
monai
infoM3 is a medical visual language model that empowers medical imaging professionals, researchers, and healthcare enterprises by enhancing medical imaging workflows across various modalities.
folder_open
/datasets/ai/monaimoonshot-ai
infoKimi-Audio is an open-source audio foundation model excelling in audio understanding, generation, and conversation
folder_open
/datasets/ai/moonshotmsmarco
infoThe MS MARCO dataset is a large-scale information retrieval benchmark that uses real-world questions from Bing’s search queries to evaluate the performance of machine learning models in generating answers
folder_open
/datasets/ai/msmarconatural-questions
infoNatural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine
folder_open
/datasets/ai/natural-questionsobjaverse
infoObjaverse is a Massive Dataset with 800K+ Annotated 3D Objects
folder_open
/datasets/ai/objaverseopenai-whisper
infoWhisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation
folder_open
/datasets/ai/whisperPerplexity AI
phi
infoPhi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3
folder_open
/datasets/ai/phiplaygroundai
infoA model that generates highly aesthetic images of resolution 1024x1024, as well as portrait and landscape aspect ratios
folder_open
/datasets/ai/playgroundaipythia
infoPythia is the first LLM suite designed specifically to enable scientific research on LLMs
folder_open
/datasets/ai/pythiaqwen
infoQwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model
folder_open
/datasets/ai/qwenrag-sequence-nq
infoRAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever
folder_open
/datasets/ai/rag-sequence-nqs1-32B
infos1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
folder_open
/datasets/ai/simplescalingsatlas_pretrain
infoSatlasPretrain, a remote sensing dataset that is large in both breadth and scale, combining Sentinel-2 and NAIP images with 302M labels under 137 categories and seven label types
folder_open
/datasets/ai/allenaiscalabilityai
infoA novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens
folder_open
/datasets/ai/stabilityai/sft
infoA sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2
folder_open
/datasets/ai/sftSlimPajama
infoSlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together
folder_open
/datasets/ai/slim-pajamat5
infoThe T5 model, short for Text-to-Text Transfer Transformer, is a machine learning model developed by Google
folder_open
/datasets/ai/t5Tulu
infoTülu 3: Pushing Frontiers in Open Language Model Post-Training
folder_open
/datasets/ai/tuluV2X
infoV2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving
folder_open
/datasets/ai/v2xvideo-MAE
infoVideo masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models
folder_open
/datasets/ai/opengvlabvit
infoThe Vision Transformer (ViT) model uses the transformer architecture to process image patches for tasks like image classification
folder_open
/datasets/ai/vitwildchat
infoWildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns
folder_open
/datasets/ai/wildchat