UCI ML Hackathon: Challenge Datasets
California Wildfires
California Wildfires
Donated by: Casey Graff (email)
(average year) ~50k instances, (all years) ~350k instances. Each instance is an image with multiple channels, each corresponding to a different feature (ground vegetation, observed fire detection, weather) and the target is a single channel image of the next day’s fire detections.
Possible Applications: Image to Image Classification (binary classification per pixel of output image; similar to image segmentation).
Galaxy Spiral Structure
Spiral Structure Analysis of 1 Million Sloan Galaxies
Donated by: Wayne Hayes (email)
CSV file with description of 856,734 galaxies (+ header line), 338 features each
Possible Applications: Measuring the structure and evolution of galaxies; evolution of the Cosmos at large.
GPA
Geometric Pose Affordance
Donated by: Zhe Wang (email)
Given 2D location of the person in image size, how do we get the 3d location of the person in root-relative coordinate (xyz location relative to pelvis joint). Number of Instance: training: 222,514, validation: 8,000, testing: 82378 Type of data: human keypoints position extract from image, in 2d (input), and 3d (output), also geometry information (multi-layer depth map) extracted from 2d joint location (input).
Possible Applications: Motion-retarget in Adobe: https://sites.google.com/umich.edu/nik Action recognition and explanation Hollywood motion capture (3D Avatart), animation Sports analysis (NBA, FIFA) Health care (Autism, Parkinson, anthropometry physical rehabilitation) robot learning (action anticipation, affordance learning, assist leaving) self-driving cars (motion prediction) Virtual reality (holelens2, facebook reality lab) Amazon Go Scene understanding and proxemics recognition
Amyloid Positivity
Clinical memory assessments and biomarkers associated with Alzheimer’s Disease and Related Disorders
Donated by: Michael Lee (email)
Tabular data with 939 cases of 19 variables. Each case is a clinical test of a patient. Variables involve demographic information (age, gender, years of education), protocol information (time since baseline test), memory test outcomes (free recall scores, recognition scores), biomarkers (APOE genotype, beta amyloid), and diagnosis of memory impairment (cognitively normal or impaired).
Possible Applications: Prediction of amyloid status. Prediction of progression to cognitive impairment. Visualization of relationship between memory test performance, biomarkers, and demographics.
DNS network captures
DNS network captures
Donated by: Zhou Li (email)
DNS data is often captured and used by security companies to find cyber-attacks. There are two pcap files consisting of millions of packets of DNS queries. A portion of them are benign, while others are malicious (e.g., flowing to a domains owned by cyber-attackers). The first one contains various kinds of DNS attacks. The second one contains DNS queries to many algorithm generated domains (Domain generation algorithms, DGA) from various family. DGA domains are often used as rendezvous points linked to command and control servers by malwares.
Possible Applications: The first dataset could be used to build detection system to identify various kind of network attacks based on DNS communication patterns. The second dataset could be used to build detection system to detect DGA domains.
Satellite Imagery of Cambodia
Satellite imagery of Chbar Mon, Kampong Speu, Cambodia
Donated by: Daniel Parker (email)
These are raster data (satellite images).
Possible Applications: Classify the images according to some simple land types, including: urban, rice fields, other agricultural fields, water, buildings, houses, etc.
UCI Clinical Data
UCI OMOP DeID database
Donated by: Alessandro Ghigi, Zhaoxian Hu, Wu Fu (email)
DeID clinical data related to 800,000 patients and 15,000,000 visits. Available clinical information: encounters, conditions (diagnoses), procedures, measurements (lab tests and vital signs), drugs, observations.
Possible Applications: Feasibility studies, clinical projects that can run against DeID data.