Galaxy10 SDSS Dataset

Note

This page has been renamed to Galaxy10 SDSS. A new Galaxy10 using DECals is available at https://github.com/henrysky/Galaxy10

Introduction

Galaxy10 SDSS is a dataset contains 21785 69x69 pixels colored galaxy images (g, r and i band) separated in 10 classes. Galaxy10 SDSS images come from Sloan Digital Sky Survey and labels come from Galaxy Zoo.

Galaxy10 dataset (21785 images)
├── Class 0 (3461 images): Disk, Face-on, No Spiral
├── Class 1 (6997 images): Smooth, Completely round
├── Class 2 (6292 images): Smooth, in-between round
├── Class 3 (394 images): Smooth, Cigar shaped
├── Class 4 (1534 images): Disk, Edge-on, Rounded Bulge
├── Class 5 (17 images): Disk, Edge-on, Boxy Bulge
├── Class 6 (589 images): Disk, Edge-on, No Bulge
├── Class 7 (1121 images): Disk, Face-on, Tight Spiral
├── Class 8 (906 images): Disk, Face-on, Medium Spiral
└── Class 9 (519 images): Disk, Face-on, Loose Spiral

These classes are mutually exclusive, but Galaxy Zoo relies on human volunteers to classify galaxy images and the volunteers do not agree on all images. For this reason, Galaxy10 only contains images for which more than 55% of the votes agree on the class. That is, more than 55% of the votes among 10 classes are for a single class for that particular image. If none of the classes get more than 55%, the image will not be included in Galaxy10 as no agreement was reached. As a result, 21785 images after the cut.

The justification of 55% as the threshold is based on validation. Galaxy10 is meant to be an alternative to MNIST or Cifar10 as a deep learning toy dataset for astronomers. Thus astroNN.models.Cifar10_CNN is used with Cifar10 as a reference. The validation was done on the same astroNN.models.Cifar10_CNN. 50% threshold will result a poor neural network classification accuracy although around 36000 images in the dataset, many are probably misclassified and neural network has a difficult time to learn. 60% threshold result is similar to 55% , both classification accuracy is similar to Cifar10 dataset on the same network, but 55% threshold will have more images be included in the dataset. Thus 55% was chosen as the threshold to cut data.

The original images are 424x424, but were cropped to 207x207 centered at the images and then downscaled 3 times via bilinear interpolation to 69x69 in order to make them manageable on most computer and graphics card memory.

There is no guarantee on the accuracy of the labels. Moreover, Galaxy10 is not a balanced dataset and it should only be used for educational or experimental purpose. If you use Galaxy10 for research purpose, please cite Galaxy Zoo and Sloan Digital Sky Survey.

For more information on the original classification tree: Galaxy Zoo Decision Tree

_images/galaxy10sdss_example.png

Download Galaxy10 SDSS

Galaxy10.h5: http://www.astro.utoronto.ca/~bovy/Galaxy10/Galaxy10.h5

SHA256: 969A6B1CEFCC36E09FFFA86FEBD2F699A4AA19B837BA0427F01B0BC6DED458AF

Size: 200 MB (210,234,548 bytes)

Or see below to load (and download automatically) the dataset with astroNN

TL;DR for Beginners

You can view the Jupyter notebook in here: https://github.com/henrysky/astroNN/blob/master/demo_tutorial/galaxy10/Galaxy10_Tutorial.ipynb

OR you can train with astroNN and just copy and paste the following script to get and train a simple neural network on Galaxy10

Basically first we load the Galaxy10 with astroNN and split into train and test set. astroNN will split the training set into training data and validation data as well as normalizing them automatically.

Glaxy10CNN is a simple 4 layered convolutional neural network consisted of 2 convolutional layers and 2 dense layers.

 1# import everything we need first
 2from tensorflow.keras import utils
 3import numpy as np
 4from sklearn.model_selection import train_test_split
 5import pylab as plt
 6
 7from astroNN.models import Galaxy10CNN
 8from astroNN.datasets import load_galaxy10sdss
 9from astroNN.datasets.galaxy10sdss import galaxy10cls_lookup, galaxy10_confusion
10
11# To load images and labels (will download automatically at the first time)
12# First time downloading location will be ~/.astroNN/datasets/
13images, labels = load_galaxy10sdss()
14
15# To convert the labels to categorical 10 classes
16labels = utils.to_categorical(labels, 10)
17
18# Select 10 of the images to inspect
19img = None
20plt.ion()
21print('===================Data Inspection===================')
22for counter, i in enumerate(range(np.random.randint(0, labels.shape[0], size=10).shape[0])):
23    img = plt.imshow(images[i])
24    plt.title('Class {}: {} \n Random Demo images {} of 10'.format(np.argmax(labels[i]), galaxy10cls_lookup(labels[i]), counter+1))
25    plt.draw()
26    plt.pause(2.)
27plt.close('all')
28print('===============Data Inspection Finished===============')
29
30# To convert to desirable type
31labels = labels.astype(np.float32)
32images = images.astype(np.float32)
33
34# Split the dataset into training set and testing set
35train_idx, test_idx = train_test_split(np.arange(labels.shape[0]), test_size=0.1)
36train_images, train_labels, test_images, test_labels = images[train_idx], labels[train_idx], images[test_idx], labels[test_idx]
37
38# To create a neural network instance
39galaxy10net = Galaxy10CNN()
40
41# set maximium epochs the neural network can run, set 5 to get quick result
42galaxy10net.max_epochs = 5
43
44# To train the nerual net
45# astroNN will normalize the data by default
46galaxy10net.train(train_images, train_labels)
47
48# print model summary before training
49galaxy10net.keras_model.summary()
50
51# After the training, you can test the neural net performance
52# Please notice predicted_labels are labels predicted from neural network. test_labels are ground truth from the dataset
53predicted_labels = galaxy10net.test(test_images)
54
55# Convert predicted_labels to class
56prediction_class = np.argmax(predicted_labels, axis=1)
57
58# Convert test_labels to class
59test_class = np.argmax(test_labels, axis=1)
60
61# Prepare a confusion matrix
62confusion_matrix = np.zeros((10,10))
63
64# create the confusion matrix
65for counter, i in enumerate(prediction_class):
66    confusion_matrix[i, test_class[counter]] += 1
67
68# Plot the confusion matrix
69galaxy10_confusion(confusion_matrix)

Load with astroNN

 1from astroNN.datasets import load_galaxy10sdss
 2from tensorflow.keras import utils
 3import numpy as np
 4
 5# To load images and labels (will download automatically at the first time)
 6# First time downloading location will be ~/.astroNN/datasets/
 7images, labels = load_galaxy10sdss()
 8
 9# To convert the labels to categorical 10 classes
10labels = utils.to_categorical(labels, 10)
11
12# To convert to desirable type
13labels = labels.astype(np.float32)
14images = images.astype(np.float32)

OR Load with Python & h5py

You should download Galaxy10.h5 first and open python at the same location and run the following to open it:

 1import h5py
 2import numpy as np
 3from tensorflow.keras import utils
 4
 5# To get the images and labels from file
 6with h5py.File('Galaxy10.h5', 'r') as F:
 7    images = np.array(F['images'])
 8    labels = np.array(F['ans'])
 9
10# To convert the labels to categorical 10 classes
11labels = utils.to_categorical(labels, 10)
12
13# To convert to desirable type
14labels = labels.astype(np.float32)
15images = images.astype(np.float32)

Split into train and test set

1import numpy as np
2from sklearn.model_selection import train_test_split
3
4train_idx, test_idx = train_test_split(np.arange(labels.shape[0]), test_size=0.1)
5train_images, train_labels, test_images, test_labels = images[train_idx], labels[train_idx], images[test_idx], labels[test_idx]

Lookup Galaxy10 Class

You can lookup Galaxy10 class to the corresponding name by

1from astroNN.datasets.galaxy10sdss import galaxy10cls_lookup
2galaxy10cls_lookup(# a class number here to get back the name)

Galaxy10 Dataset Authors

  • Henry Leung - Compile the Galaxy10 - henrysky
    Department of Astronomy & Astrophysics, University of Toronto
  • Jo Bovy - Supervisor of Henry Leung - jobovy
    Department of Astronomy & Astrophysics, University of Toronto

Acknowledgments

For astroNN acknowledgment, please refers to Acknowledging astroNN

  1. Galaxy10 dataset classification labels come from Galaxy Zoo

  2. Galaxy10 dataset images come from Sloan Digital Sky Survey (SDSS)

Galaxy Zoo is described in Lintott et al. 2008, MNRAS, 389, 1179 and the data release is described in Lintott et al. 2011, 410, 166

Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.

The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.