Toys: The data science toolbox¶
Toys is a toolbox for data science, built with PyTorch, and designed for rapid research.
Documentation¶
Datasets¶
The Dataset
protocol is borrowed from PyTorch and is the boundary between the preprocess and the model. The protocol is quite easy to implement. A dataset need only have methods __len__()
and __getitem__()
with integer indexing. Most simple collections can be used as datasets, including list
and ndarray
.
We use the following vocabulary when discussing datasets:
Row: | The value at dataset[i] is called the ith row of the dataset. Each row must be a sequence of arrays and/or scalars, and each array may be of different shape. |
---|---|
Column: | The positions in a row are called the columns. The jth column of the dataset is the sequence of the jth column of every row. |
Supervised: | A supervised dataset has at least two columns where the last column is designated as the target column, and the rest as feature columns. In unsupervised datasets, all are considered feature columns. |
Feature: | The data in any one feature column of a row is called a feature of that row. |
Target: | Likewise, the data in the target column of a row is called the target of that row. |
Instance: | The features of a row are collectively called an instance. |
Shape: | The shape of a row or instance is the sequence of shapes of its columns. The shape of a dataset is the shape of its rows. Note that the shape of a dataset does not include its length. |
For example, the CIFAR10
dataset is a supervised dataset with two columns. The feature column contains 32x32 pixel RGB images, and the target column contains integer class labels. The shape of the feature is (32, 32, 3)
, and the shape if the target is ()
(i.e. the target is a scalar). The shape of the CIFAR10 dataset is thus ((32,32,3), ())
.
Note
Unlike arrays, columns need not have the same shape across all rows. In fact, the same column may have a different number of dimensions in different rows, and the rows may even have different number of columns all together. While most estimators expect some consistency, this freedom allows us to efficiently represent, e.g., variable sequence lengths. A dataset shape (as opposed to a row or instance shape) may use None
to represent a variable aspect of its shape.
Batching and iteration¶
The function toys.batches()
iterates over mini-batches of a dataset by delegating to PyTorch’s DataLoader
class. The batches()
function forwards all of its arguments to the DataLoader
constructor, but it allows the dataset to recommend default values through the Dataset.hints
attribute. This allows the dataset to, e.g. specify an appropriate collate function or sampling strategy.
The most common arguments are:
batch_size: | The maximum number of rows per batch. |
---|---|
shuffle: | A boolean set to true to sample batches at random without replacement. |
collate_fn: | A function to merge a list of samples into a mini-batch. This is required if the shape of the dataset is variable, e.g. to pad or pack a sequence length. |
pin_memory: | If true, batches are loaded into CUDA pinned memory. Unlike vanilla PyTorch, this defaults to true whenever CUDA is available. |
Note
Most estimators will require an explicit batch_size
argument when it can effect model performance. Thus the batch_size
hint provided by the dataset is more influential to scoring functions than to estimators. Therefore the hinted value should be for scoring purposes and can be quite large.
See also
See torch.utils.data.DataLoader
for a full description of all possible arguments.
Todo
Add examples
Creating and combining datasets¶
The primary functions for combining datasets are toys.concat()
and toys.zip()
which concatenate datasets by rows and columns respectively.
Of these, toys.zip()
is the more commonly used. It allows us to, e.g., combine the features and target from separate datasets:
>>> features = np.random.random(size=(100, 1, 5)) # 100 rows, 1 column of shape (5,)
>>> target = np.prod(features, axis=-1) # 100 rows, 1 scalar column
>>> dataset = toys.zip(features, target) # 100 rows, 2 columns
>>> toys.shape(features)
((5,),)
>>> toys.shape(target)
((),)
>>> toys.shape(dataset)
((5,), ())
Most estimators will automatically zip datasets if you pass more than one:
>>> from toys.supervised import LeastSquares
>>> estimator = LeastSquares()
>>> model = estimator(dataset) # Each of these calls
>>> model = estimator(features, target) # is equivalent to the other
Style Guide¶
All Python code should follow the Google Python Style Guide with the following exceptions and additions.
Doc strings¶
Doc strings should use triple single-quotes ('''
).
All values in return, yield, attributes, arguments, and keyword arguments sections must include both names and type annotations. (See the following section on type annotations).
The description for arguments and return values start on the line following their name and type annotation. This is more visually appealing when long type annotations are used, and so we require it globally for consistency.
E.g.:
def torch_dtype(dtype):
'''Casts dtype to a PyTorch tensor class.
The input may be a conventional name, like 'float' and 'double', or an
explicit name like 'float32' or 'float64'. If the input is a known
tensor class, it is returned as-is.
Args:
dtype (str or TorchDtype):
A conventional name, explicit name, or known tensor class.
Returns:
cls (TorchDtype):
The tensor class corresponding to `dtype`.
'''
...
Type Annotations¶
Type hints are useful for both documentation and static analysis tooling but can be very distracting syntactically. As a compromise, always include PEP 484 compliant type hints in docstrings for arguments, return, and yield values. Don’t include type annotations in code.
The following sugar is allowed, given in order of precedence:
Union[A, B]
may be written asA or B
.Callable[A, B]
may be written asA -> B
.
Note that Optional[T]
is equivalent to Union[T, None]
. The preferred notation for optional types is T or None
.
When types become complex, create an alias, e.g.:
CrossValSplitter = Callable[[Dataset], Iterable[Tuple[Dataset, Dataset]]]
Imports¶
Use relative imports for anything under the current package, and use absolute imports for everything else. This allows package to be moved without modifying their contents, in the common case (other cases are a bad smell).
Import classes directly, using from pkg import MyClass
.
Group imports by dependency, and separate each group by a single blank line. The first import should be the top-level package of the dependency. Sort groups by dependency name except when conflicting with the following.
Reserve the first group for the Python standard library.
Reserve the second group for the SciPy stack, e.g. numpy
, scipy
, matplotlib
, and pandas
. Other general purpose data handling tools may be included in this section, like dask
and xarray
. Use simple import
statements in this group. If you find yourself writing many import from the same package, use a dedicated group instead.
Place relative imports last in their own group.
Within each group, sort bare import ...
statements before from ... import ...
statements. Otherwise sort imports lexicographically.
Always import the top-level package for each dependency. Import all objects used in docstrings, and use objects in docstrings as imported. Otherwise avoid dead imports.
E.g.:
from typing import Any, Mapping, Sequence
import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import torch
from torch.nn import DataParallel, Module
from torch.optim import Optimizer
import toys
from toys import Dataset
from toys.metrics import Mean
from .cross_val import KFold
Code layout¶
Code is divided into packages (folders) and modules (*.py files). By default, all code in modules is considered private. Public objects should be reexported by the package’s __init__.py
file. Other than comments and a package-level docstring, each __init__.py
file should only contain relative import statements for the public objects in submodules of the package.
Do not use __all__
. The rules above serve the same purpose.
toys¶
Core protocols¶
The core protocols are pure abstract classes. They provide no functionality and are for documentation purpose only. There is no requirement to subclass them; however doing so provides certain runtime protections through Python’s abstract base class (abc
) functionality.
Dataset |
|
Estimator |
|
Model |
Common classes¶
BaseEstimator |
|
TorchModel |
Dataset utilities¶
toys.batches |
|
toys.zip |
|
toys.concat |
|
toys.flatten |
|
toys.subset |
|
toys.shape |
Argument parsers¶
parse_activation |
|
parse_initializer |
|
parse_optimizer |
|
parse_loss |
|
parse_dtype |
|
parse_metric |
Type aliases¶
These type aliases exist to aid in documentation and static analysis. They are irrelevant at runtime.
-
class
toys.
ColumnShape
= Optional[Tuple[Optional[int], ...]]¶ The shape of a single datum in a column.
None
is used for dimensions of variable length, and when the total number of dimensions is variable.Note that the shape of a column does not include the index dimension.
-
class
toys.
RowShape
= Optional[Tuple[ColumnShape, ...]]¶ The shape of a row is the sequence of (possibly variable) shapes of its columns. The dataset shape may be
None
to indicate that the number of columns is variable.
For example, the CIFAR10
dataset has two columns. The first contains 32x32 RGB images; it’s shape is (32, 32, 3)
. The second contains scalar class labels; it’s shape is ()
. The shape of the whole row is thus ((32, 32, 3), ())
.
>>> from toys.datasets import CIFAR10
>>> cifar = CIFAR10()
>>> toys.shape(cifar)
((32, 32, 3), ())
toys.metrics¶
Classification metrics¶
Accuracy |
|
TruePositives |
|
FalsePositives |
|
TrueNegatives |
|
FalseNegatives |
|
Precision |
|
Recall |
|
FScore |
Regression metrics¶
MeanSquaredError |
|
NegMeanSquaredError |
toys.model_selection¶
Functions¶
combinations |
Hyperparameter search¶
GridSearchCV |
Cross validation splitting¶
KFold |
Type aliases¶
These type aliases exist to aid in documentation and static analysis. They are irrelevant at runtime.
-
class
CrossValSplitter
= Callable[[Dataset], Iterable[Fold]]¶ A function that takes a dataset and returns an iterable over some
Fold
s of the dataset. These can be used by meta-estimators, likeGridSearchCV
, to test how estimators generalize to unseen data.
-
class
Fold
= Tuple[Dataset, Dataset]¶ A fold is the partitioning of a dataset into two disjoint subsets,
(train, test)
.
-
class
ParamGrid
= Mapping[str, Sequence]¶
toys.supervised¶
GradientDescent |
|
LeastSquares |
Contributing¶
All are welcome to contribute. But because the project is so young, coordination is key. Please reach out on the issue tracker, or in person if you are around UGA, if you are interested in contributing.
The contributing file contains style guides and other useful guidelines for contributing to the project.