A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!
How vflow
works internally
VeridicalFlow provides three main abstractions: Vset
, Vfunc
, and Subkey
. Vfunc
s run arbitrary computations on inputs. A Vset
collects related Vfunc
instances (which wrap user-provided functions or classes) and determines which Vfunc
to apply to which inputs and how to do so. Vset
outputs are dictionaries with tuple keys composed of one or more Subkey
instances that help associate outputs with the Vfunc
that produced them and the Vset
in which that Vfunc
was collected. Below we describe these three abstractions in more detail.
Vset
Fundamentally, all Vset
objects generate permutations by applying their functions to the Cartesian product of input parameters. Internally, Vset
functions and inputs are wrapped in dictionaries with tuple keys. These tuples contain Subkey
objects, which identify the functions wrapped by the Vfunc
(the Subkey
"value"), as well as its originating Vset
(the Subkey
"origin"). The following is a simple example of what a dictionary created by vflow
might look like after fitting a random forest classifier from scikit-learn to some data:
{ (X, y, RF): RandomForestClassifier(max_depth=5, n_estimators=50) }
Each of X
, y
, and RF
are instances of Subkey
where the printed representation is the Subkey
value, with origins corresponding to the name of the Vset
that created the Subkey
. The values may be user-supplied at Vset
initialization or, if not, will be generated in the format {vset.name}_{index}
, e.g. "modeling_0", "modeling_1", etc. Since inputs must be dictionaries with a certain format, all raw inputs (e.g., numpy arrays) must be initialized using helpers.init_args
.
When a Vset
method such as fit
or predict
is called with multiple input arguments, the inputs are first combined left to right by Cartesian product; the right Subkey
tuple is always filtered on any matching items in the left Subkey
tuple, and the two keys' remaining Subkey
instances are concatenated. The values of the resulting combined dictionary are tuples containing the values of the inputs concatenated in the order in which they were passed in.
Next, the Vset
computes the Cartesian product of the combined input dictionary with Vset.vfuncs
or Vset.fitted_vfuncs
(depending on the Vset
method that was called), combining Subkey
tuples in a similar process to determine which Vfuncs
to apply to which inputs.
output_matching
An important Vset
initialization parameter is the boolean output_matching
. By default, this parameter is False
, but it should be set to True
when the Vset
is used multiple times and its outputs need to be combined regardless of when it was called, as, for example, with a data cleaning Vset
that is used first on training data before model fitting and later on test data for model evaluation.
To demonstrate, if a data imputation Vset
with values "mean_impute" and "med_impute" was used at an earlier step in the pipeline with output_matching=False
, then the following bad matches may occur at the testing stage:
Input dictionaries:
# dictionary of fitted models
{ (X_train, mean_impute, y_train, RF): RF_fit_on_mean_imputed_train_data,
(X_train, med_impute, y_train, RF): RF_fit_on_med_imputed_train_data }
# dictionary of imputed testing data
{ (X_test, mean_impute, y_test): mean_imputed_test_data,
(X_test, med_impute, y_test): med_imputed_test_data }
Output dictionary:
{ (X_train, mean_impute, y_train, X_test, mean_impute, RF): # good match
RF_fit_on_mean_imputed_train_data(mean_imputed_test_data),
(X_train, mean_impute, y_train, X_test, med_impute, RF): # bad match!
RF_fit_on_mean_imputed_train_data(med_imputed_test_data)
(X_train, med_impute, y_train, X_test, med_impute, RF): # good match
RF_fit_on_med_imputed_train_data(med_imputed_test_data),
(X_train, med_impute, y_train, X_test, mean_impute, RF): # bad match!
RF_fit_on_med_imputed_train_data(mean_imputed_test_data) }
Internally, when output_matching=True
, new Subkey
instances added to the output dictionary keys by the Vset
will have an output_matching
attribute with value True
, which is used to reject Cartesian product combinations when the Subkey
origins do not match or match but their values differ. See below for more info on Subkey
matching.
In contrast, use the default output_matching=False
when:
- Separate calls to the
Vset
result in entirely independent outputs, such as is usually the case for aVset
that does subsampling of the data. In this case,output_matching=True
will result in bad matches, i.e., unnecessary matching on subsamples of the same or different datasets. output_matching=True
was already used earlier in the pipeline. For example, if you use various strategies to clean your data that must be matched at training and test time, you don't need to initialize a modelingVset
withoutput_matching=True
, but should instead useoutput_matching=True
when initializing the data cleaningVset
.
Asynchronous and lazy computation
There are two important Vset
initialization parameters that control how fuctions in the Vset
are computed:
is_async
: whenTrue
, all functions are computed asynchronously usingRay
. The resources used to distribute computation is deterined by the user's call toray.init()
before applying theVset
to inputs. Default isFalse
.lazy
: whenTrue
, functions are computed lazily, meaning that no computation occurs until their results are required downstream in the pipeline. Default isFalse
.
is_async
and lazy
, see below.
Vfunc
When a Vset
is initialized, the items in the modules
arg are wrapped in Vfunc
objects, which are like named functions that can optionally support a fit
or transform
method. Initializing the Vset
with is_async=True
and lazy=True
has the following effects:
is_async=True
: wraps user functions withAsyncVfunc
objects, which compute function outputs asynchronously usingray
.lazy=True
: wraps theVset
output dictionary values withVfuncPromise
objects, which are lazily evaluated. At the moment,VfuncPromise
objects are only resolved if passed to aVset
withlazy=False
downstream or if called manually by the user.
Subkey
Subkey
instances identify the user's functions in a Vset
and help to correctly match inputs to other inputs and functions.
Behavior when Subkey
has output_matching=False
(default)
By default, Subkey
objects are non-matching, meaning that vflow
won't bother to look for a matching Subkey
when combining data or deciding which Vfunc
to apply to which inputs. As described above, Vset
dictionaries are combined from left to right two-at-a-time, and if every Subkey
in the keys of both dictionaries is non-matching then the result is the full Cartesian product of the two dictionaries.
Behavior when Subkey
has output_matching=True
If one of the entries in a given dictionary's Subkey
tuple is matching then vflow
tries to find a match in the tuple keys of the other dictionary. The first dictionary's value is combined with the other dictionary's value in two cases:
- The
Subkey
instances of the other dictionary's tuple key are all non-matching. - The other dictionary's key has a
Subkey
withoutput_matching=True
that has the same origin and same value as the first dictionary'sSubkey
.
Subkey
to be matching: it was created by a Vset
that was initialized with output_matching=True
, as described above.