A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!
How vflow works internally
VeridicalFlow provides three main abstractions: Vset, Vfunc, and Subkey. Vfuncs run arbitrary computations on inputs. A Vset collects related Vfunc instances (which wrap user-provided functions or classes) and determines which Vfunc to apply to which inputs and how to do so. Vset outputs are dictionaries with tuple keys composed of one or more Subkey instances that help associate outputs with the Vfunc that produced them and the Vset in which that Vfunc was collected. Below we describe these three abstractions in more detail.
Vset
Fundamentally, all Vset objects generate permutations by applying their functions to the Cartesian product of input parameters. Internally, Vset functions and inputs are wrapped in dictionaries with tuple keys. These tuples contain Subkey objects, which identify the functions wrapped by the Vfunc (the Subkey "value"), as well as its originating Vset (the Subkey "origin"). The following is a simple example of what a dictionary created by vflow might look like after fitting a random forest classifier from scikit-learn to some data:
{ (X, y, RF): RandomForestClassifier(max_depth=5, n_estimators=50) }
Each of X, y, and RF are instances of Subkey where the printed representation is the Subkey value, with origins corresponding to the name of the Vset that created the Subkey. The values may be user-supplied at Vset initialization or, if not, will be generated in the format {vset.name}_{index}, e.g. "modeling_0", "modeling_1", etc. Since inputs must be dictionaries with a certain format, all raw inputs (e.g., numpy arrays) must be initialized using helpers.init_args.
When a Vset method such as fit or predict is called with multiple input arguments, the inputs are first combined left to right by Cartesian product; the right Subkey tuple is always filtered on any matching items in the left Subkey tuple, and the two keys' remaining Subkey instances are concatenated. The values of the resulting combined dictionary are tuples containing the values of the inputs concatenated in the order in which they were passed in.
Next, the Vset computes the Cartesian product of the combined input dictionary with Vset.vfuncs or Vset.fitted_vfuncs (depending on the Vset method that was called), combining Subkey tuples in a similar process to determine which Vfuncs to apply to which inputs.
output_matching
An important Vset initialization parameter is the boolean output_matching. By default, this parameter is False, but it should be set to True when the Vset is used multiple times and its outputs need to be combined regardless of when it was called, as, for example, with a data cleaning Vset that is used first on training data before model fitting and later on test data for model evaluation.
To demonstrate, if a data imputation Vset with values "mean_impute" and "med_impute" was used at an earlier step in the pipeline with output_matching=False, then the following bad matches may occur at the testing stage:
Input dictionaries:
# dictionary of fitted models
{ (X_train, mean_impute, y_train, RF): RF_fit_on_mean_imputed_train_data,
(X_train, med_impute, y_train, RF): RF_fit_on_med_imputed_train_data }
# dictionary of imputed testing data
{ (X_test, mean_impute, y_test): mean_imputed_test_data,
(X_test, med_impute, y_test): med_imputed_test_data }
Output dictionary:
{ (X_train, mean_impute, y_train, X_test, mean_impute, RF): # good match
RF_fit_on_mean_imputed_train_data(mean_imputed_test_data),
(X_train, mean_impute, y_train, X_test, med_impute, RF): # bad match!
RF_fit_on_mean_imputed_train_data(med_imputed_test_data)
(X_train, med_impute, y_train, X_test, med_impute, RF): # good match
RF_fit_on_med_imputed_train_data(med_imputed_test_data),
(X_train, med_impute, y_train, X_test, mean_impute, RF): # bad match!
RF_fit_on_med_imputed_train_data(mean_imputed_test_data) }
Internally, when output_matching=True, new Subkey instances added to the output dictionary keys by the Vset will have an output_matching attribute with value True, which is used to reject Cartesian product combinations when the Subkey origins do not match or match but their values differ. See below for more info on Subkey matching.
In contrast, use the default output_matching=False when:
- Separate calls to the
Vsetresult in entirely independent outputs, such as is usually the case for aVsetthat does subsampling of the data. In this case,output_matching=Truewill result in bad matches, i.e., unnecessary matching on subsamples of the same or different datasets. output_matching=Truewas already used earlier in the pipeline. For example, if you use various strategies to clean your data that must be matched at training and test time, you don't need to initialize a modelingVsetwithoutput_matching=True, but should instead useoutput_matching=Truewhen initializing the data cleaningVset.
Asynchronous and lazy computation
There are two important Vset initialization parameters that control how fuctions in the Vset are computed:
is_async: whenTrue, all functions are computed asynchronously usingRay. The resources used to distribute computation is deterined by the user's call toray.init()before applying theVsetto inputs. Default isFalse.lazy: whenTrue, functions are computed lazily, meaning that no computation occurs until their results are required downstream in the pipeline. Default isFalse.
is_async and lazy, see below.
Vfunc
When a Vset is initialized, the items in the modules arg are wrapped in Vfunc objects, which are like named functions that can optionally support a fit or transform method. Initializing the Vset with is_async=True and lazy=True has the following effects:
is_async=True: wraps user functions withAsyncVfuncobjects, which compute function outputs asynchronously usingray.lazy=True: wraps theVsetoutput dictionary values withVfuncPromiseobjects, which are lazily evaluated. At the moment,VfuncPromiseobjects are only resolved if passed to aVsetwithlazy=Falsedownstream or if called manually by the user.
Subkey
Subkey instances identify the user's functions in a Vset and help to correctly match inputs to other inputs and functions.
Behavior when Subkey has output_matching=False (default)
By default, Subkey objects are non-matching, meaning that vflow won't bother to look for a matching Subkey when combining data or deciding which Vfunc to apply to which inputs. As described above, Vset dictionaries are combined from left to right two-at-a-time, and if every Subkey in the keys of both dictionaries is non-matching then the result is the full Cartesian product of the two dictionaries.
Behavior when Subkey has output_matching=True
If one of the entries in a given dictionary's Subkey tuple is matching then vflow tries to find a match in the tuple keys of the other dictionary. The first dictionary's value is combined with the other dictionary's value in two cases:
- The
Subkeyinstances of the other dictionary's tuple key are all non-matching. - The other dictionary's key has a
Subkeywithoutput_matching=Truethat has the same origin and same value as the first dictionary'sSubkey.
Subkey to be matching: it was created by a Vset that was initialized with output_matching=True, as described above.