Mixture Model#
Provides the LymphMixture class for wrapping multiple lymph models.
Each component and subgroup of the mixture model is a
Unilateral instance. Its properties, parametrization, and
data are orchestrated by the LymphMixture class. It provides the methods
and computations necessary to use the expectation-maximization algorithm to fit the
model to data.
- class lymixture.models.LymphMixture(model_cls: type[~lymixture.models.ModelType] = <class 'lymph.models.unilateral.Unilateral'>, model_kwargs: dict[str, ~typing.Any] | None = None, num_components: int = 2, *, universal_p: bool = False, shared_transmission: bool = False, split_midext: bool = False)[source]#
Bases:
Composite,Composite,ModelClass that handles the individual components of the mixture model.
- __init__(model_cls: type[~lymixture.models.ModelType] = <class 'lymph.models.unilateral.Unilateral'>, model_kwargs: dict[str, ~typing.Any] | None = None, num_components: int = 2, *, universal_p: bool = False, shared_transmission: bool = False, split_midext: bool = False) None[source]#
Initialize the mixture model.
The mixture will be based on the given
model_cls(which is instantiated with themodel_kwargs), and will havenum_components.universal_pindicates whether the model shares the time prior distribution over all components.
- midext_prob_builder() ndarray[source]#
Build an array of midext probabilities for each patient and component. The result will match the number of patients in the model and assign for each patient the correct midext/1-midext probability in column 0 if there is an extension and in column 1 if there is no extension. if the extension is NaN both columns will have the midext and 1-midextprobability.
- get_mixture_coefs(component: int | None = None, subgroup: str | None = None, *, norm: bool = True) float | Series | DataFrame[source]#
Get mixture coefficients for the given
subgroupandcomponent.The mixture coefficients are sliced by the given
subgroupandcomponentwhich means that if no subgroup and/or component is given, multiple mixture coefficients are returned.If
normis set toTrue, the mixture coefficients are normalized along the component axis before being returned.
- set_mixture_coefs(new_mixture_coefs: float | ndarray, component: int | None = None, subgroup: str | None = None) None[source]#
Assign new mixture coefficients to the model.
As in
get_mixture_coefs(),subgroupandcomponentcan be used to slice the mixture coefficients and therefore assign entirely new coefs to the entire model, to one subgroup, to one component, or to one component of one subgroup.Note
After setting, these coefficients are not normalized.
- repeat_mixture_coefs(t_stage: str | None = None, subgroup: str | None = None, *, log: bool = False) ndarray[source]#
Repeat mixture coefficients.
The result will match the number of patients with tumors of
t_stagethat are in the specifiedsubgroup(or all if it is set toNone). The mixture coefficients are returned in log-space iflogis set toTrueThis method enables easy multiplication of the mixture coefficients with the likelihoods of the patients under the components as in the method
patient_mixture_likelihoods().
- infer_mixture_coefs(new_resps: ndarray | None = None, *, log: bool = False) DataFrame[source]#
Infer optimal mixture coefficients based on responsibilities.
This method updates the mixture coefficients by averaging the corresponding responsibilities, which can be provided via
new_respsor taken from the model ifnew_respsisNone.The result is a
DataFrameof shape(num_components, num_subgroups), which can be used to update the mixture coefficients viaset_mixture_coefs.If
logisTrue, both the inputnew_respsand the output coefficients are in log-space for numerical stability.
- get_params(*, as_dict: bool = True, as_flat: bool = True, model_params_only: bool = False) Iterable[float] | dict[str, float][source]#
Get the parameters of the mixture model.
This includes both the parameters of the individual components and the mixture coefficients. If a dictionary is returned (i.e. if
as_dictis set toTrue), the components’ parameters are nested under keys that simply enumerate them. While the mixture coefficients are returned under keys of the form<subgroup>from<component>_coef.The parameters are returned as a dictionary if
as_dictis True, and as an iterable of floats otherwise. The argumentas_flatdetermines whether the returned dict is flat or nested.See also
In the
lymphpackage, the model parameters are also set and get using theget_params()and theset_params()methods. We tried to keep the interface as similar as possible.>>> graph_dict = { ... ("tumor", "T"): ["II", "III"], ... ("lnl", "II"): ["III"], ... ("lnl", "III"): [], ... } >>> mixture = LymphMixture( ... model_kwargs={"graph_dict": graph_dict}, ... num_components=2, ... ) >>> mixture.get_params(as_dict=True) {'0_TtoII_spread': 0.0, '0_TtoIII_spread': 0.0, '0_IItoIII_spread': 0.0, '1_TtoII_spread': 0.0, '1_TtoIII_spread': 0.0, '1_IItoIII_spread': 0.0}
- set_params(*args: float, **kwargs: float) tuple[float][source]#
Assign new params to the component models.
This includes both the spread parameters for the component’s models (if provided as positional arguments, they are used up first), as well as the mixture coefficients for the subgroups.
See also
In the
lymphpackage, the model parameters are also set and get using theget_params()and theset_params()methods. We tried to keep the interface as similar as possible.Important
After setting all parameters, the mixture coefficients are normalized and may thus not be the same as the ones provided in the arguments.
>>> graph_dict = { ... ("tumor", "T"): ["II", "III"], ... ("lnl", "II"): ["III"], ... ("lnl", "III"): [], ... } >>> mixture = LymphMixture( ... model_kwargs={"graph_dict": graph_dict}, ... num_components=2, ... ) >>> mixture.set_params(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7) (0.7,) >>> mixture.get_params(as_dict=True) {'0_TtoII_spread': 0.1, '0_TtoIII_spread': 0.2, '0_IItoIII_spread': 0.3, '1_TtoII_spread': 0.4, '1_TtoIII_spread': 0.5, '1_IItoIII_spread': 0.6}
- get_resp_indices(subgroup: str | None = None, t_stage: str | None = None) ndarray[source]#
Get the indices of the responsibilities.
Returns a boolean array of shape
(num_patients,)that isTruefor each patient that has the givent_stageand belongs to the givensubgroup.Both
subgroupandt_stageare optional.
- get_resps(subgroup: str | None = None, component: int | None = None, t_stage: str | None = None, *, norm: bool = True) Series | DataFrame[source]#
Get the responsibilities of each patient for a component.
One can filter the returned table of responsibilities by the patient’s subgroup and T-stage. If
normis set toTrue, the responsibilities are normalized to sum to one along the component axis.
- set_resps(new_resps: float | ndarray, subgroup: str | None = None, component: int | None = None, t_stage: str | None = None) None[source]#
Assign
new_resps(responsibilities) to the model.They should have the shape
(num_patients, num_components), wherenum_patientsis either the total number of patients in the model or only the number of patients in thesubgroup(if that argument is notNone) and summing them along the last axis should yield a vector of ones.Note that these responsibilities essentially become the latent variables of the model or the expectation values of the latent variables (depending on whether or not they are “hardened”, see
harden_responsibilities()).Note
Also, like in the
set_mixtures_coefs()method, the responsibilities are not normalized after setting them.
- load_patient_data(patient_data: DataFrame, split_by: tuple[str, str, str], **kwargs) None[source]#
Split the
patient_datainto subgroups and load it into the model.This amounts to computing the diagnosis matrices for the individual subgroups. The
split_bytuple should contain the three-level header of the LyProX-style data. Any additional keyword arguments are passed to theload_patient_data()method.
- property patient_data: DataFrame#
Return all patients stored in the individual subgroups.
- patient_component_likelihoods(t_stage: str | None = None, component: int | None = None, *, log: bool = True) ndarray[source]#
Compute the (log-)likelihood of all patients, given the components.
The returned array has shape
(num_patients, num_components)and contains the likelihood of each patient witht_stageunder each component. Iflogis set toTrue, the likelihoods are returned in log-space.
- patient_mixture_likelihoods(t_stage: str | None = None, component: int | None = None, *, log: bool = True, marginalize: bool = False) ndarray[source]#
Compute the (log-)likelihood of all patients under the mixture model.
This is essentially the (log-)likelihood of all patients given the individual components as computed by
patient_component_likelihoods(), but weighted by the mixture coefficients. This means that the returned array whenmarginalizeis set toFalserepresents the unnormalized expected responsibilities of the patients for the components.If
marginalizeis set toTrue, the likelihoods are summed over the components, effectively marginalizing the components out of the likelihoods and yielding the incomplete data likelihood per patient.
- incomplete_data_likelihood(t_stage: str | None = None, component: int | None = None, *, log: bool = True) float[source]#
Compute the incomplete data likelihood of the model.
- complete_data_likelihood(t_stage: str | None = None, component: int | None = None, *, log: bool = True) float[source]#
Compute the complete data likelihood of the model.
- likelihood(given_params: Iterable[float] | dict[str, float] | None = None, given_resps: ndarray | None = None, *, log: bool = True, use_complete: bool = True) float[source]#
Compute the (in-)complete data likelihood of the model.
The likelihood is computed for the
given_params. If no parameters are given, the currently set parameters of the model are used.If responsibilities for each patient and component are given via
given_resps, they are used to compute the complete data likelihood. Otherwise, the incomplete data likelihood is computed, which marginalizes over the responsibilities.The likelihood is returned in log-space if
logis set toTrue.
- state_dist(t_stage: str = 'early', subgroup: str | None = None) ndarray[source]#
Compute the distribution over possible states.
Do this for a given
t_stageandsubgroup. If no subgroup is given, the distribution is computed for all subgroups. The result is a matrix with shape(num_subgroups, num_states).
- posterior_state_dist(subgroup: str | None = None, given_params: Iterable[float] | dict[str, float] | None = None, given_state_dist: ndarray | None = None, given_diagnosis: dict[str, dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None]] | None = None, t_stage: str | int = 'early', midext: bool | None = None, central: bool | None = None) ndarray[source]#
Compute the posterior distribution over hidden states given a diagnosis.
The
given_diagnosisis a dictionary of diagnosis for each modality. E.g., this could look like this:given_diagnosis = { "MRI": {"II": True, "III": False, "IV": False}, "PET": {"II": True, "III": True, "IV": None}, }
The
t_stageparameter determines the T-stage for which the posterior is computed.
- risk(subgroup: str, involvement: dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None], given_params: Iterable[float] | dict[str, float] | None = None, given_state_dist: ndarray | None = None, given_diagnosis: dict[str, dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None]] | None = None, t_stage: str = 'early', midext: bool | None = None) float[source]#
Compute risk of a certain
involvement, using thegiven_diagnosis.If an
involvementpattern of interest is provided, this method computes the risk of seeing just that pattern for the set of given parameters and a dictionary of diagnosis for each modality.If no
involvementis provided, this will simply return the posterior distribution over hidden states, given the diagnosis, as computed by theposterior_state_dist()method. See its documentation for more details about the arguments and the return value.