Mixture Model#

Provides the LymphMixture class for wrapping multiple lymph models.

Each component and subgroup of the mixture model is a Unilateral instance. Its properties, parametrization, and data are orchestrated by the LymphMixture class. It provides the methods and computations necessary to use the expectation-maximization algorithm to fit the model to data.

class lymixture.models.LymphMixture(model_cls: type[~lymixture.models.ModelType] = <class 'lymph.models.unilateral.Unilateral'>, model_kwargs: dict[str, ~typing.Any] | None = None, num_components: int = 2, *, universal_p: bool = False, shared_transmission: bool = False, split_midext: bool = False)[source]#

Bases: Composite, Composite, Model

Class that handles the individual components of the mixture model.

__init__(model_cls: type[~lymixture.models.ModelType] = <class 'lymph.models.unilateral.Unilateral'>, model_kwargs: dict[str, ~typing.Any] | None = None, num_components: int = 2, *, universal_p: bool = False, shared_transmission: bool = False, split_midext: bool = False) None[source]#

Initialize the mixture model.

The mixture will be based on the given model_cls (which is instantiated with the model_kwargs), and will have num_components. universal_p indicates whether the model shares the time prior distribution over all components.

property is_trinary: bool#

Check if the model is trinary.

midext_prob_builder() ndarray[source]#

Build an array of midext probabilities for each patient and component. The result will match the number of patients in the model and assign for each patient the correct midext/1-midext probability in column 0 if there is an extension and in column 1 if there is no extension. if the extension is NaN both columns will have the midext and 1-midextprobability.

get_mixture_coefs(component: int | None = None, subgroup: str | None = None, *, norm: bool = True) float | Series | DataFrame[source]#

Get mixture coefficients for the given subgroup and component.

The mixture coefficients are sliced by the given subgroup and component which means that if no subgroup and/or component is given, multiple mixture coefficients are returned.

If norm is set to True, the mixture coefficients are normalized along the component axis before being returned.

set_mixture_coefs(new_mixture_coefs: float | ndarray, component: int | None = None, subgroup: str | None = None) None[source]#

Assign new mixture coefficients to the model.

As in get_mixture_coefs(), subgroup and component can be used to slice the mixture coefficients and therefore assign entirely new coefs to the entire model, to one subgroup, to one component, or to one component of one subgroup.

Note

After setting, these coefficients are not normalized.

normalize_mixture_coefs() None[source]#

Normalize the mixture coefficients to sum to one.

repeat_mixture_coefs(t_stage: str | None = None, subgroup: str | None = None, *, log: bool = False) ndarray[source]#

Repeat mixture coefficients.

The result will match the number of patients with tumors of t_stage that are in the specified subgroup (or all if it is set to None). The mixture coefficients are returned in log-space if log is set to True

This method enables easy multiplication of the mixture coefficients with the likelihoods of the patients under the components as in the method patient_mixture_likelihoods().

infer_mixture_coefs(new_resps: ndarray | None = None, *, log: bool = False) DataFrame[source]#

Infer optimal mixture coefficients based on responsibilities.

This method updates the mixture coefficients by averaging the corresponding responsibilities, which can be provided via new_resps or taken from the model if new_resps is None.

The result is a DataFrame of shape (num_components, num_subgroups), which can be used to update the mixture coefficients via set_mixture_coefs.

If log is True, both the input new_resps and the output coefficients are in log-space for numerical stability.

get_params(*, as_dict: bool = True, as_flat: bool = True, model_params_only: bool = False) Iterable[float] | dict[str, float][source]#

Get the parameters of the mixture model.

This includes both the parameters of the individual components and the mixture coefficients. If a dictionary is returned (i.e. if as_dict is set to True), the components’ parameters are nested under keys that simply enumerate them. While the mixture coefficients are returned under keys of the form <subgroup>from<component>_coef.

The parameters are returned as a dictionary if as_dict is True, and as an iterable of floats otherwise. The argument as_flat determines whether the returned dict is flat or nested.

See also

In the lymph package, the model parameters are also set and get using the get_params() and the set_params() methods. We tried to keep the interface as similar as possible.

>>> graph_dict = {
...     ("tumor", "T"): ["II", "III"],
...     ("lnl", "II"): ["III"],
...     ("lnl", "III"): [],
... }
>>> mixture = LymphMixture(
...     model_kwargs={"graph_dict": graph_dict},
...     num_components=2,
... )
>>> mixture.get_params(as_dict=True)     
{'0_TtoII_spread': 0.0,
 '0_TtoIII_spread': 0.0,
 '0_IItoIII_spread': 0.0,
 '1_TtoII_spread': 0.0,
 '1_TtoIII_spread': 0.0,
 '1_IItoIII_spread': 0.0}
set_params(*args: float, **kwargs: float) tuple[float][source]#

Assign new params to the component models.

This includes both the spread parameters for the component’s models (if provided as positional arguments, they are used up first), as well as the mixture coefficients for the subgroups.

See also

In the lymph package, the model parameters are also set and get using the get_params() and the set_params() methods. We tried to keep the interface as similar as possible.

Important

After setting all parameters, the mixture coefficients are normalized and may thus not be the same as the ones provided in the arguments.

>>> graph_dict = {
...     ("tumor", "T"): ["II", "III"],
...     ("lnl", "II"): ["III"],
...     ("lnl", "III"): [],
... }
>>> mixture = LymphMixture(
...     model_kwargs={"graph_dict": graph_dict},
...     num_components=2,
... )
>>> mixture.set_params(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7)
(0.7,)
>>> mixture.get_params(as_dict=True)   
{'0_TtoII_spread': 0.1,
 '0_TtoIII_spread': 0.2,
 '0_IItoIII_spread': 0.3,
 '1_TtoII_spread': 0.4,
 '1_TtoIII_spread': 0.5,
 '1_IItoIII_spread': 0.6}
get_resp_indices(subgroup: str | None = None, t_stage: str | None = None) ndarray[source]#

Get the indices of the responsibilities.

Returns a boolean array of shape (num_patients,) that is True for each patient that has the given t_stage and belongs to the given subgroup.

Both subgroup and t_stage are optional.

get_resps(subgroup: str | None = None, component: int | None = None, t_stage: str | None = None, *, norm: bool = True) Series | DataFrame[source]#

Get the responsibilities of each patient for a component.

One can filter the returned table of responsibilities by the patient’s subgroup and T-stage. If norm is set to True, the responsibilities are normalized to sum to one along the component axis.

set_resps(new_resps: float | ndarray, subgroup: str | None = None, component: int | None = None, t_stage: str | None = None) None[source]#

Assign new_resps (responsibilities) to the model.

They should have the shape (num_patients, num_components), where num_patients is either the total number of patients in the model or only the number of patients in the subgroup (if that argument is not None) and summing them along the last axis should yield a vector of ones.

Note that these responsibilities essentially become the latent variables of the model or the expectation values of the latent variables (depending on whether or not they are “hardened”, see harden_responsibilities()).

Note

Also, like in the set_mixtures_coefs() method, the responsibilities are not normalized after setting them.

load_patient_data(patient_data: DataFrame, split_by: tuple[str, str, str], **kwargs) None[source]#

Split the patient_data into subgroups and load it into the model.

This amounts to computing the diagnosis matrices for the individual subgroups. The split_by tuple should contain the three-level header of the LyProX-style data. Any additional keyword arguments are passed to the load_patient_data() method.

property patient_data: DataFrame#

Return all patients stored in the individual subgroups.

patient_component_likelihoods(t_stage: str | None = None, component: int | None = None, *, log: bool = True) ndarray[source]#

Compute the (log-)likelihood of all patients, given the components.

The returned array has shape (num_patients, num_components) and contains the likelihood of each patient with t_stage under each component. If log is set to True, the likelihoods are returned in log-space.

patient_mixture_likelihoods(t_stage: str | None = None, component: int | None = None, *, log: bool = True, marginalize: bool = False) ndarray[source]#

Compute the (log-)likelihood of all patients under the mixture model.

This is essentially the (log-)likelihood of all patients given the individual components as computed by patient_component_likelihoods(), but weighted by the mixture coefficients. This means that the returned array when marginalize is set to False represents the unnormalized expected responsibilities of the patients for the components.

If marginalize is set to True, the likelihoods are summed over the components, effectively marginalizing the components out of the likelihoods and yielding the incomplete data likelihood per patient.

incomplete_data_likelihood(t_stage: str | None = None, component: int | None = None, *, log: bool = True) float[source]#

Compute the incomplete data likelihood of the model.

complete_data_likelihood(t_stage: str | None = None, component: int | None = None, *, log: bool = True) float[source]#

Compute the complete data likelihood of the model.

likelihood(given_params: Iterable[float] | dict[str, float] | None = None, given_resps: ndarray | None = None, *, log: bool = True, use_complete: bool = True) float[source]#

Compute the (in-)complete data likelihood of the model.

The likelihood is computed for the given_params. If no parameters are given, the currently set parameters of the model are used.

If responsibilities for each patient and component are given via given_resps, they are used to compute the complete data likelihood. Otherwise, the incomplete data likelihood is computed, which marginalizes over the responsibilities.

The likelihood is returned in log-space if log is set to True.

state_dist(t_stage: str = 'early', subgroup: str | None = None) ndarray[source]#

Compute the distribution over possible states.

Do this for a given t_stage and subgroup. If no subgroup is given, the distribution is computed for all subgroups. The result is a matrix with shape (num_subgroups, num_states).

posterior_state_dist(subgroup: str | None = None, given_params: Iterable[float] | dict[str, float] | None = None, given_state_dist: ndarray | None = None, given_diagnosis: dict[str, dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None]] | None = None, t_stage: str | int = 'early', midext: bool | None = None, central: bool | None = None) ndarray[source]#

Compute the posterior distribution over hidden states given a diagnosis.

The given_diagnosis is a dictionary of diagnosis for each modality. E.g., this could look like this:

given_diagnosis = {
    "MRI": {"II": True, "III": False, "IV": False},
    "PET": {"II": True, "III": True, "IV": None},
}

The t_stage parameter determines the T-stage for which the posterior is computed.

risk(subgroup: str, involvement: dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None], given_params: Iterable[float] | dict[str, float] | None = None, given_state_dist: ndarray | None = None, given_diagnosis: dict[str, dict[str, Literal[False, 0, 'healthy', True, 1, 'involved', 'micro', 'macro', 'notmacro'] | None]] | None = None, t_stage: str = 'early', midext: bool | None = None) float[source]#

Compute risk of a certain involvement, using the given_diagnosis.

If an involvement pattern of interest is provided, this method computes the risk of seeing just that pattern for the set of given parameters and a dictionary of diagnosis for each modality.

If no involvement is provided, this will simply return the posterior distribution over hidden states, given the diagnosis, as computed by the posterior_state_dist() method. See its documentation for more details about the arguments and the return value.