Skip to content

gcages.cmip7_scenariomip.pre_processing.reaggregation#

Reaggregation of timeseries from raw reporting to sectors needed for gridding

The idea here is that we receive raw data following some variable specification Based on this, we reaggregate to the variables needed for gridding (see gcages.cmip7_scenariomip.gridding_emissions). In order to do the reaggregation sensibly, two things must be true:

  1. all the timeseries we require must be there
  2. the data must be internally consistent
    • including consideration of any optional timeseries

Reaggregation is a data problem i.e. the hard part is making sure that the data we receive matches our data model. As a result, the code is highly coupled with the data we expect (writing general solutions is hard). This is why we have written the code that supports each data model in a standalone module, rather than trying to write a general solution (which was extremely difficult when we tried to do it that way from the start, we think because it creates couplings which are incredibly difficult to reason through).

Modules:

Name Description
basic

Basic reaggregation

common

Common components used across different re-aggregation strategies

Classes:

Name Description
ReaggregatorBasic

Reaggregator that follows this module's logic

ToCompleteResult

Result of calling to_complete on a reaggregator

ReaggregatorBasic #

Reaggregator that follows this module's logic

Methods:

Name Description
assert_has_all_required_timeseries

Assert that the data has all the required timeseries

assert_is_internally_consistent

Assert that the data is internally consistent

default_tols_internal_consistency

Get default tolerances for internal consistency checks

get_internal_consistency_checking_index

Get the index which selects only data relevant for checking internal consistency

to_complete

Convert the raw data to complete data

to_gridding_sectors

Re-aggregate data to the sectors used for gridding

Attributes:

Name Type Description
internal_consistency_tolerances Mapping[str, Mapping[str, float]] | Mapping[str, Mapping[str, PINT_SCALAR]]

Tolerances to apply when checking the internal consistency of the data

model_regions tuple[str, ...]

Model regions to use while reaggregating

region_level str

Region level in the data index

unit_level str

Unit level in the data index

variable_level str

Variable level in the data index

world_region str

The value used when the data represents the sum over all regions

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
@define
class ReaggregatorBasic:
    """
    Reaggregator that follows this module's logic
    """

    model_regions: tuple[str, ...]
    """Model regions to use while reaggregating"""

    region_level: str = "region"
    """Region level in the data index"""

    unit_level: str = "unit"
    """Unit level in the data index"""

    variable_level: str = "variable"
    """Variable level in the data index"""

    world_region: str = "World"
    """
    The value used when the data represents the sum over all regions

    (Having a value for this is odd,
    there should really just be no region level when data is the sum,
    but this is the data format used so we have to follow this convention.)
    """

    internal_consistency_tolerances: (
        Mapping[str, Mapping[str, float]] | Mapping[str, Mapping[str, PINT_SCALAR]]
    ) = field()
    """
    Tolerances to apply when checking the internal consistency of the data
    """

    @internal_consistency_tolerances.default
    def default_tols_internal_consistency(
        self,
    ) -> Mapping[str, Mapping[str, float]] | Mapping[str, Mapping[str, PINT_SCALAR]]:
        """
        Get default tolerances for internal consistency checks
        """
        return get_default_internal_conistency_checking_tolerances()

    def assert_has_all_required_timeseries(self, indf: pd.DataFrame) -> None:
        """
        Assert that the data has all the required timeseries

        Parameters
        ----------
        indf
            Data to check

        Raises
        ------
        NotCompleteError
            `indf` is not complete
        """
        assert_has_all_required_timeseries(
            indf,
            model_regions=self.model_regions,
            world_region=self.world_region,
            region_level=self.region_level,
            variable_level=self.variable_level,
        )

    def assert_is_internally_consistent(self, indf: pd.DataFrame) -> None:
        """
        Assert that the data is internally consistent

        Parameters
        ----------
        indf
            Data to check

        Raises
        ------
        InternalConsistencyError
            The data is not internally consistent
        """
        assert_is_internally_consistent(
            indf,
            model_regions=self.model_regions,
            tolerances=self.internal_consistency_tolerances,
            world_region=self.world_region,
            region_level=self.region_level,
            unit_level=self.unit_level,
            variable_level=self.variable_level,
        )

    def get_internal_consistency_checking_index(self) -> pd.MultiIndex:
        """
        Get the index which selects only data relevant for checking internal consistency

        Returns
        -------
        :
            Internal consistency checking index
        """
        return get_internal_consistency_checking_index(
            model_regions=self.model_regions,
            world_region=self.world_region,
            region_level=self.region_level,
            variable_level=self.variable_level,
        )

    def to_complete(self, raw: pd.DataFrame) -> ToCompleteResult:
        """
        Convert the raw data to complete data

        Parameters
        ----------
        raw
            Raw data

        Returns
        -------
        :
            To complete result
        """
        return to_complete(
            indf=raw,
            model_regions=self.model_regions,
            unit_level=self.unit_level,
            variable_level=self.variable_level,
            region_level=self.region_level,
            world_region=self.world_region,
        )

    def to_gridding_sectors(self, indf: pd.DataFrame) -> pd.DataFrame:
        """
        Re-aggregate data to the sectors used for gridding

        Parameters
        ----------
        indf
            Data to re-aggregate

        Returns
        -------
        :
            Data re-aggregated to the gridding sectors
        """
        return to_gridding_sectors(
            indf=indf, region_level=self.region_level, world_region=self.world_region
        )

internal_consistency_tolerances class-attribute instance-attribute #

internal_consistency_tolerances: (
    Mapping[str, Mapping[str, float]]
    | Mapping[str, Mapping[str, PINT_SCALAR]]
) = field()

Tolerances to apply when checking the internal consistency of the data

model_regions instance-attribute #

model_regions: tuple[str, ...]

Model regions to use while reaggregating

region_level class-attribute instance-attribute #

region_level: str = 'region'

Region level in the data index

unit_level class-attribute instance-attribute #

unit_level: str = 'unit'

Unit level in the data index

variable_level class-attribute instance-attribute #

variable_level: str = 'variable'

Variable level in the data index

world_region class-attribute instance-attribute #

world_region: str = 'World'

The value used when the data represents the sum over all regions

(Having a value for this is odd, there should really just be no region level when data is the sum, but this is the data format used so we have to follow this convention.)

assert_has_all_required_timeseries #

assert_has_all_required_timeseries(indf: DataFrame) -> None

Assert that the data has all the required timeseries

Parameters:

Name Type Description Default
indf DataFrame

Data to check

required

Raises:

Type Description
NotCompleteError

indf is not complete

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
def assert_has_all_required_timeseries(self, indf: pd.DataFrame) -> None:
    """
    Assert that the data has all the required timeseries

    Parameters
    ----------
    indf
        Data to check

    Raises
    ------
    NotCompleteError
        `indf` is not complete
    """
    assert_has_all_required_timeseries(
        indf,
        model_regions=self.model_regions,
        world_region=self.world_region,
        region_level=self.region_level,
        variable_level=self.variable_level,
    )

assert_is_internally_consistent #

assert_is_internally_consistent(indf: DataFrame) -> None

Assert that the data is internally consistent

Parameters:

Name Type Description Default
indf DataFrame

Data to check

required

Raises:

Type Description
InternalConsistencyError

The data is not internally consistent

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
def assert_is_internally_consistent(self, indf: pd.DataFrame) -> None:
    """
    Assert that the data is internally consistent

    Parameters
    ----------
    indf
        Data to check

    Raises
    ------
    InternalConsistencyError
        The data is not internally consistent
    """
    assert_is_internally_consistent(
        indf,
        model_regions=self.model_regions,
        tolerances=self.internal_consistency_tolerances,
        world_region=self.world_region,
        region_level=self.region_level,
        unit_level=self.unit_level,
        variable_level=self.variable_level,
    )

default_tols_internal_consistency #

default_tols_internal_consistency() -> (
    Mapping[str, Mapping[str, float]]
    | Mapping[str, Mapping[str, PINT_SCALAR]]
)

Get default tolerances for internal consistency checks

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
@internal_consistency_tolerances.default
def default_tols_internal_consistency(
    self,
) -> Mapping[str, Mapping[str, float]] | Mapping[str, Mapping[str, PINT_SCALAR]]:
    """
    Get default tolerances for internal consistency checks
    """
    return get_default_internal_conistency_checking_tolerances()

get_internal_consistency_checking_index #

get_internal_consistency_checking_index() -> MultiIndex

Get the index which selects only data relevant for checking internal consistency

Returns:

Type Description
MultiIndex

Internal consistency checking index

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
def get_internal_consistency_checking_index(self) -> pd.MultiIndex:
    """
    Get the index which selects only data relevant for checking internal consistency

    Returns
    -------
    :
        Internal consistency checking index
    """
    return get_internal_consistency_checking_index(
        model_regions=self.model_regions,
        world_region=self.world_region,
        region_level=self.region_level,
        variable_level=self.variable_level,
    )

to_complete #

to_complete(raw: DataFrame) -> ToCompleteResult

Convert the raw data to complete data

Parameters:

Name Type Description Default
raw DataFrame

Raw data

required

Returns:

Type Description
ToCompleteResult

To complete result

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
def to_complete(self, raw: pd.DataFrame) -> ToCompleteResult:
    """
    Convert the raw data to complete data

    Parameters
    ----------
    raw
        Raw data

    Returns
    -------
    :
        To complete result
    """
    return to_complete(
        indf=raw,
        model_regions=self.model_regions,
        unit_level=self.unit_level,
        variable_level=self.variable_level,
        region_level=self.region_level,
        world_region=self.world_region,
    )

to_gridding_sectors #

to_gridding_sectors(indf: DataFrame) -> DataFrame

Re-aggregate data to the sectors used for gridding

Parameters:

Name Type Description Default
indf DataFrame

Data to re-aggregate

required

Returns:

Type Description
DataFrame

Data re-aggregated to the gridding sectors

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/basic.py
def to_gridding_sectors(self, indf: pd.DataFrame) -> pd.DataFrame:
    """
    Re-aggregate data to the sectors used for gridding

    Parameters
    ----------
    indf
        Data to re-aggregate

    Returns
    -------
    :
        Data re-aggregated to the gridding sectors
    """
    return to_gridding_sectors(
        indf=indf, region_level=self.region_level, world_region=self.world_region
    )

ToCompleteResult #

Result of calling to_complete on a reaggregator

Attributes:

Name Type Description
assumed_zero DataFrame | None

The timeseries that were assumed to be zero to make self.complete

complete DataFrame

Complete pd.DataFrame

Source code in src/gcages/cmip7_scenariomip/pre_processing/reaggregation/common.py
@define
class ToCompleteResult:
    """
    Result of calling `to_complete` on a reaggregator
    """

    complete: pd.DataFrame
    """Complete [pd.DataFrame][pandas.DataFrame]"""

    assumed_zero: pd.DataFrame | None
    """
    The timeseries that were assumed to be zero to make `self.complete`

    If `None`, no timeseries were assumed to be zero.
    """

assumed_zero instance-attribute #

assumed_zero: DataFrame | None

The timeseries that were assumed to be zero to make self.complete

If None, no timeseries were assumed to be zero.

complete instance-attribute #

complete: DataFrame

Complete pd.DataFrame