Skip to content

eii.training

Model training and validation utilities.

NPP model training utilities.

This module provides functions for training and validating the NPP potential model, including sampling from natural areas and model evaluation.

Requires additional dependencies: pip install eii[training]

Note

Most users will not need this module. Use eii.client for retrieving pre-computed EII data, or eii.compute for calculating EII with the existing trained model.

Example

from eii.training import setup_training_grid, train_npp_model grid_cells = setup_training_grid() model = train_npp_model(training_data)

setup_training_grid(grid_size_deg=TRAINING_GRID_SIZE_DEG)

Create a global grid for spatial cross-validation.

Parameters:

Name Type Description Default
grid_size_deg int

Size of grid cells in degrees.

TRAINING_GRID_SIZE_DEG

Returns:

Type Description
dict

Dictionary mapping cell names to ee.Geometry objects.

Source code in src/eii/training/sampling.py
def setup_training_grid(grid_size_deg: int = TRAINING_GRID_SIZE_DEG) -> dict:
    """
    Create a global grid for spatial cross-validation.

    Args:
        grid_size_deg: Size of grid cells in degrees.

    Returns:
        Dictionary mapping cell names to ee.Geometry objects.
    """

    lon_min, lon_max = -180, 180
    lat_min, lat_max = -60, 90  # Exclude Antarctica

    grid_cells = {}
    for lon in range(lon_min, lon_max, grid_size_deg):
        for lat in range(lat_min, lat_max, grid_size_deg):
            cell_name = f"grid_{lon}_{lat}"
            cell = ee.Geometry.Rectangle([lon, lat, lon + grid_size_deg, lat + grid_size_deg])
            grid_cells[cell_name] = cell

    return grid_cells

train_npp_model(training_data, predictor_names=None, response_property='longterm_avg_npp_sum', output_asset_path=None, num_trees=RF_NUM_TREES, min_leaf_population=RF_MIN_LEAF_POPULATION, variables_per_split=RF_VARIABLES_PER_SPLIT, bag_fraction=RF_BAG_FRACTION, seed=RF_SEED, export=True)

Train Random Forest model for NPP prediction.

Parameters:

Name Type Description Default
training_data FeatureCollection

FeatureCollection with predictor variables and response.

required
predictor_names list[str] | None

List of predictor property names. If None, infers from first feature (requires getInfo).

None
response_property str

Name of the response variable property.

'longterm_avg_npp_sum'
output_asset_path str | None

Asset path to export model. If None, generates default.

None
num_trees int

Number of trees in the forest.

RF_NUM_TREES
min_leaf_population int

Minimum samples in a leaf.

RF_MIN_LEAF_POPULATION
variables_per_split int

Number of variables to consider per split.

RF_VARIABLES_PER_SPLIT
bag_fraction float

Fraction of data to bag per tree.

RF_BAG_FRACTION
seed int

Random seed.

RF_SEED
export bool

Whether to export the model to an asset.

True

Returns:

Type Description
tuple[Classifier, Task | None]

Tuple of (trained_model, export_task or None).

Source code in src/eii/training/model.py
def train_npp_model(
    training_data: ee.FeatureCollection,
    predictor_names: list[str] | None = None,
    response_property: str = "longterm_avg_npp_sum",
    output_asset_path: str | None = None,
    num_trees: int = RF_NUM_TREES,
    min_leaf_population: int = RF_MIN_LEAF_POPULATION,
    variables_per_split: int = RF_VARIABLES_PER_SPLIT,
    bag_fraction: float = RF_BAG_FRACTION,
    seed: int = RF_SEED,
    export: bool = True,
) -> tuple[ee.Classifier, ee.batch.Task | None]:
    """
    Train Random Forest model for NPP prediction.

    Args:
        training_data: FeatureCollection with predictor variables and response.
        predictor_names: List of predictor property names. If None, infers from
            first feature (requires getInfo).
        response_property: Name of the response variable property.
        output_asset_path: Asset path to export model. If None, generates default.
        num_trees: Number of trees in the forest.
        min_leaf_population: Minimum samples in a leaf.
        variables_per_split: Number of variables to consider per split.
        bag_fraction: Fraction of data to bag per tree.
        seed: Random seed.
        export: Whether to export the model to an asset.

    Returns:
        Tuple of (trained_model, export_task or None).
    """
    import ee

    if predictor_names is None:
        predictor_names = PREDICTOR_VARIABLES

    model = (
        ee.Classifier.smileRandomForest(
            numberOfTrees=num_trees,
            minLeafPopulation=min_leaf_population,
            variablesPerSplit=variables_per_split,
            bagFraction=bag_fraction,
            seed=seed,
        )
        .setOutputMode("REGRESSION")
        .train(
            features=training_data,
            classProperty=response_property,
            inputProperties=predictor_names,
        )
    )

    export_task = None
    if export:
        if output_asset_path is None:
            output_asset_path = f"{MODEL_ASSETS_PATH}/potential_npp_classifier"

        export_task = ee.batch.Export.classifier.toAsset(
            classifier=model,
            description="Export_NPP_Classifier",
            assetId=output_asset_path,
        )
        export_task.start()

    return model, export_task

get_train_test_split(training_data, split_ratio=TRAIN_TEST_SPLIT_RATIO, seed=RF_SEED, cv_grid_size=CV_GRID_SIZE_DEG, cv_buffer_size=CV_BUFFER_DEG)

Perform spatially stratified train/test split.

Uses a grid (e.g., 2 degrees) to create spatial blocks. Can optionally apply a negative buffer (margin) around each block to ensuring physical separation between training and validation sets.

Parameters:

Name Type Description Default
training_data FeatureCollection

FeatureCollection.

required
split_ratio float

Fraction for training (default 0.9).

TRAIN_TEST_SPLIT_RATIO
seed int

Random seed for reproducibility.

RF_SEED
cv_grid_size int

Grid size in degrees for cross-validation blocks.

CV_GRID_SIZE_DEG
cv_buffer_size float

Buffer size in degrees to exclude from block edges. 0.0 means no buffer. 0.5 means 0.5 deg excluded from all sides.

CV_BUFFER_DEG

Returns:

Type Description
tuple[FeatureCollection, FeatureCollection]

Tuple of (training_set, validation_set).

Source code in src/eii/training/model.py
def get_train_test_split(
    training_data: ee.FeatureCollection,
    split_ratio: float = TRAIN_TEST_SPLIT_RATIO,
    seed: int = RF_SEED,
    cv_grid_size: int = CV_GRID_SIZE_DEG,
    cv_buffer_size: float = CV_BUFFER_DEG,
) -> tuple[ee.FeatureCollection, ee.FeatureCollection]:
    """
    Perform spatially stratified train/test split.

    Uses a grid (e.g., 2 degrees) to create spatial blocks.
    Can optionally apply a negative buffer (margin) around each block
    to ensuring physical separation between training and validation sets.

    Args:
        training_data: FeatureCollection.
        split_ratio: Fraction for training (default 0.9).
        seed: Random seed for reproducibility.
        cv_grid_size: Grid size in degrees for cross-validation blocks.
        cv_buffer_size: Buffer size in degrees to exclude from block edges.
                        0.0 means no buffer. 0.5 means 0.5 deg excluded from all sides.

    Returns:
        Tuple of (training_set, validation_set).
    """

    print(CV_GRID_SIZE_DEG, CV_BUFFER_DEG)

    def add_block_info(feature):
        coords = feature.geometry().coordinates()
        lon = coords.get(0)
        lat = coords.get(1)

        # Shift to positive range for easier modulo/floor
        lon_shifted = ee.Number(lon).add(180)
        lat_shifted = ee.Number(lat).add(90)

        x = lon_shifted.divide(cv_grid_size).floor()
        y = lat_shifted.divide(cv_grid_size).floor()

        # relative position within the block [0, cv_grid_size)
        x_rel = lon_shifted.mod(cv_grid_size)
        y_rel = lat_shifted.mod(cv_grid_size)

        block_id = y.multiply(1000).add(x).toInt()

        # Keep if buffer <= pos <= size - buffer
        inner_cond = (
            x_rel.gte(cv_buffer_size)
            .And(x_rel.lte(ee.Number(cv_grid_size).subtract(cv_buffer_size)))
            .And(y_rel.gte(cv_buffer_size))
            .And(y_rel.lte(ee.Number(cv_grid_size).subtract(cv_buffer_size)))
        )

        return feature.set({"cv_block_id": block_id, "cv_keep": inner_cond})

    data_with_info = training_data.map(add_block_info)

    if cv_buffer_size > 0:
        data_to_split = data_with_info.filter(ee.Filter.eq("cv_keep", 1))
    else:
        data_to_split = data_with_info

    distinct_blocks = data_to_split.aggregate_array("cv_block_id").distinct().getInfo()

    random.seed(seed)
    random.shuffle(distinct_blocks)

    split_index = int(len(distinct_blocks) * split_ratio)
    training_blocks = distinct_blocks[:split_index]
    validation_blocks = distinct_blocks[split_index:]

    if len(distinct_blocks) > 5000:
        # Fallback to server-side hashing if too many blocks
        print(
            f"Warning: Large number of CV blocks ({len(distinct_blocks)}). Using hashed block split."
        )

        def add_hashed_split(f):
            bid = ee.Number(f.get("cv_block_id"))
            h = bid.multiply(12345).add(seed).mod(10000).divide(10000)
            return f.set("cv_random", h)

        data_hashed = data_to_split.map(add_hashed_split)

        training_set = data_hashed.filter(ee.Filter.lt("cv_random", split_ratio))
        validation_set = data_hashed.filter(ee.Filter.gte("cv_random", split_ratio))

    else:
        training_set = data_to_split.filter(ee.Filter.inList("cv_block_id", training_blocks))
        validation_set = data_to_split.filter(ee.Filter.inList("cv_block_id", validation_blocks))

    return training_set, validation_set

validate_model(validation_set, model_asset_path, response_vars=['current_npp'], prediction_names=['classification'])

Validate a trained model using a FeatureCollection of validation points.

This method applies the classifier directly to the features, which is extremely fast compared to image-based validation, as it utilizes the predictor values already stored in the table.

Parameters:

Name Type Description Default
validation_set FeatureCollection

FeatureCollection containing predictors and actual response.

required
model_asset_path str

Path to the trained GEE classifier.

required
response_vars list[str]

List of property names for actual values (e.g. ['current_npp', 'npp_std'])

['current_npp']
prediction_names list[str]

List of property names for predicted values (matching model output)

['classification']

Returns:

Type Description
DataFrame

DataFrame containing metrics for each response variable.

Source code in src/eii/training/validation.py
def validate_model(
    validation_set: ee.FeatureCollection,
    model_asset_path: str,
    response_vars: list[str] = ["current_npp"],
    prediction_names: list[str] = ["classification"],
) -> pd.DataFrame:
    """
    Validate a trained model using a FeatureCollection of validation points.

    This method applies the classifier directly to the features, which is
    extremely fast compared to image-based validation, as it utilizes the
    predictor values already stored in the table.

    Args:
        validation_set: FeatureCollection containing predictors and actual response.
        model_asset_path: Path to the trained GEE classifier.
        response_vars: List of property names for actual values (e.g. ['current_npp', 'npp_std'])
        prediction_names: List of property names for predicted values (matching model output)

    Returns:
        DataFrame containing metrics for each response variable.
    """
    print(f"Loading model from {model_asset_path}...")
    model = ee.Classifier.load(model_asset_path)

    predictions = validation_set.classify(model)

    return calculate_metrics(predictions, response_vars, prediction_names)