support runoff normalization for partial years with annual normalization data#434
support runoff normalization for partial years with annual normalization data#434fneum wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the runoff normalization function to support partial years by applying proportional scaling based on available data, while issuing a warning about the strong assumption of even runoff distribution.
- Improved handling of date ranges for runoff and normalization data
- Added partial year normalization logic with proportional scaling and a warning message
- Updated reindexing to use a dynamic coordinate based on the altered dimension
Files not reviewed (1)
- RELEASE_NOTES.rst: Language not supported
Comments suppressed due to low confidence (3)
atlite/convert.py:1026
- [nitpick] Ensure that reindexing using the dynamic dimension (dim) aligns with the intended coordinate, especially if the original 'countries' coordinate was explicitly required in other parts of the code.
).reindex({dim: result.coords[dim]})
atlite/convert.py:1010
- [nitpick] Verify that the proportional scaling logic correctly handles all edge cases, such as when the overlap period almost covers a full year, to ensure the computed fraction accurately reflects the intended partial data scaling.
fraction = (end - start).total_seconds() / ((year_end - year_start).total_seconds())
atlite/convert.py:965
- [nitpick] Consider renaming 'result_time' to a more descriptive name like 'time_index' to more clearly indicate that this variable represents a DatetimeIndex.
result_time = pd.DatetimeIndex(result.coords["time"].values)
brynpickering
left a comment
There was a problem hiding this comment.
I think this could be considerably simplified if we assume normalize_using_yearly is always annual data. It's unclear if this is the case since the name suggests so, but the possibility of a datetime index suggests it could e.g. be monthly data.
If definitely annual, you could do something like:
result_time = result.coords["time"].to_series()
annual_grouper = result_time.groupby(result_timeindex.year)
result_annual_secs = annual_grouper.apply(lambda x: (x.iloc[-1] - x.iloc[0]).total_seconds())
full_annual_secs = annual_grouper.apply(lambda x: (pd.Timestamp(str(x.name + 1)) - pd.Timestamp(str(x.name))).total_seconds())
fractions = result_annual_secs / full_annual_secs
if (sub_annual := fractions < 0.99).any(): # 99% to handle cases where the result data has a coarse frequency and therefore `result_annual_secs < full_annual_secs` even when it covers the full year
logger.warning(...)
overlap = normalize_using_yearly.index.intersection(fractions.index) # assuming you've previously normalised the index to integer years
if not overlap:
raise ValueError("no overlap ...")
overlap_slice = slice(overlap.min(), overlap.max()) # for use later
norm_data = normalize_using_yearly.mul(fractions, fill_value=0)Even if not definitely annual, the overall amount of code in this conditional can be significantly reduced to make it easier to maintain (I just haven't considered exactly what would be needed)
| cutout, | ||
| smooth=None, | ||
| lower_threshold_quantile=None, | ||
| normalize_using_yearly=None, |
There was a problem hiding this comment.
normalize_using_yearly is super opaque as an argument. Can you type hint it and add a docstring? After more time than should be needed, I now understand it to be a pd.Series with a date range on the index and floats as values, but that should not have required several minutes of code inspection to work out 😅
|
|
||
| if normalize_using_yearly is not None: | ||
| normalize_using_yearly_i = normalize_using_yearly.index | ||
| if isinstance(normalize_using_yearly_i, pd.DatetimeIndex): |
There was a problem hiding this comment.
You're doing this check multiple times. I get the impression that it's because you want to leave open the option for normalize_using_yearly to have an arbitrary datetime index (e.g. monthly values)? My preference would be to accept only yearly values and then work on that basis from the start. This would be easiest done by formatting normalize_using_yearly to have an integer index (as datetimeindex will set the timestamp to <year>-01-01 00:00 which could easily lead to a mistake when doing slicing.
| ) | ||
| assert len(years), "Need at least a full year of data (more is better)" | ||
| years_overlap = slice(str(min(years)), str(max(years))) | ||
| result_time = pd.DatetimeIndex(result.coords["time"].values) |
There was a problem hiding this comment.
Won't "time" always be a datetime?
| result_time = pd.DatetimeIndex(result.coords["time"].values) | |
| result_time = result.coords["time"].to_index() |
| ) | ||
|
|
||
| years_overlap = slice(str(overlap_start.date()), str(overlap_end.date())) | ||
| dim = result.dims[1 - result.get_axis_num("time")] |
There was a problem hiding this comment.
Why expect only ever one dim? To be agnostic of this, you could do:
| dim = result.dims[1 - result.get_axis_num("time")] | |
| non_time_dims = list(set(result) - {"time"}) |
and then assume a list later on as well (so [dim] -> non_time_dims and reindex({dim: result.coords[dim]}) -> reindex({dim: result.coords[dim] for dim in dims}) or reindex_like(result.coords[dims]) (I think))
| result *= ( | ||
| xr.DataArray(normalize_using_yearly.loc[years_overlap].sum(), dims=[dim]) | ||
| xr.DataArray(norm_data, dims=[dim]) | ||
| / result.sel(time=years_overlap).sum("time") | ||
| ).reindex(countries=result.coords["countries"]) | ||
| ).reindex({dim: result.coords[dim]}) |
There was a problem hiding this comment.
Pretty sure this could just be result *= norm_data / result.sel(time=years_overlap).sum("time") and you wouldn't have to also do the reindexing etc.
Changes proposed in this Pull Request
It makes
cutout.runoff(normalize_using_yearly=...)more lenient by supporting proportional scaling of partially represented years in thecutout, assuming even distribution of runoff throughout the year. In this case, a warning is printed due to the strong assumption.Relevant for PyPSA/pypsa-eur#1613 (e.g. July-June meteorological years, in case the original cutout does not span Jan-Dec for each year included)
This also means we can include hydro in the CI test of PyPSA-Eur, which uses only a week of weather data.
Checklist
doc.[] Unit tests for new features were added (if applicable).[] Newly introduced dependencies are added toenvironment.yaml,environment_docs.yamlandsetup.py(if applicable).doc/release_notes.rstof the upcoming release is included.