When processing time series in pandas, I found it quite hard to find local minima and maxima within a DataFrame. Even when knowing the correct function from scipy, taking correct results is still hard. Doing it effeciently is even harder. I found a partially correct answer on stackoverflow and expand it here to provide the correct answer with examples.
First we need data to work on. I am importing a historical set of ethereum transactions. To make things simple, I resample the DataFrame to daily set and leave only price column. Notice that you can parse dates on the fly when parsing the CSV, even with custom callback function. The next plot presents the data loaded.
import pandas as pd import matplotlib.pyplot as plt # This makes plots prettier import seaborn; seaborn.set() %matplotlib inline import datetime def dateparse (time_in_secs): return datetime.datetime.fromtimestamp(float(time_in_secs)) df = pd.read_csv('../csvs/ethusd.csv', index_col='date', parse_dates=True, date_parser=dateparse) df = df.resample('D').mean() df = df[['price']]
price date 2019-01-17 122.002639 2019-01-18 120.840360 2019-01-19 123.167901 2019-01-20 119.378474 2019-01-21 116.331089
Finding indexes for extreams with scipy and argrelextrema
Scipy provides a argrelextreama function that does a LOT of heavy lifting in this problem.
The actual implementation can be found here.
The result is an numpy array of indexes that are the peaks.
So in essence,
ilocs of the DataFrame.
If you are fuzzy on what iloc means it is a Purely integer-location based indexing for selection by position.
In order to get prices that are the peaks you can use
argrelextreama means a range from both sides.
We have daily data, so
+-3days - weekly peaks.
from scipy.signal import argrelextrema import numpy as np # first argument can a multidimentional np array as # data : ndarray # Array in which to find the relative extrema. # This is reflected in what gets returned: #Returns # ------- # extrema : ndarray # Boolean array of the same shape # That is why at the end of the line we just take the first element # order: How many points on each side to use for the comparison to consider ilocs_min = argrelextrema(df.price.values, np.less_equal, order=3) ilocs_max = argrelextrema(df.price.values, np.greater_equal, order=3) df.price.plot(figsize=(20,8), alpha=.3) # filter prices that are peaks and plot them differently to be visable on the plot df.iloc[ilocs_max].price.plot(style='.', lw=10, color='red', marker="v"); df.iloc[ilocs_min].price.plot(style='.', lw=10, color='green', marker="^");
How to update DataFrame with found peaks
Now that we know how to get the peaks in the time series, it is time to learn how to include the results in the DataFrame.
The naive approach would be to use
iloc and combine with filtering.
It would be not only very slow (for bigger datasets) but also would not work.
Snippet bellow shows that approach.
Notice that, you get a warning, but no errors nor exceptions.
# http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html df['weekly_max'] = False df['weekly_min'] = False df.iloc[ilocs_min]['weekly_min'] = True df.iloc[ilocs_max]['weekly_max'] = True # this will return empty DataFrame, which is not what we wanted. print(df[df['weekly_min'] == True])
Empty DataFrame Columns: [price, weekly_max, weekly_min] Index:  /Users/ed/.local/share/virtualenvs/data-saE_DZ1J/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. /Users/ed/.local/share/virtualenvs/data-saE_DZ1J/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """
The warning is quite verbose.
The key here is: A value is trying to be set on a copy of a slice from a DataFrame..
Good explanation of this can be found on stackoverflow and pandas documentation.
Warning suggests using .loc[row_indexer,col_indexer] = value, which is what is shown in the snippet bellow.
row_indexer means an index. To transform list of
ilocs to the index, use:
df['weekly_max'] = False df['weekly_min'] = False df.loc[df.iloc[ilocs_min].index, 'weekly_min'] = True df.loc[df.iloc[ilocs_max].index, 'weekly_max'] = True # display prices that are minimums and maximums print(df[df['weekly_min'] | df['weekly_max']]) df.price.plot(figsize=(20,8), alpha=.3) df[df['weekly_max']].price.plot(style='.', lw=10, color='red', marker="v"); df[df['weekly_min']].price.plot(style='.', lw=10, color='green', marker="^");
price weekly_max weekly_min date 2019-01-19 123.167901 True False 2019-01-29 104.371118 False True 2019-02-03 107.372949 True False 2019-02-06 102.280320 False True 2019-02-19 145.809688 True False 2019-02-23 153.401689 True False 2019-03-04 126.320644 False True 2019-03-07 137.093550 True False 2019-03-14 131.361888 False True 2019-03-16 140.445883 True False 2019-03-26 132.669041 False True 2019-04-03 170.548572 True False 2019-04-08 179.932684 True False 2019-04-16 163.505072 False True 2019-04-20 174.206577 True False 2019-04-26 151.879252 False True 2019-05-07 174.703111 True False 2019-05-16 260.186439 True False 2019-05-21 254.475075 True False 2019-05-23 239.937994 False True 2019-05-30 271.248413 True False 2019-05-31 255.605779 False True
Plot generated from updated DataFrame is the same as the one created by
df.iloc[ilocs_max].price.plot(). This is a good confirmation that DataFrame has been updated properly.
I hope it helped you and saved you some time. Cheers!