指标值的突然上升或下降是一种异常行为,这两种情况都需要注意。如果我们在建模之前就有异常行为的信息,那么异常检测可以通过监督学习算法来解决,但在没有反馈的情况下,最初很难识别这些点。因此,我们可以使用孤立森林(Isolation Forest)、支持向量机和LSTM等算法将其建模为一个无监督问题。下面使用孤立森林识别异常点。
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import warnings
warnings.filterwarnings('ignore')
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
这里的数据是一个用例(如收益、流量等),每天有12个指标。我们必须首先确定在用例级别上是否存在异常。然后,为了获得更好的可操作性,我们深入到单个指标,并识别其中的异常情况。
df=pd.read_csv("../input/metric_data.csv")
df.head()
pred = clf.predict(metrics_df[to_model_columns])
metrics_df['anomaly']=pred
outliers=metrics_df.loc[metrics_df['anomaly']==-1]
outlier_index=list(outliers.index)
#print(outlier_index)
#Find the number of anomalies and normal points here points classified -1 are anomalous
print(metrics_df['anomaly'].value_counts())
1 109
-1 12
Name: anomaly, dtype: int64
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/iforest.py:417: DeprecationWarning: threshold_ attribute is deprecated in 0.20 and will be removed in 0.22.
" be removed in 0.22.", DeprecationWarning)
现在我们有了12个指标根据孤立森林的情况对异常情况进行了分类。我们将尝试将结果可视化,并检查分类是否有意义。
将指标归一化并拟合到PCA上,以减少维数,然后以3D方式将其绘制出来,突出显示异常。
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
pca = PCA(n_components=3) # Reduce to k=3 dimensions
scaler = StandardScaler()
#normalize the metrics
X = scaler.fit_transform(metrics_df[to_model_columns])
X_reduce = pca.fit_transform(X)
因此,2D绘图为我们提供了一幅清晰的画面,表明算法正确地分类了用例中的异常点。
异常用红色边缘突出显示,正常点用绿色点表示。
在这里,Contamination参数起着很大的作用。我们的想法是捕获系统中所有的异常点。因此,最好是识别几个可能是正常的异常点(假阳性),但不要错过捕捉异常点(真阴性)。(所以我指定了12%作为Contamination,这取决于具体用例)
#Installing specific version of plotly to avoid Invalid property for color error in recent version which needs change in layout
!pip install plotly==2.7.0
现在我们已经发现了用例级别的异常行为。但是,要对异常采取行动,重要的是识别并提供信息,单独说明哪些指标标准是异常的。
当业务用户可以直观地看到(突然的下降/峰值)算法识别的异常时,就可以对其采取行动。所以在这个过程中,创造一个好的视觉效果也同样重要。
这个函数在时间序列上创建实际绘图,并在其上突出显示异常点。还有一个表,它提供了实际数据、更改和基于异常的条件格式化。
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import matplotlib.pyplot as plt
from matplotlib import pyplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
def plot_anomaly(df,metric_name):
df.load_date = pd.to_datetime(df['load_date'].astype(str), format="%Y%m%d")
dates = df.load_date
#identify the anomaly points and create a array of its values for plot
bool_array = (abs(df['anomaly']) > 0)
actuals = df["actuals"][-len(bool_array):]
anomaly_points = bool_array * actuals
anomaly_points[anomaly_points == 0] = np.nan
#A dictionary for conditional format table based on anomaly
color_map = {0: "'rgba(228, 222, 249, 0.65)'", 1: "yellow", 2: "red"}
#Table which includes Date,Actuals,Change occured from previous point
table = go.Table(
domain=dict(x=[0, 1],
y=[0, 0.3]),
columnwidth=[1, 2],
# columnorder=[0, 1, 2,],
header=dict(height=20,
values=[[&#39;<b>Date</b>&#39;], [&#39;<b>Actual Values </b>&#39;], [&#39;<b>% Change </b>&#39;],
],
font=dict(color=[&#39;rgb(45, 45, 45)&#39;] * 5, size=14),
fill=dict(color=&#39;#d562be&#39;)),
cells=dict(values=[df.round(3)[k].tolist() for k in [&#39;load_date&#39;, &#39;actuals&#39;, &#39;percentage_change&#39;]],
line=dict(color=&#39;#506784&#39;),
align=[&#39;center&#39;] * 5,
font=dict(color=[&#39;rgb(40, 40, 40)&#39;] * 5, size=12),
# format = [None] + [&#34;,.4f&#34;] + [&#39;,.4f&#39;],
# suffix=[None] * 4,
suffix=[None] + [&#39;&#39;] + [&#39;&#39;] + [&#39;%&#39;] + [&#39;&#39;],
height=27,
fill=dict(color=[test_df[&#39;anomaly_class&#39;].map(color_map)],#map based on anomaly level from dictionary
)
))
#Plot the actuals points
Actuals = go.Scatter(name=&#39;Actuals&#39;,
x=dates,
y=df[&#39;actuals&#39;],
xaxis=&#39;x1&#39;, yaxis=&#39;y1&#39;,
mode=&#39;line&#39;,
marker=dict(size=12,
line=dict(width=1),
color=&#34;blue&#34;))
iplot(fig)
pyplot.show()
#return res
一个helper函数来查找百分比变化,根据严重程度对异常进行分类。
预测函数根据决策函数的结果,对数据进行异常分类。如果企业需要发现下一个可能产生影响的异常,可以使用这个来识别这些点。
前12个分位数为识别异常(高严重性),根据决策函数识别12-24个分位数点,将其分类为低严重性异常。
def classify_anomalies(df,metric_name):
df[&#39;metric_name&#39;]=metric_name
df = df.sort_values(by=&#39;load_date&#39;, ascending=False)
#Shift actuals by one timestamp to find the percentage chage between current and previous data point
df[&#39;shift&#39;] = df[&#39;actuals&#39;].shift(-1)
df[&#39;percentage_change&#39;] = ((df[&#39;actuals&#39;] - df[&#39;shift&#39;]) / df[&#39;actuals&#39;]) * 100
#Categorise anomalies as 0-no anomaly, 1- low anomaly , 2 - high anomaly
df[&#39;anomaly&#39;].loc[df[&#39;anomaly&#39;] == 1] = 0
df[&#39;anomaly&#39;].loc[df[&#39;anomaly&#39;] == -1] = 2
df[&#39;anomaly_class&#39;] = df[&#39;anomaly&#39;]
max_anomaly_score = df[&#39;score&#39;].loc[df[&#39;anomaly_class&#39;] == 2].max()
medium_percentile = df[&#39;score&#39;].quantile(0.24)
df[&#39;anomaly_class&#39;].loc[(df[&#39;score&#39;] > max_anomaly_score) & (df[&#39;score&#39;] <= medium_percentile)] = 1
return df
识别单个指标的异常并绘制结果。 X轴-日期,Y轴-实际值和异常点。
指标的实际值显示在蓝线中,异常点以红点突出显示。在表中,背景红色表示高异常,黄色表示低异常。
import warnings
warnings.filterwarnings(&#39;ignore&#39;)
for i in range(1,len(metrics_df.columns)-1):
clf.fit(metrics_df.iloc[:,i:i+1])
pred = clf.predict(metrics_df.iloc[:,i:i+1])
test_df=pd.DataFrame()
test_df[&#39;load_date&#39;]=metrics_df[&#39;load_date&#39;]
#Find decision function to find the score and classify anomalies
test_df[&#39;score&#39;]=clf.decision_function(metrics_df.iloc[:,i:i+1])
test_df[&#39;actuals&#39;]=metrics_df.iloc[:,i:i+1]
test_df[&#39;anomaly&#39;]=pred
#Get the indexes of outliers in order to compare the metrics with use case anomalies if required
outliers=test_df.loc[test_df[&#39;anomaly&#39;]==-1]
outlier_index=list(outliers.index)
test_df=classify_anomalies(test_df,metrics_df.columns)
plot_anomaly(test_df,metrics_df.columns)