| Title: | Concept Drift Detection Methods for Stream Data |
|---|---|
| Description: | A system designed for detecting concept drift in streaming datasets. It offers a comprehensive suite of statistical methods to detect concept drift, including methods for monitoring changes in data distributions over time. The package supports several tests, such as Drift Detection Method (DDM), Early Drift Detection Method (EDDM), Hoeffding Drift Detection Methods (HDDM_A, HDDM_W), Kolmogorov-Smirnov test-based Windowing (KSWIN), Adaptive WINdowing (ADWIN) and Page Hinkley (PH) tests. The methods implemented in this package are based on established research and have been demonstrated to be effective in real-time data analysis. For more details on the methods, please check to the following sources. Kobylińska et al. (2023) <doi:10.48550/arXiv.2308.11446>, S. Kullback & R.A. Leibler (1951) <doi:10.1214/aoms/1177729694>, Gama et al. (2004) <doi:10.1007/978-3-540-28645-5_29>, Baena-Garcia et al. (2006) <https://www.researchgate.net/publication/245999704_Early_Drift_Detection_Method>, Frías-Blanco et al. (2014) <https://ieeexplore.ieee.org/document/6871418>, Bifet and Gavalda (2007) <doi:10.1137/1.9781611972771>, Raab et al. (2020) <doi:10.1016/j.neucom.2019.11.111>, Page (1954) <doi:10.1093/biomet/41.1-2.100>, Montiel et al. (2018) <https://jmlr.org/papers/volume19/18-251/18-251.pdf>. |
| Authors: | Ugur Dar [aut, cre] (ORCID: <https://orcid.org/0009-0005-8076-2199>), Mustafa Cavus [aut] (ORCID: <https://orcid.org/0000-0002-6172-5449>) |
| Maintainer: | Ugur Dar <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-04-22 21:40:42 UTC |
| Source: | https://github.com/ugurdar/datadriftr |
ADWIN is a drift detection method that maintains a variable-length window of recent data to detect concept drift with mathematical guarantees. Based on the algorithm from Bifet and Gavalda (2007).
The algorithm maintains a sliding window W with the most recent elements. It compares the distribution of two sub-windows (W0 and W1) and detects drift when their means differ significantly according to the Hoeffding bound. Uses bucket compression for memory efficiency.
deltaSignificance threshold for drift detection
clockFrequency of drift checks
max_bucketsMaximum buckets per level
min_window_lengthMinimum window length for comparison
grace_periodInitial period before detection starts
drift_detectedCheck if drift was detected (active binding)
new()
Initialize ADWIN detector
ADWIN$new( delta = 0.002, clock = 32, max_buckets = 5, min_window_length = 5, grace_period = 10 )
deltaSignificance threshold (default 0.002)
clockDrift check frequency (default 32)
max_bucketsMax buckets per level (default 5)
min_window_lengthMin window for comparison (default 5)
grace_periodStartup period (default 10)
reset()
Reset the detector state
ADWIN$reset()
add_element()
Add a new element to the detector
ADWIN$add_element(value)
valueNumeric value to add
detected_change()
Check if drift was detected
ADWIN$detected_change()
Logical indicating drift detection
width()
Get current window width
ADWIN$width()
Integer window width
n_detections()
Get number of detections
ADWIN$n_detections()
Integer count of detections
estimation()
Get current mean estimate
ADWIN$estimation()
Numeric mean
variance()
Get current variance
ADWIN$variance()
Numeric variance
clone()
The objects of this class are cloneable with this method.
ADWIN$clone(deep = FALSE)
deepWhether to make a deep clone.
Bifet, A., & Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining (pp. 443-448).
set.seed(12345) adwin <- ADWIN$new() # Stream: 1000 values from {0,1}, then 1000 values from {4,5,6,7} stream <- c(sample(0:1, 1000, replace = TRUE), sample(4:7, 1000, replace = TRUE)) for (i in seq_along(stream)) { adwin$add_element(stream[i]) if (adwin$detected_change()) { message("Change detected at index ", i, ", input value: ", stream[i]) } }set.seed(12345) adwin <- ADWIN$new() # Stream: 1000 values from {0,1}, then 1000 values from {4,5,6,7} stream <- c(sample(0:1, 1000, replace = TRUE), sample(4:7, 1000, replace = TRUE)) for (i in seq_along(stream)) { adwin$add_element(stream[i]) if (adwin$detected_change()) { message("Change detected at index ", i, ", input value: ", stream[i]) } }
Implements the Drift Detection Method (DDM), used for detecting concept drift in data streams by analyzing the performance of online learners. The method monitors changes in the error rate of a learner, signaling potential concept drift.
DDM is designed to be simple yet effective for detecting concept drift by monitoring the error rate of any online classifier. The method is particularly sensitive to increases in the error rate, which is typically a strong indicator of concept drift.
min_instancesMinimum number of instances required before drift detection begins.
warning_levelMultiplier for the standard deviation to set the warning threshold.
out_control_levelMultiplier for the standard deviation to set the out-of-control threshold.
sample_countCounter for the number of samples processed.
miss_probCurrent estimated probability of misclassification.
miss_stdCurrent estimated standard deviation of misclassification probability.
miss_prob_sd_minMinimum recorded value of misclassification probability plus its standard deviation.
miss_prob_minMinimum recorded misclassification probability.
miss_sd_minMinimum recorded standard deviation.
estimationCurrent estimation of misclassification probability.
change_detectedBoolean indicating if a drift has been detected.
warning_detectedBoolean indicating if a warning level has been reached.
delayDelay since the last relevant sample.
new()
Initializes the DDM detector with specific parameters.
DDM$new(min_num_instances = 30, warning_level = 2, out_control_level = 3)
min_num_instancesMinimum number of samples required before starting drift detection.
warning_levelThreshold multiplier for setting a warning level.
out_control_levelThreshold multiplier for setting the out-of-control level.
reset()
Resets the internal state of the DDM detector.
DDM$reset()
add_element()
Adds a new prediction error value to the model, updates the calculation of the misclassification probability and its standard deviation, and checks for warnings or drifts based on updated statistics.
DDM$add_element(prediction)
predictionThe new data point (prediction error) to be added to the model.
detected_change()
Returns a boolean indicating whether a drift has been detected based on the monitored statistics.
DDM$detected_change()
clone()
The objects of this class are cloneable with this method.
DDM$clone(deep = FALSE)
deepWhether to make a deep clone.
João Gama, Pedro Medas, Gladys Castillo, Pedro Pereira Rodrigues: Learning with Drift Detection. SBIA 2004: 286-295
Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/ddm.py
set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) ddm <- DDM$new() # Iterate through the data stream for (i in seq_along(data_stream)) { ddm$add_element(data_stream[i]) if (ddm$change_detected) { message(paste("Drift detected!", i)) } else if (ddm$warning_detected) { # message(paste("Warning detected at position:", i)) } }set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) ddm <- DDM$new() # Iterate through the data stream for (i in seq_along(data_stream)) { ddm$add_element(data_stream[i]) if (ddm$change_detected) { message(paste("Drift detected!", i)) } else if (ddm$warning_detected) { # message(paste("Warning detected at position:", i)) } }
A user-friendly wrapper function that detects data drift in streaming data without requiring explicit for-loops. This function processes the entire data stream and returns all detected drift points (and optionally warning points) as a dataframe.
detect_drift( stream, method = c("ddm", "eddm", "page_hinkley", "hddm_a", "hddm_w", "kswin", "adwin"), include_warnings = TRUE, ... )detect_drift( stream, method = c("ddm", "eddm", "page_hinkley", "hddm_a", "hddm_w", "kswin", "adwin"), include_warnings = TRUE, ... )
stream |
Numeric vector representing the data stream to monitor. |
method |
Character string specifying the drift detection method to use. Options are: "ddm", "eddm", "page_hinkley", "hddm_a", "hddm_w", "kswin". Default is "ddm". |
include_warnings |
Logical indicating whether to include warning detections in the results (applicable for DDM, EDDM, HDDM_A, and HDDM_W). Default is TRUE. |
... |
Additional parameters to pass to the specific detector constructor. See individual detector documentation for available parameters. |
This function provides a simplified interface to all streaming drift detectors in the datadriftR package. Instead of manually iterating through the stream and checking for drift at each point, users can simply pass the entire stream to this function and receive a dataframe with all drift detection results.
The function supports the following drift detection methods:
ddm: Drift Detection Method
eddm: Early Drift Detection Method
page_hinkley: Page-Hinkley Test
hddm_a: HDDM with Adaptive Windows
hddm_w: HDDM with Weighted Windows
kswin: Kolmogorov-Smirnov Windowing
adwin: ADaptive WINdowing
A data.frame with the following columns:
Integer index in the stream where detection occurred
Numeric value at that index
Character indicating "drift" or "warning"
If no drift or warnings are detected, returns an empty data.frame with the same structure.
library(datadriftR) set.seed(123) # Generate synthetic stream with drift at index 501 pre <- sample(c(0,1), 500, replace = TRUE, prob = c(0.7, 0.3)) post <- sample(c(0,1), 500, replace = TRUE, prob = c(0.3, 0.7)) stream <- c(pre, post) # Detect drift using DDM results <- detect_drift(stream, method = "ddm") print(results) # Detect drift using Page-Hinkley with custom parameters results <- detect_drift(stream, method = "page_hinkley", delta = 0.005, threshold = 50) print(results) # Detect drift using KSWIN results <- detect_drift(stream, method = "kswin") print(results) # Get only drift detections, exclude warnings results <- detect_drift(stream, method = "ddm", include_warnings = FALSE) print(results)library(datadriftR) set.seed(123) # Generate synthetic stream with drift at index 501 pre <- sample(c(0,1), 500, replace = TRUE, prob = c(0.7, 0.3)) post <- sample(c(0,1), 500, replace = TRUE, prob = c(0.3, 0.7)) stream <- c(pre, post) # Detect drift using DDM results <- detect_drift(stream, method = "ddm") print(results) # Detect drift using Page-Hinkley with custom parameters results <- detect_drift(stream, method = "page_hinkley", delta = 0.005, threshold = 50) print(results) # Detect drift using KSWIN results <- detect_drift(stream, method = "kswin") print(results) # Get only drift detections, exclude warnings results <- detect_drift(stream, method = "ddm", include_warnings = FALSE) print(results)
This class implements the Early Drift Detection Method (EDDM), designed to detect concept drifts in online learning scenarios by monitoring the distances between consecutive errors. EDDM is particularly useful for detecting gradual drifts earlier than abrupt changes.
EDDM is a statistical process control method that is more sensitive to changes that happen more slowly and can provide early warnings of deterioration before the error rate increases significantly.
eddm_warningWarning threshold setting.
eddm_outcontrolOut-of-control threshold setting.
m_num_errorsCurrent number of errors encountered.
m_min_num_errorsMinimum number of errors to initialize drift detection.
m_nTotal instances processed.
m_dDistance to the last error from the current instance.
m_lastdDistance to the previous error from the last error.
m_meanMean of the distances between errors.
m_std_tempTemporary standard deviation accumulator for the distances.
m_m2s_maxMaximum mean plus two standard deviations observed.
delayDelay count since the last detected change.
estimationCurrent estimated mean distance between errors.
warning_detectedBoolean indicating if a warning has been detected.
change_detectedBoolean indicating if a change has been detected.
new()
Initializes the EDDM detector with specific parameters.
EDDM$new(min_num_instances = 30, eddm_warning = 0.95, eddm_outcontrol = 0.9)
min_num_instancesMinimum number of errors before drift detection starts.
eddm_warningThreshold for warning level.
eddm_outcontrolThreshold for out-of-control level.
reset()
Resets the internal state of the EDDM detector.
EDDM$reset()
add_element()
Adds a new observation and updates the drift detection status.
EDDM$add_element(prediction)
predictionNumeric value representing a new error (usually 0 or 1).
clone()
The objects of this class are cloneable with this method.
EDDM$clone(deep = FALSE)
deepWhether to make a deep clone.
Early Drift Detection Method. Manuel Baena-Garcia, Jose Del Campo-Avila, Raúl Fidalgo, Albert Bifet, Ricard Gavalda, Rafael Morales-Bueno. In Fourth International Workshop on Knowledge Discovery from Data Streams, 2006.
Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/eddm.py
set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) eddm <- EDDM$new() for (i in 1:length(data_stream)) { eddm$add_element(data_stream[i]) if (eddm$change_detected) { message(paste("Drift detected!",i)) } else if (eddm$warning_detected) { message(paste("Warning detected!",i)) } }set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) eddm <- EDDM$new() for (i in 1:length(data_stream)) { eddm$add_element(data_stream[i]) if (eddm$change_detected) { message(paste("Drift detected!",i)) } else if (eddm$warning_detected) { message(paste("Warning detected!",i)) } }
This class implements the HDDM_A drift detection method that uses adaptive windows to detect changes in the mean of a data stream. It is designed to monitor online streams of data and can detect increases or decreases in the process mean in a non-parametric and online manner.
HDDM_A adapts to changes in the data stream by adjusting its internal windows to track the minimum and maximum values of the process mean. It triggers alerts when a significant drift from these benchmarks is detected.
drift_confidenceConfidence level for detecting a drift.
warning_confidenceConfidence level for warning detection.
two_side_optionBoolean flag for one-sided or two-sided mean monitoring.
total_nTotal number of samples seen.
total_cTotal cumulative sum of the samples.
n_maxMaximum window end for sample count.
c_maxMaximum window end for cumulative sum.
n_minMinimum window start for sample count.
c_minMinimum window start for cumulative sum.
n_estimationNumber of samples since the last detected change.
c_estimationCumulative sum since the last detected change.
change_detectedBoolean indicating if a change was detected.
warning_detectedBoolean indicating if a warning has been detected.
estimationCurrent estimated mean of the stream.
delayCurrent delay since the last update.
new()
Initializes the HDDM_A detector with specific settings.
HDDM_A$new( drift_confidence = 0.001, warning_confidence = 0.005, two_side_option = TRUE )
drift_confidenceConfidence level for drift detection.
warning_confidenceConfidence level for issuing warnings.
two_side_optionWhether to monitor both increases and decreases.
add_element()
Adds an element to the data stream and updates the detection status.
HDDM_A$add_element(prediction)
predictionNumeric, the new data value to add.
mean_incr()
Calculates if there is an increase in the mean.
HDDM_A$mean_incr(c_min, n_min, total_c, total_n, confidence)
c_minMinimum cumulative sum.
n_minMinimum count of samples.
total_cTotal cumulative sum.
total_nTotal number of samples.
confidenceConfidence threshold for detection.
mean_decr()
Calculates if there is a decrease in the mean.
HDDM_A$mean_decr(c_max, n_max, total_c, total_n)
c_maxMaximum cumulative sum.
n_maxMaximum count of samples.
total_cTotal cumulative sum.
total_nTotal number of samples.
reset()
Resets all internal counters and accumulators to their initial state.
HDDM_A$reset()
update_estimations()
Updates estimations of the mean after detecting changes.
HDDM_A$update_estimations()
clone()
The objects of this class are cloneable with this method.
HDDM_A$clone(deep = FALSE)
deepWhether to make a deep clone.
Frías-Blanco I, del Campo-Ávila J, Ramos-Jimenez G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823.
Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer. MOA: Massive Online Analysis; Journal of Machine Learning Research 11: 1601-1604, 2010.
Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/hddm_a.py
set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) # Initialize the hddm_a object hddm_a_instance <- HDDM_A$new() # Iterate through the data stream for(i in seq_along(data_stream)) { hddm_a_instance$add_element(data_stream[i]) if(hddm_a_instance$warning_detected) { message(paste("Warning detected at index:", i)) } if(hddm_a_instance$change_detected) { message(paste("Concept drift detected at index:", i)) } }set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) # Initialize the hddm_a object hddm_a_instance <- HDDM_A$new() # Iterate through the data stream for(i in seq_along(data_stream)) { hddm_a_instance$add_element(data_stream[i]) if(hddm_a_instance$warning_detected) { message(paste("Warning detected at index:", i)) } if(hddm_a_instance$change_detected) { message(paste("Concept drift detected at index:", i)) } }
Implements the Kolmogorov-Smirnov test for detecting distribution changes within a window of streaming data. KSWIN is a non-parametric method for change detection that compares two samples to determine if they come from the same distribution.
KSWIN is effective for detecting changes in the underlying distribution of data streams. It is particularly useful in scenarios where data properties may evolve over time, allowing for early detection of changes that might affect subsequent data processing.
drift_confidenceConfidence level for detecting a drift (default: 0.001).
warning_confidenceConfidence level for warning detection (default: 0.005).
lambda_optionDecay rate for the EWMA statistic, smaller values give less weight to recent data (default: 0.050).
two_side_optionBoolean flag for one-sided or two-sided error monitoring (default: TRUE).
totalContainer for the EWMA estimator and its bounded conditional sum.
sample1_decr_monitorFirst sample monitor for detecting decrements.
sample1_incr_monitorFirst sample monitor for detecting increments.
sample2_decr_monitorSecond sample monitor for detecting decrements.
sample2_incr_monitorSecond sample monitor for detecting increments.
incr_cutpointCutpoint for deciding increments.
decr_cutpointCutpoint for deciding decrements.
widthCurrent width of the window.
delayDelay count since last reset.
change_detectedBoolean indicating if a change was detected.
warning_detectedBoolean indicating if currently in a warning zone.
estimationThe current estimation of the stream's mean.
new()
Initializes the HDDM_W detector with specific parameters.
HDDM_W$new( drift_confidence = 0.001, warning_confidence = 0.005, lambda_option = 0.05, two_side_option = TRUE )
drift_confidenceConfidence level for drift detection.
warning_confidenceConfidence level for issuing warnings.
lambda_optionDecay rate for the EWMA statistic.
two_side_optionWhether to monitor both increases and decreases.
add_element()
Adds a new element to the data stream and updates the detection status.
HDDM_W$add_element(prediction)
predictionThe new data value to add.
SampleInfo()
Provides current information about the monitoring samples, typically used for debugging or monitoring.
HDDM_W$SampleInfo()
reset()
Resets the internal state to initial conditions.
HDDM_W$reset()
detect_mean_increment()
Detects an increment in the mean between two samples based on the provided confidence level.
HDDM_W$detect_mean_increment(sample1, sample2, confidence)
sample1First sample information, containing EWMA estimator and bounded conditional sum.
sample2Second sample information, containing EWMA estimator and bounded conditional sum.
confidenceThe confidence level used for calculating the bound.
Boolean indicating if an increment in mean was detected.
monitor_mean_incr()
Monitors the data stream for an increase in the mean based on the set confidence level.
HDDM_W$monitor_mean_incr(confidence)
confidenceThe confidence level used to detect changes in the mean.
Boolean indicating if an increase in the mean was detected.
monitor_mean_decr()
Monitors the data stream for a decrease in the mean based on the set confidence level.
HDDM_W$monitor_mean_decr(confidence)
confidenceThe confidence level used to detect changes in the mean.
Boolean indicating if a decrease in the mean was detected.
update_incr_statistics()
Updates increment statistics for drift monitoring based on new values and confidence. This method adjusts the cutpoint for increments and updates the monitoring samples.
HDDM_W$update_incr_statistics(value, confidence)
valueThe new value to update statistics.
confidenceThe confidence level for the update.
update_decr_statistics()
Updates decrement statistics for drift monitoring based on new values and confidence. This method adjusts the cutpoint for decrements and updates the monitoring samples.
HDDM_W$update_decr_statistics(value, confidence)
valueThe new value to update statistics.
confidenceThe confidence level for the update.
clone()
The objects of this class are cloneable with this method.
HDDM_W$clone(deep = FALSE)
deepWhether to make a deep clone.
Frías-Blanco I, del Campo-Ávila J, Ramos-Jimenez G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823.
Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer. MOA: Massive Online Analysis; Journal of Machine Learning Research 11: 1601-1604, 2010. Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/hddm_w.py
set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) # Initialize the HDDM_W object hddm_w_instance <- HDDM_W$new() # Iterate through the data stream for(i in seq_along(data_stream)) { hddm_w_instance$add_element(data_stream[i]) if(hddm_w_instance$warning_detected) { message(paste("Warning detected at index:", i)) } if(hddm_w_instance$change_detected) { message(paste("Concept drift detected at index:", i)) } }set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) # Initialize the HDDM_W object hddm_w_instance <- HDDM_W$new() # Iterate through the data stream for(i in seq_along(data_stream)) { hddm_w_instance$add_element(data_stream[i]) if(hddm_w_instance$warning_detected) { message(paste("Warning detected at index:", i)) } if(hddm_w_instance$change_detected) { message(paste("Concept drift detected at index:", i)) } }
Implements the Kullback-Leibler Divergence (KLD) calculation between two probability distributions using histograms. The class can detect drift by comparing the divergence to a predefined threshold.
The Kullback-Leibler Divergence (KLD) is a measure of how one probability distribution diverges from a second, expected probability distribution. This class uses histograms to approximate the distributions and calculates the KLD to detect changes over time. If the divergence exceeds a predefined threshold, it signals a detected drift.
epsilonValue to add to small probabilities to avoid log(0) issues.
baseThe base of the logarithm used in KLD calculation.
binsNumber of bins used for the histogram.
drift_levelThe threshold for detecting drift.
drift_detectedBoolean indicating if drift has been detected.
pInitial distribution.
kl_resultThe result of the KLD calculation.
new()
Initializes the KLDivergence class.
KLDivergence$new(epsilon = 1e-10, base = exp(1), bins = 10, drift_level = 0.2)
epsilonValue to add to small probabilities to avoid log(0) issues.
baseThe base of the logarithm used in KLD calculation.
binsNumber of bins used for the histogram.
drift_levelThe threshold for detecting drift.
reset()
Resets the internal state of the detector.
KLDivergence$reset()
set_initial_distribution()
Sets the initial distribution.
KLDivergence$set_initial_distribution(initial_p)
initial_pThe initial distribution.
add_distribution()
Adds a new distribution and calculates the KLD.
KLDivergence$add_distribution(q)
qThe new distribution.
calculate_kld()
Calculates the KLD between two distributions.
KLDivergence$calculate_kld(p, q)
pThe initial distribution.
qThe new distribution.
The KLD value.
get_kl_result()
Returns the current KLD result.
KLDivergence$get_kl_result()
The current KLD value.
is_drift_detected()
Checks if drift has been detected.
KLDivergence$is_drift_detected()
TRUE if drift is detected, otherwise FALSE.
clone()
The objects of this class are cloneable with this method.
KLDivergence$clone(deep = FALSE)
deepWhether to make a deep clone.
Kullback, S., and Leibler, R.A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79-86.
set.seed(123) # Setting a seed for reproducibility initial_data <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) kld <- KLDivergence$new(bins = 10, drift_level = 0.2) kld$set_initial_distribution(initial_data) new_data <- c(0.2, 0.2, 0.3, 0.4, 0.4, 0.5, 0.6, 0.7, 0.7, 0.8) kld$add_distribution(new_data) kl_result <- kld$get_kl_result() message(paste("KL Divergence:", kl_result)) if (kld$is_drift_detected()) { message("Drift detected.") }set.seed(123) # Setting a seed for reproducibility initial_data <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) kld <- KLDivergence$new(bins = 10, drift_level = 0.2) kld$set_initial_distribution(initial_data) new_data <- c(0.2, 0.2, 0.3, 0.4, 0.4, 0.5, 0.6, 0.7, 0.7, 0.8) kld$add_distribution(new_data) kl_result <- kld$get_kl_result() message(paste("KL Divergence:", kl_result)) if (kld$is_drift_detected()) { message("Drift detected.") }
Implements the Kolmogorov-Smirnov test for detecting distribution changes within a window of streaming data. KSWIN is a non-parametric method for change detection that compares two samples to determine if they come from the same distribution.
KSWIN is effective for detecting changes in the underlying distribution of data streams. It is particularly useful in scenarios where data properties may evolve over time, allowing for early detection of changes that might affect subsequent data processing.
alphaSignificance level for the KS test.
window_sizeTotal size of the data window used for testing.
stat_sizeNumber of data points sampled from the window for the KS test.
windowCurrent data window used for change detection.
change_detectedBoolean flag indicating whether a change has been detected.
p_valueP-value of the most recent KS test.
new()
Initializes the KSWIN detector with specific settings.
KSWIN$new(alpha = 0.005, window_size = 100, stat_size = 30, data = NULL)
alphaThe significance level for the KS test.
window_sizeThe size of the data window for change detection.
stat_sizeThe number of samples in the statistical test window.
dataInitial data to populate the window, if provided.
reset()
Resets the internal state of the detector to its initial conditions.
KSWIN$reset()
add_element()
Adds a new element to the data window and updates the detection status based on the KS test.
KSWIN$add_element(x)
xThe new data value to add to the window.
detected_change()
Checks if a change has been detected based on the most recent KS test.
KSWIN$detected_change()
Boolean indicating whether a change was detected.
clone()
The objects of this class are cloneable with this method.
KSWIN$clone(deep = FALSE)
deepWhether to make a deep clone.
Christoph Raab, Moritz Heusinger, Frank-Michael Schleif, Reactive Soft Prototype Computing for Concept Drift Streams, Neurocomputing, 2020.
Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/kswin.py
set.seed(123) x <- c(rnorm(100, mean = 0, sd = 1), rnorm(100, mean = 3, sd = 1)) # High-level interface (returns a data.frame of detections) detect_drift(x, method = "kswin", alpha = 0.001, window_size = 50, stat_size = 20) # Online usage (update one observation at a time) kswin <- KSWIN$new(alpha = 0.001, window_size = 50, stat_size = 20) drift_idx <- integer() for (i in seq_along(x)) { kswin$add_element(x[i]) if (kswin$detected_change()) { drift_idx <- c(drift_idx, i) } } drift_idxset.seed(123) x <- c(rnorm(100, mean = 0, sd = 1), rnorm(100, mean = 3, sd = 1)) # High-level interface (returns a data.frame of detections) detect_drift(x, method = "kswin", alpha = 0.001, window_size = 50, stat_size = 20) # Online usage (update one observation at a time) kswin <- KSWIN$new(alpha = 0.001, window_size = 50, stat_size = 20) drift_idx <- integer() for (i in seq_along(x)) { kswin$add_element(x[i]) if (kswin$detected_change()) { drift_idx <- c(drift_idx, i) } } drift_idx
Implements the Page-Hinkley test, a sequential analysis technique used to detect changes in the average value of a continuous signal or process. It is effective in detecting small but persistent changes over time, making it suitable for real-time monitoring applications.
The Page-Hinkley test is a type of cumulative sum (CUSUM) test that accumulates differences between data points and a reference value (running mean). It triggers a change detection signal when the cumulative sum exceeds a predefined threshold. This test is especially useful for early detection of subtle shifts in the behavior of the monitored process.
min_instancesMinimum number of instances required to start detection.
deltaMinimal change considered significant for detection.
thresholdDecision threshold for signaling a change.
alphaForgetting factor for the cumulative sum calculation.
x_meanRunning mean of the observed values.
sample_countCounter for the number of samples seen.
sumWeighted cumulative sum used for mean calculation.
PHPage-Hinkley statistic.
min_PHMinimum value of PH statistic observed.
change_detectedBoolean indicating if a drift has been detected.
new()
Initializes the Page-Hinkley test with specific parameters.
PageHinkley$new(min_instances = 30, delta = 0.05, threshold = 50, alpha = 1)
min_instancesMinimum number of samples before detection starts.
deltaChange magnitude to trigger detection.
thresholdCumulative sum threshold for change detection.
alphaWeight for older data in cumulative sum.
reset()
Resets all the internal states of the detector to initial values.
PageHinkley$reset()
add_element()
Adds a new element to the data stream and updates the detection status based on the Page-Hinkley test.
PageHinkley$add_element(x)
xNew data value to add and evaluate.
detected_change()
Checks if a change has been detected based on the last update.
PageHinkley$detected_change()
Boolean indicating whether a change was detected.
get_PH()
Returns the current Page-Hinkley statistic.
PageHinkley$get_PH()
The current PH value.
clone()
The objects of this class are cloneable with this method.
PageHinkley$clone(deep = FALSE)
deepWhether to make a deep clone.
E. S. Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1/2 (1954), 100–115.
Montiel, Jacob, et al. "Scikit-Multiflow: A Multi-output Streaming Framework." Journal of Machine Learning Research, 2018. This framework provides tools for multi-output and stream data mining and was an inspiration for some of the implementations in this class.
Implementation: https://github.com/scikit-multiflow/scikit-multiflow/blob/a7e316d1cc79988a6df40da35312e00f6c4eabb2/src/skmultiflow/drift_detection/page_hinkley.py
set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 5), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) ph <- PageHinkley$new() for (i in seq_along(data_stream)) { ph$add_element(data_stream[i]) if (ph$detected_change()) { cat(sprintf("Change has been detected in data: %s - at index: %d\n", data_stream[i], i)) } }set.seed(123) # Setting a seed for reproducibility data_part1 <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.7, 0.3)) # Introduce a change in data distribution data_part2 <- sample(c(0, 5), size = 100, replace = TRUE, prob = c(0.3, 0.7)) # Combine the two parts data_stream <- c(data_part1, data_part2) ph <- PageHinkley$new() for (i in seq_along(data_stream)) { ph$add_element(data_stream[i]) if (ph$detected_change()) { cat(sprintf("Change has been detected in data: %s - at index: %d\n", data_stream[i], i)) } }
Implements the calculation of profile differences using various methods such as PDI, L2, and L2 derivative. The class provides methods for setting profiles and calculating the differences.
The class supports multiple methods for calculating profile differences, including the Profile Disparity Index (PDI) using gold or simple derivative methods, and L2 norm and L2 derivative calculations. It allows for customization of various parameters such as embedding dimensions, derivative orders, and thresholds.
methodThe method used for profile difference calculation.
derivThe method used for derivative calculation.
gold_splineBoolean indicating if cubic spline should be used in gold method.
gold_embeddingEmbedding dimension for gold method.
nderivOrder of the derivative for simple method.
gold_spline_thresholdThreshold for cubic spline in gold method.
epsilonSmall value to avoid numerical issues.
profile1The first profile.
profile2The second profile.
new()
Initializes the ProfileDifference class.
ProfileDifference$new( method = "pdi", deriv = "gold", gold_spline = TRUE, gold_embedding = 4, nderiv = 4, gold_spline_threshold = 0.01, epsilon = NULL )
methodThe method used for profile difference calculation.
derivThe method used for derivative calculation.
gold_splineBoolean indicating if cubic spline should be used in gold method.
gold_embeddingEmbedding dimension for gold method.
nderivOrder of the derivative for simple method.
gold_spline_thresholdThreshold for cubic spline in gold method.
epsilonSmall value to avoid numerical issues.
reset()
Resets the internal state of the detector.
ProfileDifference$reset()
set_profiles()
Sets the profiles for comparison.
ProfileDifference$set_profiles(profile1, profile2)
profile1The first profile.
profile2The second profile.
calculate_difference()
Calculates the difference between the profiles.
ProfileDifference$calculate_difference()
A list containing the method details and the calculated distance.
clone()
The objects of this class are cloneable with this method.
ProfileDifference$clone(deep = FALSE)
deepWhether to make a deep clone.
Kobylińska, K., Krzyziński, M., Machowicz, R., Adamek, M., & Biecek, P. (2023). Exploration of the Rashomon Set Assists Trustworthy Explanations for Medical Data. arXiv e-prints, arXiv-2308.
set.seed(123) profile1 <- list(x = 1:100, y = sin(1:100)) profile2 <- list(x = 1:100, y = sin(1:100) + rnorm(100, 0, 0.1)) pd <- ProfileDifference$new(method = "pdi", deriv = "gold") pd$set_profiles(profile1, profile2) result <- pd$calculate_difference() message("Method:", result$method) message("Distance:", round(result$distance, 4))set.seed(123) profile1 <- list(x = 1:100, y = sin(1:100)) profile2 <- list(x = 1:100, y = sin(1:100) + rnorm(100, 0, 0.1)) pd <- ProfileDifference$new(method = "pdi", deriv = "gold") pd$set_profiles(profile1, profile2) result <- pd$calculate_difference() message("Method:", result$method) message("Distance:", round(result$distance, 4))