Neural networks are a method capable of detecting anomalies by learning patterns within complex datasets. In climate data, this method is used to identify unusual weather conditions or measurement errors. EcoLlama utilizes a complex neural network based on the Llama3 model as a control mechanism, featuring an architecture that allows for customizations through fine-tuning. However, in some cases, it can also leverage the outputs of code that interact with plugins referred to as "external modules" to perform different analyses. This is done to evaluate different data in addition to its generalization capabilities for certain data groups, to validate existing outputs. This situation can be thought of as a competitor action that broadens its scope by looking at the works of its rivals. EcoLlama always works with external modules in its comprehensive analyses. If the data obtained from external modules are consistent with its own data, this is considered validated. If there is an inconsistency in the data, it gathers more external module data and acts according to various evaluation criteria. Many evaluation principles, such as data diversity, scope of operations, and data structure, determine which data is considered correct.
Below are the external modules and their functions planned to be added to the EcoLlama 1.0 Beta version. The examples are prepared to focus on data anomaly detection.
- Benford's Law: Benford's Law is a statistical rule that describes the distribution of leading digits in a dataset and has various significant applications in climate science. This law can be an effective tool for detecting potential anomalies in climate data. Unexpected changes or errors in climate data can become apparent in datasets that do not conform to this law's distribution. For example, errors in measurements or records, or attempts at data manipulation, can be revealed through Benford analysis. Compliance with Benford's Law can support the reliability of the data, but should always be considered in conjunction with other verification methods.
- Z-Score Analysis: Z-score analysis measures how far a data point is from the mean of the dataset. In climate data, high Z-score values in variables like temperature or precipitation can indicate values that significantly deviate from the dataset. These deviations may point to potential anomalies in the dataset and can be used for the early detection of natural disasters or climate changes.
- Interquartile Range (IQR): The interquartile range measures the middle 50% of a dataset and is defined by the quartiles (Q1 and Q3). In climate data, values outside the IQR, especially those showing deviations greater than 1.5 * IQR, are considered abnormal. This method is used to identify abnormal temperature changes or extreme precipitation events.
- Grubbs' Test: Grubbs' test is used to detect a single outlier in a dataset. In climate science, it is useful for determining if a specific measurement in small datasets is significantly different from others. This could be associated with a measurement error or an extreme weather event.
- Boxplot Method: A boxplot visualizes the distribution of data and typically shows data that falls between the quartiles. In climate data, values outside the box can be considered anomalies, such as extreme temperatures or unexpected precipitation.
- Median Absolute Deviation (MAD): MAD measures the deviation of each data point from the median of the dataset. In climate data, values showing extreme deviation may indicate extreme weather conditions or data collection errors, which are critical for early warning systems.
- Mahalanobis Distance: Mahalanobis distance measures how "far" data points are in a multivariate dataset. In climate data, this method is used to detect atmospheric variability and the extreme weather events caused by this variability.
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN): DBSCAN is a density-based clustering algorithm that identifies points in low-density regions (noise) as anomalies. In climate data, this method is used to identify unexpected changes, particularly in temperature and humidity data.
- K-Nearest Neighbors (KNN): KNN calculates the average of the nearest neighbors for each data point and identifies those significantly deviating from this value as anomalies. In climate data, the KNN method is effective for detecting unusual weather conditions or data entry errors.
- Local Outlier Factor (LOF): LOF calculates the local density of data points and determines whether data points are anomalous based on this density. In climate science, the LOF method is used to identify temperature or pressure measurements that exhibit unusual behavior compared to their local surroundings.
- Principal Component Analysis (PCA): PCA can be used for anomaly detection in high-dimensional data. In climate data, PCA reduces data to a smaller number of dimensions, and deviations from normality in these new dimensions can be detected to identify extreme weather events or measurement errors.
- Gaussian Mixture Models (GMM): GMM attempts to model the dataset as a mixture of different normal distributions. In climate data, values that significantly deviate from the GMM model can be considered as abnormal weather conditions or data processing errors.
- Time Series Analysis: In time-dependent climate data, time series analysis methods (e.g., ARIMA models) can be used to detect non-trend movements. This method can help identify seasonal anomalies and long-term climate changes.
- Support Vector Machines (SVM): Support Vector Machines (SVM) are used to find the best hyperplane that separates data points into two different classes. In climate data, SVM anomaly detection can classify data points as normal or abnormal, identifying unexpected weather events or measurement errors.
- Isolation Forest: Isolation Forest is based on the assumption that anomalies are less frequently observed in the dataset and more easily isolated. In climate data, this method is effective for detecting unusual deviations, especially in temperature and humidity data.
- Seasonal Decomposition of Time Series (STL): STL analyzes time series by decomposing them into trend, seasonal, and random components. In climate data, this method helps identify non-trend anomalies and seasonal changes.
- Parametric and Non-parametric Tests: Parametric tests (such as the t-test) and non-parametric tests (such as the Mann-Whitney U test) are used to detect unexpected changes in a dataset. In climate data, these tests can be used to identify unusual weather conditions or measurement errors.
- Neyman-Pearson Hypothesis Test: The Neyman-Pearson test aims to maximize the distinction between two hypotheses. In climate data, this method can be used to detect abnormal measurements and unexpected changes.
- Automatic Statistical Process Control (SPC) Methods: SPC methods are used to detect deviations and out-of-control conditions in a dataset. In climate science, these methods are effective for identifying extreme weather conditions or unexpected measurement errors.
- Manual Review of Suspect Data: Finally, potential anomalies identified by algorithms can be verified through manual reviews and expert evaluations. In climate data, expert review is critical for identifying erroneous data entries or incorrect measurements.
- Histogram: A histogram is used to visualize the frequency distribution of a data set. In climate data, histograms can reveal abnormal frequency distributions by visualizing how often certain temperature or precipitation levels occur.
- Autoregressive Integrated Moving Average (ARIMA): ARIMA models are used to predict autocorrelations and trends in time series data. In climate data, ARIMA can be used for forecasting seasonal changes and anomaly detection.
- Run Sequence Analysis: Run sequence analysis is used to identify patterns and anomalies in a time series or sequential data. In climate data, this method is useful for monitoring the impacts of natural events or climate changes.
- Regression Analysis: Regression analysis is used to model the relationships between independent and dependent variables. In climate data, this method can be used to identify changes or trends caused by specific weather events.
- Ensemble Learning Methods (Bagging, Boosting): Ensemble learning methods improve anomaly detection by combining the results of multiple models. In climate data, these methods provide more robust and reliable data analysis.
- Advanced Neural Networks (Deep Learning): Advanced neural networks use deep learning techniques to learn patterns in complex data structures. In climate data, this method allows for more accurate predictions of weather forecasts and the impacts of climate change.
- High-Frequency Data Analysis: High-frequency data analysis involves the analysis of data collected over short time intervals. In climate data, this method is used to detect sudden weather changes and short-term extreme events.
- Association Rule Mining: Association rules determine probabilistic relationships between data. In climate data, this method can be used to examine the likelihood of specific climate events occurring together.
- Geographic Information Systems (GIS): Geographic information systems perform spatial analysis of climate data. This method can be used to detect climate anomalies in specific regions and analyze geographical patterns.
- Spatial Statistics: Spatial statistics are methods that examine the spatial distribution and relationships of data. In climate data, this method is used to determine geographic changes and spatial anomalies.
These modules are among the planned additions for the Beta version of EcoLlama 1.0 and can help more effectively detect anomalies in climate data. Each module supports a specific analytical perspective and is customized to handle different data types.