My Webpage

Abstract

We propose using Ensemble Learning to improve Cluster Analysis accuracy by combining multiple models.

Our project aims to enhance clustering accuracy for datasets with related and non-related attributes.

By applying Ensemble Learning techniques, we can uncover hidden patterns and improve real-world clustering applications.

Cluster Analysis

Clustering groups similar data points together without relying on pre-defined labels.
It uncovers hidden patterns and structures in data that may not be immediately apparent.
Clustering simplifies complex data sets by reducing variables, aiding in analysis and trend identification.
It enhances the performance of other machine learning algorithms by reducing noise and improving accuracy.

Ensemble Clustering

Ensemble clustering combines multiple algorithms for more accurate and robust clustering results.
It improves stability, reliability, and accuracy of clustering, particularly for complex data sets with noise or outliers.
Applications include image analysis, text mining, and bioinformatics, where traditional clustering techniques struggle with large, high-dimensional data sets.
Ensemble clustering aids in determining the optimal number of clusters and uncovers valuable insights missed by individual algorithms.

About Dataset

The Weather History dataset provides historical weather data for various locations. It contains detailed information about weather conditions recorded over a specific period.

Attributes:

Formatted Date: The date and time of the recorded weather data.
Summary: A brief summary of the weather condition.
Precip Type: Indicates the type of precipitation, such as rain or snow.
Temperature (C): The temperature in Celsius.
Apparent Temperature (C): The perceived temperature in Celsius.
Humidity: The relative humidity recorded.
Wind Speed (km/h): The speed of the wind in kilometers per hour.
Wind Bearing (degrees): The direction of the wind in degrees.
Visibility (km): The visibility distance in kilometers.
Loud Cover: A value indicating the presence of a loud cover (0 or 1).
Pressure (millibars): Atmospheric pressure measured in millibars.
Daily Summary: A summary of the weather conditions for the day.

The dataset contains a total of 96,453 records, each representing a specific timestamp with corresponding weather measurements. It offers valuable insights for analyzing and modeling historical weather patterns.

Data Records Info

<class 'pandas.core.frame.DataFrame'>
  RangeIndex: 96453 entries, 0 to 96452
            
            
  Data columns (total 12 columns):
    #   Column                    Non-Null Count  Dtype  
  ---  ------                    --------------  -----  
    0   Formatted Date            96453 non-null  object 
    1   Summary                   96453 non-null  object 
    2   Precip Type               95936 non-null  object 
    3   Temperature (C)           96453 non-null  float64
    4   Apparent Temperature (C)  96453 non-null  float64
    5   Humidity                  96453 non-null  float64
    6   Wind Speed (km/h)         96453 non-null  float64
    7   Wind Bearing (degrees)    96453 non-null  float64
    8   Visibility (km)           96453 non-null  float64
    9   Loud Cover                96453 non-null  float64
    10  Pressure (millibars)      96453 non-null  float64
    11  Daily Summary             96453 non-null  object 
            
            
              dtypes: float64(8), object(4)
              memory usage: 8.8+ MB

Number of Clusters

Elbow Graph

Calinski-Harabaz

Traditional Clustering Algorithm Analysis

Algorithm	No.of Clusters	Daives Bouldin Score	Silhoutte Score
KMeans Clustering	4	0.401	0.608
Mean Shift Clustering	4	0.435	0.867
Agglomerative Clustering	4	0.405	0.588
Spectral Clustering	4	0.401	0.605
OPTICS Clustering	4	1.870	-0.561
BIRCH Clustering	4	0.405	0.028
Ensembled Clustering	4	0.184	0.873

Daives-Bouldin Score

Conclusion

The voting technique of ensembling with the Mean Shift and Birch clustering algorithm yielded a higher silhouette score and lower Davies-Bouldin score in the analysis of the Weather History dataset.
The higher silhouette score indicates well-separated and closely-knit data points within each cluster, showcasing the successful capture of inherent structures and patterns in the weather data.
The lower Davies-Bouldin score highlights distinct and meaningful clusters with minimal overlap and high inter-cluster similarity, supporting the identification of homogeneous groups within the dataset.
The effectiveness of the ensembling approach with the Mean Shift and Birch clustering algorithm demonstrates its value in improving clustering accuracy and robustness.
Overall, these insights enhance decision-making, data exploration, and understanding of the underlying structures in the Weather History dataset.