About Dataset
Introduction

The Weather Prediction Dataset contains meteorological data collected from 18 different European cities or places, including Basel (Switzerland), Budapest (Hungary), Dresden, Düsseldorf, Kassel, München (Germany), De Bilt and Maastricht (the Netherlands), Heathrow (UK), Ljubljana (Slovenia), Malmo and Stockholm (Sweden), Montélimar, Perpignan and Tours (France), Oslo (Norway), Roma (Italy), and Sonnblick (Austria).


The dataset includes daily observations from the years 2000 to 2010, resulting in 3654 daily observations. The data consists of various variables such as mean temperature, maximum temperature, minimum temperature, cloud cover, wind speed, wind gust, humidity, sea level pressure, global radiation, precipitation, and sunshine. The data has undergone basic cleaning, removing columns with more than 5% invalid entries and replacing invalid entries in remaining columns with mean values. Additionally, as part of the preprocessing, all attributes have been scaled to achieve similar ranges for the present values. Please note that the dataset has been preprocessed to ensure data quality. Some units have been transformed for consistency.


The dataset comprises 165 variables over the course of 3654 days and has been transformed to achieve similar ranges for the present values. The units of temperature are given in degrees Celsius, wind speed and gust in m/s, humidity as a fraction of 100%, sea level pressure in 1000 hPa, global radiation in 100 W/m², precipitation amounts in centimeters, and sunshine in hours.

Attribute Details
Attributes:
  • Cloud Cover (CC): Represents the extent of cloud cover in oktas.
  • Wind Direction (DD): Indicates the direction of the wind in degrees.
  • Wind Speed (FG): Provides the wind speed in 1 m/s.
  • Wind Gust (FX): Refers to the maximum wind gust in 1 m/s.
  • Humidity (HU): Represents the humidity level as a fraction of 100%.
  • Sea Level Pressure (PP): Indicates the sea level pressure in 1000 hPa.
  • Global Radiation (QQ): Represents the global radiation in 100 W/m².
  • Precipitation Amount (RR): Indicates the amount of precipitation in 10 mm.
  • Sunshine (SS): Represents the duration of sunshine in 1 hour.
  • Mean Temperature (TG): Refers to the mean temperature in 1 °C.
  • Minimum Temperature (TN): Indicates the minimum temperature in 1 °C.
  • Maximum Temperature (TX): Indicates the maximum temperature in 1 °C.
Number of Clusters
Calinski-Harabaz Graph

Silhoutte Score

Traditional Clustering Algorithm Analysis
Algorithm No.of Clusters Daives Bouldin Score Silhoutte Score
KMeans Clustering 2 0.937 0.414
Affinity Propagation Clustering 2 1.502 0.171
Mean Shift Clustering 3 0.955 -0.002
Agglomerative Clustering 2 1.021 0.354
Spectral Clustering 2 0.939 0.412
OPTICS Clustering 2 1.300 -0.255
Guassian Clustering 2 0.997 0.381
BIRCH Clustering 2 0.970 0.378
Ensembled Clustering 3 0.683 0.277
Daives-Bouldin Score
Silhoutte Score
Conclusion
  • The voting technique of ensembling with the Mean Shift and Birch clustering algorithm yielded a higher silhouette score and lower Davies-Bouldin score in the analysis of the Weather Prediction dataset.
  • The lower Davies-Bouldin score highlights distinct and meaningful clusters with minimal overlap and high inter-cluster similarity, supporting the identification of homogeneous groups within the dataset.
  • The effectiveness of the ensembling approach with the Mean Shift and Birch clustering algorithm demonstrates its value in improving clustering accuracy and robustness.
  • Overall, these insights enhance decision-making, data exploration, and understanding of the underlying structures in the Weather Prediction dataset.