Copper Price Forecasting Using Machine Learning
Overview
This project focuses on forecasting monthly copper prices by employing advanced machine learning techniques. The project was undertaken to address the volatile nature of the copper market, enabling companies like MetalliQ Resources Inc. to make informed decisions for strategic planning, budgeting, and risk management.
Objective
To build a robust machine learning model capable of forecasting copper prices with high accuracy by analyzing historical data on commodity prices and exchange rates.
Dataset
Source
- Investing.com and other reputable financial platforms.
- Validated against multiple sources to ensure data reliability.
Features
- Copper Prices: Target variable for forecasting.
- Related Commodities: Iron, gold, crude oil prices.
- Exchange Rates: USD-AUD, USD-CLP, USD-CNY, USD-PEN.
- Time-Dependent Features: Month encoded using cyclical transformations (sine/cosine).
Timeframe
- 1991–2023
- Approximately 400 rows of data due to alignment across multiple datasets.
Preprocessing Steps
-
Data Cleaning:
- Handled missing values using interpolation techniques.
- Removed outliers to ensure model accuracy.
- Aligned features to a common timeframe and standardized currencies to USD.
-
Normalization:
- Applied Min-Max Scaling with a range of -1 to 1.
- Ensured separate scaling for training and testing datasets to prevent data leakage.
-
Feature Selection:
- Selected essential attributes: "Price" for each commodity and currency pair, alongside date information.
- Reduced dimensionality to focus on predictive variables.
-
Cyclical Encoding:
- Transformed the "Month" feature into sine and cosine components to capture seasonal patterns effectively.
-
Dataset Splitting:
- Non-random 80:20 train-test split to preserve time-series data integrity.
Machine Learning Models
Implemented Models
- GA-ANN: Genetic Algorithm-optimized Artificial Neural Network.
- GA-SVM: Support Vector Machine with genetic tuning.
- GA-KNN: K-Nearest Neighbors enhanced with genetic optimization.
- GA-GBT: Gradient Boosting Tree.
- GA-RF: Random Forest.
Why These Models?
- Captures the non-linear relationships between variables (GA-ANN, GA-SVM).
- Handles high-dimensional and complex data (GA-SVM, GA-GBT, GA-RF).
- Identifies patterns in time-series data (GA-ANN, GA-KNN).
Hyperparameter Tuning
- Used Distributed Evolutionary Algorithms in Python (DEAP) for fine-tuning:
- Parameters: population size, number of generations, mutation rate, crossover rate.
- Custom fitness functions optimized for project-specific goals (minimizing RMSE and MSE).
Validation
- Applied 10-fold time-series cross-validation to rigorously evaluate model robustness.
- Metrics Used:
- Root Mean Squared Error (RMSE)
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Coefficient of Determination (R²)
My Contributions
1. Dataset Acquisition and Preprocessing
- Dataset Sourcing: Researched, identified, and compiled datasets from reliable sources.
- Data Cleaning:
- Aligned features from multiple datasets to a common timeframe.
- Normalized data for consistent scaling across models.
- Interpolated missing values and addressed data outliers.
- Feature Engineering:
- Applied Min-Max scaling and cyclical encoding for seasonal variables.
- Reduced dimensionality to focus on predictive variables.
2. Model Development
- Artificial Neural Network (ANN):
- Implemented and tuned the GA-ANN model for optimal performance.
- Fine-tuned hyperparameters using genetic algorithms.
- Conducted rigorous validation through time-series cross-validation.
- Testing:
- Compared model performance using RMSE, MSE, MAE, and R² metrics.
- Ensured the model generalized well to unseen test data.
3. Data Analysis Techniques
- Exploratory Data Analysis (EDA):
- Conducted correlation analysis to identify key relationships between features.
- Performed regression analysis on copper prices vs. exchange rates.
- Visualized principal components to reduce dimensionality and uncover patterns.
- Performance Evaluation:
- Evaluated and compared models based on statistical metrics and hypothesis testing.
- Ranked models using metrics to identify the best-performing solution.
Results
Best Model
- Genetic Algorithm-Optimized Artificial Neural Network (GA-ANN):
- RMSE: 0.1040 (Lowest among models)
- MSE: 0.0108 (Lowest among models)
- R²: 0.9171 (Highest among models)
- Ranked second in MAE: 0.0824
Comparative Metrics Summary
RMSE
- GA-ANN: 0.1040 (Best)
- GA-SVM: 0.1041
- GA-KNN: 0.1912
- GA-GBT: 0.128
- GA-RF: 0.126
MSE
- GA-ANN: 0.0108 (Best)
- GA-SVM: 0.0109
- GA-KNN: 0.0366
- GA-GBT: 0.0162
- GA-RF: 0.0160
MAE
- GA-ANN: 0.0824
- GA-SVM: 0.0811 (Best)
- GA-KNN: 0.1497
- GA-GBT: 0.1209
- GA-RF: 0.1007
R²
- GA-ANN: 0.9171 (Best)
- GA-SVM: 0.9168
- GA-KNN: 0.7200
- GA-GBT: 0.8180
- GA-RF: 0.8775
Recommendations
- Deploy the GA-ANN model for forecasting copper prices due to its superior performance across multiple metrics.
- Incorporate additional features like GDP growth, inflation, and unemployment rates to improve prediction accuracy.
Tools and Technologies
- Programming: Python (pandas, numpy, sklearn, DEAP)
- Data Manipulation: Excel for feature selection and alignment
- Validation: 10-fold cross-validation for time-series data
- Visualization: Matplotlib, PCA visualizations
Future Enhancements
- Feature Expansion:
- Incorporate macroeconomic indicators like GDP growth, inflation, and interest rates.
- Scalability:
- Leverage cloud platforms for real-time forecasting and increased dataset size.
- Deployment:
- Build a user-friendly interface for business stakeholders to utilize the forecasting model.