Startup Quant Fund

Enhancing Stock Price Prediction Accuracy Through Optimized Data Processing and Model Execution

This project aimed to improve the accuracy of stock price movement predictions by leveraging big data analytics and machine learning. Initially, the project faced challenges due to the quality of available data (“dirty data”) which hindered the team’s ability to isolate meaningful signals for effective model training. This case study details our approach to overcoming these challenges and optimizing the prediction process.

Challenge: Data Integrity and Signal Generation

The core challenge revolved around extracting actionable insights from noisy market data. In share trading, signal generation—the process of identifying patterns indicative of future price movements—is crucial. These signals, based on factors like historical prices, volume, and technical indicators, inform buy/sell decisions. However, the presence of irrelevant or erroneous data (“noise”) can lead to inaccurate signals and ultimately, poor trading decisions. Common signal generation techniques include:

Our initial data hampered the effectiveness of these techniques.

Solution: Data Cleaning, Infrastructure Optimization, and Model Enhancement

The project involved a multi-pronged approach:

  1. Data Cleaning: Significant effort was dedicated to cleaning and preprocessing the raw market data, removing inconsistencies and errors to improve signal quality.

  2. Infrastructure Upgrade: The original model backtesting process, running on local machines, took approximately 8 hours. Migrating to Google Cloud Platform (GCP) reduced this to 20 minutes, representing a significant improvement in efficiency. This optimization balanced performance needs with cost-effectiveness, avoiding unnecessarily high-spec virtual machines (VMs).

  3. Model Refinement: The initial R-based models relied heavily on sequential processing (for loops), creating a bottleneck. Refactoring the code to introduce parallelism significantly reduced execution time. This involved leveraging R’s capabilities for parallel computation, drawing inspiration from functional programming paradigms.

  4. Tooling and Integration: A Bloomberg Excel plugin was integrated for data acquisition, requiring troubleshooting and customization to ensure seamless operation. Additionally, a Java/Spring Boot application was developed to support front/middle/back-office functionalities.

Internal Knowledge Sharing and Skill Development:

Beyond technical improvements, the project fostered internal skill development. Team members received coaching and training in various technologies, including:

  • GCP: Leveraging cloud resources for computation and data storage.
  • Excel/G-Suite: Improving data manipulation and analysis skills.
  • Git/Bash: Implementing version control and command-line proficiency.
  • R and Java Fundamentals: Enhancing programming skills relevant to the project.

Results and Conclusion:

By addressing data quality issues, optimizing infrastructure, and refining model execution, the project significantly improved the speed and efficiency of stock price prediction. The transition to GCP drastically reduced backtesting time, while code parallelization further enhanced model performance. Furthermore, the project facilitated valuable skill development within the team, strengthening their expertise in relevant technologies. This project demonstrates the importance of data integrity, efficient infrastructure, and continuous learning in achieving accurate and timely financial predictions.