MIDS Capstone Project Summer 2024

Hoops IQ

Team members

Problem & Motivation

Hoops IQ revolves around leveraging the growing interest in the WNBA and the increased complexity of basketball data analysis. The core problem is the gap in accessible, user-friendly, and accurate data analytics tools tailored for non-expert basketball enthusiasts and fantasy sports players. Hoops IQ is motivated by the need to provide a robust platform that integrates advanced data analysis, machine learning, and user-friendly interfaces to enhance engagement and decision-making in fantasy sports and broader WNBA fandom. This project aims to transform raw sports data into insightful, actionable information that enhances user experience and deepens understanding of the game.

Data Source & Data Science Approach

Data Source

We have sourced rich variety of WNBA statistics and information from multiple reliable platforms, from 2018 season and onwards, to ensure comprehensive coverage and accuracy. The primary sources include SportsDataverse for in-depth play-by-play event data, official WNBA resources for detailed player statistics and career data, and ESPN for up-to-date schedule and game information.

Data Science Approach

Tech Overview

We leverage Route53, to help with mapping the URL wnba.hoops-iq.com to our EC2 instance. Within our EC2 instance, we have two main components - the first one is the chatbot interface in Streamlit, what the user interacts with. The other component is the main one working behind the scenes, the LLM Translation. This component leverages Athena to query against our data that is stored in S3. Additionally, this LLM component uses DynamoDB to handle chat history.

Front End Overview & Compute

In Route53, we have a record that routes the traffic of our website URL to the DNS name of an EC2 load balancer. This load balancer has a listener on port 80, and this listener’s action is to forward to a target group. This target group has a target configured with our EC2 instance, specifically port 8080. And on port 8080 of this EC2 instance is where we have the Streamlit running.

LLM Architecture

The current model contains three components: the Table Retrieval Chain, the main Text-to-SQL Chain, and the Visualization Chain.

Table Retrieval Chain: This chain contains all tables’ names and descriptions and their corresponding columns’ names and descriptions. Based on user input, it returns a list of the most relevant tables’ names and descriptions.

Text-to-SQL Chain: This chain performs the following steps:

Invokes the Table Retrieval Chain and fetches the list of relevant table information
Loads a list of query samples
Loads chat history from AWS DynamoDB
Combines information from Step 1 to Step 3 with the user input to form the Text-to-SQL prompt
Invoke the Langchain agent with the prompt
1. The agent generates an SQL query based on information in the prompt
2. The agent attempts to query the database via AWS Athena
3. The agent returns a natural language output if the query is successful
4. The agent retries or re-generates a SQL query if the query fails until a set limit is reached
Fetches the output from the agent
Invokes the Visualization Chain with the natural language output
Returns the natural language output to the user

Visualization Chain: This chain uses the natural language output received from the Text-to-SQL Chain and decides which graph is the most suitable to represent output data. If a graph is selected, it displays it to the user, or it simply terminates if no graph is appropriate.

Evaluation

We developed a gold standard question and answer set of statistics questions tailored to the needs of daily fantasy WNBA players. Our scoring framework specifically penalized models for hallucinations to ensure reliability. We compared several language models, both directly and through our Text-to-SQL architecture, and found that models run via the Text-to-SQL architecture uniformly outperformed direct queries to the language models. Among the tested models, GPT-4o demonstrated the strongest performance within our Text-to-SQL framework, surpassing GPT-3.5, GPT-4 Turbo, GPT-4o Mini, and Claude 3.5 Sonnet. This evaluation highlights the effectiveness of using Text-to-SQL architecture for accurate and reliable basketball data analysis and the superior capabilities of GPT-4o in this context.

Key Learnings & Impact

The development of a specialized architecture for WNBA data analysis is crucial due to the current lack of reliable tools for understanding WNBA-specific statistics. Hoops IQ's advanced Text-to-SQL capabilities enable natural language queries to translate into SQL commands, simplifying data retrieval for analysts. Its dynamic multimodal UI development allows for interactive data visualization, enhancing user experience and understanding. Hoops IQ limits hallucinations, ensuring reliable and trustworthy outputs, and streamlines manual processes, allowing analysts to focus on higher-level analysis. This positions Hoops IQ as a next-generation tool, revolutionizing basketball data analysis by providing comprehensive solutions for WNBA data. By automating routine tasks and providing intuitive interfaces, Hoops IQ empowers analysts to derive deeper insights and make data-driven decisions more efficiently. Its impact on the field underscores the importance of integrating advanced NLP and visualization technologies in sports analytics.

Acknowledgements

A special thank you to our Capstone Advisors, Fred Nugen and Todd Holloway, for their support and guidance throughout our journey of working on this project, and to all the people who provided us with valuable feedback on our journey of developing this tool!

Course

Data Science 210. Capstone , Summer 2024

Class Project Gallery

More Information

Hoops IQ - Project Webpage

Hoops IQ - Github Repo

Hoops IQ - Demo Video

hoopsiq-finalpresentation.pdf

Last updated: August 5, 2024