For my WPI IQP project, I went to Hangzhou, China where I was able to work with a Chinese company (similar to google classroom) to collect data and build them a prototype for an AI assistant that would help students improve their writing in preparation for the IELTS or TOEFL.
My responsibility for this project was single-handidly designing and implementing the writing portion of the prototype, I designed and implemented a Retrieval-Augmented Generation (RAG) system to evaluate TOEFL writing responses with rubric-aligned scoring and structured feedback. The goal was to move beyond generic LLM grading by grounding evaluations in official scoring criteria and curated exemplar essays by pulling relevant examples from a database.
The system architecture combines semantic-based embedding & algorithmic-based vectorization with RAG to assist large language model evaluation. I built a document ingestion pipeline that generates 2 different essay performance vectors. One is created using LLM performance analysis then embedding through an embedding model and stored in a vector database for a similarity search. Each essay is also assigned a much shorter separate custom vectorization that categorizes the essay using error counters and other more algorithmically based info to provide a more consistent vector to counter the LLM-embedded version. When a user submits an essay, the system retrieves the most relevant rubric descriptors and exemplar responses based on essay performance according to both vectors before passing the grounded context into the LLM for scoring and feedback generation. I had to design and code the SQL database management and API calls for this.
I implemented prompt engineering strategies to enforce rubric-aligned grading across four TOEFL criteria: task response, coherence and cohesion, lexical resource, and grammatical range and accuracy. The model outputs structured JSON containing predicted band scores, justification tied to retrieved rubric text, and targeted improvement recommendations which is then shown to the user. The goal is for this to reduce hallucination and ensures traceable evaluation logic.
To improve reliability, I tested different ways of vectorization and pulling from the database to see how to properly use the RAG to generate more accurate feedback
This project demonstrates applied NLP system design, retrieval engineering, web hosting, prompt optimization, and structured LLM output validation in an educational assessment context. I also gained experience adjusting design based on user feedback