For my WPI IQP project, I went to Hangzhou, China where I was able to work with a Chinese company (similar to google classroom) to collect data and build them a prototype for an AI assistant that would help students improve their writing in preparation for the IELTS or TOEFL.
My responsibility for this project was designing the writing portion of the prototype, I designed and implemented a Retrieval-Augmented Generation (RAG) system to evaluate TOEFL writing responses with rubric-aligned scoring and structured feedback. The goal was to move beyond generic LLM grading by grounding evaluations in official scoring criteria and curated exemplar essays by pulling relevant examples from a database.
The system architecture combines semantic-based embedding & algorithmic-based vectorization with RAG to assist large language model evaluation. I built a document ingestion pipeline that generates 2 different essay performance vectors. One is created using LLM preformance analysis then embedding through an embedding model and stored in a vector database for a similarity search. Each essay is also assigned a much shorter seperate vector that categorizes the essay using error counters and other more algorithmically based info to provide a more replicatable and consistent vector to counter the LLM-embedded version. When a user submits an essay, the system retrieves the most relevant rubric descriptors and exemplar responses based on essay performance according to both vectors before passing the grounded context into the LLM for scoring and feedback generation.
I implemented prompt engineering strategies to enforce rubric-aligned grading across four TOEFL criteria: task response, coherence and cohesion, lexical resource, and grammatical range and accuracy. The model outputs structured JSON containing predicted band scores, justification tied to retrieved rubric text, and targeted improvement recommendations which is then shown to the user. The goal is for this to reduce hallucination and ensures traceable evaluation logic.
To improve reliability, I tested different ways of vectorization and pulling from the database to see how to properly use the RAG to generate more accurate feedback
This project demonstrates applied NLP system design, retrieval engineering, prompt optimization, and structured LLM output validation in an educational assessment context.