As an AI Engineer at TestZeus, you will take ownership of designing, building, and maintaining production-grade LLM-based systems that power our core testing platform. You’ll work closely with cross-functional teams (backend, product, design) to ship features quickly, test with real users, and iterate in production. You’ll also play a key role in defining evaluation metrics, optimizing prompts and retrieval flows, and helping the team stay current with the latest research. This role is ideal for someone who has already delivered LLM applications end-to-end and wants to deepen their expertise in retrieval-augmented generation (RAG), prompt workflows, and agent-driven evaluation.
Key Responsibilities
Design & Build LLM Workflows:
Create systems that score freeform answers, generate contextual feedback, and assist users in real time.
Develop prompt templates and chaining strategies that improve relevance, reduce token usage, and mitigate hallucinations.
RAG Pipeline Implementation & Optimization:
Build and tune retrieval-augmented generation pipelines that fetch dynamic context from vector stores (e.g., Pinecone, Weaviate).
Ensure low-latency, high-accuracy retrieval combined with LLM-driven generation to personalize experiences (e.g., mock interviews, code reviews).
LLM Evaluation & Analysis:
Define and implement evaluation frameworks covering accuracy, consistency, bias, and interpretability for model outputs.
Automate evaluation pipelines that monitor LLM performance over time and flag failure modes.
Agent-Based System Development:
Build tool-augmented agents that can evaluate coding, system design, or reasoning questions, using frameworks like LangChain, AutoGen or LlamaIndex.
Research and integrate new agent orchestration techniques to improve multi-step reasoning.
Cross-Functional Collaboration:
Partner with backend engineers (Go, FastAPI), frontend engineers (React), and product managers to iterate on features, gather user feedback, and refine in production.
Participate in agile ceremonies—standups, sprint planning, retrospectives—and provide regular status updates.
Stay Current & Innovate:
Review state-of-the-art papers, benchmarks, and open-source tools (e.g., retrieval research, prompt optimization techniques).
Prototype new ideas (e.g., advanced retrieval strategies, custom fine-tuning flows) and demonstrate their feasibility to the team.
Required Skills & Qualifications
Have Real-World LLM Production Experience: You’ve built and deployed LLM-powered applications (beyond toy projects) that solve concrete business problems.
Are Proficient in Python & LLM Frameworks: Comfortable writing Python code to integrate with OpenAI, Claude, or self-hosted models; familiar with LangChain, LlamaIndex, or similar libraries.
Understand LLM Failure Modes: You know why models hallucinate, go off-topic, or repeat; and can engineer around these issues using retrieval, prompt chaining, or evaluation loops.
Think Like a Product Engineer: You ship experiments quickly, gather user feedback, and iterate fast—always focused on delivering measurable value and improving user experience.
Are Passionate About Advanced LLM Features: You’re excited to build functionality that goes beyond chat—scoring, ranking, summarization, bias detection, and automated feedback loops.
LLM & Prompt Engineering
At least 2 years of hands-on experience designing and deploying prompt workflows in production.
Familiarity with OpenAI API, Claude API, or open-weight LLMs (e.g., Hugging Face models).
Experience with LangChain, LlamaIndex, or equivalent frameworks for agent/chain construction.
Retrieval & RAG
Built at least one RAG pipeline that integrates vector search (Pinecone, Weaviate, or Elasticsearch) with LLM generation.
Understand embedding generation, similarity search, and dynamic context selection to reduce hallucinations.
Evaluation Frameworks
Defined metrics for LLM output quality (accuracy, consistency, bias, interpretability) and automated evaluation pipelines.
Implemented unit/functional tests to monitor LLM failure modes and aggregate performance statistics.
Python Engineering
4–5 years of Python development experience, including building production services using FastAPI, Flask, or similar.
Strong knowledge of data preprocessing, ETL pipelines, and integration testing for AI systems.
Collaboration & Agile
Demonstrated ability to work collaboratively in cross-functional teams (backend, product, UX) within an Agile/Scrum environment.
Clear communicator—able to translate research insights and technical trade-offs to non-technical stakeholders.
Degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
Bonus Skills
Vector Databases & Semantic Search: Hands-on experience with Pinecone, Weaviate, or open-source vector search libraries.
Domain Experience: Exposure to AI in edtech, developer tooling, hiring/assessment platforms, or similar.
Fine-Tuning & Custom Models: Experience fine-tuning LLMs or building lightweight custom models.
Future Growth Potential: Interest in scaling into a Founding AI Lead role as TestZeus expands.
What we offer
Real Impact: Own and shape the AI features that power our flagship product, influencing quality improvements for thousands of users.
Competitive Compensation: Market-aligned salary and meritocratic equity grants.
Cutting-Edge Environment: Continuous exposure to SOTA research, with opportunities to prototype and ship innovative AI features.
Collaborative Culture: Work alongside a small, dedicated team of engineers, researchers, and product leaders—everyone’s voice matters.
Learning & Growth: Regular “Tech Talks,” knowledge-share sessions, and support for attending conferences or workshops.
Application process
To apply, please share the following details with us:
Your CV
Current and Expected CTC
Months of Experience in building AI agents
Links to Public Work (e.g., GitHub, Medium, personal website)
Complete the test at:
https://app.utkrusht.ai/assessment/bb24e9d4-a9d0-4e7b-bc30-0e6c8b56c3fd/interview
📬 Send everything to: hiring@testzeus.com
We’re excited to review your application!