Key Responsibilities
1. LLM Workflow & Prompt Engineering
Design, build, and maintain LLM workflows that score freeform answers, generate contextual feedback, and assist users in real time.
Develop and refine prompt templates, chains, and tools to improve relevance, reduce token usage, and mitigate hallucinations.
Implement multi-step prompt workflows for tasks like mock interviews, code reviews, and automated test guidance.
2. Retrieval & RAG Pipelines
Build and optimise retrieval-augmented generation (RAG) pipelines using vector stores (e.g., Pinecone, Weaviate, Elasticsearch).
Implement embedding generation, similarity search, and dynamic context selection that reduce hallucinations and improve answer quality.
Ensure low-latency, high-accuracy retrieval combined with LLM generation for highly personalised user experiences.
3. LLM Evaluation & Analytics
Define and implement evaluation frameworks for LLM outputs: accuracy, consistency, bias, interpretability, and robustness.
Build automated evaluation pipelines that monitor performance over time, detect regression, and flag failure modes.
Instrument systems with metrics and logging to understand model behaviour in production and drive data-informed decisions.
4. Agent-Based System Development
Build tool-augmented agents capable of evaluating coding, system design, or reasoning questions using frameworks like LangChain, AutoGen, LlamaIndex, or similar.
Design and experiment with agent orchestration patterns (multi-agent workflows, planners, evaluators) to improve multi-step reasoning and reliability.
Integrate agents with external tools and APIs (code execution, documentation search, test runners, etc.) to extend capabilities.
5. Cross-Functional Collaboration & Product Impact
Partner with backend engineers (Go, FastAPI) and frontend engineers (React) to ship features end-to-end.
Work closely with product managers and designers to shape user journeys, gather feedback, and iterate quickly in production.
Participate in agile ceremonies (standups, sprint planning, retrospectives) and provide clear, consistent status updates.
6. Research, Experimentation & Innovation
Stay current with state-of-the-art LLM and retrieval research, benchmarks, and open-source tools.
Rapidly prototype new ideas (e.g., advanced retrieval strategies, custom fine-tuning flows, new evaluation methods) and demonstrate feasibility.
Contribute to internal best practices, playbooks, and reusable components for LLM and agent development.
Qualifications & Skills
Education
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
Experience
4–5 years of professional software engineering experience, with a strong focus on Python.
At least 2 years of hands-on experience building and deploying LLM-powered applications in production (beyond toy projects).
Core Skills
LLM & Prompt Engineering
Proven experience designing prompt workflows, tools, and chains for production use.
Familiarity with OpenAI, Claude, and/or open-weight LLMs (e.g., Hugging Face ecosystem).
Experience with LangChain, LlamaIndex, AutoGen, or equivalent frameworks for chains/agents.
Retrieval & RAG
Built at least one end-to-end RAG pipeline integrating vector search (Pinecone, Weaviate, Elasticsearch, or similar) with LLMs.
Understanding of embeddings, similarity search, reranking, and dynamic context selection for reducing hallucinations.
Evaluation Frameworks
Experience defining LLM output quality metrics (accuracy, consistency, bias, interpretability).
Implemented automated evaluation or testing pipelines to monitor LLM behaviour and aggregate performance statistics.
Python Engineering
Strong Python skills and experience building production services using frameworks like FastAPI, Flask, or similar.
Good understanding of data preprocessing, ETL flows, and integration patterns for AI/ML components.
Familiarity with testing, observability, and integration testing for AI systems.
Collaboration & Product Mindset
Demonstrated ability to work in cross-functional teams (backend, product, UX) within an Agile/Scrum environment.
Ability to translate research concepts and technical trade-offs to non-technical stakeholders.
Product mindset: you care deeply about user experience, measurable impact, and iteration speed.
Preferred
Hands-on experience with vector databases & semantic search (Pinecone, Weaviate, pgvector, Elasticsearch, etc.).
Domain exposure to developer tooling, QA/testing platforms, edtech, or hiring/assessment products.
Experience with fine-tuning LLMs or building lightweight custom models.
Interest in growing into a Founding AI Lead role as TestZeus scales.
Why Join TestZeus?
Real Impact: Own and shape the AI systems that power our flagship testing platform, impacting quality for thousands of users.
Cutting-Edge Environment: Work with SOTA LLMs, agents, and retrieval systems, and get space to prototype bold ideas.
High Ownership: Small, talented team where your decisions and code directly shape the product and company direction.
Learning & Growth: Regular tech talks, deep dives into research, and support for conferences, workshops, and open-source work.
Bangalore – Work From Office: Collaborate closely with the team, iterate quickly, and build a strong product culture together.
Application Process
To apply, please share the following details with us:
Your CV
Your Current and Expected CTC
Months of experience building LLM applications and/or AI agents
Links to Public Work (e.g., GitHub, Medium, personal website, open-source contributions)
📬 Send everything to: hiring@testzeus.com
We’re excited to review your application!
// Start testing //



