Evals, Red Teaming and Test Generation for Agentic Systems

Modular, Lightweight, Dynamic and Async-first

Docs • Website • Community

Important

Giskard v3 is a fresh rewrite designed for dynamic, multi-turn testing of AI agents. This release drops heavy dependencies for better efficiency while introducing a more powerful AI vulnerability scanner and enhanced RAG evaluation capabilities. For now, the vulnerability scanner and RAG evaluation still rely on Giskard v2. Giskard v2 remains available but is no longer actively maintained. Follow progress → Read the v3 Announcement · Roadmap

Install

pip install giskard

Requires Python 3.12+.

Telemetry: Libraries built on giskard-core (including giskard-checks) may send optional, aggregated usage analytics to help improve the product. No prompts, model outputs, or scenario text are included. See what is collected and how to opt out.

Giskard is an open-source Python library for testing and evaluating agentic systems. The v3 architecture is a modular set of focused packages — each carrying only the dependencies it needs — built from scratch to wrap anything: an LLM, a black-box agent, or a multi-step pipeline.

Status	Package	Description
✅ Beta	`giskard-checks`	Testing & evaluation — scenario API, built-in checks, LLM-as-judge
🚧 In progress	`giskard-scan`	Agent vulnerability scanner — red teaming, prompt injection, data leakage (successor of v2 Scan)
📋 Planned	`giskard-rag`	RAG evaluation & synthetic data generation (successor of v2 RAGET)

Giskard Checks — create and apply evals for testing agents

pip install giskard-checks

Giskard Checks is a lightweight library for creating evaluations (evals) that test LLM-based systems — from simple assertions to LLM-as-judge assessments. Unlike traditional unit tests, evals are designed for non-deterministic outputs where the same input can produce different valid responses.

Use Giskard Checks to:

Catch regressions — verify your system still behaves correctly after changes
Validate RAG quality — check if answers are grounded in retrieved context
Enforce safety rules — ensure outputs conform to your content policies
Evaluate multi-turn agents — test full conversations, not just single exchanges

Built-in evals include string matching, comparisons, regex, semantic similarity, and LLM-as-judge checks (Groundedness, Conformity, LLMJudge).

Quickstart

from openai import OpenAI
from giskard.checks import Scenario, Groundedness

client = OpenAI()

def get_answer(inputs: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": inputs}],
    )
    return response.choices[0].message.content

scenario = (
    Scenario("test_dynamic_output")
    .interact(
        inputs="What is the capital of France?",
        outputs=get_answer,
    )
    .check(
        Groundedness(
            name="answer is grounded",
            context="France is a country in Western Europe. Its capital is Paris.",
        )
    )
)

result = await scenario.run()
result.print_report()

The run() method is async. In a script, wrap it with asyncio.run(). See the full docs for Suites, LLMJudge, multi-turn scenarios, and more.

Looking for Giskard v2?

Giskard v2 included Scan (automatic vulnerability detection) and RAGET (RAG evaluation test set generation) for both ML models and LLM applications. These features are not available in v3.

pip install "giskard[llm]>2,<3"

Scan — automatically detect performance, bias & security issues

Wrap your model and run the scan:

import giskard
import pandas as pd

# Replace my_llm_chain with your actual LLM chain or model inference logic
def model_predict(df: pd.DataFrame):
    """The function takes a DataFrame and must return a list of outputs (one per row)."""
    return [my_llm_chain.run({"query": question}) for question in df["question"]]

giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="My LLM Application",
    description="A question answering assistant",
    feature_names=["question"],
)

scan_results = giskard.scan(giskard_model)
display(scan_results)

RAGET — generate evaluation datasets for RAG applications

Automatically generate questions, reference answers, and context from your knowledge base:

import pandas as pd
from giskard.rag import generate_testset, KnowledgeBase

# Load your knowledge base documents
df = pd.read_csv("path/to/your/knowledge_base.csv")
knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    language='en',
    agent_description="A customer support chatbot for company X",
)

Full v2 docs

👋 Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

Follow the progress and share feedback: v3 Announcement · Roadmap

🌟 Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! 🌟

❤️ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

Name		Name	Last commit message	Last commit date
Latest commit History 10,542 Commits
.claude		.claude
.cursor		.cursor
.github		.github
libs		libs
readme		readme
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
.worktreeinclude		.worktreeinclude
AGENTS.md		AGENTS.md
AUTONOMOUS.md		AUTONOMOUS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUES.md		ISSUES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
renovate.json		renovate.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evals, Red Teaming and Test Generation for Agentic Systems

Modular, Lightweight, Dynamic and Async-first

Docs • Website • Community

Install

Giskard Checks — create and apply evals for testing agents

Quickstart

Looking for Giskard v2?

Scan — automatically detect performance, bias & security issues

RAGET — generate evaluation datasets for RAG applications

👋 Community

About

Uh oh!

Releases 102

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Evals, Red Teaming and Test Generation for Agentic Systems

Modular, Lightweight, Dynamic and Async-first

Docs • Website • Community

Install

Giskard Checks — create and apply evals for testing agents

Quickstart

Looking for Giskard v2?

Scan — automatically detect performance, bias & security issues

RAGET — generate evaluation datasets for RAG applications

👋 Community

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 102

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages