Dataset Isolation

Context

Some analysis workflows require a tightly bounded dataset. That boundary matters for reproducibility, traceability, and confidence in the resulting output.

In this case, the goal was straightforward: create a project scoped to a single folder and use that folder as the complete knowledge boundary for the analysis. The intent was not broad retrieval across a storage account. The intent was a constrained, project-level working set.

That distinction matters. Permission scope and project scope are not the same thing. A connector may have access to an entire account, while a workflow still depends on strict logical isolation to a narrower subset.

Setup

The environment contained multiple folders in the same cloud storage account. A project was configured with only one specific folder added as a knowledge source, and the project was set to use project-only knowledge.

The expected operating model was simple: retrieval should remain confined to the files contained within that explicitly linked folder. That folder defined the intended dataset.

Expected and observed behavior

Expected behavior

Retrieval remains strictly limited to files inside the explicitly linked project folder. Queries operate only against that bounded dataset, regardless of what else may exist in the underlying storage account.

Observed behavior

Queries surfaced files that were not present in the linked folder and were instead located elsewhere in the same cloud storage account. The workflow crossed the declared project boundary even though no broader dataset had been intentionally included.

Failure mode

This was not a permission problem. It was a boundary problem. The relevant failure was scope violation relative to the declared project dataset.

Failure classification

The core issue is best understood as a dataset isolation failure in a connector-based workflow.

The system was expected to honor a logical boundary defined by explicit project configuration. Instead, retrieval appears to have followed a wider account-level access path than the declared project dataset.

This was not a permission problem. It was a scope-control failure: the working corpus was no longer reliably constrained to the intended source boundary.

Why isolation matters

Reproducibility

A bounded dataset makes it possible to rerun the same workflow and understand why the output changed. If retrieval scope is unstable, reproducibility degrades immediately.

Traceability

A controlled dataset supports source attribution and auditability. If files can enter from outside the declared boundary, traceability becomes uncertain.

Contamination risk

When unrelated files are eligible for retrieval, analysis can be influenced by material that was never meant to be part of the task. That introduces contamination into both reasoning and output.

Trust in system behavior

A system does not become trustworthy merely because it returns useful results. It becomes trustworthy when its operating boundary is legible and dependable under normal use.

Implications

The broader takeaway is simple: connectors should not be treated as a guarantee of strict dataset isolation unless that behavior has been directly validated.

A connector can be operationally convenient while still being methodologically loose. For exploratory work, that may be acceptable. For controlled analysis, reference generation, validation tasks, or any workflow that depends on a known corpus, it is not.

This is the practical distinction between access and control. Access answers what the system can potentially reach. Control answers what it is actually allowed to use for a given workflow. Reliable methodology depends on the second, not the first.

Method implication

When dataset boundaries matter, connector convenience should be treated with caution. If the task depends on strict source control, the safest approach is to work from an explicit local snapshot rather than a live connected corpus.

In this case, the mitigation was to switch to local file uploads so the analysis dataset could be fixed, inspectable, and intentionally bounded. That is slower than connector-based retrieval, but it restores the conditions required for trustworthy evaluation.

The operational lesson is straightforward: connectors improve access, but they do not by themselves establish isolation.

This note is an example of a broader working principle: systems should be evaluated not only by what they can do, but by how reliably they respect declared constraints.

Read Working Principles