Skip to content

Semantic File Systems Research

Brian Welcker, a Microsoft veteran who witnessed WinFS from inside, identified the key failure: “No two members of the team seemed to be able to answer the question ‘What is it?’ in a succinct, cohesive way.” Some said database in the OS, others said XML in the database, SQL-based file system, object/relational mapping layer, rich storage for Office apps, file system metadata indexer, .NET in the database, etc.

  • Attempted to build a relational database layer on top of NTFS
  • Bill Gates mandated SQL Server as the foundation, but “the Achilles’ heel of the relational database has always been modeling hierarchy. File systems are hierarchical. XML is strictly hierarchical.”
  • Consumed “many thousands of hours of efforts from really smart engineers” without shipping
  • Cairo was the predecessor project with the same ambitions
  • Became “a bit of a monster, sucking up Microsoft employees at an insatiable rate with no clear shipping date”
  • Eventually canceled as overambitious, though components shipped separately
  • The object-oriented file system component never shipped due to “performance concerns on hardware of the day”
  1. Search matters more than hierarchy — “Hierarchies alone don’t let you find things”
  2. Metadata is important, but doesn’t require librarians — “folksonomy” (ad-hoc tagging) can work
  3. RDBMSs were designed to model data, not content — Codd wasn’t thinking about Word documents
  4. People like the simplicity of the file system
  5. People are only willing to do content preparation for very high-value content

Gates called WinFS his “biggest product regret,” believing the core idea — “rich database as client/cloud store with schema” — was ahead of its time and will eventually resurface in cloud systems.


NEPOMUK was an EU-funded (€11.5 million) research project to create a “Social Semantic Desktop” using RDF (Resource Description Framework) to store semantic metadata across applications.

  • RDF proved “oversized for desktop systems”
  • Initial indexing consumed excessive RAM and CPU, often taking days on large datasets
  • Systems would hang during indexing
  • Poor performance led users to disable it, depriving developers of testing and feedback

KDE replaced NEPOMUK with Baloo in 2014. Key architectural shifts:

NEPOMUKBaloo
Central RDF databaseDecentralized, specialized databases per data type
Broad semantic desktop visionFocused on 3 specific use-cases
Heavyweight storageSQLite + Xapian (lightweight)
Complex ontologySimple data stores, search stores, relations

Baloo’s Three Use Cases (vs NEPOMUK’s many)

Section titled “Baloo’s Three Use Cases (vs NEPOMUK’s many)”
  1. Content-based file finding
  2. Storing simple metadata (tags/ratings)
  3. Managing relationships between data

Critical Insight: Rather than attempting comprehensive semantic desktop functionality, Baloo succeeded by dramatically narrowing scope and using purpose-built storage for each data type.


Spotlight is an index layer on top of the file system, not a replacement for it. Key aspects:

  • Background indexing via importer plug-ins triggered by kernel file notifications
  • Each file type has its own importer that extracts metadata
  • Metadata stored in .Spotlight-V100 folder per volume
  • Search queries hit the index, not the files themselves

Apple didn’t try to change how files are stored — they added a search/metadata layer that complements the existing file system. This is fundamentally different from WinFS’s approach of replacing/augmenting the file system itself.

Smart Folders are saved searches, not containers:

  • Items remain in original locations
  • Smart Folder shows matching results based on criteria
  • Updates automatically as files change
  • You can’t delete items from Smart Folders (they’re references)

This pattern succeeds because:

  1. No data migration required
  2. Files remain accessible via normal paths
  3. Organization is a view, not a storage location
  4. Multiple overlapping organizations possible
Integrated (WinFS, Cairo)Overlay (Spotlight, Baloo)
Replace or augment file systemAdd layer on top
All-or-nothing migrationWorks with existing files
Performance tied to file operationsIndex can be rebuilt
Single point of failureGraceful degradation

The lesson: Index layers that complement file systems succeed; attempts to replace file systems with databases fail.


AI file organization fundamentally differs from manual approaches:

  • Analyzes file content, not just filenames
  • Reads PDFs, extracts text, OCR for images, transcribes audio/video
  • No ongoing human decision-making required (vs tags which require discipline)

The Decision Fatigue Insight: “The average person makes about 100 file-related decisions daily, causing decision fatigue.” AI removes this burden.

Why AI Tagging Succeeds Where Manual Tagging Fails

Section titled “Why AI Tagging Succeeds Where Manual Tagging Fails”

Manual tagging (including manual “semantic” approaches):

  • Requires ongoing effort
  • Inconsistent over time (taxonomies drift)
  • Abandoned after a few months by most users

AI tagging:

  • Consistent regardless of time or user state
  • Understands relationships between files (e.g., files from same email batch)
  • “No manual work after setup”

Modern NLP extracts:

  • People (names)
  • Organizations
  • Locations
  • Dates and times
  • Monetary values
  • Products, Events
  • Domain-specific terms

Tools: spaCy, Stanza, Haystack’s NamedEntityExtractor

Each extracted entity includes: type, character span, confidence score.

Modern approaches can automatically generate extraction schemas:

  • LlamaExtract: Generate schema from a prompt or example document
  • PARSE: Autonomous schema refinement with up to 64.7% accuracy improvement
  • AutoSchemaKG: Process millions of documents with “95% semantic alignment with human-crafted schemas with zero manual intervention”

Key Insight: Schemas are becoming “dynamic specifications that LLMs can interpret and systematically improve based on document characteristics” rather than static contracts defined upfront.


Turn notes into queryable database while keeping them as readable Markdown files.

  1. Automatic fields (tags, list items)
  2. YAML frontmatter at top of file
  3. Inline fields: [key:: value] syntax
  • Only indexed data is queryable — regular paragraph text is not
  • This is a feature, not a bug: keeps queries fast and predictable
  • Scales to “hundreds of thousands of annotated notes without issue”

“Dataview is meant for displaying and calculating data. It is not meant to edit your notes/metadata and will always leave them untouched.”

This separation of concerns (query layer ≠ storage layer) mirrors Spotlight’s success pattern.

Supertags define what something is, not just what it’s about. A node tagged #Task is a task, with automatic fields (Due Date, Status, etc).

When you apply a supertag, it populates default fields relevant to that object type. This provides structured data without requiring upfront schema definition.

Every bullet point is a discrete node in a knowledge graph, but the interface remains a familiar outliner.

  • Notion: Real-time collaboration, web access, but requires internet, privacy concerns
  • Local-first: 100% private, works offline, no file size limits, instant access

You can recreate Notion’s organizational patterns (databases, tags, properties) using local files — gaining organization benefits while keeping local-first advantages.


  1. Replacing the file system — Too ambitious, no migration path, all-or-nothing
  2. Central databases for everything — Performance bottleneck, complexity
  3. Complex ontologies — Over-engineered, poor adoption
  4. Manual tagging at scale — Users abandon after months
  5. Fuzzy vision — “What is it?” must have a clear answer
  1. Index/search layer on top of existing storage — Spotlight, Baloo
  2. Purpose-built storage per data type — Not one-size-fits-all
  3. Smart folders (saved searches) — Organization as view, not location
  4. Inline metadata in documents — Dataview, Tana supertags
  5. AI-powered extraction — Removes human effort from tagging
  6. Focused scope — Baloo’s 3 use-cases vs NEPOMUK’s broad vision
ComponentRole
Files remain filesStorage unchanged
Lightweight index layerFast search, relationships
AI extractionAutomatic entity/metadata discovery
Query languageDynamic views over indexed data
Smart foldersMultiple overlapping organizations

Based on this research, several architectural principles emerge:

  1. Don’t build a “semantic file system” — Build an index/query layer over existing files
  2. Purpose-built storage per entity type — Like Baloo, not one RDF store
  3. AI extracts entities automatically — Don’t rely on manual tagging
  4. Files → Entities mapping is a view — Same file can map to multiple entities
  5. Clear, narrow scope — Define exactly what the entity layer does (and doesn’t)
  6. Graceful degradation — System works even if indexing fails