Skip to content

Google Structured Data & Knowledge Graph Research

Google is moving from keyword-based search to entity-based understanding. They said “things, not strings” back in 2012. This has massive implications:

  1. Knowledge Graph — Google’s database of 500+ billion facts about entities
  2. Schema.org — Google co-created the vocabulary standard
  3. Structured Data — JSON-LD, microdata that websites use to describe entities
  4. Entity-first SEO — The future of search is entity recognition

If Google is betting on entities, we should understand how they model them.


  • When did Google launch Knowledge Graph? (May 16, 2012)
  • How many entities does it contain? (5 billion entities, 500 billion facts as of 2020)
  • What entity types exist in Knowledge Graph?
  • How do they resolve entity identity? (Machine IDs / MIDs)
  • How do they handle relationships? (Graph structure with edges as relationships)
  • Who built it? Key engineers/researchers?
  • Google’s role in creating Schema.org (co-founded with Microsoft, Yahoo, Yandex in 2011)
  • What types does Schema.org define? (827 Types, 1,528 Properties)
  • How does the type hierarchy work? (Multiple inheritance from Thing)
  • Most important types for search (Person, Organization, Product, etc.)
  • How has it evolved? (Community-driven via W3C Community Group)
  • JSON-LD format deep dive
  • Required vs recommended properties
  • How Google uses structured data for rich results
  • Common patterns: Article, LocalBusiness, Product, Event, FAQ
  • Validation: Rich Results Test
  • Recent announcements about entity-first search
  • How AI/LLMs are changing entity recognition (Gemini integration)
  • Implications for SEO
  • What new entity types are emerging?
  • How does Google use Wikidata?
  • Knowledge Graph vs Wikidata relationship
  • Entity reconciliation across sources
  • Who led Knowledge Graph development?
  • Key researchers at Google on entity recognition
  • Schema.org founders/maintainers

PersonRoleOrganizationEra
Amit SinghalSVP Engineering, SearchGoogle2000-2016
John GiannandreaDirector of Engineering (KG), later SVP SearchGoogle → Apple2010-2018
R.V. GuhaGoogle Fellow, Schema.org founder/chairGoogle → OpenAI → Microsoft2005-present
Dan BrickleySchema.org daily operations, W3C Community Group chairGoogle2011-present
Steve MacbethMicrosoft Schema.org co-founderMicrosoft2009-2020
Peter MikaYahoo Schema.org lead, semantic searchYahoo2011-2016
Danny HillisMetaweb/Freebase co-founderMetaweb2005-2010
Robert CookMetaweb/Freebase co-founderMetaweb2005-2010
Jack MenzelProduct Management, Google SearchGoogle2010s
Johanna WrightProduct Manager, Google SearchGoogle2010s

Full career arc:

  • 1987-1994: Co-leader of Cyc Project at MCC. Designed CycL (representation language), upper ontological layers, NLU components
  • 1995-1997: Apple Computer’s Advanced Technology Group. Created Meta Content Framework (MCF), FlyThru visualization
  • 1997-1999: Principal Engineer at Netscape. Created RSS and RDF (with W3C/Tim Bray), helped establish Open Directory Project
  • 1999-2000: Co-founded Epinions (Web of Trust concept)
  • 2000-2002: Founded Alpiri, created TAP (semantic web application)
  • 2002-2005: IBM Almaden Theory group
  • 2005-2024: Google Fellow. Started Google Custom Search, Schema.org, and Data Commons
  • Aug 2024: Moved to OpenAI as Technical Advisor to CEO
  • May 2025: Joined Microsoft as Technical Fellow working on NLWeb

John Giannandrea — From Freebase to Knowledge Graph

Section titled “John Giannandrea — From Freebase to Knowledge Graph”
  • CTO of Netscape (related browsing technology)
  • Co-founded Metaweb, helped build Freebase
  • Joined Google in 2010 when Google acquired Metaweb
  • Led the creation of Google Knowledge Graph (launched May 2012)
  • Appointed Google search chief in 2016, replacing Amit Singhal
  • Left Google in 2018 to join Apple as SVP of Machine Learning and AI Strategy
  • Announced retirement December 2025

Launch & Scale:

  • Launched May 16, 2012 with 500 million entities, 3.5 billion facts
  • By 2020: 5 billion entities, 500 billion facts
  • Tripled in size within 7 months of launch (570M entities, 18B facts)

Philosophy:

“Things, not strings” — Google’s fundamental shift from keyword matching to entity understanding

Data Sources:

  • Freebase (acquired 2010 from Metaweb)
  • Wikipedia
  • CIA World Factbook
  • Google Books
  • Online event listings
  • Web crawl data
  • Commercial datasets (undisclosed)
  • Wikidata (after Freebase sunset)

Primary Entity Categories:

  • People — individuals, historical figures, fictional characters
  • Places — locations, landmarks, cities, geographical features
  • Things — products, objects, concepts
  • Organizations — companies, institutions, groups
  • Events — occurrences at specific times/locations
  • Creative Works — books, movies, music, articles

Entity Identity Resolution:

  • Uses Machine IDs (MIDs) — globally unique identifiers
  • Format examples: /m/0dl567 (Taylor Swift in original KG), c-024dcv3mk (Cloud KG)
  • MID namespaces: c (Cloud KG), r (reconciliation candidate), e (reconciled master)
  • Entity Reconciliation API handles deduplication across sources
  • Creates master entity with new MID when merging

Graph Structure:

  • Non-hierarchical graph: nodes (topics/entities) connected by edges (relationships)
  • Each entity can have multiple types assigned
  • Relationships are typed and can carry attributes

Current Scale (as of 2024):

  • 827 Types (classes)
  • 1,528 Properties
  • 14 Datatypes
  • 94 Enumerations with 522 enumeration members
  • Used on 45+ million web domains
  • Over 450 billion Schema.org objects marked up

Founding (2011):

  • Co-created by Google, Microsoft (Bing), Yahoo, and Yandex
  • Key founders: R.V. Guha (Google), Steve Macbeth (Microsoft), Peter Mika (Yahoo), Alex Shubin (Yandex)
  • Based on RDF Schema (derived from CycL)
  • Pragmatic design over academic purity

Hierarchy Design:

  • Thing is the root type — all other types inherit from it
  • Multiple inheritance — types can have multiple parent types
  • Properties can have multiple domains and ranges
  • Avoids creating artificial types

Core Type Branches (direct children of Thing):

Thing
├── Action (performed by agent on object)
│ ├── AchieveAction, AssessAction, ConsumeAction
│ ├── CreateAction, InteractAction, MoveAction
│ ├── OrganizeAction, PlayAction, SearchAction
│ ├── TradeAction, TransferAction, UpdateAction
│ └── ... (100+ action subtypes)
├── BioChemEntity
│ └── Gene, Protein, MolecularEntity, ChemicalSubstance
├── CreativeWork
│ ├── Article → NewsArticle, BlogPosting, ScholarlyArticle
│ ├── Book → Audiobook
│ ├── Dataset, SoftwareApplication, WebPage
│ ├── Movie, MusicRecording, Photograph
│ └── ... (100+ subtypes)
├── Event
│ ├── BusinessEvent, MusicEvent, SportsEvent
│ ├── ConferenceEvent, Festival, Hackathon
│ └── ... (30+ event types)
├── Intangible (utility class)
│ ├── Brand, Language, Occupation, Service
│ ├── Offer, Order, Invoice, Reservation
│ ├── Rating, Role, Schedule
│ └── ... (200+ intangible types)
├── MedicalEntity
│ └── AnatomicalStructure, Drug, MedicalCondition...
├── Organization
│ ├── Corporation, EducationalOrganization
│ ├── GovernmentOrganization, LocalBusiness
│ ├── MedicalOrganization, PerformingGroup
│ └── ... (150+ org subtypes)
├── Person (alive, dead, undead, or fictional)
├── Place
│ ├── Accommodation, AdministrativeArea, CivicStructure
│ ├── Landform, LocalBusiness (also under Organization)
│ └── ... (100+ place types)
├── Product
│ ├── IndividualProduct, ProductGroup, Vehicle
│ └── Drug, DietarySupplement
└── Taxon (biological taxonomy)

Key Property Categories:

TypeKey Properties
Personname, birthDate, birthPlace, affiliation, jobTitle, email, alumniOf, colleague, children
Organizationname, address, contactPoint, founder, employee, numberOfEmployees, areaServed
CreativeWorkname, author, datePublished, about, publisher, copyrightHolder
Eventname, startDate, endDate, location, organizer, performer, offers
Placename, address, geo (coordinates), containedInPlace, openingHours
Productname, brand, offers, sku, gtin, manufacturer, productID

Multi-Typed Entities (MTEs):

  • Single entity can be marked as multiple types simultaneously
  • Example: A Book that is also a Product (when it has an Offer)
  • Extends multiple inheritance at the instance level

Extensions:

  • health-lifesci.schema.org — Medical/healthcare vocabulary (moved from core in 2016)
  • pending section — 825 work-in-progress terms
  • auto extension — Vehicle-specific types

What is JSON-LD?

  • JavaScript Object Notation for Linked Data
  • Google’s recommended format (used in 83.2% of implementations)
  • Placed in <script type="application/ld+json"> tags
  • Can appear anywhere in HTML document

Basic Structure:

{
"@context": "https://schema.org",
"@type": "Person",
"name": "John Doe",
"jobTitle": "Software Engineer",
"email": "john@example.com"
}

Common Patterns:

Article/BlogPosting:

{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Article Title",
"author": {
"@type": "Person",
"name": "Author Name"
},
"datePublished": "2024-01-15",
"publisher": {
"@type": "Organization",
"name": "Publisher Name",
"logo": {
"@type": "ImageObject",
"url": "https://example.com/logo.png"
}
}
}

LocalBusiness:

{
"@context": "https://schema.org",
"@type": "Restaurant",
"name": "Restaurant Name",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Main St",
"addressLocality": "City",
"addressRegion": "State"
},
"openingHours": "Mo-Sa 11:00-22:00",
"priceRange": "$$"
}

Product with Offer:

{
"@context": "https://schema.org",
"@type": "Product",
"name": "Product Name",
"offers": {
"@type": "Offer",
"price": "99.99",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
}
}

FAQPage:

{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [{
"@type": "Question",
"name": "What is the question?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The answer text."
}
}]
}

Validation Tools:

  • Rich Results Test — Google’s official validator
  • Schema Markup Validator — General Schema.org validation
  • Google Search Console — Rich results performance reporting

Format Comparison:

FormatRecommendationNotes
JSON-LDPreferredSeparate from HTML, easy to manage
MicrodataSupportedEmbedded in HTML elements
RDFaSupportedPart of broader RDF ecosystem

2012: “Things, Not Strings”

  • Knowledge Graph launch marked shift from keyword matching to entity understanding
  • Enabled disambiguation (Taj Mahal: monument vs. musician vs. casino)
  • Information panels appear alongside search results

2016: AI Transition

  • Amit Singhal (search traditionalist) replaced by John Giannandrea (AI/ML head)
  • Signaled Google’s move toward machine learning-driven search

2024: AI Overviews

  • May 2024: Introduced AI Overviews powered by custom Gemini model
  • Multi-step reasoning beyond simple entity retrieval
  • Synthesized answers, not just entity cards

2025: Gemini 3 Integration

  • November 2025: Gemini 3 brought directly into Search
  • “State-of-the-art reasoning, deep multimodal understanding, powerful agentic capabilities”
  • Enhanced “query fan-out” for discovering relevant content
  • Dynamic visual layouts and interactive tools
  • AI Mode for sophisticated research capabilities

Evolution Pattern:

  1. 2012: Entity recognition (what is this thing?)
  2. 2016: ML-enhanced entity understanding (context, relationships)
  3. 2024: Entity reasoning (synthesize knowledge about entities)
  4. 2025: Agentic entity interaction (take actions, research deeply)

Freebase History:

  • Created by Metaweb Technologies (founded 2005 by Danny Hillis, Robert Cook, John Giannandrea)
  • Collaborative knowledge base with 39+ million topics
  • Google acquired Metaweb in July 2010
  • Used MQL (Metaweb Query Language) via graphd triplestore
  • Had 23,000-40,000 types organized into domains

Shutdown & Migration:

  • December 2014: Google announced Freebase closure
  • Data transferred to Wikidata community
  • Initial import covered 10%+ of Freebase collection
  • Google released Primary Sources Tool for migration
  • Final data dump: 1.9 billion RDF triples

Entity Reconciliation Challenges:

  • Freebase used domains/types structure
  • Wikidata uses properties like “instance of” (P31) instead of formal types
  • 4.4 million relations mapped
  • Conflicts resolved by preferring Wikidata mappings (6,000 cases)

Wikidata Entity Model:

  • Q numbers — Entity identifiers (Q80 = Tim Berners-Lee)
  • P numbers — Property identifiers (P108 = employer)
  • L numbers — Lexeme identifiers (linguistics)
  • Statements: Subject (Q) — Predicate (P) — Object (Q or value)
  • Qualifiers add context to statements
  • Example: Tim Berners-Lee (Q80) — employer (P108) — CERN (Q42944)

Structure:

  • Independent project with its own steering group (chaired by R.V. Guha)
  • W3C Schema.org Community Group handles public discussion
  • Dan Brickley chairs Community Group and runs daily operations

Decision Process:

  1. Proposals discussed via GitHub and public mailing list
  2. Webmaster maintains staging site reflecting discussions
  3. Release candidates submitted to steering group
  4. 10 business day review period
  5. Official release every few weeks

Community Scale:

  • Open participation via W3C Community Group
  • Public mailing lists and GitHub issues
  • “Early Access Fixes” fast-track for simple changes

Academic Research on Entity Disambiguation

Section titled “Academic Research on Entity Disambiguation”

Key Approaches:

  1. Freebase-based Disambiguation (Google Research)

    • Used Freebase’s 22M entities (more than Wikipedia)
    • Leveraged naturally disambiguated aliases and rich taxonomy
    • Achieved 90% accuracy without labeled training data
  2. GEEK (Graphical Entity Extraction Kit)

    • Graph-based entity linking
    • Considers entity commonness, relatedness, contextual similarity
    • Integrates Knowledge Graph and Wikipedia APIs
  3. Plato (Google)

    • Probabilistic model for entity resolution
    • Handles noisy features
    • Scales to 10^7+ entities
    • Combines Wikipedia training with unlabeled text
  4. SMAPH

    • Entity recognition in search queries
    • Uses SVM classifiers for disambiguation
    • Handles query brevity and noisy language

  • 44% of websites use schema markup globally
  • 10+ million websites implement Schema.org
  • 83.2% use JSON-LD as format

Most Popular Schema Types (Top 1M sites):

TypeAdoption
Organization22.26% (222,550 sites)
Person14.37% (143,694 sites)
Product6.97% (69,747 sites)
Offer6.54% (65,396 sites)
PostalAddress6.2% (61,966 sites)
AggregateRating4.28% (42,752 sites)

SEO Impact:

  • Rich results appear in 33%+ of Google searches
  • 88% of featured snippets pull from schema-enhanced pages
  • Case studies show 25-82% increases in click-through rates

1. “Thing” as Universal Base

  • Every entity inherits from Thing
  • Thing has universal properties: name, description, url, image, identifier
  • This provides a consistent baseline for all entities

2. Multiple Inheritance is Practical

  • Schema.org uses multiple inheritance (type can have multiple parents)
  • More flexible than strict single-inheritance hierarchies
  • Example: LocalBusiness appears under both Organization and Place
  • Reflects real-world messiness

3. Multi-Typed Entities (MTEs)

  • Single instance can have multiple types
  • A Book can also be a Product (when being sold)
  • This is instance-level, not class-level
  • Implication: Our entities should support multiple type assignments

4. Properties vs Types

  • Schema.org has 1,528 properties vs 827 types
  • Properties carry more information than types
  • Relationships are expressed as properties, not separate edge types
  • Implication: Rich properties may matter more than type granularity

5. Pragmatic Over Pure

  • Properties accept text even when specific types expected
  • Multiple domains/ranges on properties
  • Avoiding “artificial types” for technical reasons
  • Implication: Usability > ontological purity

6. Core Types Are Few

  • Despite 827 types, the core everyday types are limited:
    • Person, Organization, Place, Event, CreativeWork, Product
  • Most sites use just Organization, Person, Product
  • Implication: Start with 6-10 core types, extend later

7. Action Types Exist

  • Schema.org has 100+ Action subtypes (CreateAction, UpdateAction, BuyAction…)
  • Actions are first-class entities, not just events
  • Implication: Consider modeling actions/activities as entities

8. Extensions Are Layered

  • Core vocabulary + pending + extensions (health, auto)
  • Domain-specific vocabulary separated from core
  • Implication: Design for extensibility from day one

1. Relationships Are Properties

  • No separate “edge” or “relationship” type
  • Relationships are properties on entities
  • author, creator, memberOf, worksFor — all properties

2. Key Relationship Properties:

PropertyDomainRangeNotes
authorCreativeWorkPerson, OrganizationWho created it
creatorCreativeWorkPerson, OrganizationBroader than author
publisherCreativeWorkOrganization, PersonWho published it
memberOfPerson, OrganizationOrganizationMembership
worksForPersonOrganizationEmployment
affiliationPersonOrganizationAny affiliation
alumniOfPersonEducationalOrganizationAlumni relationship
parent / childrenPersonPersonFamily relationships
spousePersonPersonMarriage
colleaguePersonPersonProfessional relationship
knowsPersonPersonKnows another person
isPartOfCreativeWorkCreativeWorkPart/whole
hasPartCreativeWorkCreativeWorkContains parts
subOrganizationOrganizationOrganizationOrg hierarchy
parentOrganizationOrganizationOrganizationParent org
locationEvent, OrganizationPlaceWhere something is
containedInPlacePlacePlaceGeographic containment
sameAsThingURLIdentity across sources

3. Inverse Properties

  • hasPartisPartOf
  • parentOrganizationsubOrganization
  • Explicitly defined inverses aid traversal

4. Role-Based Relationships

  • Schema.org has Role type for qualified relationships
  • OrganizationRole, PerformanceRole, EmployeeRole
  • Allows adding metadata to relationships (start date, end date)
  • Implication: Consider role/reification for relationship metadata

5. sameAs for Identity

  • Links to equivalent entities on other sites
  • Key for entity reconciliation
  • Points to authoritative sources (Wikipedia, Wikidata)

6. Relationship Cardinality

  • Properties are multi-value by default
  • Can have multiple authors, multiple locations
  • Implication: Design for multi-valued relationships

1. Machine IDs (MIDs)

  • Globally unique identifiers per entity
  • Different namespaces for different sources
  • Master entity created when merging

2. Reconciliation Approach:

  • Fuzzy text matching on names/aliases
  • Common relationships as signals
  • Entity types and attributes
  • Semantic clustering

3. Cross-Source Linking:

  • sameAs property for web identity
  • MID mappings between systems
  • Prefer established mappings, resolve conflicts

4. Multi-Source Truth:

  • No single source of truth
  • Entities assembled from multiple sources
  • Wikipedia, Wikidata, web crawl, commercial data

Implications for Our Knowledge Graph:

  • Need stable entity identifiers
  • Support for aliases and alternative names
  • Explicit cross-reference properties
  • Ability to merge entities from different sources
  • Confidence/source tracking per fact

Google Official:

Schema.org:

Wikidata:

Academic Papers:

Technical Deep Dives:

Industry Analysis:

People Profiles:

Videos:


  • Created research doc
  • Outlined research tasks
  • Next: Subagent deep-dives on Schema.org and Knowledge Graph
  • Extensive web research completed on all research areas
  • Knowledge Graph: Launch date, scale (5B entities, 500B facts), architecture, MID system, entity reconciliation
  • Schema.org: Complete type hierarchy (827 types, 1528 properties), founding story, governance
  • Key People: Documented 10 key figures including Guha, Giannandrea, Singhal, Brickley, Macbeth, Mika
  • JSON-LD: Patterns for Article, LocalBusiness, Product, FAQ, etc.
  • Entity Evolution: 2012 → 2016 (AI transition) → 2024 (Gemini) → 2025 (agentic)
  • Freebase/Wikidata: Migration history, entity reconciliation approaches
  • Adoption: 44% of sites, 83% use JSON-LD, Organization most popular type

Key Insights for Our Project:

  1. Start with ~6 core types (Person, Org, Place, Event, CreativeWork, Product)
  2. Use multiple inheritance and multi-typed entities
  3. Properties matter more than type granularity
  4. Stable identifiers (like MIDs) are critical
  5. Support sameAs for cross-source identity
  6. Design for extensibility from day one

Next Steps:

  • Phase 2: Synthesize findings into entity model recommendations
  • Compare with Wikidata research
  • Draft initial type hierarchy for our knowledge graph