An_Experimental_Observation_of_AI's_Cultural_Understanding_Gap

During an experiment on

generating historical scenes, I encountered a puzzling phenomenon: no matter how I optimized my prompts, the AI-generated so-called “classical Chinese scenes” consistently exhibited an undeniable “modern period drama feel.” This discovery prompted a series of comparative tests and a review of relevant literature to understand the nature of this problem.

Experimental Observation: The Limits of Prompt Engineering

In my initial experiment, I attempted to generate a scene of a secret meeting in a classical Chinese garden. The result, however, featured characters in pristine costumes resembling stage outfits, with impeccable modern makeup and lighting effects strikingly similar to contemporary film and television productions. I subsequently refined my prompt with professional adjustments: specifying “Ming or Qing dynasty scholarly attire,” instructing to “avoid modern makeup,” invoking the “style of classic Chinese landscape painting,” and emphasizing “weathered textures” and “natural lighting.”

Yet, the outcome did not improve fundamentally.

Interestingly, when I attempted to generate Western historical scenes—such as “a philosophical discussion in an 18th-century French salon” or “a Victorian family gathering”—the AI seemed capable of producing images with a relatively stronger sense of historical immersion. Details in character clothing and stylistic elements of interior settings appeared more “natural,” with less of the obvious “modern reenactment” feel. This comparative discrepancy caught my attention: why does the AI seem to possess a peculiar “barrier” when comprehending non-Western historical and cultural contexts?

Analysis Through the Lens of Literature

Consulting relevant literature revealed that this issue is not an isolated case but is rooted in the structural limitations of current Text-to-Image (TTI) models.

1. Quantitative Evidence of Systemic Cultural Bias
The quantitative framework proposed in “On the Cultural Gap in Text‑to‑Image Generation” (2023) clearly demonstrates a significant gap in quality and accuracy when mainstream diffusion models generate content from non-Western cultures. This “cultural gap” is particularly pronounced for East Asian historical content. Models often reduce “Chinese classical” to a few highly stereotyped visual symbols (like specific colors, decorative patterns) but fail to capture their intrinsic diversity and historical evolution.

2. “Synthetic History” and Data Contamination
The paper “Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models” (2024) directly addresses my confusion. The research finds that historically themed images generated by models are actually re-syntheses of existing visual media representations of history, rather than models of history itself. Since the vast majority of visual material about Chinese history on the internet consists of period dramas, games, and influencer photography from recent decades, what the model has learned is a “synthetic history” repeatedly filtered through modern aesthetics. This explains the ineffectiveness of my prompt optimization—the model’s knowledge base itself is contaminated by “studio photography style” and “film set aesthetics.”

3. Structural Lack of Knowledge Representation
The review in “A Systematic Review of Cultural Bias in Text‑to‑Image (TTI) Models” (2025) points out that current models lack a structured understanding of cultural concepts. For instance, a model might know the term “Hanfu,” but it cannot associate it with specific dynasties, social classes, or ceremonial occasions. When I requested “Ming dynasty scholarly attire,” the model only vaguely associated it with common visual patterns for “historical costume + scholar,” unable to invoke knowledge about the specific cut, fabric, and manner of wearing of Ming dynasty lanshan. This stands in stark contrast to the structured historical analysis method based on knowledge graphs advocated in “Knowledge Graph based Analysis and Exploration of Historical Theatre Photographs” (2020)—current models rely on “pattern matching” rather than “knowledge reasoning.”

4. Why Do Western Scenarios “Seem” Better?
“Stable Bias: Evaluating Societal Representations in Diffusion Models” (2023) and “Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text‑to‑Image Generative Models” (2023) provide clues. Western historical and aesthetic systems (especially the artistic tradition since the Renaissance) occupy a central and relatively unified position in model training data. Models have learned a relatively coherent set of visual language from classical oil paintings to historical films. When a user (especially a Western user) requests a “Victorian era” scene, there is a higher degree of alignment between the user’s expectation and the representations the model learned from Western art history data. However, this does not mean the model truly understands Western history; it merely indicates that the distribution of its training data makes the output align better with certain common “visual conventions.”

In contrast, for Chinese history, models face a triple dilemma:

  • Modern Contamination of Data Sources: High-quality, serious visual materials of ancient China (e.g., classical paintings, artifact catalogs) constitute an extremely low proportion of the training data.
  • Diversity of User Expectations: Different users’ imaginations of “Chinese classical” may stem from vastly different sources (serious historical dramas, xianxia fantasy dramas, Japanese anime, Western Orientalist paintings), leading to greater confusion in matching prompts with the model’s internal representations.
  • Loss in Cultural Translation: Even when using English prompts like “scholarly attire,” the model must undergo multiple layers of mapping—“English vocabulary → abstract concept → visual pattern”—each potentially introducing distortions based on training data bias.

The Chinese technical report “AI文生图模型测评:从基础美学到文化理解的多维度分析” (2025) and the article “AI图像生成技术的蓬勃发展与语料、语境的作用” (2025) also emphasize that the quality of generated content is fundamentally constrained by the corpus. The lack of high-quality, contextualized Chinese historical and cultural corpus is a key bottleneck.

The Core Issue: The Missing “Cultural-Context Layer”

Synthesizing my experiment and the literature, I argue that the core problem lies in the current models’ lack of a flexibly accessible “cultural-historical context layer.”

The model is like a “stills photographer” with a massive archive of movie stills, but it lacks a “historical consultant.” It can collage elements that look “historical,” but it cannot understand the social rules, lived logic, and aesthetic spirit behind these elements. When I say “secret meeting,” it thinks of dramatic cinematography; when I say “sheer summer garments,” it presents the texture of modern chiffon rather than the drape of classical gauze. What it generates is always a modern visual commentary on history, not a visual hypothesis attempting to approach history itself.

Conclusion and Outlook

My experimental observations and literature review jointly point to a conclusion: the fundamental challenge current TTI models face when processing deep, non-Western historical and cultural concepts is a representational gap. Merely optimizing prompts is an adjustment at the model’s “symptomatic level,” unable to address the defects at its “knowledge” and “contextual” layers.

Future improvements may not lie in pursuing larger general-purpose models, but rather in:

  1. Developing Specialized Cultural Computing Models: Building fine-tuned models or plugins for specific historical and cultural domains, integrating structured knowledge (e.g., historical knowledge graphs, clothing systems).
  2. Innovating Data Curation Paradigms: Proactively incorporating more diverse, high-quality local historical visual materials and academic research to balance data distribution.
  3. Exploring New Interaction Paradigms: Allowing users to engage in “multi-turn conversational co-creation” with the model regarding historical background and character relationships, gradually building context rather than outputting finalized images in one shot.

Only when models learn to consider not just “what it looks like” but also “what it might have been like in a specific historical context” during generation, can we potentially bypass this “modern period drama” filter and glimpse more authentic historical light and shadow. This is not only a technical challenge but also a profound exploration into how AI can comprehend the complexity of human civilization.