Semantic Layer for the People and by the People

TL;DR:

My 3 straightforward and one hidden Joker are:

Joker #1: Pattern-Driven Repository Structure 🗂 ️
Joker #2: Organized Code 👩🏻 ‍💻
Joker #3: (Non-) Embedded Documentation 📜
[🃏 Hidden Joker: Refinement Loop 🃏]

"Simple and Consistent. - This is how I would describe to someone the 2 most important dimensions to keep in mind when building a semantic layer." [Photo by Zuzana Ruttkay on Unsplash] — "**Simple** and **Consistent.** – This is how I would describe to someone the 2 most important dimensions to keep in mind when building a semantic layer." [Photo by Zuzana Ruttkay on Unsplash]

Semantic – the study of linguistic meaning

According to Wikipedia, the term semantic, which is the study of linguistic meaning, examines what meaning is.

Or per se – how words get their meaning and how the meaning of a complex expression depends on its parts [1].

Although the term semantic is straightforwardly explained, I honestly had to dwell for a while on the part "the meaning of a complex expression depends on its parts", because I wanted to re-use it to explain the semantic layer in analytics.

After rereading it, my explanation goes as follows:

Similar to semantics in the linguistic context, the semantic layer in analytics is about making data meaningful.

Much like words together form a specific meaning, which leads to an understanding of what was said, the raw data from different sources gets enriched and forms specific insights.

Much like meaning depends on the combination of the words in expressions, the outcome derived from the raw data depends on the modelling approach in the semantic layer.

And much like _properly structure_d and well-formed expressions lead to ease of understanding, properly modelled data leads to quality data insights.

In summary, it is all about how one can create better, faster, and novel __ value from raw data, which leads to insights and understandin_g of_ business action(s) to take.

This is the core purpose of the semantic layer, and building one is filled with numerous challenges.

The two key ones that I always face when building a semantic layer are (1) simplicity and (2) consistency.

Or, better said, how to achieve them both.

When I try to simplify my semantic layer by placing focus on one area of my analytical development, this usually undermines my consistency.
And when I try to make my semantic layer consistent, it usually results in being overly complex.

So, finding a balance, or more precisely, the structure of the semantic layer that will be both simple and consistent, is relevant as it impacts the speed and sometimes the quality of delivering business value.

Hence, in this post, I’ll share my 3 "jokers" for balancing these challenges while creating a semantic layer that’s built to evolve with your business needs.

My explanations, with the provided visual templates, will mostly be based on how I built a semantic layer using Looker.

However, they will be generalized and can be applied to building a semantic layer in any other BI or modelling tool (like dbt Core or Dataform).

Let’s dive in.

Joker #1: Pattern-Driven Repository Structure 🗂 ️

The repository structure of the semantic layer is the blueprint that you need to focus on. As a foundation task, it really needs to be both simple and consistent.

Why, you ask?

Because the proper repository structure of the semantic layer needs to serve both technical (for data modelling and testing development) and business colleagues (for the self-service analytics part). Long story short, it needs to be understandable to both sides.
Because business requirements will "explode" with time, and so will your analytical development. If you don’t keep track of the semantic layer repository structure patterns, the chances are high your development will result in redundancy and it will lack proper development standards.

So, what are the key components that a good semantic layer repository should contain:

(1) Layers of folders or sub-folders—create clear layers in your repository hierarchy, separating business logic from technical logic. This way, both technical and business users can "be on the same page" when they talk about the same data model or the same method for comparing the data insights (e.g., specific period-over-period method or specific forecasting model).
(2) _Naming conventions—_correlated to the first component, naming conventions play an important part in standardizing development and keeping your repository structure tidy. Properly defined naming conventions result in faster data modelling by following consistent patterns and also ease navigability during the troubleshooting process.

To provide more context to my theory above, I will explain visually my blueprint for the semantic layer repository structure.

.
└── Semantic Layer/
    ├── area_association_rules/
        ├── models/
          ├── association_rules.model.lkml
          ├── frequent_items.model.lkml
        ├── views_shop/
          ├── association_rules.view.lkml
          ├── frequent_items.view.lkml
    ├── area_benchmarking/
    ├── area_demand_forecast/
    ├── area_financial_forecasting/
    ├── area_customer_intelligence/
    ├── ....
    ├── area_business_controlling/
        ├── models/
        ├── views/
        ├── views_derived/
        ├── views_aggregated/  
    ├── area_finances/
    ├── area_logistics/
    ├── area_performance_marketing/
    ├── ....
    ├── base_views/
    ├── base_date_granularity/
    ├── base_date_granularity_customised/
    ├── base_pop_logic/
    ├── base_pop_logic_customised/
    ├── ....
    ├── data_tests/
    ├── documentation/
    ├── locales/
    ├── manifest.lkml
    ├── README.md

How I usually structure my semantic layer repository is that I separate it into the area_* __ and the base_*__ folders:

(1) The **area_*** folders.

The use of these folders is to focus analytical development on organizing the code into specific business areas (e.g., area_finances or area_marketing) OR the cross-shared business analytical cases (e.g., area_demand forecast, area_benchmarking, etc.).
Each area_* folder encapsulates the logic and code relevant to that specific business department or use case, and it’s further split into models and views_* folders.
The models folder in the semantic layer contains the model files, while the views_* folders contain the view files or aggregated/derived views. | 👉🏼 Note: More info about these files can be found in the next section.

(2) The **base_*** folders.

The use of these folders is to focus on separating the core logic or functionality, like customized timeframe analysis methods, and cross-shared views that are used in multiple business areas and encapsulated within numerous model files of the semantic layer.
This organization ensures that the common business modelling logic is not scattered across multiple areas, and it’s centrally managed, resulting in consistency across the whole semantic layer.

Applying this template, I experienced faster data development (of the whole team) and speed/ease in troubleshooting because I focused on creating a pattern-driven repository structure.

Having in mind the repository folders themselves contain different files, keeping the code tidy is equally important. This leads me to my second Joker - creating clean code files. <h2 class="wp-block-heading">Joker #2: Organized Code 👩🏻 ‍💻</h2> In most of the semantic layers, there are common types of <code> files that a data project repository contains. As an example, in the Looker semantic layer, the 3 main types of <code> files are <a href="https://cloud.google.com/looker/docs/lookml-terms-and-concepts#model">model</a>, <a href="https://cloud.google.com/looker/docs/lookml-terms-and-concepts#view">view,</a> and <a href="https://cloud.google.com/looker/docs/reference/param-project-manifest">manifest</a> [2]. | 👉🏼 N_ote: There are m<a href="https://cloud.google.com/looker/docs/lookml-project-files#view_files">ore file types in Looker,</a> but I will focus on the three main listed above._ Every single one of the listed files has its own code logic, and this code logic can be organized. Let me provide you again with visual templates of how I organized my code within specific files in Looker: (1) The <code>model file – contains information about the views (tables) and how they should be joined together in Explore [2].

###########
# MODEL_NAME &amp; METADATA
# Description: This model reflects the [business context].
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

#########
# 1. DISPLAY &amp; CONNECTION PARAMETERS
#########

## 1.1 Label and Connection Setup
connection: "your_connection_name"
label: "Your Model Label"

#########
# 2. STRUCTURAL PARAMETERS
#########

## 2.1 Include Statements for Views
include: "../views/*.view"
include: "../views_aggregated/*.view"
include: "../views_derived/*.view"

## 2.2 Additional Include Statements (E.g. Logistics and Base Logic Views | Optional)
include: "/area_logistics/views/*.view"
include: "/base/*.view"
include: "/base_granularity/*.view"

## 2.3 Include Statements for Data Tests
include: "../../data_tests/*.lkml"
include: "../../data_tests/views/*.view"

#########
# 3. ACCESS CONTROL (Optional)
#########

## 3.1 Define Access Grants 
access_grant: access_grant_name {
  user_attribute: user_attribute_name
  allowed_values: ["value_1", "value_2"]
}

#########
# 4. EXPLORES
#########

## 4.1 Main Explore for Your Data Model
explore: explore_name {
  label: "Explore Label"
  view_name: view_name
  persist_for: "N hours"

  ## 4.1.1 SQL Filters and Conditions (Optional)
  sql_always_where:
    @{sql_always_where_condition_1}
    AND
    {% if some_field._in_query %} ${some_field} IS NOT NULL {% else %} 1=1 {% endif %}
    AND
    @{sql_always_where_condition_2};;

  ## 4.1.2 Joins for Additional Views
  join: another_view_name {
    type: left_outer
    relationship: one_to_many
    sql_on: ${view_name.field_name} = ${another_view_name.field_name};;
  }

  join: another_join_view {
    type: inner
    relationship: one_to_one
    sql_on: ${view_name.field_name} = ${another_join_view.field_name};;
  }

  # Add more joins to create data model
}

#########
# 5. DATA TESTING EXPLORE
#########

## 5.1 Define Explore for Data Testing (Optional)
explore: data_testing {
  label: "Data Testing Explore"
  view_name: view_name
  hidden: yes

  join: another_test_view {
    type: left_outer
    relationship: one_to_one
    sql_on: ${view_name.field_name} = ${another_test_view.field_name};;
  }

  # Add more testing joins as required
}

#########
# 6. MAP LAYER (Optional)
#########

## 6.1 Define Map Layer for Geographic Data (Optional)
map_layer: map_name {
  file: "../files/your_map_file.json"
  property_key: "your_property_key"
  label: "Map Label"
  format: topojson
  max_zoom_level: 15
  min_zoom_level: 2
}

(2) The view file – contains dimensions and measures accessed from a specific database table (or across multiple joined tables) [2].

###########
# VIEW_NAME &amp; METADATA
# Description: This view reflects the [business context] or [data source] it represents.
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

view: view_name {
  sql_table_name: `project.dataset.table_name` ;;

  #########
  # 1. DISPLAY PARAMETERS
  #########

  ## 1.1 Label for View
  # Specifies how the view name will appear in the field picker
  label: "Your View Display Name"

  ## 1.2 Fields Hidden by Default
  # When set to yes, hides all fields in the view by default
  fields_hidden_by_default: yes

  #########
  # 2. STRUCTURAL &amp; FILTER PARAMETERS (Optional)
  #########

  ## 2.1 Include Files 
  # Includes additional files or views to be part of this view
  include: "filename_or_pattern"

  ## 2.2 Extends View 
  # Specifies views that this view will extend
  extends: [another_view_name]  

  ## 2.3 Drill Fields 
  # Specifies the default list of fields shown when drilling into measures
  drill_fields: [dimension_name, another_dimension]

  ## 2.4 Default Filters for Common Queries
  filter: default_date_filter {
    label: "Date Filter"
    type: date
    sql: ${order_date} ;;
    description: "Filter data based on order date."
  }

  ## 2.5 Suggestions for Dimensions
  # Enables or disables suggestions for all dimensions in this view
  suggestions: yes  

  ## 2.6 Set of Fields
  # Defines a reusable set of dimensions and measures
  set: set_name {
    fields: [dimension_name, measure_name]
  }

  #########
  # 3. DIMENSIONS
  #########

  ## 3.1 Simple Dimensions (Directly from DB)
  dimension: dimension_name {
    label: "Dimension Display Name"
    type: string
    sql: ${TABLE}.column_name ;;
    description: "This dimension represents [business context] and contains values like [example]."
  }

  dimension: another_dimension {
    label: "Another Dimension Display Name"
    type: number
    sql: ${TABLE}.other_column ;;
    description: "Explanation of the dimension, including business context and possible values."
  }

  ## 3.2 Compound Dimensions (Concatenated from Existing Dimensions)
  dimension: compound_dimension {
    label: "Compound Dimension Name"
    type: string
    sql: CONCAT(${dimension_name}, "-", ${another_dimension}) ;;
    description: "A compound dimension created by concatenating [dimension_name] and [another_dimension]."
  }

  ## 3.3 Derived Dimensions (Filtered/Grouped Values from Existing Dimensions)
  dimension: filtered_dimension {
    label: "Filtered Dimension Name"
    type: string
    sql: CASE
            WHEN ${dimension_name} = 'specific_value' THEN 'Subset Value'
            ELSE 'Other'
         END ;;
    description: "This dimension subsets values from [dimension_name] based on specific business rules."
  }

  ## 3.4 Tiered Dimension (Grouped by Tiers)
  dimension: order_amount_tier {
    label: "Order Amount Tier [€]"
    type: integer
    tiers: [50, 100, 150]
    sql: ${revenue_column} ;;
    description: "This dimension creates tiers of order amounts based on thresholds (50, 100, 150)."
  }

  #########
  # 4. MEASURES
  #########

  ## 4.1 Simple Aggregated Measures (Sum, Count, Average)
  measure: total_revenue {
    group_label: "KPIs"
    label: "Total Revenue [€]"
    type: sum
    sql: ${revenue_column} ;;
    value_format_name: currency_format
    description: "Total revenue, summing up all revenue from each record."
  }

  ## 4.2 Calculated Measures (Derived from Existing Measures)
  measure: profit_margin {
    group_label: "KPIs"
    label: "Profit Margin [%]"
    type: number
    sql: (${total_revenue} - ${cost_column}) / NULLIF(${total_revenue}, 0) ;;
    value_format_name: percent_2
    description: "Calculated profit margin as (Revenue - Cost) / Revenue."
  }

}

(3) The manifest file – is a configuration file that contains project constants, code for using files imported from another project(s), code for localization settings, and serves for adding extensions or custom visualizations [2].

###########
# MANIFEST_INFO &amp; METADATA
# Description: This file reflects the [business context].
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

#########
# 1. STRUCTURAL PARAMETERS
#########

## 1.1 Project Name &amp; LookML Runtime
project_name: "Current Project Name"
new_lookml_runtime: yes 

## 1.2 Local Dependency
local_dependency: {
  project: "project_name"
  override_constant: constant_name {
    value: "string value"
  }
}
# Add additional local dependencies as needed.

## 1.3 Remote Dependency (Optional)
remote_dependency: remote_project_name {
  url: "remote_project_url"
  ref: "remote_project_ref"
  override_constant: constant_name {
    value: "string value"
  }
}
# Add additional remote dependencies as needed.

## 1.4 Constants (Optional, but useful)
constant: constant_name {
  value: "string value"
  export: none | override_optional | override_required
}

#########
# 2. LOCALIZATION PARAMETERS
#########

## 2.1 Localization Settings
localization_settings: {
  localization_level: strict | permissive
  default_locale: locale_name
}

#########
# 3. EXTENSION FRAMEWORK PARAMETERS (Optional)
#########

## 3.1 Application Definitions
application: application_name {
  label: "Application Label"
  url: "application_url"
  file: "application_file_path"

  ## 3.1.1 Mount Points
  mount_points: {
    # Define mount points here (refer to the application page for more details)
  }

  ## 3.1.2 Entitlements
  entitlements: {
    # Define entitlements here (refer to the application page for more details)
  }
}
# Add additional application declarations as required.

#########
# 4. CUSTOM VISUALIZATION PARAMETERS (Optional)
#########

## 4.1 Visualization Definition
visualization: {
  id: "unique-id"
  label: "Visualization Label"
  url: "visualization_url"
  sri_hash: "SRI hash"
  dependencies: ["dependency_url_1", "dependency_url_2"]
  file: "visualization_file_path"
}
# Add additional visualizations as needed.

Finally, after presenting the visual examples, I will list the two traits of organizing __ the model, view, and manifest files in a similar way:

(1) Design clarity: By using enumeration in the code sections, the code logic can be more easily tracked, adapted, extended, and troubleshot.
(2) Consistency framework: By creating patterns of your code files, it creates uniformity in development. This results in easier collaboration and faster development delivery.

These two traits lead me to a Joker that is correlated with the code structure and organization, i.e., embedding of the documentation.

Joker #3: (Non-) Embedded Documentation 📜

When I start with my analytical development in the semantic layer, I always focus on the input-output flow or the process flow.

I can’t stress enough the reason for this, so I will repeat it again: I want my semantic layer to be understandable to both technical and business colleagues.

And to achieve this, I need to document my inputs, explain how I modelled them, and then elaborate on derived outputs.

In my early days of development, I always used Confluence for documentation of my flows and explanation of the data models in my semantic layers.

Lucky for me now, most of the analytical tools offer the embedded documentation feature, and the two ones that I tend to always adopt are:

_(1) Data dictionary_

Serves as my reference guide to explain to my stakeholders the core of the data models—i.e., dimensions and measures.

I tend to embed the formulas, business rules, and terminology within the code files, and then, by using the Data Dictionary, educate my end-users on how they can explore the data models.

With this approach, I save myself numerous minutes, if not hours, per week on responding to chat and mail enquiries about the meaning of the specific measures and dimensions.

On top of this, it serves me for the technical purpose of identifying if I have redundant fields in my project or if I have fields without annotated descriptions.

_(2) Data lineage_

Nowadays, a must-have for me in my data project is to have a graphical explanation of where my data comes from, how it flows through transformation, and where it ends up.

For this purpose, I tend to adopt tools like dbt Core with the native Data Lineage feature. And if I don’t have tools in my data architecture/project with native data lineage features, I will make sure to get one that only serves this purpose.

This gives my data team and stakeholders a clear view of the data’s entire journey – from the raw tables to presentational tables.

It makes code troubleshooting easier and elevates the usability of the Semantic Layer by providing a clear data path.

And the usability topic leads me to my final Joker, which is to constantly work on the improvement of your semantic layer.

[🃏Hidden Joker: Refinement Loop🃏]

My Refinement Loop: Iterate, Correct, Educate 🔄

This is where it all comes together if you want to build a semantic layer that is both simple and consistent.

To achieve this, you will probably need to go through the process of constant iteration, correction, and education.

— Why?

Because every semantic layer is built by the people to benefit people – or in a more catchy title "For the people, by the people."

With this sentence, I want to remind you that the semantic layer is a living and evolving component of the data project that usually grows together with the data team and rising business requirements.

So, even if you currently don’t have enough knowledge or experience to build a semantic layer that is both simple and consistent, just keep investing time and effort until it becomes one.

With each improvement iteration, the goal is to think about the end users, how to increase the usability of the insights created within the semantic layer, and how to create business value faster by delivering quality insights.

And if you manage to bring more value to your stakeholders in a shorter time, that’s the real Joker that can position you and your data team in the business landscape.

Thank you for reading my post. Stay connected for more stories on Medium, Substack ✍️ and LinkedIn 🖇 ️.

Looking for more analytical [templates | tutorials | reads]?

My 3 posts **** related to this topic:

The One Page Data and Analytics Templates | Master Data and Analytics Reports and Processes with 5 Templates
3 Clusters of Knowledge Base Templates for Data and Analytics Teams | Create External, Shared and Internal Documentation in a Systematic Way
Decoding Data Development Process | How Is Kowalski Providing "Ad-Hoc" Data Insights?

References

[1] Wikipedia: Semantics, https://en.wikipedia.org/wiki/Semantics, accessed on September 18, 2024

[2] Google Cloud Documentation: LookML terms and concepts, https://cloud.google.com/looker/docs/lookml-terms-and-concepts, accessed on September 19, 2024

Semantic Layer for the People and by the People

Semantic – the study of linguistic meaning

Joker #1: Pattern-Driven Repository Structure 🗂 ️

Joker #3: (Non-) Embedded Documentation 📜

_(1) Data dictionary_

_(2) Data lineage_

[🃏Hidden Joker: Refinement Loop🃏]

Looking for more analytical [templates | tutorials | reads]?

References

Related Articles

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Integrating LLM Agents with LangChain into VICA

Spotting Spatiotemporal Patterns in Earthquake Data

Find Inspiration for Your Next Data Science Side Project

How does a decision tree know the next best question to ask from the data?

Linguistic Fingerprinting with Python