Marina Tosic, Author at Towards Data Science https://towardsdatascience.com The world’s leading publication for data science, AI, and ML professionals. Fri, 11 Apr 2025 19:19:38 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Marina Tosic, Author at Towards Data Science https://towardsdatascience.com 32 32 Ivory Tower Notes: The Problem https://towardsdatascience.com/ivory-tower-notes-the-problem/ Thu, 10 Apr 2025 18:48:08 +0000 https://towardsdatascience.com/?p=605707 When a data science problem is "the" problem

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

]]>
Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic. 


The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes ✈ and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…


How every Machine Learning and AI journey starts

 — It starts with a problem. 

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first. 

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections. 

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly —  workforce, equipment, money, or time. 

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times. 

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset. 

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and 
3 — Create a methodology to get to the solution that will generate business value from it. 

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.” 

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem. 

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location. 

At this moment, in front of you, you see something like this: 

Data Science Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate. 

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago. 

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages. 

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum. 

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]: 

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens)👇🏼.

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations. 

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

  • It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
  • It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
  • It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

  1. Which aircraft (sub)system is most critical to flight disruptions?
    (Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
  2. What constitutes “critical downtime” from an operational perspective?
    (Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
  3. Which data sources capture the root causes, and how can we fuse them?
    (Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

  • Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after. 

It’s the foundation upon which your data, modelling, and evaluation phases are built 👇🏼.

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well: 

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
 — History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
 —  Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
 — Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
 — ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is. 

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague. 

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente)

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too. 

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part. 

You get there by learning to ask good questions.

To end this narrative, recall how Einstein famously said:  

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.


Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network. 👏

Connect for more stories on Medium ✍ and LinkedIn 🖇.


References: 

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

]]>
Human Minds vs. Machine Learning Models https://towardsdatascience.com/human-minds-vs-machine-learning-models-c66c809194b4/ Wed, 22 Jan 2025 16:02:07 +0000 https://towardsdatascience.com/human-minds-vs-machine-learning-models-c66c809194b4/ Exploring the parallels and differences between psychology and machine learning

The post Human Minds vs. Machine Learning Models appeared first on Towards Data Science.

]]>
Disclaimers:

  • This blog post was co-authored with my friend, Dee Penco, a certified therapist and counsellor.
  • For the past 2.5+ years, Dee and I have spent hours and hours discussing and reasoning human behaviours, which has sparked my passion for psychology. With time, we cross-shared ideas about psychology and ML/AI modelling concepts.
  • While I found parallels between these fields, I asked Dee if she could explain how humans function and make decisions. Being intrigued by the rise of artificial general intelligence (AGI), my question was: Do humans follow the same steps in generating outputs as machine learning does, and how feasible is it to mimic human-like decision-making?
  • To my great joy, Dee decided to write this post with me, to share a dual perspective on how humans vs. ML models create outputs.
  • In the text below, we sometimes interchange AI and ML as terms, but we understand they are not the same, and that ML is a subset of AI.
_"If your mind is a swirling galaxy of influences, an ML model is more like a solar system with defined orbits. Both revolve and evolve, just at different scales of complexity."_ [Photo by Growtika on Unsplash]
_"If your mind is a swirling galaxy of influences, an ML model is more like a solar system with defined orbits. Both revolve and evolve, just at different scales of complexity."_ [Photo by Growtika on Unsplash]

The year 2024 was big in recognizing Machine Learning and artificial intelligence contributions.

The Nobel Prize in Chemistry was awarded for advancements in protein science: David Baker for creating new kinds of proteins, alongside Demis Hassabis and John Jumper for developing an AI model that solved a 50-year-old challenge of predicting proteins’ complex structures.

Furthermore, John Hopfield and Geoffrey Hinton were awarded the Nobel Prize in Physics for their work on artificial neural networks, brainlike models capable of recognizing patterns and producing outcomes that resemble human decision-making processes.

Although artificial intelligence increasingly and more accurately models human problem-solving and decision-making, the mechanisms behind human cognition still need to be fully understood.

The Psychology of human (re-)action involves complex interconnected dimensions, shaped by layers of conscious and subconscious factors.

— So, what sets human and ML/AI models apart in generating outputs?

To address this question, let’s explore these two worlds – psychology and machine learning – and uncover the connections that shape how humans and human-created AI models produce outputs.

The aims of this post are:

  1. Bring high-level professional psychology explanations closer to technical readers on what affects human decision-making.
  2. Showcase the high-level machine learning (ML) modelling process and explain how ML models generate outputs to non-technical professionals.
  3. Identify the differences and similarities between the two processes—the human and the machine—that produce outputs.

Psychology aspect: How do humans generate outputs? | By Dee

Before I start writing this section, I want to emphasize that every psychologist worldwide would be thrilled if there were a way for people to function more simply – or at least as simple as AI or ML models.

Experts in AI and ML are probably appalled by what I just said because I implied that "AI and ML are simple."

— But that wasn’t my intention.

I only mean to emphasize how much simpler these models are compared to the complexities of humans.

When Marina was explaining, at a high level, how machine learning modelling works – I couldn’t help but think:

If we could "reduce" humans to this "straightforward" methodology, we would cure most psychological problems, transform lives for the better, and dramatically improve overall population wellbeing.

Imagine if a person could receive input(s), pass them to some internal algorithms that could determine the weight, importance, and quality of that input, make the most likely prediction, and, based on that, produce a controlled outputthought, emotion, or behaviour.

But unlike ML or AI, the human mind processes information in far more complex ways, influenced by numerous interconnected factors.

From this point, I’ll stop speculating about what happens within an AI or ML model and explain the "human" modelling flow.

To illustrate the concept, I will discuss several factors influencing humans’ decision-making.

For this, I invite you to imagine a person as a "black box pre-trained model". In other words, a model already comes pre-loaded with knowledge patterns, and weights learned in the training step.

These patterns and weights or factors vary from person to person and are known as:

  • (1) Intelligence and IQ
  • (2) Emotional world and EQ
  • (3) Conscious world – what the "model" has learned so far: values, experiences, purpose
  • (4) Unconscious and subconscious world – what the "model" learned and repressed so far – short-term/long-term memory + (again) values, experiences, purpose
  • (5) Genetic predispositions – what we’re born with
  • (6) Environment – social, cultural, physical
  • (7) Physiological needs – Maslow’s Hierarchy (Hierarchy of Needs)
  • (8) Hormonal + physiological status – Neurobiology, Endocrine System, Arousal
  • (9) Decision-making centres – Id, Ego, Superego that separate entities within us
  • (10) Intuition creativity – which can be considered part of the above-grouped variables or separate entities on their own (Intuition, Divergent Thinking, Flow State)

So far, we have identified 10 factors that vary for each person.

I want to emphasize that they are all interconnected and sometimes so "fused" together that they can even be compounded.

In addition, each one can be thicker or thinner and may contain "particles", or information that is either predominant or deficient.

  • For example, the hormonal factors may have a predominant hormone (such as serotonin, which affects mood; cortisol in response to stress; dopamine, which is essential for excitement, etc.). The intellectual factor can be higher or lower.

Now imagine that there is some algorithm inside the person constantly rearranging the order of factor importance, so one may sometimes end up in the front, sometimes in the middle, and sometimes in the back.

  • Let’s take the physiological needs – hunger for example. If you place it right up front, it will dictate which information reaches the second, third, fourth, etc.; the output will depend on that.

In other words, if you make decisions while hungry, the output will probably not be the same as when you would have your stomach full.

📌 The factors are ordered by importance, with the most important at the specific moment always in the first position, then the second, and so on.

What I just described briefly above is how humans operate when they receive input information under specific circumstances.

🙋🏽♀ Returning to the intro thought again, did I explain why psychologists everywhere long for a more straightforward human experience? Why would we, on this side, actually be happy if people were as "complicated" as ML?

  • Consider only how many psychological problems could be resolved – given that psychological factors are fundamental to everything – and how many other issues around us would be resolved if only we could "tune" or "reset" our factors as easily as in the ML modelling.

— But, what can you do to control your outputs better?

Remember the part I mentioned above on the "thickness" of your factors? Well, since the thickness of your factors and their order determine your output (the emotion you feel, the thought you form, and the reaction to information), it’s useful to know that you can absolutely thicken some of these factors to your advantage and hold them firmly in the first position.

I’ll simplify this again with several examples: You can probably (and hopefully) ensure you’re never hungry. You can work on regulating your hormones. In therapy, you can address your fears and work to eliminate them. You can adjust deeply rooted beliefs stored in your subconscious, and so on.👉🏼 We (You) can indeed do this, and it’s a pity more people don’t work on adjusting their factors.

And, for now, let’s leave it at that.

Until some form of AI figures out a more efficient way to do this or instead of us, we’ll continue to make certain decisions, feel certain emotions, and take specific actions as we do now.


Machine Learning aspect: How do models generate outputs? | By Marina

When Dee talked about the "human black box" with pre-trained patterns, I couldn’t help but think about how closely that parallels the machine learning process. Just as humans have multiple interconnected factors influencing their decisions, ML models have their version of this complexity.

So, what is Machine Learning?

It is a subset of AI that allows machines to learn from past data (or historical data) and then make predictions or decisions on new data records without being explicitly programmed for every possible scenario.

With this said, some of the more common ML "scenarios" are:

  • Forecasting or Regression (e.g., predicting house prices)
  • Classification (e.g., labelling images of cats and dogs)
  • Clustering (e.g., finding groups of customers by analyzing their shopping habits)
  • Anomaly Detection (e.g., finding outliers in your transactions for fraud analysis)

Or, to exemplify these scenarios with our human cognitive daily tasks, we also predict (e.g., will it rain today?), classify (e.g., is that a friend or stranger?), and detect anomalies (e.g., the cheese that went bad in our fridge). The difference lies in how we process these tasks and which inputs or data we have (e.g., the presence of clouds vs. a bright, clear sky).

So, data (and its quality) is always at the core of producing quality model outcomes from the above scenarios.

Data: The Core "Input"

Similar to humans, who gather multimodal sensory inputs from various sources (e.g., videos from YouTube, music coming from radio, blog posts from Medium, financial records from Excel sheets, etc.), ML models rely on data that can be:

  • Structured (like rows in a spreadsheet)
  • Semi-structured (JSON, XML files)
  • Unstructured (images, PDF documents, free-form text, audio, etc.)

Because data fuels every insight an ML model produces, we (data professionals) spend a substantial amount of time preparing it – often cited as 50–70% of the overall ML project effort.

This preparation phase gives ML models a taste of the "filtering and pre-processing" that humans do naturally.

We look for outliers, handle missing values and duplicates, remove part of the inputs (features) unnecessary features, or create new ones.

Except for the above-listed tasks, we can additionally "tune" the data inputs. – Remember how Dee mentioned factors being "thicker" or "thinner"? – In ML, we achieve something similar through feature engineering and weight assignments, though fully in a mathematical way.

In summary, we are "organizing" the data inputs so the model can "learn" from clean, high-quality data, yielding more reliable model outputs.

Modelling: Training and Testing

While humans can learn and adapt their "factor weights" through deliberate practices, as Dee described, ML models have a similarly structured learning process.

Once our data is in good shape, we feed it into ML algorithms (like neural networks, decision trees, or ensemble methods).

In a typical supervised learning setup, the algorithm sees examples labelled with the correct answers (like a thousand images labelled "cat" or "dog").

It then adjusts its internal weights – its version of "importance factors"— to match (predict) those labels as accurately as possible. In other words, the trained model might assign a probability score indicating how likely each new image is a "cat" or a "dog", based on the learned patterns.

This is where ML is more "straightforward" than the human mind: the model’s outputs come from a defined process of summing up weighted inputs, while humans shuffle around multiple factors – like hormones, subconscious biases, or immediate physical needs – making our internal process far less transparent.

So, the two core phases in model building are:

  • Training: The model is shown the labelled data. It "learns" patterns linking inputs (image features, for example) to outputs (the correct pet label).
  • Testing: We evaluate the model on new, unseen data (new images of cats and dogs) to gauge how well it generalizes. If it consistently mislabels certain images, we might tweak parameters or gather more training examples to improve the accuracy of generated outputs.

As it all comes back to the data, it’s relevant to mention that there can be more to the modelling part, especially if we have "imbalanced data."

For example: if the training set has 5,000 dog images but only 1,000 cat images, the model might lean toward predicting dogs more often – unless we apply special techniques to address the "imbalance". But this is a story that would call for a fully new post.

The idea behind this mention is that the number of examples in the input dataset for each possible outcome (the image "cat" or "dog") influences the complexity of the model’s training process and its output accuracy.

Ongoing Adjustments and the Human Factor

However, despite its seeming straightforwardness, an ML pipeline isn’t "fire-and-forget".

When the model’s predictions start drifting off track (maybe because new data has changed the scenario), we retrain and fine-tune the system.

Again, the data professionals behind the scenes need to decide how to clean or enrich data and re-tune the model parameters to improve model performance metrics.

That’s the"re-learning" in machine learning.

This is important because bias and errors in data or models can ripple through to flawed outputs and have real-life consequences. For instance, a credit-scoring model trained on biased historical data might systematically lower scores for certain demographic groups, leading to unfair denial of loans or financial opportunities.

In essence, humans still drive the feedback loop of the improvement in training machines, shaping how the ML/AI model evolves and "behaves".


Concluding remarks

"While psychologists dream of a day when the human mind's complexity is fully understood, or data professionals dream of a day when AI models reach AGI, each domain still retains its unique field. Yet, the dialogue between these two fields deepens our understanding of both." [Image generated by Authors using DALL-E]
"While psychologists dream of a day when the human mind’s complexity is fully understood, or data professionals dream of a day when AI models reach AGI, each domain still retains its unique field. Yet, the dialogue between these two fields deepens our understanding of both." [Image generated by Authors using DALL-E]

In this post, we’ve explored how humans generate outputs, influenced by at least ten major interwoven factors, and how ML models produce results through data-driven algorithms.

Although machines and humans don’t derive outcomes in the same way, the core ideas are remarkably similar.

  • For humans, the process is intuitive: we receive and collect sensory inputs and store them in various forms of memory. Our cognitive processes then combine logic, emotions, hormones, past experiences, in-the-moment external inputs, etc., to produce actions or behaviours. As Dee illustrated, this unique combination of factors creates our human ability to be both predictable and surprising, rational and emotional, all at once.
  • For ML or AI, the process flow is data-based. We provide structured, semi-structured, or unstructured data, store, clean, label, or enrich them, model them with ML/AI algorithms, and then let the developed model generate predictions, decisions, or recommendations. Yet, as shown above, even this seemingly straightforward pipeline requires ongoing human oversight and adaptation to new scenarios.

The key difference lies in the unpredictability of human mental processes, versus the more trackable, parameterized nature of ML/AI models.

And while most of us wonder what the future brings – what will happen when AI reaches AGI and possibly replaces our jobs with its "superhuman" abilities – Dee shared an interesting perspective on that point:

What if AI starts following our path? – Developing emotions, cluttering its memory with conflicting or harmful data, somehow creating an Id, and becoming unpredictable.

Hmm… You weren’t expecting that, were you? – The outcome doesn’t necessarily have to be that AI will surpass us. – Maybe we’ll surpass AI, which opens up another question: do human traits (factors) necessarily evolve…? Because maybe that’s the future of AI.

Until then, it’s important to recognize both human and machine models can learn, adapt, and change their outputs over time, given the right inputs and a willingness to keep refining the process.

We can work on improving the quality of outputs by selecting better inputs and refining how they’re processed – whether in guided sessions with counsellors or ML/AI data and algorithm designs.


✨ Thank you for diving into our post! ✨

For a fresh psychology perspective, subscribe to Dee’s posts👇🏼

For data trends and tips, don’t miss Marina’s posts👇🏼

The post Human Minds vs. Machine Learning Models appeared first on Towards Data Science.

]]>
The Lead, Shadow, and Sparring Roles in New Data Settings https://towardsdatascience.com/the-lead-shadow-and-sparring-roles-in-new-data-settings-fbce6e721cbd/ Sun, 01 Dec 2024 14:02:20 +0000 https://towardsdatascience.com/the-lead-shadow-and-sparring-roles-in-new-data-settings-fbce6e721cbd/ From data engineer to domain expert-what it takes to build a new data platform

The post The Lead, Shadow, and Sparring Roles in New Data Settings appeared first on Towards Data Science.

]]>
From data engineer to domain expert—what it takes to build a new data platform
"The best data engineers are runaway software engineers; the best data analysts, scientists, and solution (data) architects are runaway data engineers; and the best data product managers are runaway data analysts or scientists." [Photo by Yurii Khomitskyi on Unsplash]
"The best data engineers are runaway software engineers; the best data analysts, scientists, and solution (data) architects are runaway data engineers; and the best data product managers are runaway data analysts or scientists." [Photo by Yurii Khomitskyi on Unsplash]

Ever wonder why roles with multidisciplinary skills are the key to successful delivery in new data settings? — If curious, read this post to discover how hybrid roles and their LSS capabilities can get you from idea to revenue.


After spending years navigating the complexities of evolving technologies and business demands in the data world, I’ve gathered some wisdom of my own—much of it centred on driving new developments or re-engineering existing ones.

I can’t say it was always an intuitive or straightforward journey to improve or create something from scratch, or that this path was free of mistakes. Yet, through these experiences, I have learned what it takes to build a new and high-performing data team, lead a new data project, or design an ecosystem of new data products that drive business value up.

With all the business and technical variables, the positive outcome usually depends on one constant—the right people who will support and shadow you.

At its core, the success of the new Data Platforms boils down to the proper selection of individual contributors with multidisciplinary and complementary skills. These individuals go beyond job titles; they bring together expertise across domains and a shared innovation mindset.

Because of this, one of my favourite sayings related to data staffing is:

"The best data engineers are runaway software engineers; the best data analysts, scientists, and solution (data) architects are runaway data engineers; and the best data product managers are runaway data analysts or scientists."

Or in semi-visual format: SWE → DE → DA/DS/SA → DPM.

This flow motivated me to write my post today, where I want to focus on answering the question:

  • Why do you need multidisciplinary data roles and how do they navigate and balance their responsibilities in new settings?

Or, to say it bluntly, how different data roles "split the bill" when building a new data platform.

Moreover, I will address how they act as Leads, Shadows, or Sparring (LSS) partners during different phases of the new data setting.

The new data settings and the role dynamics

There are usually four phases in building a new data platform: (1) preparation, (2) prototyping, (3) productionalizing, and finally—(4) monetizing.

Speaking from my experience of working in medium to large data projects, to arrive at phase (4), you mostly need a few key data roles:

  • Engineer/Architect
  • Analyst/Scientist
  • Business/Technical Analyst
  • Team/Product/Project Lead
  • Domain Expert/Stakeholder

Now you probably wonder, "Which roles take on the lead, shadow, and sparring tasks? And what’s the significance of the "backslashes" in every role?"

Let me address the latter question first __ and explain the "backslashes." With project budgets in mind, few companies nowadays can afford to dedicate a single position to a single role in new developments. Hence, you will mostly find hybrid roles in these settings, with the "M-shaped" profile, where individuals bring depth and breadth of expertise.

Consequently, every role in a new setting can be a lead, a shadow, and a sparring one in some capacity or scenario (which is an answer to the first question above). This is where multidisciplinary becomes crucial; it’s no longer about focusing on small project components but taking bigger responsibility and contributing to multiple areas.

This leads me back to my intro section, where I explained the wisdom of the SWE → DE → DA/DS/SA → DPM flow.

In other words, if you get a chance, staff the people with knowledge spanning multiple areas. As knowledge is power (lat. _Scientia potentia est_), they will understand what comes "before" and what comes "after." Their ability to lead, shadow, and spar effectively will enhance the quality and efficiency of project delivery.

With this said, it’s important to note that the intensity of every role isn’t uniform across all phases.

— Why?

Because the workload distribution shifts as development evolves. The technical roles are most active during the core technical phases, and non-technical roles maintain consistent engagement across all phases.

To better understand how these roles evolve, I’ve created a matrix that maps their involvement across the four phases of new developments.

LSS Dynamics per New Data Setting Phase and Data Role (Image by Author)
LSS Dynamics per New Data Setting Phase and Data Role (Image by Author)

This matrix shows how LSS capabilities shift across four development phases, and their involvement intensity can be summarized as follows:

  • Lead: Takes the primary ownership and direction.
  • Lead/Sparring: Primary leads but also actively provides feedback.
  • Sparring: Provides parallel delivery, support, and/or feedback.
  • Shadow: Observe and have limited active participation, but are preparing the work to continue the delivery process.
  • Shadow/Sparring: A combination of observation, preparation, parallel delivery, and providing occasional feedback.

Let’s dive into their tasks and a more detailed scope of the work per project phase.

1 – Preparation Phase

This phase is about laying the groundwork. It’s where you identify the potential threats (business and technical problems that can be expected), outline probable solutions, and then translate these into project/product work packages, architectural plans, and budget estimates. With this in mind, the work-task distribution of the different data roles usually looks like this:

Lead roles:

  • Business/Technical Analyst: Typically, this role is the lead in preparation, as they gather requirements and assess the feasibility. The role aims to bridge the business and development inputs and help in a "boundary setting." More precisely, it provides inputs on what you need, functionalities to focus on, and setting realistic timelines to meet business and technical expectations.
  • Team/Product/Project Lead: Coordinating tasks, timelines, approvals, and overall planning. It has the task of mapping out the big picture – project timelines, team composition, communication channels, and high-level solution design. Their main task is to organize other roles on strengths and ensure the structure is in place to keep development on track in future phases. They are responsible for communicating with project sponsors and serve as SPOCs.

Shadow/Sparring roles:

  • Domain Expert/Stakeholder: Functions as a shadow and sparring role, advising on specific business enquiries, and works closely with other business and technical roles in full plan creation. It usually pinpoints relevant data sources and supports setting the project success metrics.
  • Engineer/Architect: Acts as a sparring **** role here, contributing to the technical plan (data architecture and service definition) and feasibility assessment, but not as a lead. Their early architectural decisions impact the important aspects of the data platform (e.g., data quality, platform scalability or even delivery speed).
  • Analyst/Scientist: Limited involvement; has a shadow role, and it’s contributing in a similar way as a ** Domain Expert.** They usually provide inputs to support the creation of the architecture by listing their requirements for specific analytical/data features that data services (e.g., BI or ML platforms) should have.

2 – Prototyping Phase

This is where technical plans become tangible, and the technical data roles start their hands-on part in delivering PoCs and MVPs that will be "brought to life" in the next phase.

Lead roles:

  • Engineer/Architect: Responsible for building the initial prototype of the integration pipelines, with a definition of the data storage and management architecture, data quality tools, and orchestration services. In this phase, they aim to ensure that the selected technical infrastructure supports smooth data flow and scalability, especially if this is a first-time build.
  • Analyst/Scientist: Steps up to a lead role, building and prototyping data models or algorithms that will be new data products. They run preliminary analyses to explore data’s potential and design early-stage self-service models, or BI/ML reports/dashboards.

Shadow/Sparring roles:

  • Business/Technical Analyst: Becomes a sparing role, focusing on refining business requirements based on prototyping inputs from technical colleagues. In summary, they gather and share feedback from the business and technical side to ensure new data products are aligned with end-user expectations.
  • Team/Product/Project Lead: Acts as a sparring role, overseeing and resolving project blockers – e.g., resolving connectivity issues to the source/target systems. In addition, it keeps track of the project scope and ensures that business requirements don’t change too much.
  • Domain Expert/Stakeholder: Maintains a shadow/sparring role, with the task of validating the data product prototypes and providing feedback for their design. They check for the relevance of the delivered MVPs, ensuring the initial data products align with the expected business impact.

3— Productionalizing Phase

The production phase brings data products to life. This stage also covers the last-mile development and testing/improvement of new developments. It is a critical phase focused on deployment and its coordination, requiring collaboration between technical and business roles.

Lead roles:

  • Engineer/Architect: Remains in the lead seat by overseeing the deployment of the different data pipelines, performance adjustments, and maintenance setups to ensure a smooth production run.
  • Analyst/Scientist: Also a lead role, responsible for supporting the deployment of data models and their performance monitoring. Accordingly, they implement the necessary tweaks to deliver fully functional data products to end-users.
  • Team/Product/Project Lead: Has a lead role and responsibility for managing release coordination and communication with stakeholders. They manage the final handover to operations and conduct a post-launch review to assess the project’s performance.

Shadow/Sparring roles:

  • Business/Technical Analyst: Moves to a sparring role, supporting user acceptance testing (UAT) and documenting processes. They assist with creating user guides and training resources for new developments.
  • Domain Expert/Stakeholder: Remains in shadow mode, validating final outputs and giving the data products a sign-off, ensuring they align with business goals and are ready for market use.

4 – Monetizing Phase

In the monetizing stage, non-technical roles are essential to ensure the new data platform and new products start creating actual business value.

Lead roles:

  • Business/Technical Analyst: A lead role that defines and collaborates on pricing strategy. If necessary, it also conducts market analysis to identify potential revenue streams. They articulate the product’s value proposition, ensuring it resonates with potential customers and stakeholders.
  • Domain Expert/Stakeholder: Takes on a lead/sparring role, sharing market insights and helping differentiate the new developments within the market. Their industry knowledge directly supports the product’s positioning.
  • Team/Product/Project Lead: As a lead role, manages the monetization strategy, explores partnerships, tracks revenue, and reports performance back to the team and project stakeholders.

Shadow/Sparring roles:

  • Engineer/Architect: As a shadow/sparring role, ensures the data infrastructure can scale and remain secure as the product is monetized. They set up the necessary compliance measures to align with market standards, especially if data privacy regulations are involved.
  • Analyst/Scientist: Acts as a sparring role and **** provides analytical insights and data product development insights to support the business side in monetizing new developments.

Final remarks

Building a new data platform isn’t just about modern technology or tools.

It’s more about selecting people with multidisciplinary expertise who can collaborate, and balance responsibilities in new settings to drive business value up.

After exploring different hybrid data roles and their dynamics across the four delivery phases, I wanted to showcase how the success of a new data platform depends on each role’s LSS capabilities.

So, the next time you find yourself at the beginning of a new development, ask yourself one question: Do I have the right people with the right knowledge?


Thank you for reading my post. Stay connected for more stories on Medium, Substack ✍ and LinkedIn 🖇 .


Looking for more similar reads?

My 3 posts **** related to this topic:

  1. Decoding Data Development Process | How Is Kowalski Providing "Ad-Hoc" Data Insights?
  2. 3 Animated Movie Quotes to Follow when Establishing a Data and Analytics Team | Kung Fu Panda quotes for the future Data and Analytics leaders
  3. Opening Pandora’s Box: Conquer the 7 "Bringers of Evil" in Data Cloud Migration and Greenfield Projects | A guide to conquering cloud migration challenges

The post The Lead, Shadow, and Sparring Roles in New Data Settings appeared first on Towards Data Science.

]]>
The Two Sides of Hiring: Recruiting vs. Interviewing for Data Roles in Diverse Markets https://towardsdatascience.com/the-two-sides-of-hiring-recruiting-vs-interviewing-for-data-roles-in-diverse-markets-f65b49990687/ Sun, 13 Oct 2024 14:02:07 +0000 https://towardsdatascience.com/the-two-sides-of-hiring-recruiting-vs-interviewing-for-data-roles-in-diverse-markets-f65b49990687/ Factors of success in recruiting and interviewing after applying for 150+ positions and reviewing 500+ CVs in 4 different countries

The post The Two Sides of Hiring: Recruiting vs. Interviewing for Data Roles in Diverse Markets appeared first on Towards Data Science.

]]>
Stories From Both Sides of the Table

I have stories from sitting on both sides of the hiring table, some successful and others not so successful.

For example, I can tell you stories about how I was successful in the interview process and got a job offer when:

  • I had zero industry experience but had shown my academic knowledge and interest in working in a customer-facing data [%] role, even though I didn’t speak the local language.
  • In the last round of interviews for another data [%] role, the CEO cracked a joke, and I was the only one in the room who didn’t laugh at it. However, I convinced him and other colleagues that I was responsible and could build the cloud data platform they would need.

Or, I can share with you my stories about how I didn’t get a job offer when:

  • I failed to convince the VP, senior manager, and HR lead at my final-round case study presentation to hire me over another candidate, even though I thought I aced my presentation.
  • In one interview, I spoke too much and expressed my strong view on the evolution of the company data landscape by stating, "I think you should not build a BI team only, but rather focus on building an ecosystem of data teams and start by hiring data engineers first."

Furthermore, I can share stories of success in making good decisions in staffing when:

  • I structured the hiring process to have at least three interview rounds with my peers, team colleagues, manager, or talent acquisition colleague and me.
  • I went with my "gut feeling" and selected the candidates who had fewer years of experience than listed in the job vacancy but were able to debate the data problems with correct methodologies and reason their approaches with a business value in mind.

And the stories of making less successful decisions in staffing when:

  • I did not pay enough attention to the candidate’s low adaptability to new technologies or soft skills.
  • I didn’t do a reference check, although I had doubts about the truthfulness of the CV.

I could go on with more success and non-success stories and share more firsthand examples of job hunting and staffing. However, I want to delve deeper into this post and focus on the common factors contributing to success on both sides of the hiring table.

But first, to provide a clearer picture, I will step back and quantify my experiences a bit:

  • The stories I shared with you are collected from multiple markets, or more precisely—4 different ones. For confidentiality purposes, I won’t mention which story is coming from which market or what the industry is.
  • During the past 6+ years of my career, I applied for more than 150 data positions in 3 European markets—1 native to my nationality and 2 foreign ones. My job search in foreign markets was successful, and as my experience grew, my career grew too. This allowed me to cover different data roles, such as data analyst, data scientist, and data engineer, and become a data lead in these markets.
  • Consequently, the data lead role enabled me to build the team(s) and lead the hiring for various data roles in 4 diverse markets—3 European and 1 non-European—for the past 3+ years, during which time I reviewed over 500 CVs and interviewed more than 100 candidates.

In summary, this is how I gained perspective on getting a job offer or making a good staffing decision in diverse markets. I’ve learnt that it’s not just one factor that makes the difference, but a combination of them.

And, as mentioned before, I will showcase in this post the success factors that worked for me on both sides of the hiring process.

Now, let’s dive in.

Side 1: My factors of success in interviewing and landing a job offer

In this section, I will provide the most important factors that helped me land data interviews, progress through the hiring process, and eventually secure job offers in different markets.

(Spoiler alert is in the image below.)

My factors of success in job hunt [Created by Author using draw.io]
My factors of success in job hunt [Created by Author using draw.io]

Before all, you need to know that landing a data role across borders requires you to pass a question, "What skills do I have over someone who already is in this market, and why would I be a better pick to hire?"

I am telling you this because, in most markets, you have Ad A) immigration quotas and Ad B) work permit requirements **** for foreign citizens.

This was the case for me in both foreign (European) markets where I searched for the job opportunity.

However, there was some good news for me.

In the markets where I searched for a job, the data roles were on the shortage occupation list (even with a 5-year time difference in the job search on the two markets).

In addition, my higher university degree made me eligible to get a visa for a skilled worker.

These were my first and second factors of success: my education and technical expertise.

Yet, again, landing interviews and the first industry job in the foreign market didn’t go without additional effort.

In my first search, and after more than one month of straight rejections and not getting even one single call of interest on my CV, I have learnt that you need to have a strategic approach to the job search itself and be familiar with the market particularities.

With this said, I researched the market standards about CV structures and cover letters to understand better what can land me more interviews.

For example, after I changed my CV according to specific market criteria—like adding my professional photo, birthday, and even my marital status—I saw a higher interest in my CV. As crazy as it might sound to someone, in some markets this is still a standard, or at least welcoming.

On the technical side, I have learnt how to structure my CV better using the Google preferred CV format and XYZ formula by bolding the keywords aligned with numerical achievements.

Example given: "Secured the company’s Microsoft Partner status by completing certification for DWH implementation and data model development using SQL in the first 2 months of employment." or "Co-developed a Spark ETL framework, saving an average of 10 man-hours per month on a greenfield Data Lake project."

In addition, I followed the rule that my CV was a maximum of two pages long.

Although these points are not a "CV-local trait," but rather a "CV-global trait," they boosted interest in my CV, so it’s worth mentioning them.

This was my third factor of success: familiarization with the local (market) and global resume standards.

When this was settled, the next factor that helped me out was the speed at which I applied for roles.

Simply put, I was applying daily for the new vacancies. Waiting for the weekends to do bulk applications was not an option, as I knew that companies tend to close job ads quickly if the application count is high.

So, I activated the job notifications on my LinkedIn, the main job portal sites, and the career sites of the big companies in the market.

This consistency in job applications ensured I got noticed early in the hiring process and resulted in more interviews.

These were my fourth and fifth factors of success: speed and consistency in job applications.

Now you wonder what to do next and how to move from only interviews to landing an offer—especially when you are not yet on the market.

What worked for me in all cases was that I pointed out my curiosity, potential, and flexibility. For every role I applied for, I stressed my interdisciplinary skills and that I could cover more than the tasks described in the vacancy and support business development. With time, this helped me develop the M-shaped profile—i.e., being eager to work on multiple roles and cover both technical and business sides in the data area.

This was my sixth factor of success: interdisciplinary skills.

However, getting constant rejections one-to-two months in a row sometimes resulted in my low morale. So I had to keep my big goal in mind and focus on what is important. Even though time passed, I invested it in further developing my skills and adding new "bullet points" to my resume. But I never gave up on my search.

This is my seventh and most important factor of success: determination.

Being persistent was also the factor that enabled me to grow in my career and experience the other side of the hiring process – recruiting.

Side 2: My factors of success in recruiting and selecting good candidates

In this section, I’ll share the "universal" factors that, in my experience, contributed to making successful hiring decisions in different markets.

(Again, spoiler alert is in the image below.)

My factors of success in leading the hiring [Created by Author using draw.io]
My factors of success in leading the hiring [Created by Author using draw.io]

A successful hiring process starts with a foundation, or, to be precise, a defined structure and steps.

In my early days as a hiring lead, I learnt that a lack of structure or having different hiring process structures per market can lead to inconsistencies in the evaluation process. This makes it difficult to objectively compare candidates in different markets, leading to missed opportunities in staffing good candidates.

So, if I didn’t have a global or standardized cross-border interview process in my company, I would set it up myself.

With this said, I would aim to have at least three interview stages:

  • the general one (with me only, to check the team and the role fit),
  • the technical one (with me and at least 2 colleagues to check the use case presentation or hands-on coding), and
  • the final one (with the n+2 manager and talent acquisition manager to check the company fit).

This was my first factor of success in recruiting: a structured cross-border hiring process that was transparent and less biased.

While having the structure is important, being tentative in recruiting is equally important if you want to build a high-performing team.

This means that as a hiring lead, you need to "look beyond" the bullet points on a resume and find individuals who want to work in the international environment, collaborate, and contribute to business value.

You can do this by creating a set of standard (market-agnostic) questions to check the candidate’s motivations, interests, and ability to work in an international setting.

For example, I always ask one open-ended question: "How do you approach collaborating with colleagues who have different perspectives or working styles?"

This was my second factor of success in recruiting: focussing on a candidate’s potential and their ability to adapt and grow beyond technical skills.

During this whole process, I try to keep the positive candidate experience.

With this said, in every market and the interview stage, I come up with a prepared structure where I have an "icebreaker" intro to present the company, the team, and the new open role. I try to encourage open conversation and flow in every interview and speak less than the candidate, so I have a higher chance of identifying the top talents.

On top of this, I always "leave my doors open" and mention to the candidates that they can reach out to me via email at any moment during the interview process. Finally, I tend to provide timely feedback and minimize unnecessary assessments and prolongations in the recruiting.

This was my third factor of success in recruiting: creating a positive candidate experience to be able to attract the top talents in every market.

With this last factor, I have learnt that I can build strong cross-market relationships with future employees even before they join the team.

In addition, building relationships with candidates during the hiring process decreased my recruiting efforts in one market already. It allowed me to reach out to a talented individual in the network and offer him newly opened positions in the team.

Key Takeaway

This post aimed to show you the factors of success I collected after experiencing how it is to "sit on both sides" of the hiring process.

From my past experiences, I have understood that successful hiring, regardless of the market, is a two-way street.

For candidates, this means finding a place where they will feel appreciated and be able to grow, and for hiring managers, it means finding new colleagues who will enjoy working with the team and for the company.

To achieve this win-win situation, regardless of which side you sit on, track your success factors and establish a process that will work for you in the long run.

Until next time, happy interviewing and recruiting!


Thank you for reading my post. Stay connected for more stories on Medium, Substack ✍ and LinkedIn 🖇 .


Looking for more data career reads?

My 3 posts **** related to this topic:

  1. End-of-Year Report on a 12-Year Data Journey | Three stories about the data career journey
  2. Why Formal Education Is Beyond Degrees and Income: The Broader Impact | Inspired by The Forbes Article, I Share My Two Cents on The Power of Formal Education
  3. Growing a Data Career in the Generative-AI Era | Raising awareness about learning three fundamental data concepts

The post The Two Sides of Hiring: Recruiting vs. Interviewing for Data Roles in Diverse Markets appeared first on Towards Data Science.

]]>
Opening Pandora’s Box: Conquer the 7 “Bringers of Evil” in Data Cloud Migration and Greenfield… https://towardsdatascience.com/opening-pandoras-box-conquer-the-7-bringers-of-evil-in-data-cloud-migration-and-greenfield-d7a18912f2a1/ Tue, 08 Oct 2024 01:28:28 +0000 https://towardsdatascience.com/opening-pandoras-box-conquer-the-7-bringers-of-evil-in-data-cloud-migration-and-greenfield-d7a18912f2a1/ A guide to conquering cloud migration challenges

The post Opening Pandora’s Box: Conquer the 7 “Bringers of Evil” in Data Cloud Migration and Greenfield… appeared first on Towards Data Science.

]]>
Opening Pandora’s Box: Conquer the 7 "Bringers of Evil" in Data Cloud Migration and Greenfield Projects
"Despite warnings, Pandora was curious, and she opened the jar, releasing the evils of the world - leaving only hope trapped inside." [Photo by Bailey Heedick on Unsplash]
"Despite warnings, Pandora was curious, and she opened the jar, releasing the evils of the world – leaving only hope trapped inside." [Photo by Bailey Heedick on Unsplash]

Pandora, the first mortal woman, was created by the gods as part of Zeus‘s plan to punish humanity for Prometheus‘s theft of fire [1].

She was gifted with beauty and intelligence, and Zeus sent her to Epimetheus, Prometheus’s brother. For the wedding gift, Zeus gave Pandora a jar (often interpreted as a "box") and warned her never to open it [1].

Despite warnings, Pandora was curious. She opened the jar, releasing the world’s bringers of evilsleaving only hope trapped inside [2].

Since then, "to open a Pandora’s Box" has been synonymous with doing or starting something that will cause many unforeseen problems [3].

Comparing this to my professional life, the only occasion I felt like I had opened "Pandora’s Box" was when I began working on a data cloud migration/greenfield project several years ago.

And the funny thing is that this thought hasn’t changed years later, even after working on two additional and almost identical projects.

Not only did I experience new "bringers of evil" with every new data cloud migration project, but I also felt I had managed to release "hope" from the box.

Now, with (a bit) more wisdom, knowledge, and experience in a few data migration/greenfield projects, I will share the "7 bringers of evil" and how to overcome them.

Let’s dive in.

A Guide to Conquering Cloud Migration Challenges

I thought a lot about how to compare the mythical bringers of evil—envy, remorse, avarice, poverty, scorn, ignorance, and inconstancy [3]—to the real-world challenges I’ve seen or gone through in data cloud migration/greenfield projects.

And, as a result, I first created a one-sentence explanation of what specific bringer of evil can cause in the project.

Then, I provided the [hypothetical] project scenario for more context.

Lastly, I provided my input on best- and worst-case solutions, i.e., how to "conquer" a specific [hypothetical] scenario in case it happens in your project.

The flow of explaining and presenting solutions to project challenges [Diagram by Author]
The flow of explaining and presenting solutions to project challenges [Diagram by Author]

1. Envy

Comparing your migration project unfavourably to other projects, leading to poor project planning and unrealistic expectations.

Scenario:

You did your research, you talked to people, you called in consultants, and they all confirmed it: "The company XY, which is similar to your company, managed to migrate their whole data platform to the cloud in only 10 months."

By all means, this implies only one – you should be able to migrate in the same period, if not faster.

So, you start by creating a project plan, keeping in mind external inputs on the migration deadline. This leads to budget approval, which leads to the execution phase.

Then, suddenly, reality kicks in during the project execution phase.

  • You start realizing that your on-prem infrastructure is more complex, and you have security restrictions that don’t allow you to connect to the source database(s). You understand then you can’t even use a 3rd party data integration tool to move your data to the cloud. You need to develop a new solution to overcome this problem.
  • Then you realize that your legacy development depends on the data sources, which you are not allowed to move to the cloud without approval, and you don’t have this approval.
  • Then you figure out that 15 to 20% of the data stored locally is not in use anymore from business, but you didn’t have time to do a proper analysis to design the clean cloud file storage or database landing layer, and you need to do it now.
  • …..

And the problems just started piling up because the planning was rushed and biased by external inputs.

On top of that, project "what-if" scenarios were not developed beforehand to have a "best-case" vs. "worst-case" implementation plan to justify what is happening.

What follows is that you need to communicate this information to project sponsors and inform them of the increased project scope and budget.

Not a fun thing to do.

Solution(s):

["Best-case"] OR [What you can do to prevent this]

  • Develop a customized migration plan: Create a migration plan that fits your organisation’s needs and infrastructure, rather than competing with external benchmarks.
  • Think about all pillars during planning: During the planning, consider technical, regulatory, and cost pillars, and plan contingency money (plus "what-if" scenarios) for project implementation.
  • Plan for Contingencies: From developed "what-if" scenarios, think and prepare how you will be able to resolve the "worst-case" ones if they happen.

["Worst-case"] OR [What you can do for damage control]

  • Take responsibility and be transparent: Ask for additional resources (both human and monetary ones), but this time be transparent in presenting the maximum project over-budget scenario. Outline the valid arguments/blockers you have faced, and take responsibility for what happened.
  • Highlight the long-term benefits: Emphasize to project sponsors that resolving these issues during the migration phase will lead to long-term cost savings on the data platform.
  • Focus on quick wins: If feasible, ensure that you speed up your developments in components where you don’t have blockers, so you can show the progress and maintain a "positive image" in front of the stakeholders.

2. Remorse

Regretting the selection of the new technologies and development principles, leading to delays in migration and impacting the solution design of the new data components.

Scenario:

You started your cloud migration, designed the architecture, selected the cloud services, and finally – the development began.

What could go wrong, ha?

  • Then you realize that some of the newly selected cloud services may be missing part of the features that you were so used to/are critical in your legacy ones, and again – you need to find a solution or even additional service to cover the core purpose of these features.
  • Then you understand that your legacy development standards should be adapted because you changed your data management system and didn’t develop the new development standards. Pressured by deadlines, you "freelance" in the development, creating a new, even bigger mess.
  • With all the new pitfalls, you dwell on your destiny and regret not starting the migration project 5 years ago, when your dataset was smaller and business requirements manageable.

Finally, in Cher‘s words, you sing (quietly) the title of the song "If I Could Turn Back Time." On a loop.

Solution(s):

["Best-case"] OR [What you can do to prevent this]

  • Conduct feature analysis beforehand: Before selecting cloud services, do a feature comparison with the legacy services. Involve technical teams early in the evaluation and run proofs-of-concept (PoC) before the project starts.
  • Create new development standards early: Think in advance about how cloud development principles differ from on-prem systems. Develop new standards, by creating a mapping table with the legacy ones.

  • Again: contingency planning.

["Worst-case"] OR [What you can do for damage control]

  • Reevaluate technology choices quickly: When you realize some cloud services don’t fit your needs, temporarily pause the affected data migration component and re-plan the development while focusing on resolving this issue.
  • Temporarily assign extra or external resources to resolve blockers: To resolve missed feature gaps or lack of standards, assign additional resources (preferably consultants) to clean up these specific problems.
  • Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

3. Avarice

Overemphasizing cost savings and future time-to-market value at the expense of critical components, like data quality, security, or performance, leading to higher costs and poor data platform in the end.

Scenario:

You landed yourself a new job.

You were hired to develop the new data platform where architecture is already defined, services selected, and the first data integration pipelines are functional.

Your main task is to **** create value from the data as fast as possible, design the semantic layer from scratch, and engineer new data products for key business colleagues.

—No big deal.

  • You start by focusing on transformations and data modelling, and voila – the first data products are here, and you are the company star. Then the business comes to you and starts questioning the accuracy of the delivered insights. You return to your data, compare it to the source systems, and then realize they don’t match. Something went wrong in the data integration part, but you didn’t see this before because you didn’t implement quality checks.
  • In addition, you were so eager to deliver the data products faster that you didn’t even think about infrastructure nor tried to put in place multiple environments. Instead, you did your development in production. Yes, production.
  • As a cherry on top, you started designing machine learning data products as fast as possible, realizing only later that their development and run costs are high with a low return on investment (ROI).
  • ….

And here you are now.

You promised that your cloud costs would be lower than the on-prem ones, with the motto: "The storage is cheap in the cloud, and I don’t need SRE colleagues or a variety of data roles to develop a functional data platform."

Solution(s):

["Best-case"] OR [What you can do to prevent this]

  • Data quality as a mission: Plan time and staff human resources to set up quality checks at every stage of the data pipeline design—from the data integration to the final products that contain insights.
  • Design a stable and secure infrastructure: Plan time and staff human resources to set up multiple secure environments (dev, test, prod), even under tight deadlines. This allows you to develop quality and reliable data products.
  • Assess the costs of the data products: Before committing to deliver new data products, perform a cost-benefit analysis to assess their (potential) ROI. Avoid starting expensive developments without an understanding of the financial impact.
  • Again: contingency.

["Worst-case"] OR [What you can do for damage control]

  • Shut down the high-cost, low-value data products quickly, and manage expectations: Implement the cost monitoring framework and shut down the data products with low ROI. Show the numbers to the business transparently, and propose scaling back when business needs are more mature.
  • Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

4. Poverty

Cutting down the budget and resources allocated to the migration project and not getting granted "feel good" benefits, leading to high stress and low morale.

Scenario:

You received your project budget, and implementation is going as planned.

However, the project sponsors expect you to keep the development costs under the assigned budget, while speeding up the delivery.

You got the assignment to build an additional, remote team in a more budget-friendly location.

  • Although this was not originally planned, you acknowledge their input and start the hiring process. However, the process of staffing takes longer than expected. You initially invest 3 months in staffing the new resources, then another 3 months in repeated staffing to replace candidates who declined offers. But finally, the staffing is done, and the new team members are slowly joining the project.
  • You start making organizational and backlog changes to consolidate the project development that is now split between two locations.
  • In the meantime, your small local cloud team is investing daily effort to share the knowledge and onboard the new (remote) colleagues with the current project development.
  • The collaboration is going well, except for minor delays in communication/providing feedback due to different time zones. Because of this, you would like to reward your local team members, who cover 2 to 3 roles simultaneously and work long hours to ensure the project gets successfully delivered.
  • However, this is not feasible. You can’t get this extra money for travel to a remote location, team events, etc.
  • The dissatisfaction only piles up when you realize that you and the team are not allowed to visit your remote team, with whom you work daily, but the business colleagues – are.

Solution(s):

["Best-case"] OR [What you can do to prevent this]:

  • Ensure a higher budget from the start: During the project planning phase, ask for a budget cut that covers project team-building events and travel. Try to argue that "feel-good" benefits are not only perks but are important for maintaining team morale.
  • Plan for unforeseen changes: Even if you don’t expect to get an additional team in your project, you can adapt remote-friendly processes. Create project documentation (space), and adopt Scrum/Kanban to track your development. In addition, create communication channels/groups and organize virtual coffee sessions/open discussion weekly meetings. These will help with every new onboarding, regardless of the new team member’s location.

["Worst-case"] OR [What you can do for damage control]:

  • Request project onboarding time: When building a new remote team, communicate to management the additional time and resources required for onboarding. Get the "buffer time" in the project to ensure proper knowledge transfer and collaboration from the start of the cooperation. Ensure that the local team doesn’t constantly work long hours. The team’s well-being should be a priority, even if this means prolonging project delivery.
  • Focus on non-monetary recognition after budget cuts: Recognize the team’s hard work in a non-financial way. This includes flexible working hours/locations, skipping admin meetings, giving public acknowledgement, or even small tokens of appreciation like chocolate.
  • Negotiate for minimal travel: If the travel between project locations isn’t feasible, negotiate for in-real-life meetings only to celebrate the project’s main milestones.

5. Scorn

Facing internal resistance from colleagues who feel their legacy systems are being dismissed, leading to passive and active obstruction of the migration project.

Scenario:

You know it, business knows it, everyone knows it – the cloud data platform is the fastest way to meet rising business requirements and, if developed properly, stop the rising on-prem maintenance costs.

  • Yet, despite the obvious need for the migration, you begin hearing the same concerns from various colleagues: "Our legacy system is completely functional; why do you want to replace it?" – OR – "This cloud migration is just a trend, and you will fail in it." – OR – "The cloud is not secure enough, and we need to keep data in our on-prem platform." You only hear reasons why this can’t and won’t work.
  • On top of this, the resistance comes in actions too. The colleagues are not sharing inputs in the project planning phase. They show a lack of interest in providing support in the development of proof of concepts. Consequently, project approvals get delayed.
  • Then you manage to resolve the blockers in the project planning/preparation phase, staff the cloud-skilled colleagues, and the project takes off. However, the resistance is higher than ever. The same colleagues who blocked the project now feel excluded and share constant and negative feedback on the new development and colleagues. All this blocks the normal development of the project, as your focus needs to be on conflict resolution instead of delivery.

You seek advice from your business coach, and he tells you the following sentence: "Welcome to big-corp business; now you need to learn how to deal with it." (Side input: it was only a starting sentence said as a joke.)

Solution(s):

["Best-case"] OR [What you can do to prevent this]:

  • Include middle/high management to share the vision: Before the cloud migration starts, seek help from management and project sponsors to share the positive vision of the change and the strategic direction of why platform modernization is important for the business.
  • Organize workshops: Seek consultancy support in organizing workshops for everyone to get insights into the new technology and present customer success stories. This, too, should result in creating awareness of the cloud platform’s advantages and benefits.
  • Create a transition plan: To show the benefits of the cloud platform, create a plan for comparing data products on both platforms. In other words, during the PoC phase, compare the metrics of the same data product developed on-prem vs. cloud. Focus on improvements in performance, costs, and development steps, showing that the legacy work will not be disregarded but evolved.

["Worst-case"] OR [What you can do for damage control]:

  • Address resistance head-on: If you encounter individuals who actively resist and block migration, try to address this behaviour in direct conversation. Address the possible issues of their concerns by providing positive aspects of the new development. Then observe if their behaviour will change after.
  • Escalate if necessary: If resistance persists and affects the project delivery, escalate the issue to middle/higher management. Show them the impact of this behaviour, and if necessary, ask them to help out in distancing some individuals from the project. You will need support from leadership to help push through roadblocks caused by internal resistance.

6. Ignorance

Lack of expertise in cloud technologies and migration best practices, increasing the risk of failure and delays.

Scenario:

You were assigned a new data-stack modernization project from management, and the expectation is to deliver the first PoCs in the next quarter.

You have never worked on anything similar and don’t know where to start.

Your colleagues and you start comparing the existing cloud platforms and services, and you select "the one" you will work with.

  • However, no one in the existing team is familiar with the new technologies or data migration best practices related to them.
  • So, you staff one or two new colleagues, or "fresh blood," who are enthusiastic techies but haven’t had experience in similar projects before.
  • They engage with new technologies, catch up very quickly, and manage to showcase new PoCs. And all of this is done before you get the official project approval and budget.
  • The enthusiasm on your and the team’s side gets higher, and you think that everything from now on will be a smooth sail.
  • Then the project officially starts, and you get a tangible budget. All happy, you again start staffing new colleagues. But this time, you pick the people who have worked on similar projects.
  • And they all bring new ideas and migration concepts from their previous roles. This results in discarding the initial concepts and PoCs you had. And again, you are starting the development from scratch.
  • As the initial effort of 3–4 months is gone, from the start of the project, you find yourself behind schedule.
  • This causes pressure on your team and you, and some colleagues even leave the project due to this.
  • …..

Solution(s):

["Best-case"] OR [What you can do to prevent this]

  • Conduct pre-project skills gap analysis: Before starting the cloud migration, identify the skills you need for your cloud project. Then try to get a hiring budget before the project budget. Following this, ensure that the same people who work on PoCs will work on the cloud development.
  • Commit to PoC development: Instead of rushing the PoC development with a quick-win solution, invest additional time in the design of several solutions. Approach this problem strategically, and for one data pipeline, try to develop and test two PoCs. On top of the "theoretical" one (read: best practices), a hands-on approach in selecting the optimal solution will bring your team confidence in the development phase.

["Worst-case"] OR [What you can do for damage control]

  • Reassess the staffing and get expert consultants: If the mixed ideas on the development solutions start causing project delays, bring in external consultants who can help with their expertise. These people can assist in improving the team’s learning curve and ensure the project stays on track.
  • Go with the suboptimal solution: Pick the suboptimal solution for part of your development if you have an existing skillset in the team. Existing hands-on experience can speed up the project’s delivery. For example, pick the programming language that everyone is already familiar with.
  • Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

7. Inconstancy

Changing project scope and priorities, causing confusion and disrupting planned project delivery.

Scenario:

  • You: So, there are new requirements for this project component?
  • Your colleague: Yes, and some additional ones are still being discussed.
  • You: Really?
  • Your colleague: Yes, really.
  • You: But you know the project deadline is in 5 months? And what’s with that – is it still the same?
  • Your colleague: Mhm.
  • ……………………………………………
  • Here we go again. For the fourth time since the project started, the business requirements have changed.
  • Twice in the current release.
  • Some work that you already delivered is now out of the project scope. However, the new one has been included in the scope. And you guessed it – with no extra time for this development.
  • Similar to the previous iterations, you return this information to your team, which only confuses them more.
  • And again – you know it, they know it, everyone knows it – this time, the changed project scope will break the deadline.
  • ….

Solution(s):

["Best-case"] OR [What you can do to prevent this]

  • Develop a change management process: Establish rules and guidelines for handling unplanned requirements before the project starts. Ensure that this process includes an impact analysis on the timelines and obtaining approval for the requested changes from the project sponsors.
  • Introduce "hard deadlines": Lock down the changes in the project scope early enough and define the "hard deadlines" to get fixed requirements. In addition, track your project backlog delivery carefully and communicate the pending critical work to the requesters in case of new requirements.
  • Plan project meetings: Organize weekly meetings with project managers, business requirements engineers, technical analysts, and the delivery team. Ensure that everyone is on the same page when it comes to understanding the pending backlog and the timelines.

["Worst-case"] OR [What you can do for damage control]

  • Escalate fast: Escalate the changes in the project scope to your project sponsors. Create awareness of how these changes affect the deadline and the project outcome.
  • Re-prioritize even faster: If the new requirements have an impact on critical components, prioritize this development by re-planning lower-priority tasks for later project phases.
  • Again: take responsibility, communicate issues/needs transparently, and highlight the long-term benefits.

See no evil 🙈 , hear no evil 🙉 , speak no evil 🙊

Although cloud migration or greenfield projects can often feel like opening Pandora’s Box, it doesn’t mean they are not awarding to work on.

Despite all the challenges I experienced and saw in these projects, the learnings I got from them resulted in my personal and professional growth. Bigger and more positive than I was ever able to imagine.

I have learnt that good preparation, contingency planning, taking responsibility for mistakes, staffing skilled people, maintaining transparent communication towards ** sponsors, and consistently highlighting long-term benefit**s always lead to positive project outcomes.

In summary, it’s important to stay proactive in every challenging situation and find a solution for it.

Hence, in this blog post, I’ve shared my solutions in the hope you can reuse them to handle the challenges in your cloud migration/greenfield projects.

Until next time, happy ☁ data migration planning & development!


Thank you for reading my post. Stay connected for more stories on Medium, Substack ✍ and LinkedIn 🖇 .


References

[1] Theoi Greek Mythology: Pandora, https://www.theoi.com/Heroine/Pandora.html, accessed on September 25, 2024

[2] Britannica: Pandora, __ https://www.britannica.com/topic/Pandora-Greek-mythology, accessed on September 25, 2024

[3] Wikipedia: Pandora’s box, https://en.wikipedia.org/wiki/Pandora%27s_box, accessed on September 25, 2024

The post Opening Pandora’s Box: Conquer the 7 “Bringers of Evil” in Data Cloud Migration and Greenfield… appeared first on Towards Data Science.

]]>
Semantic Layer for the People and by the People https://towardsdatascience.com/semantic-layer-for-the-people-and-by-the-people-ce9ecbd0a6f6/ Mon, 23 Sep 2024 19:29:54 +0000 https://towardsdatascience.com/semantic-layer-for-the-people-and-by-the-people-ce9ecbd0a6f6/ My 3 [+1] jokers with templates for building a powerful analytical semantic layer

The post Semantic Layer for the People and by the People appeared first on Towards Data Science.

]]>
TL;DR:

My 3 straightforward and one hidden Joker are:

  • Joker #1: Pattern-Driven Repository Structure 🗂
  • Joker #2: Organized Code 👩🏻💻
  • Joker #3: (Non-) Embedded Documentation 📜
  • [🃏 Hidden Joker: Refinement Loop 🃏]
"Simple and Consistent. - This is how I would describe to someone the 2 most important dimensions to keep in mind when building a semantic layer." [Photo by Zuzana Ruttkay on Unsplash]
"Simple and Consistent. – This is how I would describe to someone the 2 most important dimensions to keep in mind when building a semantic layer." [Photo by Zuzana Ruttkay on Unsplash]

Semantic – the study of linguistic meaning

According to Wikipedia, the term semantic, which is the study of linguistic meaning, examines what meaning is.

Or per sehow words get their meaning and how the meaning of a complex expression depends on its parts [1].

Although the term semantic is straightforwardly explained, I honestly had to dwell for a while on the part "the meaning of a complex expression depends on its parts", because I wanted to re-use it to explain the semantic layer in analytics.

After rereading it, my explanation goes as follows:

Similar to semantics in the linguistic context, the semantic layer in analytics is about making data meaningful.

Much like words together form a specific meaning, which leads to an understanding of what was said, the raw data from different sources gets enriched and forms specific insights.

Much like meaning depends on the combination of the words in expressions, the outcome derived from the raw data depends on the modelling approach in the semantic layer.

And much like _properly structure_d and well-formed expressions lead to ease of understanding, properly modelled data leads to quality data insights.


In summary, it is all about how one can create better, faster, and novel __ value from raw data, which leads to insights and understandin_g of_ business action(s) to take.


This is the core purpose of the semantic layer, and building one is filled with numerous challenges.

The two key ones that I always face when building a semantic layer are (1) simplicity and (2) consistency.

Or, better said, how to achieve them both.

  • When I try to simplify my semantic layer by placing focus on one area of my analytical development, this usually undermines my consistency.
  • And when I try to make my semantic layer consistent, it usually results in being overly complex.

So, finding a balance, or more precisely, the structure of the semantic layer that will be both simple and consistent, is relevant as it impacts the speed and sometimes the quality of delivering business value.

Hence, in this post, I’ll share my 3 "jokers" for balancing these challenges while creating a semantic layer that’s built to evolve with your business needs.

My explanations, with the provided visual templates, will mostly be based on how I built a semantic layer using Looker.

However, they will be generalized and can be applied to building a semantic layer in any other BI or modelling tool (like dbt Core or Dataform).

Let’s dive in.

Joker #1: Pattern-Driven Repository Structure 🗂

The repository structure of the semantic layer is the blueprint that you need to focus on. As a foundation task, it really needs to be both simple and consistent.

Why, you ask?

  • Because the proper repository structure of the semantic layer needs to serve both technical (for data modelling and testing development) and business colleagues (for the self-service analytics part). Long story short, it needs to be understandable to both sides.
  • Because business requirements will "explode" with time, and so will your analytical development. If you don’t keep track of the semantic layer repository structure patterns, the chances are high your development will result in redundancy and it will lack proper development standards.

So, what are the key components that a good semantic layer repository should contain:

  • (1) Layers of folders or sub-folders—create clear layers in your repository hierarchy, separating business logic from technical logic. This way, both technical and business users can "be on the same page" when they talk about the same data model or the same method for comparing the data insights (e.g., specific period-over-period method or specific forecasting model).
  • (2) _Naming conventions—_correlated to the first component, naming conventions play an important part in standardizing development and keeping your repository structure tidy. Properly defined naming conventions result in faster data modelling by following consistent patterns and also ease navigability during the troubleshooting process.

To provide more context to my theory above, I will explain visually my blueprint for the semantic layer repository structure.

.
└── Semantic Layer/
    ├── area_association_rules/
        ├── models/
          ├── association_rules.model.lkml
          ├── frequent_items.model.lkml
        ├── views_shop/
          ├── association_rules.view.lkml
          ├── frequent_items.view.lkml
    ├── area_benchmarking/
    ├── area_demand_forecast/
    ├── area_financial_forecasting/
    ├── area_customer_intelligence/
    ├── ....
    ├── area_business_controlling/
        ├── models/
        ├── views/
        ├── views_derived/
        ├── views_aggregated/  
    ├── area_finances/
    ├── area_logistics/
    ├── area_performance_marketing/
    ├── ....
    ├── base_views/
    ├── base_date_granularity/
    ├── base_date_granularity_customised/
    ├── base_pop_logic/
    ├── base_pop_logic_customised/
    ├── ....
    ├── data_tests/
    ├── documentation/
    ├── locales/
    ├── manifest.lkml
    ├── README.md

How I usually structure my semantic layer repository is that I separate it into the area_* __ and the base_*__ folders:

(1) The **area_*** folders.

  • The use of these folders is to focus analytical development on organizing the code into specific business areas (e.g., area_finances or area_marketing) OR the cross-shared business analytical cases (e.g., area_demand forecast, area_benchmarking, etc.).
  • Each area_* folder encapsulates the logic and code relevant to that specific business department or use case, and it’s further split into models and views_* folders.
  • The models folder in the semantic layer contains the model files, while the views_* folders contain the view files or aggregated/derived views. | 👉🏼 Note: More info about these files can be found in the next section.

(2) The **base_*** folders.

  • The use of these folders is to focus on separating the core logic or functionality, like customized timeframe analysis methods, and cross-shared views that are used in multiple business areas and encapsulated within numerous model files of the semantic layer.
  • This organization ensures that the common business modelling logic is not scattered across multiple areas, and it’s centrally managed, resulting in consistency across the whole semantic layer.

Applying this template, I experienced faster data development (of the whole team) and speed/ease in troubleshooting because I focused on creating a pattern-driven repository structure.

Having in mind the repository folders themselves contain different files, keeping the code tidy is equally important.</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p>This leads me to my second Joker - creating clean code files.</p><!-- /wp:paragraph --> <!-- wp:heading {"level":2} --><h2 class="wp-block-heading">Joker #2: Organized Code 👩🏻 ‍💻</h2><!-- /wp:heading --> <!-- wp:paragraph --><p>In most of the semantic layers, there are common types of <code> files that a data project repository contains.</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p>As an example, in the Looker semantic layer, the 3 main types of <code> files are <strong><a href="https://cloud.google.com/looker/docs/lookml-terms-and-concepts#model">model</a></strong>, <strong><a href="https://cloud.google.com/looker/docs/lookml-terms-and-concepts#view">view,</a> and <a href="https://cloud.google.com/looker/docs/reference/param-project-manifest">manifest</a></strong> [2].</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p>| 👉🏼 N_<strong>ote: T</strong>here are m<a href="https://cloud.google.com/looker/docs/lookml-project-files#view_files">ore file types in Looker,</a> but I will focus on the three main listed above._</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p>Every single one of the listed files has its own code logic, and this code logic can be organized.</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p>Let me provide you again with visual templates of how I organized my code within specific files in Looker:</p><!-- /wp:paragraph --> <!-- wp:paragraph --><p><strong>(1) The</strong> <code>model file – contains information about the views (tables) and how they should be joined together in Explore [2].

###########
# MODEL_NAME &amp; METADATA
# Description: This model reflects the [business context].
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

#########
# 1. DISPLAY &amp; CONNECTION PARAMETERS
#########

## 1.1 Label and Connection Setup
connection: "your_connection_name"
label: "Your Model Label"

#########
# 2. STRUCTURAL PARAMETERS
#########

## 2.1 Include Statements for Views
include: "../views/*.view"
include: "../views_aggregated/*.view"
include: "../views_derived/*.view"

## 2.2 Additional Include Statements (E.g. Logistics and Base Logic Views | Optional)
include: "/area_logistics/views/*.view"
include: "/base/*.view"
include: "/base_granularity/*.view"

## 2.3 Include Statements for Data Tests
include: "../../data_tests/*.lkml"
include: "../../data_tests/views/*.view"

#########
# 3. ACCESS CONTROL (Optional)
#########

## 3.1 Define Access Grants 
access_grant: access_grant_name {
  user_attribute: user_attribute_name
  allowed_values: ["value_1", "value_2"]
}

#########
# 4. EXPLORES
#########

## 4.1 Main Explore for Your Data Model
explore: explore_name {
  label: "Explore Label"
  view_name: view_name
  persist_for: "N hours"

  ## 4.1.1 SQL Filters and Conditions (Optional)
  sql_always_where:
    @{sql_always_where_condition_1}
    AND
    {% if some_field._in_query %} ${some_field} IS NOT NULL {% else %} 1=1 {% endif %}
    AND
    @{sql_always_where_condition_2};;

  ## 4.1.2 Joins for Additional Views
  join: another_view_name {
    type: left_outer
    relationship: one_to_many
    sql_on: ${view_name.field_name} = ${another_view_name.field_name};;
  }

  join: another_join_view {
    type: inner
    relationship: one_to_one
    sql_on: ${view_name.field_name} = ${another_join_view.field_name};;
  }

  # Add more joins to create data model
}

#########
# 5. DATA TESTING EXPLORE
#########

## 5.1 Define Explore for Data Testing (Optional)
explore: data_testing {
  label: "Data Testing Explore"
  view_name: view_name
  hidden: yes

  join: another_test_view {
    type: left_outer
    relationship: one_to_one
    sql_on: ${view_name.field_name} = ${another_test_view.field_name};;
  }

  # Add more testing joins as required
}

#########
# 6. MAP LAYER (Optional)
#########

## 6.1 Define Map Layer for Geographic Data (Optional)
map_layer: map_name {
  file: "../files/your_map_file.json"
  property_key: "your_property_key"
  label: "Map Label"
  format: topojson
  max_zoom_level: 15
  min_zoom_level: 2
}

(2) The view file – contains dimensions and measures accessed from a specific database table (or across multiple joined tables) [2].

###########
# VIEW_NAME &amp; METADATA
# Description: This view reflects the [business context] or [data source] it represents.
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

view: view_name {
  sql_table_name: `project.dataset.table_name` ;;

  #########
  # 1. DISPLAY PARAMETERS
  #########

  ## 1.1 Label for View
  # Specifies how the view name will appear in the field picker
  label: "Your View Display Name"

  ## 1.2 Fields Hidden by Default
  # When set to yes, hides all fields in the view by default
  fields_hidden_by_default: yes

  #########
  # 2. STRUCTURAL &amp; FILTER PARAMETERS (Optional)
  #########

  ## 2.1 Include Files 
  # Includes additional files or views to be part of this view
  include: "filename_or_pattern"

  ## 2.2 Extends View 
  # Specifies views that this view will extend
  extends: [another_view_name]  

  ## 2.3 Drill Fields 
  # Specifies the default list of fields shown when drilling into measures
  drill_fields: [dimension_name, another_dimension]

  ## 2.4 Default Filters for Common Queries
  filter: default_date_filter {
    label: "Date Filter"
    type: date
    sql: ${order_date} ;;
    description: "Filter data based on order date."
  }

  ## 2.5 Suggestions for Dimensions
  # Enables or disables suggestions for all dimensions in this view
  suggestions: yes  

  ## 2.6 Set of Fields
  # Defines a reusable set of dimensions and measures
  set: set_name {
    fields: [dimension_name, measure_name]
  }

  #########
  # 3. DIMENSIONS
  #########

  ## 3.1 Simple Dimensions (Directly from DB)
  dimension: dimension_name {
    label: "Dimension Display Name"
    type: string
    sql: ${TABLE}.column_name ;;
    description: "This dimension represents [business context] and contains values like [example]."
  }

  dimension: another_dimension {
    label: "Another Dimension Display Name"
    type: number
    sql: ${TABLE}.other_column ;;
    description: "Explanation of the dimension, including business context and possible values."
  }

  ## 3.2 Compound Dimensions (Concatenated from Existing Dimensions)
  dimension: compound_dimension {
    label: "Compound Dimension Name"
    type: string
    sql: CONCAT(${dimension_name}, "-", ${another_dimension}) ;;
    description: "A compound dimension created by concatenating [dimension_name] and [another_dimension]."
  }

  ## 3.3 Derived Dimensions (Filtered/Grouped Values from Existing Dimensions)
  dimension: filtered_dimension {
    label: "Filtered Dimension Name"
    type: string
    sql: CASE
            WHEN ${dimension_name} = 'specific_value' THEN 'Subset Value'
            ELSE 'Other'
         END ;;
    description: "This dimension subsets values from [dimension_name] based on specific business rules."
  }

  ## 3.4 Tiered Dimension (Grouped by Tiers)
  dimension: order_amount_tier {
    label: "Order Amount Tier [€]"
    type: integer
    tiers: [50, 100, 150]
    sql: ${revenue_column} ;;
    description: "This dimension creates tiers of order amounts based on thresholds (50, 100, 150)."
  }

  #########
  # 4. MEASURES
  #########

  ## 4.1 Simple Aggregated Measures (Sum, Count, Average)
  measure: total_revenue {
    group_label: "KPIs"
    label: "Total Revenue [€]"
    type: sum
    sql: ${revenue_column} ;;
    value_format_name: currency_format
    description: "Total revenue, summing up all revenue from each record."
  }

  ## 4.2 Calculated Measures (Derived from Existing Measures)
  measure: profit_margin {
    group_label: "KPIs"
    label: "Profit Margin [%]"
    type: number
    sql: (${total_revenue} - ${cost_column}) / NULLIF(${total_revenue}, 0) ;;
    value_format_name: percent_2
    description: "Calculated profit margin as (Revenue - Cost) / Revenue."
  }

}

(3) The manifest file – is a configuration file that contains project constants, code for using files imported from another project(s), code for localization settings, and serves for adding extensions or custom visualizations [2].

###########
# MANIFEST_INFO &amp; METADATA
# Description: This file reflects the [business context].
# Author: [Your Name] | Created on: [Date]
# Contributors: [Your Name] | Last change on: [Date]
###########

#########
# 1. STRUCTURAL PARAMETERS
#########

## 1.1 Project Name &amp; LookML Runtime
project_name: "Current Project Name"
new_lookml_runtime: yes 

## 1.2 Local Dependency
local_dependency: {
  project: "project_name"
  override_constant: constant_name {
    value: "string value"
  }
}
# Add additional local dependencies as needed.

## 1.3 Remote Dependency (Optional)
remote_dependency: remote_project_name {
  url: "remote_project_url"
  ref: "remote_project_ref"
  override_constant: constant_name {
    value: "string value"
  }
}
# Add additional remote dependencies as needed.

## 1.4 Constants (Optional, but useful)
constant: constant_name {
  value: "string value"
  export: none | override_optional | override_required
}

#########
# 2. LOCALIZATION PARAMETERS
#########

## 2.1 Localization Settings
localization_settings: {
  localization_level: strict | permissive
  default_locale: locale_name
}

#########
# 3. EXTENSION FRAMEWORK PARAMETERS (Optional)
#########

## 3.1 Application Definitions
application: application_name {
  label: "Application Label"
  url: "application_url"
  file: "application_file_path"

  ## 3.1.1 Mount Points
  mount_points: {
    # Define mount points here (refer to the application page for more details)
  }

  ## 3.1.2 Entitlements
  entitlements: {
    # Define entitlements here (refer to the application page for more details)
  }
}
# Add additional application declarations as required.

#########
# 4. CUSTOM VISUALIZATION PARAMETERS (Optional)
#########

## 4.1 Visualization Definition
visualization: {
  id: "unique-id"
  label: "Visualization Label"
  url: "visualization_url"
  sri_hash: "SRI hash"
  dependencies: ["dependency_url_1", "dependency_url_2"]
  file: "visualization_file_path"
}
# Add additional visualizations as needed.

Finally, after presenting the visual examples, I will list the two traits of organizing __ the model, view, and manifest files in a similar way:

  • (1) Design clarity: By using enumeration in the code sections, the code logic can be more easily tracked, adapted, extended, and troubleshot.
  • (2) Consistency framework: By creating patterns of your code files, it creates uniformity in development. This results in easier collaboration and faster development delivery.

These two traits lead me to a Joker that is correlated with the code structure and organization, i.e., embedding of the documentation.

Joker #3: (Non-) Embedded Documentation 📜

When I start with my analytical development in the semantic layer, I always focus on the input-output flow or the process flow.

I can’t stress enough the reason for this, so I will repeat it again: I want my semantic layer to be understandable to both technical and business colleagues.

And to achieve this, I need to document my inputs, explain how I modelled them, and then elaborate on derived outputs.

In my early days of development, I always used Confluence for documentation of my flows and explanation of the data models in my semantic layers.

Lucky for me now, most of the analytical tools offer the embedded documentation feature, and the two ones that I tend to always adopt are:

_(1) Data dictionary_

Serves as my reference guide to explain to my stakeholders the core of the data models—i.e., dimensions and measures.

I tend to embed the formulas, business rules, and terminology within the code files, and then, by using the Data Dictionary, educate my end-users on how they can explore the data models.

With this approach, I save myself numerous minutes, if not hours, per week on responding to chat and mail enquiries about the meaning of the specific measures and dimensions.

On top of this, it serves me for the technical purpose of identifying if I have redundant fields in my project or if I have fields without annotated descriptions.

_(2) Data lineage_

Nowadays, a must-have for me in my data project is to have a graphical explanation of where my data comes from, how it flows through transformation, and where it ends up.

For this purpose, I tend to adopt tools like dbt Core with the native Data Lineage feature. And if I don’t have tools in my data architecture/project with native data lineage features, I will make sure to get one that only serves this purpose.

This gives my data team and stakeholders a clear view of the data’s entire journey – from the raw tables to presentational tables.

It makes code troubleshooting easier and elevates the usability of the Semantic Layer by providing a clear data path.

And the usability topic leads me to my final Joker, which is to constantly work on the improvement of your semantic layer.

[🃏Hidden Joker: Refinement Loop🃏]

My Refinement Loop: Iterate, Correct, Educate 🔄

This is where it all comes together if you want to build a semantic layer that is both simple and consistent.

To achieve this, you will probably need to go through the process of constant iteration, correction, and education.

— Why?

Because every semantic layer is built by the people to benefit people – or in a more catchy title "For the people, by the people."

With this sentence, I want to remind you that the semantic layer is a living and evolving component of the data project that usually grows together with the data team and rising business requirements.

So, even if you currently don’t have enough knowledge or experience to build a semantic layer that is both simple and consistent, just keep investing time and effort until it becomes one.

With each improvement iteration, the goal is to think about the end users, how to increase the usability of the insights created within the semantic layer, and how to create business value faster by delivering quality insights.

And if you manage to bring more value to your stakeholders in a shorter time, that’s the real Joker that can position you and your data team in the business landscape.


Thank you for reading my post. Stay connected for more stories on Medium, Substack ✍ and LinkedIn 🖇 .


Looking for more analytical [templates | tutorials | reads]?

My 3 posts **** related to this topic:

  1. The One Page Data and Analytics Templates | Master Data and Analytics Reports and Processes with 5 Templates
  2. 3 Clusters of Knowledge Base Templates for Data and Analytics Teams | Create External, Shared and Internal Documentation in a Systematic Way
  3. Decoding Data Development Process | How Is Kowalski Providing "Ad-Hoc" Data Insights?

References

[1] Wikipedia: Semantics, https://en.wikipedia.org/wiki/Semantics, accessed on September 18, 2024

[2] Google Cloud Documentation: LookML terms and concepts, https://cloud.google.com/looker/docs/lookml-terms-and-concepts, accessed on September 19, 2024

The post Semantic Layer for the People and by the People appeared first on Towards Data Science.

]]>
Fantastic Beasts of BigQuery and When to Use Them https://towardsdatascience.com/fantastic-beasts-of-bigquery-and-when-to-use-them-13af9a17f3db/ Sun, 31 Dec 2023 16:05:49 +0000 https://towardsdatascience.com/fantastic-beasts-of-bigquery-and-when-to-use-them-13af9a17f3db/ Unveiling the Traits of BigQuery Studio, DataFrames, Generative AI/AI Functions, and DuetAI

The post Fantastic Beasts of BigQuery and When to Use Them appeared first on Towards Data Science.

]]>
"BigQuery is an all-in-one Google service with DB-BI-ML-GenAI features." [Photo by Korng Sok on Unsplash]
"BigQuery is an all-in-one Google service with DB-BI-ML-GenAI features." [Photo by Korng Sok on Unsplash]

Discover more from the BigQuery world

One of my favourite books is "Fantastic Beasts and Where to Find Them" by J.K. Rowling. It is a story about the world of magical creatures who get loose in the non-wizard world. It’s also a story that shows how Maj (wizards) and No-Maj (non-wizard) people form a friendship to protect magical creatures. On this mission, the lead No-Maj character discovers a world of magic and falls in love with all the challenges it offers, wishing he was a wizard too.

As a No-Maj myself, my transition from mechanical engineering to the data world was initially filled with challenges. Each time I entered a new data area, I was thinking: "If only I were a wizard". 😉

When I first started learning about databases (DB) and business intelligence (BI), I had this thought in my head.

As I progressed to machine learning (ML) topics, this thought was again present.

Nowadays, I am trying to manoeuvre through generative AI (GenAI) development, and—you guessed it—this thought is again with me.

Even after gaining experience in DB, BI, and ML, the GenAI area would be more challenging for me if it weren’t for one Google service—BigQuery (BQ).

Do you know why?

Because Bigquery is offering "all-in-one" when it comes to the "DB-BI-ML-GenAI" combination. Or, as Google has announced in one of its webinars, it is covering features "From Data to AI" [1].

And how I thought it should be announced: "Fantastic Beasts of BigQuery".

On top of my all-time favourite BigQuery feature—BQML—Google has recently implemented additional transformative features, making BQ more similar to an analytical development environment and less to a database environment.

These new features enable data professionals to conduct end-to-end analytical tasks without the need to switch between multiple tools.

With end-to-end tasks, I have in mind performing exploratory data analysis (EDA), predictive modelling using either SQL or Python with Spark, and creating new insights by using generative AI features. And all this can now be done with the assistance of the code co-creating feature.

The recent evolution of BigQuery’s ecosystem motivated me to write this post and to show the new BQ advances that will transform the way (we) data professionals work and possibly make us feel a bit like Maj people. 😉

In other words, this post aims to present when the new BQ features can be used in the analytical workflow.

But before we start the explanation, I need to share the names of these "fantastic beasts":

  1. BigQuery Studio and BigQuery DataFrames
  2. BigQuery GenAI and AI Functions
  3. DuetAI in BigQuery

Let’s begin now by unveiling their unique traits and pointing out how they can be leveraged to enhance your performance.

Fantastic Beasts of BigQuery

To hint at the purpose of the new BQ features, I created an illustration of how they align with the knowledge data discovery process (analytical workflow).

The new BigQuery features aligned with analytical workflow [Image by author using draw.io]
The new BigQuery features aligned with analytical workflow [Image by author using draw.io]

As visible from the picture, at the base of the analytical workflow is DuetAI, which is an AI coding assistance feature. Except for the coding support, DuetAI is a chatbot, and you can use it for brainstorming.

This means that data professionals can ask the chatbot different questions related to input problem definitions (e.g., how can I subset my dataset or could you explain a specific function) and recommendations on how to structure the analytical output (e.g., how can I present my findings).

In between the analytical input → output flow, other features come in handy:

  • In Phase I, i.e., the data preparation and understanding phase, BigQuery SQL (for subsetting and wrangling datasets), GenAI/AI functions (for enriching datasets), BigQuery DataFrames, and other Python libraries (for exploring datasets) can be used via Bigquery Studio.
  • In Phase II, i.e., the data modelling and insights synthesis phase, BigQuery SQL or BQML functions can be used alongside BigQuery DataFrames in BQ Studio (for BI/ML model creation) to get the required analytical output (descriptive or predictive outcome).

Now we will showcase the magical traits of these features.

BigQuery Studio and DataFrames

Don’t get confused here with BigQuery Studio vs. Looker Studio (the former Data Studio). While the latter is a self-service BI tool, the first one is a new collaborative workspace that supports the complete analytical workflow.

With this said, BigQuery Studio has the following main traits [2]:

#1: Supports multiple languages and tools in a unified interface.

By this, I mean that it eases the work between different data professions, as:

  • data engineering (ingestion and data wrangling),
  • data analytics (descriptive statistics/EDA), and
  • data science tasks (predictive modelling) can be done in one environment, or better yet, in one notebook.

The BQ Studio provides the notebook Colab interface and enables data professionals to use either SQL, Python, Spark, or natural language in BigQuery (in combination with DuetAI) in one notebook file. In addition, developed notebooks can later be accessed via Vertex AI for ML workflows.

When it comes to data ingestion formats, it supports both structured, semi-structured, and unstructured formats from different cloud platforms.

Google's presentation of BigQuery Studio notebook [2]
Google’s presentation of BigQuery Studio notebook [2]

#2: Enhances collaboration and versioning by connecting to external code repositories.

I have to say this trait is something I have been wishing for a while. Although it doesn’t support (yet) all git commands, the BQ Studio supports software development practices like continuous integration/continuous deployment (CI/CD), version history, and source control of data code assets. Simply put, it is now possible to review the history of a notebook and revert to or branch from a specific notebook version.

Google's presentation of version control BQ Studio feature [2]
Google’s presentation of version control BQ Studio feature [2]

#3: Enforces security and data governance within BigQuery.

BQ Studio enforces security because it reduces the need to share the data outside of BigQuery. In other words, by adopting unified credential management between services, analysts can, e.g., access Vertex AI foundational models to perform complex analytical tasks (like sentiment analysis) without sharing data with third-party tools.

On top of this, there are data governance traits, including data lineage tracking, data profiling, and enforcing quality constraints. I can only add "Amen to that".

Google's presentation of data governance features [2]
Google’s presentation of data governance features [2]

To summarise the above-listed traits, it is evident that BigQuery Studio is a magical feature, as it enables tasks from data ingestion to data modelling by enforcing security and governance.

The story would not be complete if Google didn’t provide an additional feature that can be used within BigData Studio notebooks for Data Analysis and modelling purposes—BigQuery DataFrames.

By installing the bigframes package (similar to the installation of any other Python package with pip), data professionals can use the following Python APIs [3]:

  • bigframes.[pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which is a pandas API for analysis and manipulation, and
  • bigframes.ml, i.e., a scikit-learn API for machine learning.

And with the machine learning topic, I wanted to finalise this section.

The reason for this is that in the next section, I will elaborate on the new BigQuery ML functions.

BigQuery GenAI and AI functions

As mentioned in the introduction, I am a big fan of the BQML functions because they enable data professionals to create predictive models using SQL syntax.

In addition to its already nice portfolio of BQML functions for supervised and unsupervised learning, Google has now added my next favourite functions: generative AI and AI functions.

When it comes to the generative AI function ML.GENERATE_TEXT, I recently wrote a blog to show its traits.

The New Generative AI Function in BigQuery

In summary, the function can be used to create new insights from unstructured text stored in the BigQuery datasets. Or more precisely, you can use it to create new classes or attributes (classification analysis, sentiment analysis, or entity extraction), summarise or rewrite natural text records, and generate ads or ideation concepts.

I would say that SQL-based data engineering now has next-level powers.

Alongside this magic function, other powerful new SQL-based AI functions are:

  • ML.UNDERSTAND_TEXT – a function that helps in text analysis of the records stored in BigQuery and supports similar features as the ML.GENERATE_TEXT function. Meaning that it supports entity, sentiment, classification, and syntax analysis.
  • ML.TRANSLATE – a function used for text translation from one language to another.
  • ML.PROCESS_DOCUMENT – a function used for processing unstructured documents from an object table (e.g., invoices).
  • ML.TRANSCRIBE – a function used for transcribing audio files from an object table.
  • ML.ANNOTATE_IMAGE – a function used for image annotation from object tables.

Although these functions are written in SQL, understanding their query structure and parameters is a must for proper usage. To speed up the learning curve, a little coding assistance—Duetai—comes in handy.

DuetAI in BigQuery

Simple as that: the DuetAI magic feature can help data professionals write, improve, and understand their multi-language code within the BQ and BQ Studio environments.

More precisely, this feature possesses the following traits:

#1: Creates queries or code from scratch within the BQ environment or BQ Studio notebooks

Example of the code completion by using DuetAI in BigQuery [Image by author]
Example of the code completion by using DuetAI in BigQuery [Image by author]

#2: Explains queries and code snippets within the BQ environment or BQ Studio notebooks

Example of the code completion by using DuetAI in BigQuery [Image by author]
Example of the code completion by using DuetAI in BigQuery [Image by author]

#3: Enhances code quality within the BQ environment or BQ Studio notebooks

By this, I mean that DuetAI can improve code by assisting with:

  • syntax correction: it can identify and suggest corrections for syntax errors.
  • logic improvement: it can suggest alternative ways to structure the code, improving the overall efficiency and readability.
  • documentation generation: it can automatically generate documentation for the code, making it easier to understand.

Finally, with this magic feature, I will conclude the section and presentation of the new BigQuery "fantastic beasts".

A World of Endless Possibilities

In the book "Fantastic Beasts and Where to Find Them", J.K. Rowling showed how magical creatures become less scary for Maj and even No-Maj people when they learn about their positive traits. Similarly, in this blog post, I wanted to showcase the new fantastic features of BigQuery and point out their positive traits on different analytical levels.

Illustration of the three BigQuery beasts created by the author using DALL-E extension in ChatGPT (Correct number of beasts, but a wrong number of the names ;))
Illustration of the three BigQuery beasts created by the author using DALL-E extension in ChatGPT (Correct number of beasts, but a wrong number of the names ;))

My goal was to present how the new features can support you in the complete analytical workflow and ease your work, whether you are a data engineer, analyst, or scientist. In addition, I wanted to point out how they can enhance collaboration among team members with different data backgrounds.

Hopefully, you will engage in hands-on magic yourself and learn more about the "Fantastic Beasts of BigQuery".

Happy learning!

Thank you for reading my post. Stay connected for more stories on Medium and Linkedin.

Knowledge resources

[1] Google Cloud webinar: "Cloud OnBoard: From Data to AI with BigQuery and Vertex AI", accessed December 10th 2023, https://cloudonair.withgoogle.com/events/cloud-onboard-data-to-ai

[2] Google Cloud blog: "Announcing BigQuery Studio – a collaborative analytics workspace to accelerate data-to-AI workflows", accessed December 11th 2023, https://cloudonair.withgoogle.com/events/cloud-onboard-data-to-AI

[3] Google Cloud documentation: "BigQuery DataFrames", accessed December 11th 2023, https://cloud.google.com/python/docs/reference/bigframes/latest

The post Fantastic Beasts of BigQuery and When to Use Them appeared first on Towards Data Science.

]]>
End-of-Year Report on a 12-Year Data Journey https://towardsdatascience.com/ending-the-year-with-12-lessons-about-data-career-8786afc068f4/ Sat, 09 Dec 2023 06:53:23 +0000 https://towardsdatascience.com/ending-the-year-with-12-lessons-about-data-career-8786afc068f4/ Three stories about the data career journey

The post End-of-Year Report on a 12-Year Data Journey appeared first on Towards Data Science.

]]>
Introduction: Beyond Numbers

In my previous position, I created end-of-year reports for my business colleagues and CEOs.

Nothing special, you would say. Standard reports with a bunch of numbers per different business areas—from general business controlling, marketing to supply chain management and finance.

True, and I was aware these reports were part of my tasks and would not draw much attention from every colleague in the company.

So, I gave it some thought on how to make the reports more "attention-getters". The answer was simple—make them sound cool.

In other words, I decided to spice up the report names, as spicing up the numbers was not an option.

The difference between any other standard end-of-year report and "mine" was that I named mine after the trending buzzwords this year.

With this said, the 2020 report was named "The Notorious 2020". The 2021 report was named "The Vax 2021", and the 2022 report was "The ChatGPT 2022".

You already have a clear idea of why I decided on these names.

From the world-turbulent year 2020, when the world was affected by the coronavirus, to the vaccination creation in 2021, and to one of the greatest developments in the data world in 2022—the generative AI chatbot launch.

To move on from the end-of-year business reports, I (finally) decided to create a personal end-of-year report on my career progress.

Why, you ask?

The first reason is that 2023 was the year that marked 12 years of my career, and it was—let’s put it like this—transformational. In the sense that I again changed the country, the job, and the living environment at once.

The second reason is that 2023 was one of the biggest years of new technological development since I started my career, and the knowledge barrier for entering the data area has increased.

With all the new developments and sudden demands on generative AI skills in the data field, I think it’s one of the most challenging times to start a career.

To confirm this statement, I will just add that even a pioneer of machine learning, Andrew Ng, recently wrote a memo on how to build a career in AI [1].

It’s little to mention how this memo is truly inspirational.

Not only does it encourage people to enter the field of AI, but it also gives advice on which skills are required and how to overcome your internal "human" struggles when searching for a job or switching career paths.

Although I can’t fully understand how challenging it is to start a data career nowadays, I was able to "find myself" in most of the challenges Andrew described in the memo.

In other words, I know how tricky it is to start building a data career if you didn’t explicitly study this area. The feeling of not knowing where to start, what to learn first, or how to develop is well-known to me.

This is exactly the second main reason I finally decided to write this blog post and sum up the "insights" I gathered in the past 12 years of my career.

With this post, my goal is to help someone "out there" with their data career struggles and make new joiners feel encouraged to enter this area. I aim to humanize this process by sharing my stories and a few lessons learned on the path of building a data career.

And now the same old question has arisen: How to name the report to sound cool?

Again, having in mind that 2023 is the year that marked 12 years of my career, I only thought the appropriate name for the report would be "The Cosmic 2023".

So, let me share with you the journey that shaped me as a data professional and made me decide to build, and stick to, a career in the most amazing field ever—the data field.

The Cosmic 2023: The 3 Stories with Input-Output Flow

As the title of the section states, I will share with you three stories that mark the main stages of my professional data journey.

Story one, named "Per Aspera ad Astra", is a story of how I started a data journey. And how you probably shouldn’t. 😉

Story two, named "Terra Incognita", is a story of how I managed to land my data industry job(s) in a new country.

Story three, named "Semper Crescens", is a story of the next career level and adopting a Growth Mindset in the data field.

Each story has an input-output flow:

  • The story input part represents a story background.
  • The story output part represents a story outcome.

In the section ending, I will share the lessons learned to sum up the most important insights from the experience.

Story #1: Per Aspera ad Astra 🌟

A story about how (not ;)) to start a data journey without a computer science background.

➡ Story input

Having studied mechanical engineering (ME), I always lacked passion for most of the ME core courses. By this, I mean the "special chapters" of the courses like Elements of Construction, Mechanics of Fluids, Thermodynamics, etc.

I was just not 100% into the course material, and the best part of every exam for me was when I had to resolve mathematical equations and derive the calculations. This was "green flag #1" I should orient my career in the analytical area.

The "green flags #2 and #3" happened during my master’s studies. I remember it being a hot summer day, and my classmates and I were sitting in the small lab room and attending the course named Information Systems. And there it was—the course where I learned the basics of how to do relational database architecture design and how to create CRUD applications and SQL reports from them.

Later on, I took another course named Information Management, where I was able to learn about advanced analysis using only SQL. I loved it instantly, and I knew back then this was exactly the area for me.

You would probably expect this to be enough to automatically start working in the data field. However, this was far from my path.

After finishing my studies in 2011, I got an internship at a manufacturing company, where I worked on the project management side. A bit of the reporting was present in my daily tasks, but this was just not "it". I wanted to work with databases.

After almost 1 year, I found an "exit door" in the unstable work market, and I got the position of Research and Teaching Assistant at my alma mater. Accepting the role with a high dose of romanticism, I was finally hoping for my analytical journey to start.

I was, to simply put it – wrong. 🙂

One thing I didn’t think about before was that when you work in academia, no dataset is waiting for you. In a lot of cases, there is not even a project or a company partner waiting for you. You need to find both on your own.

To get to this point, i.e., to find my company partner who would give me a dataset, it took me another 3 years. I won’t go into details about the problems and obstacles along the way, as this is a topic for another blog post. 🙂 But I can tell you it was an iterative process of networking and relying on good people’s beliefs in me.

After this longer process, I just remember the feeling of being proud and having immense motivation to work with "my precious" data.

⬅ Story output

I finally got my datasets.

And, as soon as I thought the hardest part was behind me, the difficulties only began.

Questions were piling up: What to do now with this data? How do you develop something new to have a scientific contribution?

I had no clue. No clue where to start or how to finish the analytical part of my thesis.

Then somehow, in the process of reading scientific publications, I stumbled on the area called "data mining".

And there it was again—something I had been waiting for. The analytical techniques I can use to develop my models from the obtained datasets.

If only it was as easy as that.

The new struggles began once I delved into the world of ETL processing to prepare the data for the modelling stage. I had messy, missing, and imbalanced datasets without proper joins that had to be joined together. This was the time when I was regularly staying awake and working until 2-3 AM with my business colleagues (note: they were Aircraft Engineers working in the Maintenance Control Center with a 24/7 working schedule) to understand different data sources in the core.

Once this battle was won and input datasets were prepared, the next one was learning about machine learning modelling. From feature selection and dimensionality reduction to selecting the proper machine learning algorithm and understanding the math behind it. On top of that, I was collecting knowledge of statistical analysis and how to evaluate, compare, tune, and present the model outcomes.

This all took another 3 years.

Finally, putting it all together, it took me 6 years to be able to say that I understand how to create value from data.

In other words, this time I finally got a degree that had "something to do with data". 🙂

Story #2: Terra Incognita 🗺

A story about how to land industry job(s) (in a foreign country).

➡ Story input

Following my "big-life victory", i.e., getting a new degree, I started working on my next career move—finding an industry job.

I created a completely new CV, as the old one was no longer a trend in the market. I created profiles on all job and recruiting portals in the country. And finally, I started applying for the few available data-alike positions listed and sending open applications.

I say "a few available positions", as the market in 2017 in a country of 4 million people was not in high demand for these roles back then.

If I recall properly, all together in Q4–2017, there were 4–5 data positions open to which I applied. From these 4–5 vacancies and several open applications, I managed to land 2 interviews.

After passing the initial assessment tests (technical, intelligence, and/or organisational), the next stage was an interview with the hiring leads.

Again, I won’t go into the details here, but I will just add how I wish I could forget some of the questions I got in these interviews. One direct question was "What do you even know?" after I elaborated on my thesis work.

You can imagine my level of confusion about this and similar questions. I felt this was not okay, and things should look different. I should be treated differently.

I took matters into my own hands and made a plan to search for a job where I could get an equal chance, where things were different, and where I could get more life opportunities.

To cut to the chase: I __ got different after a while. I got a job in a different industry in a different country, with all the different challenges in the package. 🙂

It was not the smoothest road to get to this point, and the following statistics will give you a better idea of the process:

  • Time duration: it took me 14 weeks;
  • Number of applications: 60 job applications;
  • Number of landed interviews: 3 interviews with several rounds (2–3);
  • Number of offers: 1 job offer.

As you can see from the above-listed numbers, I was in a hurry. I spent most of my free time looking at new job openings on a specific job board and writing cover letters.

However, every minute spent on this was worth my time. I got a job.

⬅ Story output

I finally got it—the job as an IT consultant. Someone thought I had what it takes for this role, regardless of the missing language skills, not knowing how to work with specific tools, and bureaucratic obstacles (working permit). 🙂

Accepting this offer was a no-brainer for me, and I knew it was a good choice. The first reason for this was that, as a consultant, I was again working on projects. So, something similar to what I had in my previous role. The second reason was the knowledge. This job gave me the freedom to collect and build knowledge in new data areas.

This was the time when I got acquainted with data warehouse architecture design concepts and cloud platforms. Both areas were completely new to me and extremely interesting.

And then one day I got a call from an unknown number. It was a call from the recruiter, explaining something about the new team starting a new data project and having my CV from almost a year ago on his stack.

At first, I didn’t get what this call was about, but I agreed to have a follow-up call. What I can still recall from the second call is one piece of information: "You will work with billions of records." My eyes sparkled. Never before have I worked with Big Data, and this opportunity sounded great. So, I decided to take it and start a new role—as a Data Engineer on a Big Data migration project.

The job was fulfilling—learning intensively about coding and co-developing near-real-time data ingestion pipelines, as well as co-developing analytical customer-oriented models and insights. Again, I was collecting new data engineering and data science knowledge by working on use cases I had never worked on before.

This process was constant…until one day, the company went bankrupt.

Well, this was not in my plans by far. 🙂 I was suddenly jobless and started sending out my CV in the pandemic pre-summer (i.e., dead) job season.

Luckily enough, by this time I already had a small network, and one of my former colleagues recommended me to his company. His company was migrating their analytical dataset to the cloud platform and had no one to work full-time on this topic. The stars aligned—I was jobless, and they were in a search. Win-win situation. So, again, I started my new role—as a Data Analyst.

This position has shaped me in different directions. The ones I never expected.

The reasons are numerous: from being a business analyst to being a data generalist (analyst, engineer, and scientist in one) and working on the first line between IT and business colleagues, from learning almost every aspect of the business through data, and from being a "one-woman show" to building a data team.

To sum it up, it took me another 3 years, or altogether 9 years, to progress to the next career level.

Story #3: Semper Crescens 🧠

A story of the new role and working towards adopting a growth mindset.

➡ Story input

The next career level—I got an offer to build a data team and become a Data Lead. I mean, I already have 6 years of experience as a mentor and coaching students, so this should be easy, right?

Except it wasn’t.

But let’s rewind a bit on how I arrived at this point. I still think to this day the most influential factor was "being in the right place at the right time" .

Meaning: I was the first to work full-time on a migration project; I had experience in different data fields; the business requirements on the data side were exploding; and I brought ideas on how to create a long-term data roadmap. Again, the stars aligned, and apparently, it was a logical choice to be "the one" who would form and lead a data team.

A leap of faith was given to me. So, I grabbed it and started a new role.

It would be an understatement to say how lost I was in this position at the beginning. It was again like building a new career and not knowing what to do first.

Initially, I was not able to let go of the technical work, and I was clinging to it. I mean, I did all the development from scratch, and now someone wants to take it over? Although I was not able to manage the development requirements by myself anymore, it was hard to let go of hands-on work.

Then, I realised I needed to do "human-oriented" tasks too. Motivate colleagues, conduct 1-on-1 talks, provide guidance wherever possible, create team vision and objectives, lead hiring, and all together, create a pleasant working environment.

As a bonus, there were other tasks—the "management" tasks—doing team budgeting, controlling, collecting, and organising work per data role, and presenting the team.

These were __ challenging tasks for me, and I needed help.

Luckily, I had it. Not only did I get support from my supervisors, but I also got support from my peers. In addition, I even got support from former supervisors and their peers.

However, this was not enough for me. The real change happened, and the role became easier when I started reading books on psychology and organisation, following people who were sharing their stories on leadership, attending workshops, and taking coaching.

After this, I realised everyone can grow in unknown dimensions faster and more successfully with proper guidance. And these were my first steps to acquiring what I later discovered is called a "growth mindset".

⬅ Story output

Now, I am not saying I possess a growth mindset in every life situation. In the end, I am only a human, and my fixed mindset is present in me. However, I will say I am dedicated to getting one.

Like everything else in life, this takes constant work and discipline. It takes being able to reflect and take several steps back when needed. It takes not caring if you turn out looking ignorant sometimes. And it takes focusing on your growth through learning.

Finally, you realize that it’s not about chasing roles and the hierarchical ladder; it’s about knowledge.

After the 12-year journey and being currently in the second Data Lead role, the important thing for me is that I further acquire knowledge inside and outside of the data field. Especially when it comes to generative AI development, as it will impact the entire way of working in the data industry.

Except for this, I believe I am finally able to empower and support others on a similar path by sharing my experiences. All to create more interest in the data field and attract new talents.

Lessons learned 🧐

I will try to keep it short here and list the most important lessons learned on the 12-year-long path described before. More technical details will follow in another blog. 😉

Data Career Essentials

  • Build the foundational knowledge. Understand the importance of foundational knowledge in the data field. In other words, gain logic by learning math and statistics first, algorithms, or understanding data structures, architectures, and coding principles. It’s a lot, of course, but knowing general concepts will make your hands-on work easier later on.
  • Deliver quality work (whenever possible). Prioritise quality over quantity in your work, and adopt a methodological way of working for clarity and better performance.
  • Avoid excuses (whenever possible). It’s always easy to justify yourself and find excuses for—well, everything. However, taking responsibility and owning your mistakes will make you stand out from the crowd.
  • Create your opinions (whenever possible). Develop critical thinking skills by evaluating and questioning existing findings and conclusions.
  • Always keep learning. Recognise the value of self-paced learning and continuous education by using online e-learning platforms. Learn about business, psychology, and other sciences to complement your technical knowledge.
  • Do personal retrospectives. Keep track of your failures to measure your progress later on.

Data Career Personal Insights

  • Search for so long until you get a "yes". Train yourself to be persistent in challenging career situations, and don’t take "noes" personally.
  • Learn that "no, thank you" is a full sentence. Focus on the goals that matter to you, and respect others along the way.
  • Someone’s ceiling is your floor. Probably the best advice I ever got was not to limit my ambitions on account of the viewpoints of others.
  • Credits where credits are due. Understand the importance of giving credit and the value of sharing knowledge.
  • Rely on 2F. Seek support from friends and family during tough times.
  • It’s you against you. Your career is not a competition, and no one has the same starting point in life. Everyone is struggling on their own paths, and it makes no sense to compare yourself to others. It is only you against you.

Conclusion: "There is no cure for curiosity"

By sharing my stories in this blog, I wanted to normalise the struggles on the path of building your data career. I aimed to give you examples of how every obstacle can be resolved if you persist in finding solutions and working towards your goals.

With this, I wanted to motivate the people who didn’t study this field and the ones who are thinking of switching careers to join the data field.

It’s probably not going to be a smooth journey, but have confidence that every second will be worthwhile a few years from now—maybe 3, 6, 9, or the cosmic 12. 😉

Lastly, I will end this blog with a saying from Dorothy Parker:

The cure for boredom is curiosity; there is no cure for curiosity.

So, be curious and join the most amazing field ever – the data field. 🙂


Thank you for reading my post. Stay connected for more stories on Medium and LinkedIn.


Acknowledging the pillars of support

As this is an end-of-year "report", I need to express my gratitude to everyone who shared and is still sharing my journey.

  • Family & Friends. On the path of pivoting to and sticking to a data career, I got immense support from family and friends. You cried and laughed together with me through challenging times. Thank you for sticking with me.
  • Mentors & Colleagues. To my former supervisors and their peers, whom I considered mentors and whom I can still call today for a piece of advice. The same is true for my colleagues. Thank you for everything I have learned from you.
  • Community. On Medium and LinkedIn, I crossed paths with people who inspired, applauded, and shared my stories (TDS). This sometimes puts an amazing light on my days. Thank you for sharing your kindness.

Knowledge references

[1] DeepLearning.AI resource, "How to Build Your Career in AI" by Andrew Ng, accessed October 13th 2023, https://info.deeplearning.ai/how-to-build-a-career-in-ai-book

The post End-of-Year Report on a 12-Year Data Journey appeared first on Towards Data Science.

]]>
The New Generative AI Function in BigQuery https://towardsdatascience.com/the-new-generative-ai-function-in-bigquery-38d7a16d4efc/ Fri, 01 Dec 2023 21:47:54 +0000 https://towardsdatascience.com/the-new-generative-ai-function-in-bigquery-38d7a16d4efc/ How to use BigQuery GENERATE_TEXT remote function

The post The New Generative AI Function in BigQuery appeared first on Towards Data Science.

]]>
"Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure" [Photo by Adi Goldstein on Unsplash]
"Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure" [Photo by Adi Goldstein on Unsplash]

Introduction

Since I started working with the Google Platform, Google has not stopped surprising me with its BigQuery (BQ) features and development.

A real "wow" moment for me happened four years ago.

I remember it like it was yesterday, and I was sitting in the front row at the Big Data London 2019 conference. Little did I know back then about the possibility of creating machine learning models using only BQ functions, or, better said, what BQ Machine Learning (BQML) is.

At least until the conference session, where the Google colleague presented how you can create classification, clustering, and time-series forecasting models by simply using Google’s SQL.

The first thought that went through my mind back then was "You must be joking".

The second thought in my head was, "Does this mean that everyone who knows only SQL will be able to create machine learning models?"

As you can assume, the answer is "yes" if you are using BigQuery as your data warehouse.

Now, after using the BQML functions for a while, the correct answer to the question listed above is "maybe."

This means that even though the [CREATE MODEL](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create) syntax is written in SQL, knowledge of machine learning modelling and statistics is still needed.

In other words, you still need to understand the math behind the available models for different types of machine learning use cases (supervised/unsupervised), conduct feature engineering, hyperparameter tuning and model evaluation tasks.

Fast forward to the year 2023, and BigQuery is further amazing me with its new features.

This time, we are talking about the new generative AI BigQuery machine learning functions.

With these new functions, data engineers and analysts are able to perform generative natural language tasks on textual data stored in BQ tables with a few query lines.

Hence, the goal of this blog post is to showcase the new analytical advances of BQ in generative AI, with a focus on one function—the _GENERATE_TEXT_ function.


About the GENERATE_TEXT function

The main idea behind the _GENERATETEXT function is to assist data professionals with the following tasks using only SQL and prompt instructions in BigQuery [1]:

  • Classification
  • Sentiment Analysis
  • Entity extraction
  • Extractive Question Answering
  • Summarization
  • Re-writing text in a different style
  • Ad copy generation
  • Concept ideation

In other words, the function can perform generative natural language tasks on textual attributes stored in BQ by using the Vertex AI [text-bison](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text) natural language foundation model [1].

It works in the following way [1]:

  • First, it sends requests to a Bigquery Ml remote model representing a Vertex AI text-bison natural language foundation model (LLM).
  • Then, it returns the response with defined input parameters.

This means that different function parameters and the input prompt design (prompt instructions for analysis) affect the LLM’s response.

With this said, the following function-specific parameters can be passed to the _GENERATETEXT function and affect the response quality [1]:

#1: model [STRING] → specifies the name of a remote model that uses one of the text-bison Vertex AI LLMs.

#2: query_statement [STRING] → specifies the SQL query that is used to generate the prompt data.

#3: max_output_tokens [INT64] → sets the maximum number of tokens that the model outputs. For shorter responses, the lower value can be specified.

  • Note: A token might be smaller than a word and is approximately four characters.

#4: temperature [FLOAT64] → argument in the range[0.0,1.0] that is used for sampling during the response generation, which occurs when top_k and top_p are applied.

  • Note: The argument presents the degree of randomness in token selection, i.e., the lower values are good for prompts that require a more deterministic response, while higher values can lead to more diverse results.

#5: top_k [FLOAT64] → argument in the range[1,40] that changes how the model selects tokens for output.

  • Note: To get fewer random responses, lower values should be specified.

#6: top_p [FLOAT64] → argument in the range[0.0,1.0] that changes how the model selects tokens for output.

  • Note: To get fewer random responses, lower values should be specified. Tokens are selected from the most probable (based on thetop_k value) to the least probable until the sum of their probabilities equals the top_p value.

After we understand the purpose of the function and the role of each parameter, we can start to demo how the BigQuery _GENERATETEXT function _c_an be used.


The 4-step methodology for using the function

This section will present four methodological steps for testing the Generative Ai function _GENERATETEXT.

GENERATE_TEXT function workflow [Image by author]
GENERATE_TEXT function workflow [Image by author]

In summary, the methodology consists of:

  • Generation of the small mockup dataset using ChatGPT and exporting it to a Google Sheets document.
  • Creation of the BigQuery dataset on top of the Google Sheets document.
  • Setting up the connection between Google service Vertex AI and BQ in order to use a remote generative AI model.
  • Hands-on use-case examples for testing function _GENERATETEXT on top of the mockup dataset in BigQuery.

More context for every step is shown in the subsections below.


Step 1: Generate a mockup dataset using ChatGPT

As I didn’t have a real-life dataset, I decided to create a mockup dataset for testing purposes.

For this, I used ChatGPT by entering the following prompt text:

"I would need to automatically generate customer reviews for 50 different imaginary hair products.

For each product, I would need 2 positive, 2 negative, and 1 neutral review.

In addition, I would expect reviews to be at least 4 sentences long and contain different information: product name, location of the store where the product was bought, and price of the product.

In negative reviews, please include different reasons, like product quality issues, price issues, and delivery issues."

The outcome was a small dataset table with five attributes (previewed in the image below):

  • _product_name –_ contains values of the fake product names,
  • _review_type –_ contains the review sentiment type (positive, negative or neutral),
  • _store_location –_ contains the random city and state name,
  • _product_price – contains the random product price in dollars,_ and
  • _product_review—_contains the five-sentence-long fake product review text.
BigQuery mockup dataset preview [Image by author]
BigQuery mockup dataset preview [Image by author]

The complete dataset can be found in the Git repository here [4].

After preparing the mockup dataset and storing it in Google Sheets, the next step is transferring it to BigQuery.


Step 2: Create a BQ table on top of the Google Sheet document

My go-to method when I have a small dataset, which I need to tweak or change often manually, is to create a BigQuery table on top of the editable Google Sheet document.

To do so, the following substeps are needed [2]:

Substep #1: Create a dataset and table in the BigQuery project by specifying the file location and format

The CREATE TABLE option located in the top-right corner of the BigQuery environment should be selected for the creation of the new dataset and table. Values marked with a star sign (*****) need to be entered, as shown in the image below.

Creating a BQ dataset and table on top of the mockup dataset in Google Sheets [Image by author]
Creating a BQ dataset and table on top of the mockup dataset in Google Sheets [Image by author]

As visible from the image above, the name of the newly created BQ dataset is hair_shop , while the name of the newly created table is product_review_table.

Important note to keep in mind:

  • In case you don’t want to define the table schema in the Schema section shown in the image above, all imported attributes from Google Sheets will have the datatype STRING by default.

Substep #2: Query the mockup dataset in the BQ

The second substep is to simply explore the hair_shop.product_review_tabledataset directly in BigQuery.

Query the mockup dataset in BQ [Image by author]
Query the mockup dataset in BQ [Image by author]

Now that you can access your mockup dataset in BQ, it’s time to connect the generative AI remote model to it.


Step 3: Connect the Vertex AI service to BQ

A detailed explanation that served as a guideline for this step can be found in Google’s documentation here [3].

To summarize the provided Google tutorial, the three main substeps for connecting the Vertex AI to BQ are:

  • Substep #1: Create a cloud resource connection from BigQuery to get the connection’s service account.
  • Substep #2: Grant the connection’s service account an appropriate role to access the Vertex AI service.
  • Substep #3: Create the text-bison remote model that represents a hosted Vertex AI large language model in the created BigQuery dataset hair_shop.

Once these substeps are concluded, a new object Model will be visible within the hair_shop dataset.

A preview of the new database object Model in the selected BQ dataset [Image by author]
A preview of the new database object Model in the selected BQ dataset [Image by author]

Finally, the magic can begin, and we can start using the _GENERATETEXT function.


_Step 4: Use the GENERATE_TEXT function in BQ_

In this step, we will focus on two use cases for testing the function usage: sentiment analysis and entity extraction.

The reason for selecting these two use cases is that we already created two attributes in the input mockup dataset (review_type and product_review)which can be used to test the function results. With function results, we mean validating the accuracy of the AI-generated values within the new attributes.

Let’s now present the concrete input-output flow for generating the new attributes for each use case.

Sentiment analysis example

The query for sentiment analysis is shown in the code block below.

--Generate the positive/negative sentiment
SELECT
  ml_generate_text_result['predictions'][0]['content'] AS review_sentiment,
  * EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `hair_shop.llm_model`,
    (
      SELECT
        CONCAT('Extract the one word sentiment from product_review column. Possible values are positive/negative/neutral ', product_review) AS prompt,
        *
      FROM
        `macro-dreamer-393017.hair_shop.product_review_table`
      LIMIT 5
    ),
    STRUCT(
      0.1 AS temperature,
      1 AS max_output_tokens,
      0.1 AS top_p, 
      1 AS top_k));

The breakdown of the query is as follows:

Outer SELECT statement:

  • The outer query selects the newly generated string attribute review_sentiment obtained from LLM by applying the ML.GENERATE_TEXT function to the hair_shop.llm_model object.
  • In addition, it selects all other columns from the input dataset hair_shop.product_review_table.

Inner SELECT statement:

  • The inner query statement selects a CONCAT function with a specific prompt instruction to each product_review from the input dataset hair_shop.product_review_table.
  • In other words, the prompt instruction guides the model to extract a one-word sentiment (positive, negative, or neutral) from the product_review attribute.

STRUCT for model parameters:

  • temperature: 0.1 – a lower value of 0.1 will lead to more predictable text generation.
  • max_output_tokens: 1– limits the model’s output to 1 token (sentiment analysis outcome can be either positive, negative, or neutral).
  • top_p: 0.1 is influencing the likelihood distribution of the next token.
  • top_k: 1 is restricting the number of top tokens considered.

After triggering the presented query, the first five outcomes were analysed:

The sentiment analysis outcome of the GENERATE_TEXT function [Image by author]
The sentiment analysis outcome of the GENERATE_TEXT function [Image by author]

As visible from the image, the newly generated attribute named review_sentiment is added to the mockup dataset. The values of the review_sentiment attribute were compared to the values in the review_type attribute, and their records matched.

Following the sentiment analysis outcome, the aim was to identify the reason behind every sentiment, i.e., to perform entity extraction.

Entity extraction example

The query for entity extraction analysis is shown in the code block below.

--Check what's the reason behind the positive/negative sentiment
SELECT
  ml_generate_text_result['predictions'][0]['content'] AS generated_text,
  * EXCEPT (ml_generate_text_status,ml_generate_text_result)
FROM
  ML.GENERATE_TEXT(
    MODEL `hair_shop.llm_model`,
    (
      SELECT
        CONCAT('What is the reason for a review of the product? Possible options are: good/bad/average product quality, delivery delay, low/high price', product_review) AS prompt,
        *
      FROM
        `macro-dreamer-393017.hair_shop.product_review_table`
      LIMIT 5
    ),
    STRUCT(
      0.2 AS temperature,
      20 AS max_output_tokens,
      0.2 AS top_p, 
      5 AS top_k));

The breakdown of the query is analogous to the sentiment analysis query breakdown, except that the prompt instruction and model parameter values have been adjusted for entity extraction analysis.

After triggering the presented query, the first five outcomes were analysed:

The entity extraction analysis outcome of the GENERATE_TEXT function [Image by author]
The entity extraction analysis outcome of the GENERATE_TEXT function [Image by author]

The results of the review_reason attribute were compared to the values of the product_review attribute. Like in the previous example, the outcomes were matching.

And here you have it, folks. This was the tutorial that showed the two use cases for creating new attributes from unstructured text by using only SQL and prompt text. All with the aid of the new generative AI function in BigQuery.

We can now share a short summary of this blog post.

Conclusion

The goal of this blog post was to present the usage of the new GENERATE_TEXT ML function in BigQuery.

With this function, data professionals who are SQL-oriented now have options to create new insights (attributes) from the unstructured text stored directly in the BQ data warehouse table.

In other words, the function enables data professionals to leverage NLP machine learning tasks directly within the BigQuery environment.

Furthermore, it overcomes the knowledge gap between the data professionals who are developing new analytical insights using Python and the ones who are more database-oriented.

Finally, I will conclude this blog post with one saying: "Everyone can code and do NLP analysis in BigQuery with SQL knowledge and a good prompt structure."

Thank you for reading my post. Stay connected for more stories on Medium and Linkedin.

Knowledge references

The post The New Generative AI Function in BigQuery appeared first on Towards Data Science.

]]>
Growing a Data Career in the Generative-AI Era https://towardsdatascience.com/one-blogpost-comment-growing-the-data-career-in-the-generative-ai-era-9ef1242d3019/ Thu, 06 Jul 2023 17:58:10 +0000 https://towardsdatascience.com/one-blogpost-comment-growing-the-data-career-in-the-generative-ai-era-9ef1242d3019/ Raising awareness about learning three fundamental data concepts As a data professional, I am just amazed by all the recent developments in the area of generative AI. While some call it hype and are willing to quickly write it off as just another tech trend, others are convinced it is a game-changer. Regardless of which […]

The post Growing a Data Career in the Generative-AI Era appeared first on Towards Data Science.

]]>
Raising awareness about learning three fundamental data concepts

As a data professional, I am just amazed by all the recent developments in the area of generative AI.

While some call it hype and are willing to quickly write it off as just another tech trend, others are convinced it is a game-changer.

Regardless of which stream you support, it is hard to ignore the transformational possibilities generative AI can bring to the future of education and the workplace.

To back up this statement, it is enough to mention that Harvard University is introducing an AI chatbot into classrooms this fall (fall 2023) to approximate a one-to-one teacher-student ratio. The students will use the Harvard-developed chatbot to guide them to solutions rather than to provide them with straightforward answers.

For me, this is a clear indicator that Harvard is triggering a wave of change in how the new generations will learn and, consequently, work.

Meaning, generative AI is not just a passing trend, and we need to start finding a way to adapt our working environments to it.

Despite my enthusiastic view of generative AI, I have never had such FOMO before.

In other words, although I have navigated through various data roles in the past 12 years and gained knowledge of machine learning concepts, I am not able to keep up with the new developments in the generative AI area.

The new terminology, the concept of prompt engineering, the development of new large language models, numerous apps and solutions built on top of them, new e-learning courses, and the sheer volume of posts on this topic – all of this is simply overwhelming.

Moreover, I can’t shake off the unsettling feeling that some of my data skills are now just, well, obsolete.

The idea that my business colleagues will replace my hard-earned query skills with a few keystrokes is scary.

However, when giving it a second thought, I have to admit that I don’t even mind the fact that some (but only some) of my skills will be replaced. Executing ad-hoc queries several times per week to answer the same repetitive business questions is something I never liked to do.

Among others, I am aware that "me" being in between the data stored in the data warehouse and the generation of business insights is just slowing down the decision-making process.

The other thing I am aware of is that this transition, i.e., my substitution, won’t happen overnight.

First of all, the current development environments need to be adapted, i.e., they need to be more "business-user friendly", and less "developer-friendly".

Second, the business users will need to gain a technical understanding of what is "behind the hub". The freedom to generate analytical insights from natural text entries comes with the same issues.

Problems like slow insight generation, incorrect insight generation, enrichment of the insights without new inputs (new data sources), and the technical process of insight quality checks will still exist.

And someone will still need to handle and "fix" these problems for the business users.

In other words, generative AI won’t be able to easily replace fundamental data knowledge.


So, what do I mean by "foundational" data knowledge?

To back up my answer with the above-listed problems, it comes down to three core concepts:

  1. Building Data Architecture

Argument: Technical knowledge and an understanding of how to design an appropriate data architecture in a specific industry are crucial.

Let me give you an example from the fintech industry.

In the fintech industry, there are strict regulations, i.e., the PCI Data Security Standards, that need to be considered when building a data platform. On top of these standards, sometimes there are market-based standards.

For example, in Switzerland, among others, there are FINMA regulations that need to be taken into consideration to make your data platform, and consequently your data architecture, compliant.

Of course, the regulations are prone to changes, implying that the data architecture needs to follow these changes. And this imposes a real challenge for generative AI.

Generative AI can support architecture design and development up to a specific level.

But it is not able to design customizable architectural solutions in industries where regulations are changing.

It does not possess the ability to apply specific architectural adaptations if it’s not trained on similar historical examples.

2. Data Quality Management

Argument: The saying "garbage in – garbage out" will always be valid, and everyone working in the data world knows exactly what the cost of poor data input quality is.

Using generative AI solutions, the cost of poor output quality is even higher.

For example, I need to refer myself to the recent article I read in the Guardian. It was an article about a lawyer using ChatGPT to provide examples of similar previous legal cases. He wanted to back up his argument about why his client’s lawsuit against the aviation company should not be dismissed.

I think you can already imagine how the story goes: when the airline’s lawyers checked the cited decisions and legal citations, they found out none of them ever existed. In short, ChatGPT was hallucinating.

To derive my conclusions from this article, poor data quality outputs could cost you a business and lead to a complete project shut-down or losing your clients and reputation.

Hence, data professionals will be even more busy managing the quality of the data inputs and outputs.

3. Data Privacy and Security

Argument: As a data professional, you are aware of the concepts of SQL injections and database security.

With the developments in generative AI and the simple usage of prompts, data warehouse attacks and data breach scenarios are more likely than ever to happen.

The danger of prompt injections — e.g., the possibility that with one text input, someone could potentially drop the whole database or retrieve confidential records—needs to be placed at the centre of data security.

Meaning that data and IT professionals will continue to play crucial roles in protecting and securing the data.


To summarize: data professionals with knowledge of foundational data concepts will stay in the workplace as "constants" that will manage data efficiently, identify issues, and optimize solutions to be compliant, secure, and reliable.

This is the part that generative AI will not be able to easily replace.

So, if you are a young professional seeking advice on how to grow your data career in the generative AI era, start by learning the above-listed core concepts.

Trust me on this: investing time and resources to acquire fundamental data knowledge will pay off long-term in your data career.

Generative AI will boost your learning curve and work performance in these areas, but it will only help you up to a certain level. The "important" work will still be up to you and your knowledge.

The post Growing a Data Career in the Generative-AI Era appeared first on Towards Data Science.

]]>