Sarah Schürch, Author at Towards Data Science

Understanding the Tech Stack Behind Generative AI

Sarah Schürch — Tue, 01 Apr 2025 00:35:03 +0000

Understanding the Tech Stack Behind Generative AI

When ChatGPT reached the one million user mark within five days and took off faster than any other technology in history, the world began to pay attention to artificial intelligence and AI applications.

And so it continued apace. Since then, many different terms have been buzzing around — from ChatGPT and Nvidia H100 chips to Ollama, LangChain, and Explainable AI. What is actually meant for what?

That’s exactly what you’ll find in this article: A structured overview of the technology ecosystem around generative AI and LLMs.

Let’s dive in!

Table of Contents
1 What makes generative AI work – at its core
2 Scaling AI: Infrastructure and Compute Power
3 The Social Layer of AI: Explainability, Fairness and Governance
4 Emerging Abilities: When AI Starts to Interact and Act
Final Thoughts
Where Can You Continue Learning?

1 What makes generative AI work – at its core

New terms and tools in the field of artificial intelligence seem to emerge almost daily. At the core of it all are the foundational models, frameworks and the infrastructure required to run generative AI in the first place.

Foundation Models

Do you know the Swiss Army Knife? Foundation models are like such a multifunctional knife – you can perform many different tasks with just one tool.

Foundation models are large AI models that have been pre-trained on huge amounts of data (text, code, images, etc.). What is special about these models is that they can not only solve a single task but can also be used flexibly for many different applications. They can write texts, correct code, generate images or even compose music. And they are the basis for many generative AI applications.

The following three aspects are key to understanding foundation models:

Pre-trained
These models were trained on huge data sets. This means that the model has ‘read’ a huge amount of text or other data. This phase is very costly and time-consuming.
Multitask-capable
These foundation models can solve many tasks. If we look at GPT-4o, you can use it to solve everyday questions about knowledge questions, text improvements and code generation.
Transferable
Through fine-tuning or Retrieval Augmented Generation (RAG), we can adapt such Foundation Models to specific domains or specialise them for specific application areas. I have written about RAG and fine-tuning in detail in How to Make Your LLM More Accurate with RAG & Fine-Tuning. But the core of it is that you have two options to make your LLM more accurate: With RAG, the model remains the same, but you improve the input by providing the model with additional sources. For example, the model can access past support tickets or legal texts during a query – but the model parameters and weightings remain unchanged. With fine-tuning, you retrain the pre-trained model with additional sources – the model saves this knowledge permanently.

To get a feel for the amount of data we are talking about, let’s look at FineWeb. FineWeb is a massive dataset developed by Hugging Face to support the pre-training phase of LLMs. The dataset was created from 96 common-crawl snapshots and comprises 15 trillion tokens – which takes up about 44 terabytes of storage space.

Most foundation models are based on the Transformer architecture. In this article, I won’t go into this in more detail as it’s about the high-level components around AI. The most important thing to understand is that these models can look at the entire context of a sentence at the same time, for example – and not just read word by word from left to right. The foundational paper introducing this architecture was Attention is All You Need (2017).

All major players in the AI field have released foundation models — each with different strengths, use cases, and licensing conditions (open-source or closed-source).

GPT-4 from OpenAI, Claude from Anthropic and Gemini from Google, for example, are powerful but closed models. This means that neither the model weights nor the training data are accessible to the public.

There are also high-performing open-source models from Meta, such as LLaMA 2 and LLaMA 3, as well as from Mistral and DeepSeek.

A great resource for comparing these models is the LLM Arena on Hugging Face. It provides an overview of various language models, ranks them and allows for direct comparisons of their performance.

Screenshot taken by the author: We can see a comparison of different llm models in the LLM Arena.

Multimodal models

If we look at the GPT-3 model, it can only process pure text. Multimodal models now go one step further: They can process and generate not only text, but also images, audio and video. In other words, they can process and generate several types of data at the same time.

What does this mean in concrete terms?

Multimodal models process different types of input (e.g. an image and a question about it) and combine this information to provide more intelligent answers. For example, with the Gemini 1.5 version you can upload a photo with different ingredients and ask the question which ingredients you see on this plate.

How does this work technically?

Multimodal models understand not only speech but also visual or auditory information. Multimodal models are also usually based on transformer architecture like pure text models. However, an important difference is that not only words are processed as ‘tokens’ but also images as so-called patches. These are small image sections that are converted into vectors and can then be processed by the model.

Let’s have a look at some examples:

GPT-4-Vision
This model from OpenAI can process text and images. It recognises content on images and combines it with speech.
Gemini 1.5
Google’s model can process text, images, audio and video. It is particularly strong at retaining context across modalities.
Claude 3
Anthropic’s model can process text and images and is very good at visual reasoning. It is good at recognising diagrams, graphics and handwriting.

Other examples are Flamingo from DeepMind, Kosmos-2 from Microsoft or Grok (xAI) from Elon Musk’s xAI, which is integrated into Twitter.

GPU & Compute Providers

When generative AI models are trained, this requires enormous computing capacity. Especially for pre-training but also for inference – the subsequent application of the model to new inputs.

Imagine a musician practising for months to prepare for a concert – that’s what pre-training is like. During pre-training, a model such as GPT-4, Claude 3, LLaMA 3 or DeepSeek-VL learns from trillions of tokens that come from texts, code, images and other sources. These data volumes are processed with GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units). This is necessary because this hardware enables parallel computing (compared to CPUs). Many companies rent computing power in the cloud (e.g. via AWS, Google Cloud, Azure) instead of operating their own servers.

When a pre-trained model is adapted to specific tasks with fine-tuning, this in turn, requires a lot of computing power. This is one of the major differences when the model is customised with RAG. One way to make fine-tuning more resource-efficient is low-rank adaptation (LoRA). Here, small parts of the model are specifically retrained instead of the entire model being trained with new data.

If we stay with the music example, the inference is the moment when the actual live concert takes place, which has to be played over and over again. This example also makes it clear that this also requires resources. Inference is the process of applying an AI model to a new input (e.g. you ask a question to ChatGPT) to generate an answer or a prediction.

Some examples:

Specialised hardware components that are optimised for parallel computing are used for this. For example, NVIDIA’s A100 and H100 GPUs are standard in many data centres. AMD Instinct MI300X, for example, are also catching up as a high-performance alternative. Google TPUs are also used for certain workloads – especially in the Google ecosystem.

ML Frameworks & Libraries

Just like in programming languages or web development, there are frameworks for AI tasks. For example, they provide ready-made functions for building neural networks without the need to program everything from scratch. Or they make training more efficient by parallelising calculations with the framework and making efficient use of GPUs.

The most important ML frameworks for generative AI:

PyTorch was developed by Meta and is open source. It is very flexible and popular in research & open source.
TensorFlow was developed by Google and is very powerful for large AI models. It supports distributed training – explanation and is often used in cloud environments.
Keras is a part of TensorFlow and is mainly used for beginners and prototype development.
JAX is also from Google and was specially developed for high-performance AI calculations. It is often used for advanced research and Google DeepMind projects. For example, it is used for the latest Google AI models such as Gemini and Flamingo.

PyTorch and TensorFlow can easily be combined with other tools such as Hugging Face Transformers or ONNX Runtime.

AI Application Frameworks

These frameworks enable us to integrate the Foundation Models into specific applications. They simplify access to the Foundation Models, the management of prompts and the efficient administration of AI-supported workflows.

Three tools, as examples:

LangChain enables the orchestration of LLMs for applications such as chatbots, document processing and automated analyses. It supports access to APIs, databases and external storage. And it can be connected to vector databases – which I explain in the next section – to perform contextual queries.

Let’s look at an example: A company wants to build an internal AI assistant that searches through documents. With LangChain, it can now connect GPT-4 to the internal database and the user can search company documents using natural language.
LlamaIndex was specifically designed to make large amounts of unstructured data efficiently accessible to LLMs and is therefore important for Retrieval Augmented Generation (RAG). Since LLMs only have a limited knowledge base based on the training data, it allows RAG to retrieve additional information before generating an answer. And this is where LlamaIndex comes into play: it can be used to convert unstructured data, e.g. from PDFs, websites or databases, into searchable indices.

Let’s take a look at a concrete example:

A lawyer needs a legal AI assistant to search laws. LlamaIndex organises thousands of legal texts and can therefore provide precise answers quickly.
Ollama makes it possible to run large language models on your own laptop or server without having to rely on the cloud. No API access is required as the models run directly on the device.

For example, you can run a model such as Mistral, LLaMA 3 or DeepSeek locally on your device.

Databases & Vector Stores

In traditional data processing, relational databases (SQL databases) store structured data in tables, while NoSQL databases such as MongoDB or Cassandra are used to store unstructured or semi-structured data.

With LLMs, however, we now also need a way to store and search semantic information.

This requires vector databases: A foundation model does not process input as text, but converts it into numerical vectors – so-called embeddings. Vector databases make it possible to perform fast similarity and memory management for embeddings and thus provide relevant contextual information.

How does this work, for example, with Retrieval Augmented Generation?

Each text (e.g. a paragraph from a PDF) is translated into a vector.
You pass a query to the model as a prompt. For example, you ask a question. This question is now also translated into a vector.
The database now calculates which vectors are closest to the input vector.
These top results are made available to the LLM before it answers. And the model then uses this information additionally for the answer.

Examples of this are Pinecone, FAISS, Weaviate, Milvus, and Qdrant.

Programming Languages

Generative AI development also needs a programming language.

Of course, Python is probably the first choice for almost all AI applications. Python has established itself as the main language for AI & ML and is one of the most popular and widely used languages. It is flexible and offers a large AI ecosystem with all the previously mentioned frameworks such as TensorFlow, PyTorch, LangChain or LlamaIndex.

Why isn’t Python used for everything?

Python is not very fast. But thanks to CUDA backends, TensorFlow or PyTorch are still very performant. However, if performance is really very important, Rust, C++ or Go are more likely to be used.

Another language that must be mentioned is Rust: This language is used when it comes to fast, secure and memory-efficient AI infrastructures. For example, for efficient databases for vector searches or high-performance network communication. It is primarily used in the infrastructure and deployment area.

Julia is a language that is close to Python, but much faster – this makes it perfect for numerical calculations and tensor operations.

TypeScript or JavaScript are not directly relevant for AI applications but are often used in the front end of LLM applications (e.g., React or Next.js).

Own visualization — Illustrations from unDraw.co

2 Scaling AI: Infrastructure and Compute Power

Apart from the core components, we also need ways to scale and train the models.

Containers & Orchestration

Not only traditional applications, but also AI applications need to be provided and scaled. I wrote about containerisation in detail in this article Why Data Scientists Should Care about Containers – and Stand Out with This Knowledge. But at its core, the point is that with containers, we can run an AI model (or any other application) on any server and it works the same. This allows us to provide consistent, portable and scalable AI workloads.

Docker is the standard for containerisation. Generative AI is no different. We can use it to develop AI applications as isolated, repeatable units. Docker is used to deploy LLMs in the cloud or on edge devices. Edge means that the AI does not run in the cloud, but locally on your device. The Docker images contain everything you need: Python, ML frameworks such as PyTorch, CUDA for GPUs and AI APIs.

Let’s take a look at an example: A developer trains a model locally with PyTorch and saves it as a Docker container. This allows it to be easily deployed to AWS or Google Cloud.

Kubernetes is there to manage and scale container workloads. It can manage GPUs as resources. This makes it possible to run multiple models efficiently on a cluster – and to scale automatically when demand is high.

Kubeflow is less well-known outside of the AI world. It allows ML models to be orchestrated as a workflow from data processing to deployment. It is specifically designed for machine learning in production environments and supports automatic model training & hyperparameter training.

Chip manufacturers & AI hardware

The immense computing power that is required must be produced. This is done by chip manufacturers. Powerful hardware reduces training times and improves model inference.

There are now also some models that have been trained with fewer parameters or fewer resources for the same performance. When DeepSeek was published at the end of February, it was somewhat questioned how many resources are actually necessary. It is becoming increasingly clear that huge models and extremely expensive hardware are not always necessary.

Probably the best-known chip manufacturer in the field of AI is Nvidia, one of the most valuable companies. With its specialised A100 and H100 GPUs, the company has become the de facto standard for training and inferencing large AI models. In addition to Nvidia, however, there are other important players such as AMD with its Instinct MI300X series, Google, Amazon and Cerebras.

API Providers for Foundation Models

The Foundation Models are pre-trained models. We use APIs so that we can access them as quickly as possible without having to host them ourselves. API providers offer quick access to the models, such as OpenAI API, Hugging Face Inference Endpoints or Google Gemini API. To do this, you send a text via an API and receive the response back. However, APIs such as the OpenAI API are subject to a fee.

The best-known provider is OpenAI, whose API provides access to GPT-3.5, GPT-4, DALL-E for image generation and Whisper for speech-to-text. Anthropic also offers a powerful alternative with Claude 2 and 3. Google provides access to multimodal models such as Gemini 1.5 via the Gemini API.

Hugging Face is a central hub for open source models: the inference endpoints allow us to directly address Mistral 7B, Mixtral or Meta models, for example.

Another exciting provider is Cohere, which provides Command R+, a model specifically for Retrieval Augmented Generation (RAG) – including powerful embedding APIs.

Serverless AI architectures

Serverless computing does not mean that there is no server but that you do not need your own server. You only define what is to be executed – not how or where. The cloud environment then automatically starts an instance, executes the code and shuts the instance down again. The AWS Lambda functions, for example, are well-known here.

Something similar is also available specifically for AI. Serverless AI reduces the administrative effort and scales automatically. This is ideal, for example, for AI tasks that are used irregularly.

Let’s take a look at an example: A chatbot on a website that answers questions from customers doesn’t have to run all the time. However, when a visitor comes to the website and asks a question, it must have resources. It is, therefore, only called up when needed.

Serverless AI can save costs and reduce complexity. However, it is not useful for continuous, latency-critical tasks.

Examples: AWS Bedrock, Azure OpenAI Service, Google Cloud Vertex AI

With great power and capability comes responsibility. The more we integrate AI into our everyday applications, the more important it becomes to engage with the principles of Responsible AI.

So…Generative AI raises many questions:

Does the model explain how it arrives at its answers?
-> Question about Transparency
Are certain groups favoured?
-> Question about Fairness
How is it ensured that the model is not misused?
-> Question about Security
Who is liable for errors?
-> Question about Accountability
Who controls how and where AI is used?
-> Question about Governance
Which available data from the web (e.g. images from
artists) may be used?
-> Question about Copyright / data ethics

While we have comprehensive regulations for many areas of the physical world — such as noise control, light pollution, vehicles, buildings, and alcohol sales — similar regulatory efforts in the IT sector are still rare and often avoided.

I’m not making a generalisation or a value judgment about whether this is good or bad. Less regulation can accelerate innovation – new technologies reach the market faster. At the same time, there is a risk that important aspects such as ethical responsibility, bias detection or energy consumption by large models will receive too little attention.

With the AI Act, the EU is focusing more on a regulated approach that is intended to create clear framework conditions – but this, in turn, can reduce the speed of innovation. The USA tends to pursue a market-driven, liberal approach with voluntary guidelines. This promotes rapid development but often leaves ethical and social issues in the background.

Let’s take a look at three concepts:

Explainability

Many large LLMs such as GPT-4 or Claude 3 are considered so-called black boxes: they provide impressive answers, but we do not know exactly how they arrive at these results. The more we entrust them with – especially in sensitive areas such as education, medicine or justice – the more important it becomes to understand their decision-making processes.

Tools such as LIME, SHAP or Attention Maps are ways of minimising these problems. They analyse model decisions and present them visually. In addition, model cards (standardised documentation) help to make the capabilities, training data, limitations and potential risks of a model transparent.

Fairness

If a model has been trained with data that contains biases or biased representations, it will also inherit these biases and distortions. This can lead to certain population groups being systematically disadvantaged or stereotyped. There are methods for recognising bias and clear standards for how training data should be selected and tested.

Governance

Finally, the question of governance arises: Who actually determines how AI may be used? Who checks whether a model is being operated responsibly?

4 Emerging Abilities: When AI Starts to Interact and Act

This is about the new capabilities that go beyond the classic prompt-response model. AI is becoming more active, more dynamic and more autonomous.

Let’s take a look at a concrete example:

A classic LLM like GPT-3 follows the typical process: For example, you ask a question like ‘Please show me how to create a button with rounded corners using HTML & CSS’. The model then provides you with the appropriate code, including a brief explanation. The model returns a pure text output without the model actively executing or thinking anything further.

Screenshot taken by the author: The answer from ChatGPT if we ask for creating buttons with rounded corners.

AI agents go much further. They not only analyse the prompt but also develop plans independently, access external tools or APIs and can complete tasks in several steps.

A simple example:

Instead of just writing the template for an email, an agent can monitor a data source and independently send an email as soon as a certain event occurs. For example, an email could go out when a sales target has been exceeded.

AI agents

AI agents are an application logic based on the Foundation Models. They orchestrate decisions and execute steps independently. Agents such as AutoGPT carry out multi-step tasks independently. They think in loops and try to improve or achieve a goal step by step.

Some examples:

Your AI agent analyzes new market reports daily, summarizes them, stores them in a database, and notifies the user in case of deviations.
An agent initiates a job application process: It scans submitted profiles and matches them with job offers.
In an e-commerce shop, the agent monitors inventory levels and customer demand. If a product is running low, it automatically reorders it – including price comparisons between suppliers.

What typically makes up an AI agent?

An AI agent consists of several specialized components, making it possible to autonomously plan, execute, and learn tasks:

Large Language Model
The LLM is the core or thinking engine. Typical models include GPT-4, Claude 3, Gemini 1.5, or Mistral 7B.
Planning unit
The planner transforms a higher-level goal into a concrete plan or sequence of steps. Often based on methods like Chain-of-Thought or ReAct.
Tool access
This component enables the agent to use external tools. For example, using a browser for extended search, a Python environment for code execution or enabling access to APIs and databases.
Memory
This component stores information about previous interactions, intermediate results, or contextual knowledge. This is necessary so that the agent can act consistently across multiple steps.
Executor
This component executes the planned steps in the correct order, monitors progress, and replans in case of errors.

There are also tools like Make or n8n (low-code / no-code automation platforms), which also let you implement “agent-like” logic. They execute workflows with conditions, triggers, and actions. For example, an automated reply should be formulated when a new email arrives in the inbox. And there are a lot of templates for such use cases.

Screenshot taken by the author: Templates on n8n as an example for low-code or no-code platforms.

Reinforcement Learning

With reinforcement learning, the models are made more “human-friendly.” In this training method, the model learns through reward. This is especially important for tasks where there is no clear “right” or “wrong,” but rather gradual quality.

An example of this is when you use ChatGPT, receive two different responses and are asked to rate which one you prefer.

The reward can come either from human feedback (Reinforcement Learning from Human Feedback – RLHF) or from another model (Reinforcement Learning from AI Feedback – RLVR). In RLHF, a human rates several responses from a model, allowing the LLM to learn what “good” responses look like and better align with human expectations. In RLVR, the model doesn’t just receive binary feedback (e.g., good vs. bad) but differentiated, context-dependent rewards (e.g., a variable reward scale from -1 to +3). RLVR is especially useful where there are many possible “good” responses, but some match the user’s intent much better.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Final Thoughts

It would probably be possible to write an entire book about Generative Ai right now – not just a single article. Artificial intelligence has been researched and applied for many years. But we are currently in a moment where an explosion of tools, applications, and frameworks is happening – AI, and especially generative AI, has truly arrived in our everyday lives. Let’s see where this takes us and end with a quote from Alan Kay:

The best way to predict the future is to invent it.

Where Can You Continue Learning?

The post Understanding the Tech Stack Behind Generative AI appeared first on Towards Data Science.

How to Make Your LLM More Accurate with RAG & Fine-Tuning

Sarah Schürch — Tue, 11 Mar 2025 18:05:00 +0000

Imagine studying a module at university for a semester. At the end, after an intensive learning phase, you take an exam – and you can recall the most important concepts without looking them up.

Now imagine the second situation: You are asked a question about a new topic. You don’t know the answer straight away, so you pick up a book or browse a wiki to find the right information for the answer.

These two analogies represent two of the most important methods for improving the basic model of an Llm or adapting it to specific tasks and areas: Retrieval Augmented Generation (RAG) and Fine-Tuning.

But which example belongs to which method?

That’s exactly what I’ll explain in this article: After that, you’ll know what RAG and fine-tuning are, the most important differences and which method is suitable for which application.

Let’s dive in!

Table of contents

1. Basics: What is RAG? What is fine-tuning?
2. Differences between RAG and fine-tuning
3. Ways to build a RAG model
4. Options for fine-tuning a model
5. When is RAG recommended? When is fine-tuning recommended?
Final Thoughts
Where can you continue learning?

1. Basics: What is RAG? What is fine-tuning?

Large Language Models (LLMs) such as ChatGPT from OpenAI, Gemini from Google, Claude from Anthropics or Deepseek are incredibly powerful and have established themselves in everyday work over an extremely short time.

One of their biggest limitations is that their knowledge is limited to training. A model that was trained in 2024 does not know events from 2025. If we ask the 4o model from ChatGPT who the current US President is and give the clear instruction that the Internet should not be used, we see that it cannot answer this question with certainty:

Screenshot taken by the author

In addition, the models cannot easily access company-specific information, such as internal guidelines or current technical documentation.

This is exactly where RAG and fine-tuning come into play.

Both methods make it possible to adapt an LLM to specific requirements:

RAG — The model remains the same, the input is improved

An LLM with Retrieval Augmented Generation (RAG) remains unchanged.

However, it gains access to an external knowledge source and can therefore retrieve information that is not stored in its model parameters. RAG extends the model in the inference phase by using external data sources to provide the latest or specific information. The inference phase is the moment when the model generates an answer.

This allows the model to stay up to date without retraining.

How does it work?

A user question is asked.
The query is converted into a vector representation.
A retriever searches for relevant text sections or data records in an external data source. The documents or FAQS are often stored in a vector database.
The content found is transferred to the model as additional context.
The LLM generates its answer on the basis of the retrieved and current information.

The key point is that the LLM itself remains unchanged and the internal weights of the LLM remain the same.

Let’s assume a company uses an internal AI-powered support chatbot.

The chatbot helps employees to answer questions about company policies, IT processes or HR topics. If you would ask ChatGPT a question about your company (e.g. How many vacation days do I have left?), the model would logically not give you back a meaningful answer. A classic LLM without RAG would know nothing about the company – it has never been trained with this data.

This changes with RAG: The chatbot can search an external database of current company policies for the most relevant documents (e.g. PDF files, wiki pages or internal FAQs) and provide specific answers.

RAG works similarly as when we humans look up specific information in a library or Google search – but in real-time.

A student who is asked about the meaning of CRUD quickly looks up the Wikipedia article and answers Create, Read, Update and Delete – just like a RAG model retrieves relevant documents. This process allows both humans and AI to provide informed answers without memorizing everything.

And this makes RAG a powerful tool for keeping responses accurate and current.

Own visualization by the author

Fine-tuning — The model is trained and stores knowledge permanently

Instead of looking up external information, an LLM can also be directly updated with new knowledge through fine-tuning.

Fine-tuning is used during the training phase to provide the model with additional domain-specific knowledge. An existing base model is further trained with specific new data. As a result, it “learns” specific content and internalizes technical terms, style or certain content, but retains its general understanding of language.

This makes fine-tuning an effective tool for customizing LLMs to specific needs, data or tasks.

How does this work?

The LLM is trained with a specialized data set. This data set contains specific knowledge about a domain or a task.
The model weights are adjusted so that the model stores the new knowledge directly in its parameters.
After training, the model can generate answers without the need for external sources.

Let’s now assume we want to use an LLM that provides us with expert answers to legal questions.

To do this, this LLM is trained with legal texts so that it can provide precise answers after fine-tuning. For example, it learns complex terms such as “intentional tort” and can name the appropriate legal basis in the context of the relevant country. Instead of just giving a general definition, it can cite relevant laws and precedents.

This means that you no longer just have a general LLM like GPT-4o at your disposal, but a useful tool for legal decision-making.

If we look again at the analogy with humans, fine-tuning is comparable to having internalized knowledge after an intensive learning phase.

After this learning phase, a computer science student knows that the term CRUD stands for Create, Read, Update, Delete. He or she can explain the concept without needing to look it up. The general vocabulary has been expanded.

This internalization allows for faster, more confident responses—just like a fine-tuned LLM.

2. Differences between RAG and fine-tuning

Both methods improve the performance of an LLM for specific tasks.

Both methods require well-prepared data to work effectively.

And both methods help to reduce hallucinations – the generation of false or fabricated information.

But if we look at the table below, we can see the differences between these two methods:

RAG is particularly flexible because the model can always access up-to-date data without having to be retrained. It requires less computational effort in advance, but needs more resources while answering a question (inference). The latency can also be higher.

Fine-tuning, on the other hand, offers faster inference times because the knowledge is stored directly in the model weights and no external search is necessary. The major disadvantage is that training is time-consuming and expensive and requires large amounts of high-quality training data.

RAG provides the model with tools to look up knowledge when needed without changing the model itself, whereas fine-tuning stores the additional knowledge in the model with adjusted parameters and weights.

Own visualization by the author

3. Ways to build a RAG model

A popular framework for building a Retrieval Augmented Generation (RAG) pipeline is LangChain. This framework facilitates the linking of LLM calls with a retrieval system and makes it possible to retrieve information from external sources in a targeted manner.

How does RAG work technically?

1. Query embedding

In the first step, the user request is converted into a vector using an embedding model. This is done, for example, with text-embedding-ada-002 from OpenAI or all-MiniLM-L6-v2 from Hugging Face.

This is necessary because vector databases do not search through conventional texts, but instead calculate semantic similarities between numerical representations (embeddings). By converting the user query into a vector, the system can not only search for exactly matching terms, but also recognize concepts that are similar in content.

2. Search in the vector database

The resulting query vector is then compared with a vector database. The aim is to find the most relevant information to answer the question.

This similarity search is carried out using Approximate Nearest Neighbors (ANN) algorithms. Well-known open source tools for this task are, for example, FAISS from Meta for high-performance similarity searches in large data sets or ChromaDB for small to medium-sized retrieval tasks.

3. Insertion into the LLM context

In the third step, the retrieved documents or text sections are integrated into the prompt so that the LLM generates its response based on this information.

4. Generation of the response

The LLM now combines the information received with its general language vocabulary and generates a context-specific response.

An alternative to LangChain is the Hugging Face Transformer Library, which provides specially developed RAG classes:

‘RagTokenizer’ tokenizes the input and the retrieval result. The class processes the text entered by the user and the retrieved documents.
The ‘RagRetriever’ class performs the semantic search and retrieval of relevant documents from the predefined knowledge base.
The ‘RagSequenceForGeneration’ class takes the documents provided, integrates them into the context and transfers them to the actual language model for answer generation.

4. Options for fine-tuning a model

While an LLM with RAG uses external information for the query, with fine-tuning we change the model weights so that the model permanently stores the new knowledge.

How does fine-tuning work technically?

1. Preparation of the training data

Fine-tuning requires a high-quality collection of data. This collection consists of inputs and the desired model responses. For a chatbot, for example, these can be question-answer pairs. For medical models, this could be clinical reports or diagnostic data. For a legal AI, these could be legal texts and judgments.

Let’s take a look at an example: If we look at the documentation of OpenAI, we see that these models use a standardized chat format with roles (system, user, assistant) during fine-tuning. The data format of these question-answer pairs is JSONL and looks like this, for example:

{"messages": [{"role": "system", "content": "Du bist ein medizinischer Assistent."}, {"role": "user", "content": "Was sind Symptome einer Grippe?"}, {"role": "assistant", "content": "Die häufigsten Symptome einer Grippe sind Fieber, Husten, Muskel- und Gelenkschmerzen."}]}

Other models use other data formats such as CSV, JSON or PyTorch datasets.

2. Selection of the base model

We can use a pre-trained LLM as a starting point. These can be closed-source models such as GPT-3.5 or GPT-4 via OpenAI API or open-source models such as DeepSeek, LLaMA, Mistral or Falcon or T5 or FLAN-T5 for NLP tasks.

3. Training of the model

Fine-tuning requires a lot of computing power, as the model is trained with new data to update its weights. Especially large models such as GPT-4 or LLaMA 65B require powerful GPUs or TPUs.

To reduce the computational effort, there are optimized methods such as LoRA (Low-Rank Adaption), where only a small number of additional parameters are trained, or QLoRA (Quantized LoRA), where quantized model weights (e.g. 4-bit) are used.

4. Model deployment & use

Once the model has been trained, we can deploy it locally or on a cloud platform such as Hugging Face Model Hub, AWS or Azure.

5. When is RAG recommended? When is fine-tuning recommended?

RAG and fine-tuning have different advantages and disadvantages and are therefore suitable for different use cases:

RAG is particularly suitable when content is updated dynamically or frequently.

For example, in FAQ chatbots where information needs to be retrieved from a knowledge database that is constantly expanding. Technical documentation that is regularly updated can also be efficiently integrated using RAG – without the model having to be constantly retrained.

Another point is resources: If limited computing power or a smaller budget is available, RAG makes more sense as no complex training processes are required.

Fine-tuning, on the other hand, is suitable when a model needs to be tailored to a specific company or industry.

The response quality and style can be improved through targeted training. For example, the LLM can then generate medical reports with precise terminology.

The basic rule is: RAG is used when the knowledge is too extensive or too dynamic to be fully integrated into the model, while fine-tuning is the better choice when consistent, task-specific behavior is required.

And then there’s RAFT — the magic of combination

What if we combine the two?

That’s exactly what happens with Retrieval Augmented Fine-Tuning (RAFT).

The model is first enriched with domain-specific knowledge through fine-tuning so that it understands the correct terminology and structure. The model is then extended with RAG so that it can integrate specific and up-to-date information from external data sources. This combination ensures both deep expertise and real-time adaptability.

Companies use the advantages of both methods.

Final thoughts

Both methods—RAG and fine-tuning—extend the capabilities of a basic LLM in different ways.

Fine-tuning specializes the model for a specific domain, while RAG equips it with external knowledge. The two methods are not mutually exclusive and can be combined in hybrid approaches. Looking at computational costs, fine-tuning is resource-intensive upfront but efficient during operation, whereas RAG requires fewer initial resources but consumes more during use.

RAG is ideal when knowledge is too vast or dynamic to be integrated directly into the model. Fine-tuning is the better choice when stability and consistent optimization for a specific task are required. Both approaches serve distinct but complementary purposes, making them valuable tools in AI applications.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Where can you continue learning?

The post How to Make Your LLM More Accurate with RAG & Fine-Tuning appeared first on Towards Data Science.

Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review

Sarah Schürch — Tue, 04 Mar 2025 20:06:21 +0000

“Conduct a comprehensive literature review on the state-of-the-art in Machine Learning and energy consumption. […]”

With this prompt, I tested the new Deep Research function, which has been integrated into the OpenAI o3 reasoning model since the end of February — and conducted a state-of-the-art literature review within 6 minutes.

This function goes beyond a normal web search (for example, with ChatGPT 4o): The research query is broken down & structured, the Internet is searched for information, which is then evaluated, and finally, a structured, comprehensive report is created.

Let’s take a closer look at this.

Table of Content
1. What is Deep Research from OpenAI and what can you do with it?
2. How does deep research work?
3. How can you use deep research? — Practical example
4. Challenges and risks of the Deep Research feature
Final Thoughts
Where can you continue learning?

1. What is Deep Research from OpenAI and what can you do with it?

If you have an OpenAI Plus account (the $20 per month plan), you have access to Deep Research. This gives you access to 10 queries per month. With the Pro subscription ($200 per month) you have extended access to Deep Research and access to the research preview of GPT-4.5 with 120 queries per month.

OpenAI promises that we can perform multi-step research using data from the public web.

Duration: 5 to 30 minutes, depending on complexity.

Previously, such research usually took hours.

It is intended for complex tasks that require a deep search and thoroughness.

What do concrete use cases look like?

Conduct a literature review: Conduct a literature review on state-of-the-art machine learning and energy consumption.
Market analysis: Create a comparative report on the best marketing automation platforms for companies in 2025 based on current market trends and evaluations.
Technology & software development: Investigate programming languages and frameworks for AI application development with performance and use case analysis
Investment & financial analysis: Conduct research on the impact of AI-powered trading on the financial market based on recent reports and academic studies.
Legal research: Conduct an overview of data protection laws in Europe compared to the US, including relevant rulings and recent changes.

2. How does Deep Research work?

Deep Research uses various Deep Learning methods to carry out a systematic and detailed analysis of information. The entire process can be divided into four main phases:

1. Decomposition and structuring of the research question

In the first step the tool processes the research question using natural language processing (NLP) methods. It identifies the most important key terms, concepts, and sub-questions.

This step ensures that the AI understands the question not only literally, but also in terms of content.

2. Obtaining relevant information

Once the tool has structured the research question, it searches specifically for information. Deep Research uses a mixture of internal databases, scientific publications, APIs, and web scraping. These can be open-access databases such as arXiv, PubMed, or Semantic Scholar, for example, but also public websites or news sites such as The Guardian, New York Times, or BBC. In the end, any content that can be accessed online and is publicly available.

3. Analysis & interpretation of the data

The next step is for the AI model to summarize large amounts of text into compact and understandable answers. Transformers & Attention mechanisms ensure that the most important information is prioritized. This means that it does not simply create a summary of all the content found. Also, the quality and credibility of the sources is assessed. And cross-validation methods are normally used to identify incorrect or contradictory information. Here, the AI tool compares several sources with each other. However, it is not publicly known exactly how this is done in Deep Research or what criteria there are for this.

4. Generation of the final report

Finally, the final report is generated and displayed to us. This is done using Natural Language Generation (NLG) so that we see easily readable texts.

The AI system generates diagrams or tables if requested in the prompt and adapts the response to the user’s style. The primary sources used are also listed at the end of the report.

3. How you can use Deep Research: A practical example

In the first step, it is best to use one of the standard models to ask how you should optimize the prompt in order to conduct deep research. I have done this with the following prompt with ChatGPT 4o:

“Optimize this prompt to conduct a deep research:
Carrying out a literature search: Carry out a literature search on the state of the art on machine learning and energy consumption.”

The 4o model suggested the following prompt for the Deep Research function:

Screenshot taken by the author

The tool then asked me if I could clarify the scope and focus of the literature review. I have, therefore, provided some additional specifications:

Screenshot taken by the author

ChatGPT then returned the clarification and started the research.

In the meantime, I could see the progress and how more sources were gradually added.

After 6 minutes, the state-of-the-art literature review was complete, and the report, including all sources, was available to me.

Deep Research Example.mp4

4. Challenges and risks of the Deep Research feature

Let’s take a look at two definitions of research:

“A detailed study of a subject, especially in order to discover new information or reach a new understanding.”
Reference: Cambridge Dictionary

“Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to controlling sources of bias and error.”
Reference: Wikipedia Research

The two definitions show that research is a detailed, systematic investigation of a topic — with the aim of discovering new information or achieving a deeper understanding.

Basically, the deep research function fulfills these definitions to a certain extent: it collects existing information, analyzes it, and presents it in a structured way.

However, I think we also need to be aware of some challenges and risks:

Danger of superficiality: Deep Research is primarily designed to efficiently search, summarize, and provide existing information in a structured form (at least at the current stage). Absolutely great for overview research. But what about digging deeper? Real scientific research goes beyond mere reproduction and takes a critical look at the sources. Science also thrives on generating new knowledge.
Reinforcement of existing biases in research & publication: Papers are already more likely to be published if they have significant results. “Non-significant” or contradictory results, on the other hand, are less likely to be published. This is known to us as publication bias. If the AI tool now primarily evaluates frequently cited papers, it reinforces this trend. Rare or less widespread but possibly important findings are lost. A possible solution here would be to implement a mechanism for weighted source evaluation that also takes into account less cited but relevant papers. If the AI methods primarily cite sources that are quoted frequently, less widespread but important findings may be lost. Presumably, this effect also applies to us humans.
Quality of research papers: While it is obvious that a bachelor’s, master’s, or doctoral thesis cannot be based solely on AI-generated research, the question I have is how universities or scientific institutions deal with this development. Students can get a solid research report with just a single prompt. Presumably, the solution here must be to adapt assessment criteria to give greater weight to in-depth reflection and methodology.

Final thoughts

In addition to OpenAI, other companies and platforms have also integrated similar functions (even before OpenAI): For example, Perplexity AI has introduced a deep research function that independently conducts and analyzes searches. Also Gemini by Google has integrated such a deep research function.

The function gives you an incredibly quick overview of an initial research question. It remains to be seen how reliable the results are. Currently (beginning March 2025), OpenAI itself writes as limitations that the feature is still at an early stage, can sometimes hallucinate facts into answers or draw false conclusions, and has trouble distinguishing authoritative information from rumors. In addition, it is currently unable to accurately convey uncertainties.

But it can be assumed that this function will be expanded further and become a powerful tool for research. If you have simpler questions, it is better to use the standard GPT-4o model (with or without search), where you get an immediate answer.

Where can you continue learning?

Want more tips & tricks about tech, Python, data science, data engineering, machine learning and AI? Then regularly receive a summary of my most-read articles on my Substack — curated and for free.

Click here to subscribe to my Substack!

The post Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review appeared first on Towards Data Science.

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

Sarah Schürch — Thu, 20 Feb 2025 04:51:55 +0000

“I train models, analyze data and create dashboards — why should I care about Containers?”

Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on your laptop. However, error messages keep popping up in the cloud when others access it — for example because they are using different library versions.

This is where containers come into play: They allow us to make machine learning models, data pipelines and development environments stable, portable and scalable — regardless of where they are executed.

Let’s take a closer look.

Table of Contents
1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs
2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.
3 — First Practice, then Theory: Container creation even without much prior knowledge
4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance
Final Thoughts: Key takeaways as a data scientist
Where Can You Continue Learning?

1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs

Containers are lightweight, isolated environments. They contain applications with all their dependencies. They also share the kernel of the host operating system, making them fast, portable and resource-efficient.

I have written extensively about virtual machines (VMs) and virtualization in ‘Virtualization & Containers for Data Science Newbiews’. But the most important thing is that VMs simulate complete computers and have their own operating system with their own kernel on a hypervisor. This means that they require more resources, but also offer greater isolation.

Both containers and VMs are virtualization technologies.

Both make it possible to run applications in an isolated environment.

But in the two descriptions, you can also see the 3 most important differences:

Architecture: While each VM has its own operating system (OS) and runs on a hypervisor, containers share the kernel of the host operating system. However, containers still run in isolation from each other. A hypervisor is the software or firmware layer that manages VMs and abstracts the operating system of the VMs from the physical hardware. This makes it possible to run multiple VMs on a single physical server.
Resource consumption: As each VM contains a complete OS, it requires a lot of memory and CPU. Containers, on the other hand, are more lightweight because they share the host OS.
Portability: You have to customize a VM for different environments because it requires its own operating system with specific drivers and configurations that depend on the underlying hardware. A container, on the other hand, can be created once and runs anywhere a container runtime is available (Linux, Windows, cloud, on-premise). Container runtime is the software that creates, starts and manages containers — the best-known example is Docker.

Created by the author

You can experiment faster with Docker — whether you’re testing a new ML model or setting up a data pipeline. You can package everything in a container and run it immediately. And you don’t have any “It works on my machine”-problems. Your container runs the same everywhere — so you can simply share it.

2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.

As a data scientist, your main task is to analyze, process and model data to gain valuable insights and predictions, which in turn are important for management.

Of course, you don’t need to have the same in-depth knowledge of containers, Docker or Kubernetes as a DevOps Engineer or a Site Reliability Engineer (SRE). Nevertheless, it is worth having container knowledge at a basic level — because these are 4 examples of where you will come into contact with it sooner or later:

Model deployment

You are training a model. You not only want to use it locally but also make it available to others. To do this, you can pack it into a container and make it available via a REST API.

Let’s look at a concrete example: Your trained model runs in a Docker container with FastAPI or Flask. The server receives the requests, processes the data and returns ML predictions in real-time.

Reproducibility and easier collaboration

ML models and pipelines require specific libraries. For example, if you want to use a deep learning model like a Transformer, you need TensorFlow or PyTorch. If you want to train and evaluate classic machine learning models, you need Scikit-Learn, NumPy and Pandas. A Docker container now ensures that your code runs with exactly the same dependencies on every computer, server or in the cloud. You can also deploy a Jupyter Notebook environment as a container so that other people can access it and use exactly the same packages and settings.

Cloud integration

Containers include all packages, dependencies and configurations that an application requires. They therefore run uniformly on local computers, servers or cloud environments. This means you don’t have to reconfigure the environment.

For example, you write a data pipeline script. This works locally for you. As soon as you deploy it as a container, you can be sure that it will run in exactly the same way on AWS, Azure, GCP or the IBM Cloud.

Scaling with Kubernetes

Kubernetes helps you to orchestrate containers. But more on that below. If you now get a lot of requests for your ML model, you can scale it automatically with Kubernetes. This means that more instances of the container are started.

3 — First Practice, then Theory: Container creation even without much prior knowledge

Let’s take a look at an example that anyone can run through with minimal time — even if you haven’t heard much about Docker and containers. It took me 30 minutes.

We’ll set up a Jupyter Notebook inside a Docker container, creating a portable, reproducible Data Science environment. Once it’s up and running, we can easily share it with others and ensure that everyone works with the exact same setup.

0 — Install Docker Dekstop and create a project directory

To be able to use containers, we need Docker Desktop. To do this, we download Docker Desktop from the official website.

Now we create a new folder for the project. You can do this directly in the desired folder. I do this via Terminal — on Windows with Windows + R and open CMD.

We use the following command:

Screenshot taken by the author

1. Create a Dockerfile

Now we open VS Code or another editor and create a new file with the name ‘Dockerfile’. We save this file without an extension in the same directory. Why doesn’t it need an extension?

We add the following code to this file:

# Use the official Jupyter notebook image with SciPy
FROM jupyter/scipy-notebook:latest  

# Set the working directory inside the container
WORKDIR /home/jovyan/work  

# Copy all local files into the container
COPY . .

# Start Jupyter Notebook without token
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

We have thus defined a container environment for Jupyter Notebook that is based on the official Jupyter SciPy Notebook image.

First, we define with FROM on which base image the container is built. jupyter/scipy-notebook:latest is a preconfigured Jupyter notebook image and contains libraries such as NumPy, SiPy, Matplotlib or Pandas. Alternatively, we could also use a different image here.

With WORKDIR we set the working directory within the container. /home/jovyan/work is the default path used by Jupyter. User jovyan is the default user in Jupyter Docker images. Another directory could also be selected — but this directory is best practice for Jupyter containers.

With COPY . . we copy all files from the local directory — in this case the Dockerfile, which is located in the jupyter-docker directory — to the working directory /home/jovyan/work in the container.

With CMD [“start-notebook.sh”, “ — NotebookApp.token=‘’’”] we specify the default start command for the container, specify the start script for Jupyter Notebook and define that the notebook is started without a token — this allows us to access it directly via the browser.

2. Create the Docker image

Next, we will build the Docker image. Make sure you have the previously installed Docker desktop open. We now go back to the terminal and use the following command:

cd jupyter-docker
docker build -t my-jupyter .

With cd jupyter-docker we navigate to the folder we created earlier. With docker build we create a Docker image from the Dockerfile. With -t my-jupyter we give the image a name. The dot means that the image will be built based on the current directory. What does that mean? Note the space between the image name and the dot.

The Docker image is the template for the container. This image contains everything needed for the application such as the operating system base (e.g. Ubuntu, Python, Jupyter), dependencies such as Pandas, Numpy, Jupyter Notebook, the application code and the startup commands. When we “build” a Docker image, this means that Docker reads the Dockerfile and executes the steps that we have defined there. The container can then be started from this template (Docker image).

We can now watch the Docker image being built in the terminal.

Screenshot taken by the author

We use docker images to check whether the image exists. If the output my-jupyter appears, the creation was successful.

docker images

If yes, we see the data for the created Docker image:

Screenshot taken by the author

3. Start Jupyter container

Next, we want to start the container and use this command to do so:

docker run -p 8888:8888 my-jupyter

We start a container with docker run. First, we enter the specific name of the container that we want to start. And with -p 8888:8888 we connect the local port (8888) with the port in the container (8888). Jupyter runs on this port. I do not understand.

Alternatively, you can also perform this step in Docker desktop:

Screenshot taken by the author

4. Open Jupyter Notebook & create a test notebook

Now we open the URL [http://localhost:8888](http://localhost:8888/) in the browser. You should now see the Jupyter Notebook interface.

Here we will now create a Python 3 notebook and insert the following Python code into it.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine Wave")
plt.show()

Running the code will display the sine curve:

Screenshot taken by the author

5. Terminate the container

At the end, we end the container either with ‘CTRL + C’ in the terminal or in Docker Desktop.

With docker ps we can check in the terminal whether containers are still running and with docker ps -a we can display the container that has just been terminated:

Screenshot taken by the author

6. Share your Docker image

If you now want to upload your Docker image to a registry, you can do this with the following command. This will upload your image to Docker Hub (you need a Docker Hub account for this). You can also upload it to a private registry of AWS Elastic Container, Google Container, Azure Container or IBM Cloud Container.

docker login

docker tag my-jupyter your-dockerhub-name/my-jupyter:latest

docker push dein-dockerhub-name/mein-jupyter:latest

If you then open Docker Hub and go to your repositories in your profile, the image should be visible.

This was a very simple example to get started with Docker. If you want to dive a little deeper, you can deploy a trained ML model with FastAPI via a container.

4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance

You can actually think of a container like a shipping container. Regardless of whether you load it onto a ship (local computer), a truck (cloud server) or a train (data center) — the content always remains the same.

The most important Docker terms

Container: Lightweight, isolated environment for applications that contains all dependencies.
Docker: The most popular container platform that allows you to create and manage containers.
Docker Image: A read-only template that contains code, dependencies and system libraries.
Dockerfile: Text file with commands to create a Docker image.
Kubernetes: Orchestration tool to manage many containers automatically.

The basic concepts behind containers

Isolation: Each container contains its own processes, libraries and dependencies
Portability: Containers run wherever a container runtime is installed.
Reproducibility: You can create a container once and it runs exactly the same everywhere.

The most basic Docker commands

docker --version # Check if Docker is installed
docker ps # Show running containers
docker ps -a # Show all containers (including stopped ones)
docker images # List of all available images
docker info # Show system information about the Docker installation

docker run hello-world # Start a test container
docker run -d -p 8080:80 nginx # Start Nginx in the background (-d) with port forwarding
docker run -it ubuntu bash # Start interactive Ubuntu container with bash

docker pull ubuntu # Load an image from Docker Hub
docker build -t my-app . # Build an image from a Dockerfile

Final Thoughts: Key takeaways as a data scientist

With Containers you can solve the “It works on my machine” problem. Containers ensure that ML models, data pipelines, and environments run identically everywhere, independent of OS or dependencies.

Containers are more lightweight and flexible than virtual machines. While VMs come with their own operating system and consume more resources, containers share the host operating system and start faster.

There are three key steps when working with containers: Create a Dockerfile to define the environment, use docker build to create an image, and run it with docker run — optionally pushing it to a registry with docker push.

And then there’s Kubernetes.

A term that comes up a lot in this context: An orchestration tool that automates container management, ensuring scalability, load balancing and fault recovery. This is particularly useful for microservices and cloud applications.

Before Docker, VMs were the go-to solution (see more in ‘Virtualization & Containers for Data Science Newbiews’.) VMs offer strong isolation, but require more resources and start slower.

So, Docker was developed in 2013 by Solomon Hykes to solve this problem. Instead of virtualizing entire operating systems, containers run independently of the environment — whether on your laptop, a server or in the cloud. They contain all the necessary dependencies so that they work consistently everywhere.

I simplify tech for curious minds If you enjoy my tech insights on Python, data science, Data Engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

The post Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge appeared first on Towards Data Science.

Virtualization & Containers for Data Science Newbies

Sarah Schürch — Wed, 12 Feb 2025 01:04:25 +0000

Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak.

Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become increasingly important.

Without virtualization, many of the digital services we use every day would not be possible. Of course, this is a simplification, as some cloud services also use bare-metal infrastructures.

In this article, you will learn how to set up your own virtual machine on your laptop in just a few minutes — even if you have never heard of Cloud Computing or containers before.

Table of Contents
1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture
2 — Understanding Virtualization: Why it’s the Basis of Cloud Computing
3 — Create a Virtual Machine with VirtualBox
Final Thoughts
Where can you continue learning?

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

Cloud computing has fundamentally changed the IT landscape — but its roots go back much further than many people think. In fact, the history of the cloud began back in the 1950s with huge mainframes and so-called dumb terminals.

The era of mainframes in the 1950s: Companies used mainframes so that several users could access them simultaneously via dumb terminals. The central mainframes were designed for high-volume, business-critical data processing. Large companies still use them today, even if cloud services have reduced their relevance.
Time-sharing and virtualization: In the next decade (1960s), time-sharing made it possible for multiple users to access the same computing power simultaneously — an early model of today’s cloud. Around the same time, IBM pioneered virtualization, allowing multiple virtual machines to run on a single piece of hardware.
The birth of the internet and web-based applications in the 1990s: Six years before I was born, Tim Berners-Lee developed the World Wide Web, which revolutionized online communication and our entire working and living environment. Can you imagine our lives today without internet? At the same time, PCs were becoming increasingly popular. In 1999, Salesforce revolutionized the software industry with Software as a Service (SaaS), allowing businesses to use CRM solutions over the internet without local installations.
The big breakthrough of cloud computing in the 2010s:
The modern cloud era began in 2006 with Amazon Web Services (AWS): Companies were able to flexibly rent infrastructure with S3 (storage) and EC2 (virtual servers) instead of buying their own servers. Microsoft Azure and Google Cloud followed with PaaS and IaaS services.
The modern cloud-native era: This was followed by the next innovation with containerization. Docker made Containers popular in 2013, followed by Kubernetes in 2014 to simplify the orchestration of containers. Next came serverless computing with AWS Lambda and Google Cloud Functions, which enabled developers to write code that automatically responds to events. The infrastructure is fully managed by the cloud provider.

Cloud computing is more the result of decades of innovation than a single new technology. From time-sharing to virtualization to serverless architectures, the IT landscape has continuously evolved. Today, cloud computing is the foundation for streaming services like Netflix, AI applications like ChatGPT and global platforms like Salesforce.

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

Virtualization means abstracting physical hardware, such as servers, storage or networks, into multiple virtual instances.

Several independent systems can be operated on the same physical infrastructure. Instead of dedicating an entire server to a single application, virtualization enables multiple workloads to share resources efficiently. For example, Windows, Linux or another environment can be run simultaneously on a single laptop — each in an isolated virtual machine.

This saves costs and resources.

Even more important, however, is the scalability: Infrastructure can be flexibly adapted to changing requirements.

Before cloud computing became widely available, companies often had to maintain dedicated servers for different applications, leading to high infrastructure costs and limited scalability. If more performance was suddenly required, for example because webshop traffic increased, new hardware was needed. The company had to add more servers (horizontal scaling) or upgrade existing ones (vertical scaling).

This is different with virtualization: For example, I can simply upgrade my virtual Linux machine from 8 GB to 16 GB RAM or assign 4 cores instead of 2. Of course, only if the underlying infrastructure supports this. More on this later.

And this is exactly what cloud computing makes possible: The cloud consists of huge data centers that use virtualization to provide flexible computing power — exactly when it is needed. So, virtualization is a fundamental technology behind cloud computing.

How does serverless computing work?

What if you didn’t even have to manage virtual machines anymore?

Serverless computing goes one step further than Virtualization and containerization. The cloud provider handles most infrastructure tasks — including scaling, maintenance and resource allocation. Developers should focus on writing and deploying code.

But does serverless really mean that there are no more servers?

Of course not. The servers are there, but they are invisible for the user. Developers no longer have to worry about them. Instead of manually provisioning a virtual machine or container, you simply deploy your code, and the cloud automatically executes it in a managed environment. Resources are only provided when the code is running. For example, you can use AWS Lambda, Google Cloud Functions or Azure Functions.

What are the advantages of serverless?

As a developer, you don’t have to worry about scaling or maintenance. This means that if there is a lot more traffic at a particular event, the resources are automatically adjusted. Serverless computing can be cost-efficient, especially in Function-as-a-Service (FaaS) models. If nothing is running, you pay nothing. However, some serverless services have baseline costs (e.g. Firestore).

Are there any disadvantages?

You have much less control over the infrastructure and no direct access to the servers. There is also a risk of vendor lock-in. The applications are strongly tied to a cloud provider.

A concrete example of serverless: API without your own server

Imagine you have a website with an API that provides users with the current weather. Normally, a server runs around the clock — even at times when no one is using the API.

With AWS Lambda, things work differently: A user enters ‘Mexico City’ on your website and clicks on ‘Get weather’. This request triggers a Lambda function in the background, which retrieves the weather data and sends it back. The function is then stopped automatically. This means you don’t have a permanently running server and no unnecessary costs — you only pay when the code is executed.

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

You’ve probably heard of containers. But what is the difference to virtual machines — and what is particularly relevant as a data scientist?

Both containers and virtual machines are virtualization technologies.

Both make it possible to run applications in isolation.

Both offer advantages depending on the use case: While VMs provide strong security, containers excel in speed and efficiency.

The main difference lies in the architecture:

Virtual machines virtualize the entire hardware — including the operating system. Each VM has its own operational system (OS). This in turn requires more memory and resources.
Containers, on the other hand, share the host operating system and only virtualize the application layer. This makes them significantly lighter and faster.

Put simply, virtual machines simulate entire computers, while containers only encapsulate applications.

Why is this important for data scientists?

Since as a data scientist you will come into contact with machine learning, data engineering or data pipelines, it is also important to understand something about containers and virtual machines. Sure, you don’t need to have in-depth knowledge of it like a DevOps Engineer or a Site Reliability Engineer (SRE).

Virtual machines are used in data science, for example, when a complete operating system environment is required — such as a Windows VM on a Linux host. Data science projects often need specific environments. With a VM, it is possible to provide exactly the same environment — regardless of which host system is available.

A VM is also needed when training deep learning models with GPUs in the cloud. With cloud VMs such as AWS EC2 or Azure Virtual Machines, you have the option of training the models with GPUs. VMs also completely separate different workloads from each other to ensure performance and security.

Containers are used in data science for data pipelines, for example, where tools such as Apache Airflow run individual processing steps in Docker containers. This means that each step can be executed in isolation and independently of each other — regardless of whether it involves loading, transforming or saving data. Even if you want to deploy machine learning models via Flask / FastAPI, a container ensures that everything your model needs (e.g. Python libraries, framework versions) runs exactly as it should. This makes it super easy to deploy the model on a server or in the cloud.

3 — Create a Virtual Machine with VirtualBox

Let’s make this a little more concrete and create an Ubuntu VM.

I use the VirtualBox software with my Windows Lenovo laptop. The virtual machine runs in isolation from your main operating system so that no changes are made to your actual system. If you have Windows Pro Edition, you can also enable Hyper-V (pre-installed by default, but disabled). With an Intel Mac, you should also be able to use VirtualBox. With an Apple Silicon, Parallels Desktop or UTM is apparently the better alternative (not tested myself).

1) Install Virtual Box

The first step is to download the installation file for VirtualBox from the official Virtual Box website and install VirtualBox. VirtualBox is installed including all necessary drivers.

You can ignore the note about missing dependencies Python Core / win32api as long as you do not want to automate VirtualBox with Python scripts.

Then we start the Oracle VirtualBox Manager:

Screenshot taken by the author

2) Download the Ubuntu ISO file

Next, we download the Ubuntu ISO file from the Ubuntu website. An ISO Ubuntu file is a compressed image file of the Ubuntu operating system. This means that it contains a complete copy of the installation data. I download the LTS version because this version receives security and maintenance updates for 5 years (Long Term Support). Note the location of the .iso file as we will use it later in VirtualBox.

Screenshot taken by the author

3) Create a virtual machine in VirtualBox

Next, we create a new virtual machine in the VirtualBox Manager and give it the name Ubuntu VM 2025. Here we select Linux as the type and Ubuntu (64-bit) as the version. We also select the previously downloaded ISO file from Ubuntu as the ISO image. It would also be possible to add the ISO file later in the mass storage menu.

Screenshot taken by the author

Next, we select a user name vboxuser2025 and a password for access to the Ubuntu system. The hostname is the name of the virtual machine within the network or system. It must not contain any spaces. The domain name is optional and would be used if the network has multiple devices.

We then assign the appropriate resources to the virtual machine. I choose 8 GB (8192 MB) RAM, as my host system has 64 GB RAM. I recommend 4GB (4096) as a minimum. I assign 2 processors, as my host system has 8 cores and 16 logical processors. It would also be possible to assign 4 cores, but this way I have enough resources for my host system. You can find out how many cores your host system has by opening the Task Manager in Windows and looking at the number of cores under the Performance tab under CPU.

Screenshot taken by the author

Next, we click on ‘Create a virtual hard disk now’ to create a virtual hard disk. A VM requires its own virtual hard disk to install the OS (e.g. Ubuntu, Windows). All programs, files and configurations of the VM are stored on it — just like on a physical hard disk. The default value is 25 GB. If you want to use a VM for machine learning or data science, more storage space (e.g. 50–100 GB) would be useful to have room for large data sets and models. I keep the default setting.

We can then see that the virtual machine has been created and can be used:

Screenshot taken by the author

4) Use Ubuntu VM

We can now use the newly created virtual machine like a normal separate operating system. The VM is completely isolated from the host system. This means you can experiment in it without changing or jeopardizing your main system.

If you are new to Linux, you can try out basic commands like ls, cd, mkdir or sudo to get to know the terminal. As a data scientist, you can set up your own development environments, install Python with Pandas and Scikit-learn to develop data analysis and machine learning models. Or you can install PostgreSQL and run SQL queries without having to set up a local database on your main system. You can also use Docker to create containerized applications.

Final Thoughts

Since the VM is isolated, we can install programs, experiment and even destroy the system without affecting the host system.

Let’s see if virtual machines remain relevant in the coming years. As companies increasingly use microservice architectures (instead of monoliths), containers with Docker and Kubernetes will certainly become even more important. But knowing how to set up a virtual machine and what it is used for is certainly useful.

I simplify tech for curious minds. If you enjoy my tech insights on Python, data science, data engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

AWS Documentation — Create your first Lambda function
AWS — Getting started with Amazon S3
DataCamp Course — Understanding Cloud Computing (only the first part is free — no affiliate link)
GeeksForGeeks — What is Cloud Computing?
Kubernetes Documentation — Learn Kubernetes Basics
Net Ninja — Video Docker Crash Course

The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 2

Sarah Schürch — Mon, 20 Jan 2025 11:02:02 +0000

From AI Agent to Human-In-The-Loop — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

Innovation in the field of data is progressing rapidly.

Let’s take a quick look at the timeline of GenAI: ChatGPT, launched in November 2022, became the world’s best-known application of generative AI in early 2023. By spring 2025, leading companies like Salesforce (Marketing Cloud Growth) and Adobe (Firefly) integrated it into mainstream applications – making it accessible to companies of various sizes. Tools like MidJourney advanced image generation, while at the same time, discussions about agentic AI took center stage. Today, tools like ChatGPT have already become common for many private users.

That’s why I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025 and are important to understand. Why are they relevant? What are the challenges? And how can you apply them to a small project?

Table of Content Term 1–6 in part 1: Data Warehouse, Data Lake, Data Lakehouse, Cloud Platforms, Optimizing data storage, Big Data technologies, ETL, ELT and Zero-ETL, Even-Driven-Architecture 7 – Data Lineage & XAI 8 – Gen AI 9 – Agentic AI 10 – Inference Time Compute 11 – Near Infinite Memory 12 – Human-In-The-Loop-Augmentation Final Thoughts

In the first part, we looked at terms for the basics of understanding modern data systems (storage, management & processing of data). In part 2, we now move beyond infrastructure and dive into some terms related to Artificial Intelligence that use this data to drive innovation.

7 – Explainability of predictions and traceability of data: XAI & Data Lineage

As data and AI tools become increasingly important in our everyday lives, we also need to know how to track them and create transparency for decision-making processes and predictions:

Let’s imagine a scenario in a hospital: A deep learning model is used to predict the chances of success of an operation. A patient is categorised as ‘unsuitable’ for the operation. The problem for the medical team? There is no explanation as to how the model arrived at this decision. The internal processes and calculations that led to the prediction remain hidden. It is also not clear which attributes – such as age, state of health or other parameters – were decisive for this assessment. Should the medical team nevertheless believe the prediction and not proceed with the operation? Or should they proceed as they see best fit?

This lack of transparency can lead to uncertainty or even mistrust in AI-supported decisions. Why does this happen? Many deep learning models provide us with results and excellent predictions – much better than simple models can do. However, the models are ‘black boxes’ – we don’t know exactly how the models arrived at the results and what features they used to do so. While this lack of transparency hardly plays a role in everyday applications, such as distinguishing between cat and dog photos, the situation is different in critical areas: For example, in healthcare, financial decisions, criminology or recruitment processes, we need to be able to understand how and why a model arrives at certain results.

This is where Explainable AI (XAI) comes into play: techniques and methods that attempt to make the decision-making process of AI models understandable and comprehensible. Examples of this are SHAP (SHapley Additive ExPlanations) or LIME (Local Interpretable Model-agnostic Explanations). These tools can at least show us which features contributed most to a decision.

Data Lineage, on the other hand, helps us understand where data comes from, how it has been processed and how it is ultimately used. In a BI tool, for example, a report with incorrect figures could be used to check whether the problem occurred with the data source, the transformation or when loading the data.

Why are the terms important?

XAI: The more AI models we use in everyday life and as decision-making aids, the more we need to know how these models have achieved their results. Especially in areas such as finance and healthcare, but also in processes such as HR and social services.

Data Lineage: In the EU there is GDPR, in California CCPA. These require companies to document the origin and use of data in a comprehensible manner. What does that mean in concrete terms? If companies have to comply with data protection laws, they must always know where the data comes from and how it was processed.

What are the challenges?

Complexity of the data landscape (data lineage): In distributed systems and multi-cloud environments, it is difficult to fully track the data flow.
Performance vs. transparency (XAI): Deep learning models often deliver more precise results, but their decision paths are difficult to trace. Simpler models, on the other hand, are usually easier to interpret but less accurate.

Small project idea to better understand the terms:

Use SHAP (SHapley Additive ExPlanations) to explain the decision logic of a machine learning model: Create a simple ML model with scikit-learn to predict house prices, for example. Then install the SHAP library in Python and visualize how the different features influence the price prediction.

8 – Generative AI (Gen AI)

Since Chat-GPT took off in January 2023, the term Gen AI has also been on everyone’s lips. Generative AI refers to AI models that can generate new content from an input. Outputs can be texts, images, music or videos. For example, there are now even fashion stores that have created their advertising images using generative AI (e.g. Calvin Klein, Zalando).

"We started OpenAI almost nine years ago because we believed that AGI was possible, and that it could be the most impactful technology in human history. We wanted to figure out how to build it and make it broadly beneficial; […]"

Reference: Sam Altman, CEO of OpenAI

Why is the term important?

Clearly, GenAI can greatly increase efficiency. The time required for tasks such as content creation, design or texts is reduced for companies. GenAI is also changing many areas of our working world. Tasks are being performed differently, jobs are changing and data is becoming even more important.

In Salesforce’s latest marketing automation tool, for example, users can enter a prompt in natural language, which generates an email layout – even if this does not always work reliably in reality.

What are the challenges?

Copyrights and ethics: The models are trained with huge amounts of data that originate from us humans and try to generate the most realistic results possible based on this (e.g. also with texts by authors or images by well-known painters). One problem is that GenAI can imitate existing works. Who owns the result? A simple way to minimize this problem at least somewhat is to clearly label AI-generated content as such.
Costs and energy: Large models require a very large amount of computing resources.
Bias and misinformation: The models are trained with specific data. If the data already contains a bias (e.g. less data from one gender, less data from one country), these models can reproduce biases. For example, if an HR tool has been trained with more male than female data, it could favor male applicants in a job application. And of course, sometimes the models simply provide incorrect information.

Small project idea to better understand the terms:

Create a simple chatbot that accesses the GPT-4 API and can answer a question. I have attached a step-by-step guide at the bottom of the page.

9 – Agentic AI / AI Agents

Agentic AI is currently a hotly debated topic and is based on generative AI. AI agents describe intelligent systems that can think, plan and act "autonomously":

"This is what AI was meant to be. […] And I am really excited about this. I think this is going to change companies forever. I think it’s going to change software forever. And I think it’ll change Salesforce forever."

_Reference: Marc Benioff, Salesforce CEO about Agents & Agentforce_

AI Agents are, so to speak, a continuation of traditional chatbots and bots. These systems promise to solve complex problems by creating multi-level plans, learning from data and making decisions based on this and executing them autonomously.

Multi-step plans mean that the AI thinks several steps ahead to achieve a goal.

Let’s imagine a quick example: An AI agent has the task of delivering a parcel. Instead of simply following the sequence of orders, the AI could first analyze the traffic situation, calculate the fastest route and then deliver the various parcels in this calculated sequence.

Why is the term important?

The ability to execute multi-step plans sets AI Agents apart from previous bots and chatbots and brings a new era of autonomous systems.

If AI Agents can actually be used in businesses, companies can automate repetitive tasks through agents, reducing costs and increasing efficiency. The economic benefits and competitive advantage would be there. As the Salesforce CEO says in the interview, it can change our corporate world tremendously.

What are the challenges?

Logical consistency and (current) technological limitations: Current models struggle with consistent logical thinking – especially when it comes to handling complex scenarios with multiple variables. And that’s exactly what they’re there for – or that’s how they’re advertised. This means that in 2025 there will definitely be an increased need for better models.
Ethics and acceptance: Autonomous systems can make decisions and solve their own tasks independently. How can we ensure that autonomous systems do not make decisions that violate ethical standards? As a society, we also need to define how quickly we want to integrate such changes into our everyday (working) lives without taking employees by surprise. Not everyone has the same technical know-how.

Small project idea to better understand the term:

Create a simple AI agent with Python: define the agent first. For example, the agent should retrieve data from an API. Use Python to coordinate the API query, filtering of results and automatic emailing to the user. Implement then a simple decision logic: For example, if no result matches the filter criteria, the search radius is extended.

10 – Inference Time Compute

Next, we focus on the efficiency and performance of using AI models: An AI model receives input data, makes a prediction or decision based on it and gives an output. This process requires computing time, which is referred to as inference time compute. Modern models such as AI agents go one step further by flexibly adapting their computing time to the complexity of the task.

Basically, it’s the same as with us humans: When we have to solve more complex problems, we invest more time. AI models use dynamic reasoning (adapting computing time according to task requirements) and chain reasoning (using multiple decision steps to solve complex problems).

Why is the term important?

AI and models are becoming increasingly important in our everyday lives. The demand for dynamic AI systems (AI that adapts flexibly to requests and understands our requests) will increase. Inference time affects the performance of systems such as chatbots, autonomous vehicles and real-time translators. AI models that adapt their inference time to the complexity of the task and therefore "think" for different lengths of time will improve efficiency and accuracy.

What are the challenges?

Performance vs. quality: Do you want a fast but less accurate or a slow but very accurate solution? Shorter inference times improve efficiency, but can compromise accuracy for complex tasks.
Energy consumption: The longer the inference time, the more computing power is required. This in turn increases energy consumption.

11 – Near Infinite Memory

Near Infinite Memory is a concept that describes how technologies can store and process enormous amounts of data almost indefinitely.

For us users, it seems like infinite storage – but it is actually more of a combination of scalable cloud services, data-optimized storage solutions and intelligent data management systems.

Why is this term important?

The data we generate is growing exponentially due to the increasing use of IoT, AI and Big Data. As already described in terms 1–3, this creates ever greater demands on data architectures such as data lakehouses. AI models also require enormous amounts of data for training and validation. It is therefore important that storage solutions become more efficient.

What are the challenges?

Energy consumption: Large storage solutions in cloud data centers consume immense amounts of energy.
Security concerns and dependence on centralized services: Many near-infinite memory solutions are provided by cloud providers. This can create a dependency that brings financial and data protection risks.

Small project idea to better understand the terms:

Develop a practical understanding of how different data types affect storage requirements and learn how to use storage space efficiently. Take a look at the project under the term "Optimizing Data Storage".

12 – Human-In-The-Loop Augmentation

AI is becoming increasingly important, as the previous terms have shown. However, with the increasing importance of AI, we should ensure that the human part is not lost in the process.

"We need to let people who are harmed by technology imagine the future that they want."

_Reference: Timnit Gebru, former Head of Department of Ethics in AI at Google_

Human-in-the-loop augmentation is the interface between computer science and psychology, so to speak. It describes the collaboration between us humans and artificial intelligence. The aim is to combine the strengths of both sides:

A great strength of AI is that such models can efficiently process data in large quantities and discover patterns in it that are difficult for us to recognize.
We humans, on the other hand, bring judgment, ethics, creativity and contextual understanding to the table without being pre-trained and have the ability to cope with unforeseen situations.

The goal must be for AI to serve us humans – and not the other way around.

Why is the term important?

AI can improve decision-making processes and minimize errors. In particular, AI can recognize patterns in data that are not visible to us, for example in the field of medicine or biology.

The MIT Center for Collective Intelligence published a study in Nature Human Behavior in which they analyzed how well human-AI combinations perform compared to purely human or purely AI-controlled systems:

In decision-making tasks, human-AI combinations often performed worse than AI systems alone (e.g. medical diagnoses / classification of deepfakes).
In creative tasks, the interaction already works better. Here, human-AI teams outperformed both humans and AI alone.

However, the study shows that human-in-the-loop augmentation does not yet work perfectly.

Reference: Humans and AI: Do they work better together or alone?

What are the challenges?

Lack of synergy and mistrust: It seems that there is a lack of intuitive interfaces that make it easier for us humans to interact effectively enough with AI tools. Another challenge is that AI systems are sometimes viewed critically or even rejected.
(Current) technological limitations of AI: Current AI systems struggle to understand logical consistency and context. This can lead to erroneous or inaccurate results. For example, an AI diagnostic system could misjudge a rare case because it does not have enough data for such cases.

Final Thoughts

The terms in this article only show a selection of the innovations that we are currently seeing – the list could definitely be extended. For example, in the area of AI models, the size of the models will also play an important role: In addition to very large models (with up to 50 trillion parameters), individual very small models will probably also be developed that will only contain a few billion parameters. The advantage of these small models will be that they do not require huge data centers and GPUs, but can run on our laptops or even on our smartphones and perform very specific tasks.

Which terms do you think are super important? Let us know in the comments.

Where can you continue learning?

Book – The data lakehouse for dummies
Medium – SQL and Data Modelling in Action: A Deep Dive into Data Lakehouses
Gartner – Top 10 strategic technology trends
AWS – Start your journey with AWS
DataCamp Course – AWS Concepts
Snowflake Blog – Avro vs. Parquet
Medium – Why ETL-Zero? Understanding the Shift in Data Integration
[AWS Blog – What is event-driven architecture?](http://What is an Event-Driven Architecture?)
Medium – Can you trust AI-Models Without Explainability? Introduction to XAI
IBM Blog – Agentic AI: 4 reasons why it’s the next big thing in AI research
MIT Management Sloan School – Study about Humans & AI
Blog Sam Altman – Reflections about AI

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 2 appeared first on Towards Data Science.

The Concepts Data Professionals Should Know in 2025: Part 1

Sarah Schürch — Sun, 19 Jan 2025 19:02:04 +0000

From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT.

When I scroll through YouTube or LinkedIn and see topics like RAG, Agents or Quantum Computing, I sometimes get a queasy feeling about keeping up with these innovations as a data professional.

But when I reflect then on the topics my customers face daily as a Salesforce Consultant or as a Data Scientist at university, the challenges often seem more tangible: examples are faster data access, better data quality or boosting employees’ tech skills. The key issues are often less futuristic and can usually be simplified. That’s the focus of this and the next article:

I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025. Why are they relevant? What are the challenges? And how can you apply them to a small project?

So – Let’s dive in.

Table of Content 1 – Data Warehouse, Data Lake, Data Lakehouse 2 – Cloud platforms as AWS, Azure & Google Cloud Platform 3 – Optimizing data storage 4 – Big data technologies such as Apache Spark, Kafka 5 – How data integration becomes real-time capable: ETL, ELT and Zero-ETL 6 – Even-Driven Architecture (EDA) Term 7–12 in part 2: Data Lineage & XAI, Gen AI, Agentic AI, Inference Time Compute, Near Infinite Memory, Human-In-The-Loop-Augmentation Final Thoughts

1 – Data Warehouse, Data Lake, Data Lakehouse

We start with the foundation for data architecture and storage to understand modern data management systems.

Data warehouses became really well known in the 1990s thanks to Business Intelligence tools from Oracle and SAP, for example. Companies began to store structured data from various sources in a central database. An example are weekly processed sales data in a business intelligence tool.

The next innovation was data lakes, which arose from the need to be able to store unstructured or semi-structured data flexibly. A data lake is a large, open space for raw data. It stores both structured and unstructured data, such as sales data alongside social media posts and images.

The next step in innovation combined data lake architecture with warehouse architecture: Data lakehouses were created.

The term was popularized by companies such as Databricks when it introduced its Delta Lake technology. This concept combines the strengths of both previous data platforms. It allows us to store unstructured data as well as quickly query structured data in a single system. The need for this data architecture has arisen primarily because warehouses are often too restrictive, while lakes are difficult to search.

Why are the terms important?

We are living in the era of Big Data – companies and private individuals are generating more and more data (structured as well as semi-structured and unstructured data).

A short personal anecdote: The year I turned 15, Facebook cracked the 500 million active user mark for the first time. Instagram was founded in the same year. In addition, the release of the iPhone 4 significantly accelerated the global spread of smartphones and shaped the mobile era. In the same year, Microsoft further developed and promoted Azure (which was released in 2008) to compete with Google Cloud and AWS. From today’s perspective, I can see how all these events made 2010 a decisive year for digitalisation: 2010 was a key year in which digitalisation and the transition to cloud technologies gained momentum.

In 2010, around 2 zettabytes (ZB) of data were generated, in 2020 it was around 64 ZB, in 2024 we are at around 149 zettabytes.

Reference: Statista

Due to the explosive data growth in recent years, we need to store the data somewhere – efficiently. This is where these three terms come into play. Hybrid architectures such as data lakehouses solve many of the challenges of big data. The demand for (near) real-time data analysis is also rising (see term 5 on zero ETL). And to remain competitive, companies are under pressure to use data faster and more efficiently. Data lakehouses are becoming more important as they offer the flexibility of a data lake and the efficiency of a data warehouse – without having to operate two separate systems.

What are the challenges?

Data integration: As there are many different data sources (structured, semi-structured, unstructured), complex ETL / ELT processes are required.
Scaling & costs: While data warehouses are expensive, data lakes can easily lead to data chaos (if no good data governance is in place) and lakehouses require technical know-how & investment.
Access to the data: Permissions need to be clearly defined if the data is stored in a centralized storage.

Small project idea to better understand the terms:

Create a mini data lake with AWS S3: Upload JSON or CSV data to an S3 bucket, then process the data with Python and perform data analysis with Pandas, for example.

2 – Cloud Platforms as AWS, Azure & Google Cloud Platform

Now we move on to the platforms on which the concepts from 1 are often implemented.

Of course, everyone knows the term cloud platforms such as AWS, Azure or Google Cloud. These services provide us with a scalable infrastructure for storing large volumes of data. We can also use them to process data in real-time and to use Business Intelligence and Machine Learning tools efficiently.

But why are the terms important?

I work in a web design agency where we host our clients’ websites in one of the other departments. Before the easy availability of cloud platforms, this meant running our own servers in the basement – with all the challenges such as cooling, maintenance and limited scalability.

Today, most of our data architectures and AI applications run in the cloud. Cloud platforms have changed the way we store, process and analyse data over the last decades. Platforms such as AWS, Azure or Google Cloud offer us a completely new level of flexibility and scalability for model training, real-time analyses and generative AI.

What are the challenges?

A quick personal example of how complex things get: While preparing for my Salesforce Data Cloud Certification (a data lakehouse), I found myself diving into a sea of new terms – all specific to the Salesforce world. Each cloud platform has its own terminology and tools, which makes it time-consuming for employees in companies to familiarize themselves with them.
Data security: Sensitive data can often be stored in the cloud. Access control must be clearly defined – user management is required.

Small project idea to better understand the terms:

Create a simple data pipeline: Register with AWS, Azure or GCP with a free account and upload a CSV file (e.g. to an AWS S3 bucket). Then load the data into a relational database and use an SQL tool to perform queries.

3 – Optimizing Data Storage

More and more data = more and more storage space required = more and more costs.

With the use of large amounts of data and the platforms and concepts from 1 and 2, there is also the issue of efficiency and cost management. To save on storage, reduce costs and speed up access, we need better ways to store, organize and access data more efficiently.

Strategies include data compression (e.g. Gzip) by removing redundant or unneeded data, data partitioning by splitting large data sets, indexing to speed up queries and the choice of storage format (e.g. CSV, Parquet, Avro).

Why is the term important?

Not only is my Google Drive and One Drive storage nearly maxed out…

… in 2028, a total data volume of 394 zettabytes is expected.

It will therefore be necessary for us to be able to cope with growing data volumes and rising costs. In addition, large data centers consume immense amounts of energy, which in turn is critical in terms of the energy and climate crisis.

What are the challenges?

Different formats are optimized for different use cases. Parquet, for example, is particularly suitable for analytical queries and large data sets, as it is organized on a column basis and read access is efficient. Avro, on the other hand, is ideal for streaming data because it can quickly convert data into a format that is sent over the network (serialization) and just as quickly convert it back to its original form when it is received (deserialization). Choosing the wrong format can affect performance by either wasting disk space or increasing polling times.
Cost / benefit trade-off: Compression and partitioning save storage space but can slow down computing performance and data access.
Dependency on cloud providers: As a lot of data is stored in the cloud today, optimization strategies are often tied to specific platforms.

Small project idea to better understand the terms:

Compare different storage optimization strategies: Generate a 1 GB dataset with random numbers. Save the data set in three different formats such as CSV, Parquet & Avro (using the corresponding Python libraries). Then compress the files with Gzip or Snappy. Now load the data into a Pandas DataFrame using Python and compare the query speed.

4 – Big Data Technologies such as Apache Spark & Kafka

Once the data has been stored using the storage concepts described in sections 1–3, we need technologies to process it efficiently.

We can use tools such as Apache Spark or Kafka to process and analyze huge amounts of data. They allow us to do this in real-time or in batch mode.

Spark is a framework that processes large amounts of data in a distributed manner and is used for tasks such as machine learning, Data Engineering and ETL processes.

Kafka is a tool that transfers data streams in real-time so that various applications can access and use them immediately. One example is the processing of real-time data streams in financial transactions or logistics.

Why is the term important?

In addition to the exponential growth in data, AI and machine learning are becoming increasingly important. Companies want to be able to process data in (almost) real-time: These Big Data technologies are the basis for real-time and batch processing of large amounts of data and are required for AI and streaming applications.

What are the challenges?

Complexity of implementation: Setting up, maintaining and optimizing tools such as Apache Spark and Kafka requires in-depth technical expertise. In many companies, this is not readily available and must be built up or brought in externally. Distributed systems in particular can be complex to coordinate. In addition, processing large volumes of data can lead to high costs if the computing capacities in the cloud need to be scaled.
Data quality: If I had to name one of my customers’ biggest problems, it would probably be data quality. Anyone who works with data knows that data quality can often be optimized in many companies… When data streams are processed in real-time, this becomes even more important. Why? In real-time systems, data is processed without delay and the results are sometimes used directly for decisions or are followed by reactions. Incorrect or inaccurate data can lead to wrong decisions.

Small project idea to better understand the terms:

Develop a small pipeline with Python that simulates, processes and saves real-time data: For example, simulate real-time data streams of temperature values. Then check whether the temperature exceeds a critical threshold value. As an extension, you can plot the temperature data in real-time.

5 – How Data Integration Becomes Real-Time Capable: ETL, ELT and Zero-ETL

ETL, ELT and Zero-ETL describe different approaches to integrating and transforming data.

While ETL (Extract-Transform-Loading) and ELT (Extract-Loading-Transform) are familiar to most, Zero-ETL is a data integration concept introduced by AWS in 2022. It eliminates the need for separate extraction, transformation, and loading steps. Instead, data is analyzed directly in its original format – almost in real-time. The technology promises to reduce latency and simplify processes within a single platform.

Let’s take a look at an example: A company using Snowflake as a data warehouse can create a table that references the data in the Salesforce Data Cloud. This means that the organization can query the data directly in Snowflake, even if it remains in the Data Cloud.

Why are the terms important?

We live in an age of instant – thanks to the success of platforms such as WhatsApp, Netflix and Spotify.

This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analyzed almost in real-time and without major delays.

What are the challenges?

Here, too, there are similar challenges as with big data technologies: Data quality must be adequate, as incorrect data can lead directly to incorrect decisions during real-time processing. In addition, integration can be complex, although less so than with tools such as Apache Spark or Kafka.

Let me share a quick example to illustrate this: We implemented Data Cloud for a customer – the first-ever implementation in Switzerland since Salesforce started offering the Data Lakehouse solution. The entire knowledge base had to be built at the customer’s side. What did that mean? 1:1 training sessions with the power users and writing a lot of documentation.

This demonstrates a key challenge companies face: They must first build up this knowledge internally or rely on external resources as agencies or consulting companies.

Small project idea to better understand the terms:

Create a relational database with MySQL or PostgreSQL, add (simulated) real-time data from orders and use a cloud service such as AWS to stream the data directly into an analysis tool. Then visualize the data in a dashboard and show how new data becomes immediately visible.

6 – Event-Driven Architecture (EDA)

If we can transfer data between systems in (almost) real time, we also want to be able to react to it in (almost) real time: This is where the term Event-Driven Architecture (EDA) comes into play.

EDA is an architectural pattern in which applications are driven by events. An event is any relevant change in the system. Examples are when customers log in to the application or when a payment is received. Components of the architecture react to these events without being directly connected to each other. This in turn increases the flexibility and scalability of the application. Typical technologies include Apache Kafka or AWS EventBridge.

Why is the term important?

EDA plays an important role in real-time data processing. With the growing demand for fast and efficient systems, this architecture pattern is becoming increasingly important as it makes the processing of large data streams more flexible and efficient. This is particularly crucial for IoT, e-commerce and financial technologies.

Event-driven architecture also decouples systems: By allowing components to communicate via events, the individual components do not have to be directly dependent on each other.

Let’s take a look at an example: In an online store, the "order sent" event can automatically start a payment process or inform the warehouse management system. The individual systems do not have to be directly connected to each other.

What are the challenges?

Data consistency: The asynchronous nature of EDA makes it difficult to ensure that all parts of the system have consistent data. For example, an order may be saved as successful in the database while the warehouse component has not correctly reduced the stock due to a network issue.
Scaling the infrastructure: With high data volumes, scaling the messaging infrastructure (e.g. Kafka cluster) is challenging and expensive.

Small project idea to better understand the terms:

Simulate an Event-Driven Architecture in Python that reacts to customer events:

First define an event: An example could be ‘New order’.
Then create two functions that react to the event: 1) Send an automatic message to a customer. 2) Reduce the stock level by -1.
Call the two functions one after the other as soon as the event is triggered. If you want to extend the project, you can work with frameworks such as Flask or FastAPI to trigger the events through external user input.

Final Thoughts

In this part, we have looked at terms that focus primarily on the storage, management & processing of data. These terms lay the foundation for understanding modern data systems.

In part 2, we shift the focus to AI-driven concepts and explore some key terms such as Gen AI, agent-based AI and human-in-the-loop augmentation.

Own visualization – Illustrations from unDraw.co

All information in this article is based on the current status in January 2025.

The post The Concepts Data Professionals Should Know in 2025: Part 1 appeared first on Towards Data Science.

What is MicroPython? Do I Need to Know it as a Data Scientist?

Sarah Schürch — Sun, 12 Jan 2025 16:32:01 +0000

When I saw MicroPython on the list of the Stack Overflow survey from this year, I wanted to know what I could use this language for. And I wondered if it could serve as a bridge between hardware and software. In this article, I break down what MicroPython is and what data scientists should know about it.

Table of Content 1 – What is MicroPython and why is it special? 2 – Why should I know MicroPython as a data scientist? 3 – What is the difference to Python and other programming languages? 4 – What does this look like in practice? (Only with web based simulator) 5 – Final Thoughts & Where to continue learning?

Image from Stack Overflow

What is MicroPython and why is it special?

MicroPython is a simplified, compact version of Python 3 designed specifically for use on microcontrollers and other low-resource embedded systems. As we can read on the official website, the language offers a reduced standard library and special modules to interact directly with hardware components such as GPIO pins, sensors or LEDs.

Reference: Official MicroPython Website

Let’s break this definition down:

Simplified, compact Python: MicroPython is designed to use less memory and computing power than the standard Python version. The language is perfect for devices with just a few kilobytes of RAM.
Microcontrollers & embedded systems: Think of a microcontroller as a tiny computer on a chip. It can control devices such as IoT sensors, smart home devices and robots.
Low-resource systems: This means that these systems have little memory (often less than 1 MB) and limited computing power.
GPIO Pins: These are pins on a microcontroller that can be used for various input and output functions. For example, they can be used to control LEDs or read sensor data.

Why is MicroPython relevant?

If you know Python, you can program hardware with MicroPython – without learning a new complex language like C++ or assembly. Sure, you have more options with C++ and assembly and both are closer to machine languages. But if you want to create a prototype with relatively little effort, MicroPython offers you an ideal starting point.

Why should I know MicroPython as a data scientist?

Simply put: Because it is listed in the Stack Overflow survey and gaining traction in the developer community…

IoT and edge computing are playing an increasingly important role in AI and Data Science projects. Especially as we want to make our cities smarter (smart cities).

MicroPython can serve as a bridge between hardware and software here, as it makes it possible to collect sensor data and process it in data science pipelines or machine learning models. For example, a MicroPython sensor can measure air quality and send the data to a Machine Learning pipeline. MicroPython can also run simple AI models directly on devices (edge computing) – this makes it ideal for local computing without the device being dependent on the cloud.

So my conclusion: MicroPython makes hardware more accessible for data scientists. If you know Python, you can also use MicroPython and apply it in a smart home project.

What is the difference to Python and other programming languages?

While Python was developed for general software applications that run on powerful devices such as PCs or servers, MicroPython was developed for low-resource devices such as microcontrollers, which often only have a few kilobytes of memory and computing power.

As we all know, Python offers an extensive library for data analysis (pandas, numpy), machine learning (scikit-learn, tensorflow) or web development. MicroPython, on the other hand, only contains a reduced standard library and slimmed-down modules such as ‘math’ or ‘os’. Instead, it offers special hardware modules such as ‘utime’ for timers or ‘machine’ for controlling microcontroller pins.

While Python is better suited for data-intensive tasks, MicroPython enables direct access to hardware components and is therefore ideal for embedded systems (e.g. everyday electronics such as microwaves & smart TVs or medical devices such as blood pressure monitors) and IoT projects.

What does this look like in practice? Application areas and a quick simulator demo

In which areas is Micropython used?

Internet of Things (IoT): MicroPython can be used to control smart home devices or control sensor data for dashboards.
Edge computing: You can run machine learning models directly on edge devices (e.g. IoT sensors, smartphones, routers, intelligent cameras, smart home devices, etc.).
Prototyping: With relatively little effort, you can quickly set up a prototype for a hardware project – especially if you know Python.
Robotics: MicroPython can be used to control motors or sensors in robotics projects.

Flashing LED in the simulator as a practical example

Since as a data scientist or software specialist you probably don’t want to buy hardware just to try out MicroPython, I explored a MicroPython simulator available online. This is a simple and beginner-firendly way to get started with programming hardware concepts without the need for physical devices:

Open https://micropython.org/unicorn/
Import time, then define the function and call the function at the end. Type in each code snippet in the web terminal separately and then click ‘Enter’. You can use the following code for this:

#Provides functions to work with time
#(standard Python library instead of 'utime', as the code is used in the simulator)
import time

# Simulated LED by defining the function
def blink_led():
    for _ in range(5):
        print("LED ist jetzt: ON")
        time.sleep(0.5)  # Waits for 0.5 seconds
        print("LED ist jetzt: OFF")
        time.sleep(0.5)

# Start the blinking by calling the function
blink_led()

Now we see that the LED light (only in the console) switches back and forth between ON-OFF. In this simulator example, I only used the time library for the delay. To run the example with real hardware, you should use additional libraries such as ‘machine’ or ‘utime’.

In a web simulator, we can write a ‘hello world’ by writing a small script to output a flashing LED.

Final Thoughts

MicroPython is certainly important for people working in hardware projects such as IoT and edge computing. But due to its easy accessibility and because anyone who knows Python can also use MicroPython, the language bridges a gap between data science, AI and hardware technology. It is certainly good to at least know the purpose of MicroPython and the differences to Python. If you are interested in trying out smart home devices or IoT for yourself, it is certainly a accessible entry point.

Where to continue learning?

Own visualization – Illustrations from unDraw.co

The post What is MicroPython? Do I Need to Know it as a Data Scientist? appeared first on Towards Data Science.

5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering

Sarah Schürch — Thu, 02 Jan 2025 11:31:37 +0000

Start with 5 practical projects to lay the foundation for your data engineering roadmap

Tutorials help you to understand the basics. You will definitely learn something. However, the real learning effect comes when you directly implement small projects. And thus combine theory with practice.

You will benefit even more if you explain what you have learned to someone else. You can also use ChatGPT as a learning partner or tutor – explain in your own words what you have learned and get feedback. Use one of the prompts that I have attached after the roadmap.

In this article, I present a roadmap for 4 months to learn the most important concepts in data engineering for beginners. You start with the basics and increase the level of difficulty to tackle more complex topics. The only requirements are that you have some Python programming skills, basic knowledge of data manipulation (e.g. simple SQL queries) and motivation

Why only 4 months?

It is much easier for us to commit to a goal over a shorter period of time. We stay more focused and motivated. Open your favorite app right away and start a project based on the examples. Or set a calendar entry to make time for the implementation.

5 projects for your 4-month roadmap

As a data engineer, you ensure that the right data is collected, stored and prepared in such a way that it is accessible and usable for data scientists and analysts.

You are the kitchen chef, so to speak, who organizes the kitchen and ensures that all ingredients are fresh and ready to hand. The data scientist is the cook chief who combines them into creative dishes.

Month 1 – Programming and SQL

Deepen your knowledge of Python basics CSV and JSON files are common formats for data exchange. Learn how to edit CSV and JSON files. Understand how to manipulate data with the Python libraries with Pandas and NumPy.

A small project to start in Month 1 Clean a CSV file with unstructured data, prepare it for data analysis and save it in a clean format. Use Pandas for data manipulation and basic Python functions for editing.

Read the file with ‘pd.read_csv()’ and get an overview with ‘df.head()’ and ‘df.info()’.
Remove duplicates with ‘df.drop_duplicates()’ and fill in missing values with the average using ‘df.fillna(df.mean())’. Optional: Research what options are available to handle missing values.
Create a new column with ‘df[‘new_column’]’, which, for example, fills all rows above a certain value with a ‘True’ and all others with a ‘False’.
Save the cleansed data with ‘df.to_csv(‘new_name.csv’, index=False)’ in a new CSV file.

What problem does this project solve? Data quality is key. Unfortunately, this is not always the case when you receive data in the business world.

Tools & Languages: Python (Pandas & NumPy library), Jupyter Lab

Understanding SQL SQL allows you to query and organize data efficiently. Understand how to use the most important commands such as CREATE TABLE, ALTER TABLE, DROP TABLE, SELECT, WHERE, ORDER BY, GROUP BY, HAVING, COUNT, SUM, AVG, MAX & MIN, JOIN.

A small project to deepen your knowledge in month 1: Create a relational data model that maps real business processes. Do you have a medium-sized bookstore in your city? This is certainly a good scenario to start.

Think about what data the bookshop manages. For example, books with the data title, author, ISBN (unique identification number), customers with the data name, e-mail, etc.
Now draw a diagram that shows the relationships between the data. A bookstore has several books, which can be from several authors. Customers buy these books at the same time. Think about how this data is connected.
Next, write down which tables you need and which columns each table has. For example, the columns ISBN, title, author and price for the book table. Do this step for all the data you identified in step 1.
Optional: Create the tables with ‘CREATE TABLE nametable ();’ in a SQLite database. You can create a table with the following code.

-- Creating a table with the name of the columns and their data types
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(100),
    Author VARCHAR(50),
    Price DECIMAL(10, 2)
);

What problem does this project solve? With a well thought-out data model, a company can efficiently set up important processes such as tracking customer purchases or managing inventory.

Tools & languages: SQL, SQLite, MySQL or PostgreSQL

Month 2 – Databases and ETL pipelines

Mastering relational DBs and NoSQL databases Understand the concepts of tables, relationships, normalization and queries with SQL. Understand what CRUD operations (Create, Read, Update, Delete) are. Learn how to store, organize and query data efficiently Understand the advantages of NoSQL over relational databases.

Tools and languages: SQLite, MySQL, PostgreSQL for relational databases; MongoDB or Apache Cassandra for NoSQL databases

Understand the Etl basics Understand how to extract data from CSV, JSON or XML files and from APIs. Learn how to load cleansed data into a relational database.

A small project for month 2 Create a pipeline that extracts data from a CSV file, transforms it and loads it into a SQLite database. Implement a simple ETL logic.

Load a CSV file with ‘pd.read_csv()’ and get an overview of the data again. Again, remove missing values and duplicates (see project 1). You can find publicly accessible datasets on Kaggle. For example, search for a dataset with products.
Create a SQLite database and define a table according to the data from the CSV. Below you can see an example code for this. SQLite is easier to get started with, as the SQLite library is available in Python by default (module sqlite3).
Load the cleaned data from the DataFrame into the SQLite database with ‘df.to_sql(‘tablename’, conn, if_exists=’replace’, index=False)’.
Now execute a simple Sql query e.g. with SELECT and ORDER BY. Limit the results to 5 outputs. Close the connection to the database at the end.

import sqlite3

# Create the connection to the SQLite-DB
conn = sqlite3.connect('produkte.db')

# Create the table
conn.execute('''
CREATE TABLE IF NOT EXISTS Produkte (
    ProduktID INTEGER PRIMARY KEY,
    Name TEXT,
    Kategorie TEXT,
    Preis REAL
)
''')
print("Tabelle erstellt.")

Tools and languages: Python (SQLAlchemy library), SQL

Month 3 – Workflow orchestration and cloud storage

Workflow orchestration Workflow orchestration means that you automate and coordinate processes (tasks) in a specific order. Learn how to plan and execute simple workflows. You will also gain a basic understanding of the DAG (Directed Acyclic Graph) framework. A DAG is the basic structure in Apache Airflow and describes which tasks are executed in a workflow and in which order.

Tools and languages: Apache Airflow

Cloud storage Learn how to store data in the cloud. Know at least the names of the major products from the biggest cloud providers such as S3, EC2, Redshift from AWS, BigQuery, Dataflow, Cloud Storage from Google Cloud and Azure Blob Storage, Synapse Analytics, Azure Data Factory from Azure. The many different products can be overwhelming – start with something you enjoy.

A small project for month 3 Create a simple workflow orchestration concept with Python (without Apache Airflow, as this lowers the barrier to getting started for you) that sends you automated reminders during your daily routine:

Plan the workflow: Define tasks such as reminders to "Drink water", "Exercise for 3 minutes" or "Get some fresh air".
Create a sequence of the tasks (DAG): Decide the order in which the tasks should be executed. Define if they are dependent on each other. For example, Task A ("Drink water") runs first, followed by Task B ("Exercise for 3 minutes"), and so on.
Implement the task in Python: Write a Python function for each reminder (see code snippet 1 below as an example).
Link the tasks: Arrange the functions so that they execute sequentially (see code snipped 2 below as an example).

import os
import time

# Task 1: Send a reminder
def send_reminder():
    print("Reminder: Drink water!")  # Print a reminder message
    time.sleep(1)  # Pause for 1 second before proceeding to the next task

if __name__ == "__main__":
    print("Start Workflow...")  # Indicate the workflow has started

    # Execute tasks in sequence
    send_reminder()  # Task 1: Send a reminder to drink water

    # Additional tasks (uncomment and define these functions if needed)
    # reminder_exercise()  # Example: Send the second reminder
    # create_task_list()    # Advanced-Example: Create a daily task list

    print("Workflow is done!")  # Indicate the workflow has completed

Too easy? Install Apache Airflow and create your first DAG that performs the task of printing out "Hello World" or load your transformed data into an S3 bucket and analyze it locally.

Tools and languages: AWS, Google Cloud, Azure

Implement the 5 projects to learn twice as much as if you only look at the theory.

Month 4 – Introduction to Big Data and Visualization

Big data basics Understand the basics of Hadoop and Apache Spark. Below you can find a great, super-short video from simplilearn to introduce you to Hadoop and Apache Spark.

Tools and languages: Hadoop, Apache Spark, PySpark (Python API für Apache Spark), Python

Data visualization Understand the basics of data visualization

A small project for month 4 To avoid the need for big data tools like Apache Spark or Hadoop, but still apply the concepts, download a dataset from Kaggle, analyze it with Python and visualize the results:

Download a publicly available medium sized dataset from Kaggle (e.g. weather data), read in the dataset with Pandas and get an overview of your data.
Perform a small exploratory data analysis (EDA).
Create e.g. a line chart of average temperatures or a bar chart of rain and sun days per month.

Tools and languages: Python (Matplotlib & Seaborn libraries)

2 prompts to use ChatGPT as a learning partner or tutor

When I learn something new, the two prompts help me to reproduce what I have learned and use ChatGPT to check whether I have understood it. Try it out and see if it helps you too.

I have just learned about the [topic / project] and want to make sure I have understood it correctly. Here is my explanation: [your explanation]. Give me feedback on my explanation. Add anything that is missing or that I have not explained clearly.
I would like to understand the topic [topic/project] better. Here is what I have learned so far: [your explanation]. Are there any mistakes, gaps or tips on how I can explain this even better? Optional: How could I expand the project? What could I learn next?

What comes next?

Deepen the concepts from months 1–4.
Learn complex SQL queries such as subqueries and database optimization techniques.
Understand the principles of data warehouses, data lakes and data lakehouses. Look at tools such as Snowflake, AmazonRedshift, GoogleBigQuery or Salesforce Data Cloud.
Learn CI/CD practices for data engineers.
Learn how to prepare data pipelines for machine learning models
Deepen your knowledge of cloud platforms – especially in the area of serverless computing (e.g. AWS Lambda)

Own visualization – Illustrations from unDraw.co

Final Thoughts

Companies and individuals are generating more and more data – and the growth continues to accelerate. One reason for this is that we have more and more data from sources such as IoT devices, social media and customer interactions. At the same time, data forms the basis for machine learning models, the importance of which will presumably continue to increase in everyday life. The use of cloud services such as AWS, Google Cloud or Azure is also becoming more widespread. Without well-designed data pipelines and scalable infrastructures, this data can neither be processed efficiently nor used effectively. In addition, in areas such as e-commerce or financial technology, it is becoming increasingly important that we can process data in real-time.

As data engineers, we create the infrastructure so that the data is available for machine learning models and real-time streaming (zero ETL). With the points from this roadmap, you can develop the foundations.

Where can you continue learning?

The post 5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering appeared first on Towards Data Science.

Master Bots Before Starting with AI Agents: Simple Steps to Create a Mastodon Bot with Python

Sarah Schürch — Fri, 27 Dec 2024 14:01:48 +0000

I recently published a post on Mastodon that was shared by six other accounts within two minutes. Curious, I visited the profiles and discovered that at least one of them was a tech bot – accounts that automatically share posts based on tags such as #datascience or #opensource.

Mastodon is currently growing rapidly as a decentralized alternative to X (formerly Twitter). How can bots on a platform like this make our everyday lives easier? And what are the risks? Do bots enrich or disrupt social networks? How do I have to use the Mastodon API to create a bot myself?

In this article, I will not only show you how bots work in general but also give you a step-by-step guide with code examples and screenshots on how to create a Mastodon bot with Python and use the API.

Table of Content 1 – Why do Mastodon and tech bots exist? 2 – Technical basics for a bot on a social network 3 – Bots: The balancing act between benefit and risk 4 – How to create a Mastodon bot: Step-by-step instructions with Python Final Thoughts

1 – Why do Mastodon and tech bots exist?

Mastodon is a decentralized social network developed by Eugen Rochko in Germany in 2016. The platform is open-source and is based on a network of servers that together form the so-called ‘Fediverse’. If you want to share posts, you select a server such as mastodon.social or techhub.social and share your posts on this server. Medium also has its own server at me.dm. Each server sets its own rules and moderation guidelines.

Bots are basically software applications that perform tasks automatically. For example, there are simple bots such as crawler bots that search the internet and index websites. Other bots can do repetitive tasks for you, such as sending notifications or processing large amounts of data (Automation bots). Social media bots go one step further by sharing posts or reacting to content and thus interacting with the platforms. For example, a bot can collect and share the latest news from the technology industry so that followers of this bot profile are always up to date – the bot becomes a curator that curates according to precisely defined algorithms…

Chatbots are also a specific type of bot that are used for customer support, for example. They were developed primarily for dialog with us humans and focus much more on natural language processing (NLP) in order to understand our language and respond to it as meaningfully as possible. Agents, which are currently a hot topic of discussion, are in turn a further development of bots and chatbots: agents can generally take on more complex tasks, learn from data and make decisions independently.

Fun fact: Eliza, which was developed as a chatbot at MIT, was already able to simulate simple conversations in 1960. 65 years later, we have arrived in the world of agents…

Reference: ELIZA-Chatbot

However, bots can also spread disinformation by automatically disseminating false or misleading information on social networks to manipulate public opinion. Such troll bots are repeatedly observed in political elections or crisis situations, for example. Unfortunately, they are also sometimes used for spam messages, data scraping, DDOS cyberattacks or automated ticket sales. It is therefore important that we handle automated bots responsibly.

2 – Technical basics for a bot on a social network

In simple terms, you need these three ingredients for a bot:

Programming language: Typical programming languages are Python or JavaScript with Node.js. But you can also use languages such as Ruby or PHP.
API access: Your bot sends a request to the application programming interface (API) of a social network and receives a response back.
Hosting: Your bot must be hosted on a service such as Heroku, AWS, Replit or Google Cloud. Alternatively, you can run it locally, but this is more suitable for testing.

Programming languagePopular languages for a bot are Python or JavaScript – depending on the requirements and target platform. Python offers many helpful libraries such as Tweepy for Twitter (but now limited in use due to the changes to Twitter-X), Mastodon.py for the Mastodon API or Python Reddit API Wrapper (PRAW) to manage posts and comments for Reddit. Node.js is particularly suitable if your bot requires real-time communication, server-side requests or integration with multiple APIs. There are libraries such as mastodon-api or Botpress that support multiple channels. For bots on Facebook and Instagram, on the other hand, you need to use the Facebook Graph API, which has much stronger restrictions. And for Linkedin, you can use the LinkedIn REST API, which is designed more for company pages.

API Most modern APIs for social networks are based on the REST architecture. This API architecture uses HTTP methods such as GET (to retrieve data), POST (to send data), PUT (to update data) or DELETE (to delete data). For many platforms, you need a secure method such as OAuth2 to have access with your bot to the API: For this, you first register your bot with the platform to receive a client ID and a client secret. These credentials are used to request an access token, which is then sent with every request to the API.

HostingOnce your bot is ready, you need an environment in which your bot can run. You can run it locally for test purposes or prototypes. For longer-term solutions, there are cloud hosting platforms such as AWS, Google Cloud or Heroku. To ensure that your bot also works independently of the server environment without any problems, you can use Docker, which packages your bot together with all the necessary settings, libraries and dependencies in a standardized "package" that can be started on any server.

In addition, you can automate your bot with cron jobs by running your bot at certain times (e.g. every morning at 8.00 a.m.) or when certain events occur (e.g. a post with a certain hashtag was shared).

Own visualization – Illustrations from unDraw.co

3 – Bots: The balancing act between benefit and risk

There are big differences in quality between bots – while a well-programmed bot responds efficiently to requests and delivers added value, a poorly designed bot can be unreliable or even disruptive. As described at the beginning, a bot is a software application that performs automated tasks: The quality of the bot depends on how the underlying algorithms are programmed, what data the bot has been fed with in the case of AI bots and how the design and interactions are structured.

So how do we create ethically responsible bots?

Transparency: Users need to know that they are interacting with a bot and not a human. Bots that disguise this only destroy trust in the technology. For example, Mastodon has a rule that bots’ profiles must be clearly labeled. It is also possible for the bot to add a small note to every interaction or post that makes it clear that the interaction originates from a bot.
No manipulation: Bots must not be used to spread disinformation or manipulate users in a targeted manner.
Respect for the platform and people: Bots must follow the rules of the respective platform.
Data protection must be respected: For example, if bots analyze user profiles, it must be ensured that the bot does not store data that it should not or it must be defined who has access to this data and how it is used in order to comply with data protection laws such as the GDPR in Europe.

Are bots good or bad? Do bots disrupt social networks or enrich them?

In my opinion, technology that automates repetitive tasks is always valuable. On the one hand, well-developed bots can provide us with valuable information, stimulate discussions or act as support for curators.On the other hand, bots can spread spam, be discriminatory or dominate discussions. In my opinion, such technologies are most useful when they are used as support.

Let’s imagine for a moment a social platform that consists only of trained bots that carry out the discussions among themselves – in my opinion, that would be a pretty boring platform – the humanity is missing. The interactions would have a "bland aftertaste". Also, when it comes to automation, I often think that although technology performs the task more "perfectly", but the creativity and love is missing compared to when the task was performed by a human who works professionally and in detail. The human touch, the unforeseen is missing.

4 – How to create a Mastodon bot: Step-by-step instructions with Python

We want to create a bot that regularly searches Mastodon posts with the hashtag #datascience and automatically reposts these posts.

Everything you need to get started

Python must be installed on your device. Tip for newbies: On Windows, you can use ‘python – version’ in Powershell to check if you already have Python installed.
You need an IDE, such as Visual Studio Code, to create the Python files.
Optional: If you are working with the Anaconda distribution, it is best to create a new project with ‘conda create – name NameEnvironment python=3.9 -y’ and install the libraries in this project so that there are no dependencies between the libraries. Tips for newbies: You can then activate the environment with ‘conda activate NameEnvironment’. The -y stands for the fact that all confirmations are automatically accepted during the installation.

1) Install the Mastodon.py library

First we install Mastodon.py with pip:

pip install Mastodon.py

Tips for newbies: With ‘pip – version’ you can check if pip is installed. If no version is displayed, you can install pip with ‘conda install pip’.

2) Register the app for the bot on techhub.social

If you don’t have an account on techhub.social yet, register. Techhub.social describes itself as a Mastodon instance for passionate technologists and states in the rules that bots must be marked as Bot in their profile.

We now register our app for our bot using the ‘Mastodon.create_app()’ function. To do this, we create a Python file with the name ‘register_app.py’ and insert this code: In this code, we register the bot with Mastodon to gain API access and save the necessary access data. First, we create the app with ‘Mastodon.create_app()’. We save the client credentials in the file ‘pytooter_clientcred.secret’. Then we log in to Mastodon to generate the user credentials. We save these in another file ‘pytooter_usercred.secret’. We add the error handling to catch problems such as incorrect login data.

from mastodon import Mastodon, MastodonIllegalArgumentError, MastodonUnauthorizedError

try:
    # Step 1: Creating the app and saving the client-credentials
    Mastodon.create_app(
        'pyAppName',  # Name of your app
        api_base_url='https://techhub.social',  # URL to the Mastodon instance
        to_file='pytooter_clientcred.secret'  # File to store app credentials
    )
    print("App registered. Client-Credentials are saved.")

    # Step 2: Login & Saving of the User-Credentials
    print("Log in the user...")
    mastodon = Mastodon(
        client_id='pytooter_clientcred.secret',
        api_base_url='https://techhub.social'
    )

    mastodon.log_in(
        'useremail@example.com',  # Your Mastodon-Account-Email
        'YourPassword',  # Your Mastodon-Password
        to_file='pytooter_usercred.secret'  # File to store user credentials
    )
    print("Login successful. User-Credentials saved in 'pytooter_usercred.secret'.")

except MastodonUnauthorizedError as e:
    print("Login failed: Invalid email or password.")
except MastodonIllegalArgumentError as e:
    print("Login failed: Check the client credentials or base URL.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Then we enter this command in the Anaconda prompt to execute the script:

python register_app.py

If everything worked successfully, you will find the file ‘pytooter_clientcred.secret’ in your directory, which contains the app-specific credentials for our app that were generated when the app was registered. In addition, there should be the file ‘pytooter_usercred.secret’, which contains the user-specific access data. This information was generated after the successful login.

You will see the following output in the terminal:

Tips for newbies: Tooten is used in Mastodon to say that a post is published (like tweeting on Twitter). The two secret files contain sensitive information. It is important that you do not share them publicly (e.g. do not add them to your GitHub repository). If you want to use 2FA, you must use the OAuth2 flow instead. If you open your Mastodon account in the desktop application you can check this setting in Settings>Account>Two-Factor-Authentication.

3) Publish test post via API

Once the registration and login has worked successfully, we create an additional file ‘test_bot.py’ and use the following code. First we load the user credentials from ‘pytooter_usercred.secret’ and connect to the Mastodon API. With ‘mastodon.toot()’ we specify the content we want to publish. We display a confirmation in the terminal that the toot has been sent successfully.

from mastodon import Mastodon

mastodon = Mastodon(
    access_token='pytooter_usercred.secret',
    api_base_url='https://techhub.social'
)

mastodon.toot('Hello from my Mastodon Bot! #datascience')
print("Toot gesendet!")

We save the file in the same directory as the previous files. Then we open the file in the terminal with this command:

On Mastodon we see that the post has been successfully tooted:

4) Reblog posts with a specific hashtag

Now we want to implement that the bot searches for posts with hashtag #datascience and re-shares them.

In a first step, we create a new file ‘reblog_bot.py’ with the following code: Using the ‘reblog_datascience()’ function, we first connect to the Mastodon API by loading the user credentials from ‘pytooter_usercred.secret’. Then the bot uses ‘timeline_hashtag()’ to retrieve the last 3 posts with the hashtag #datascience. With ‘status_reblog()’ we automatically share each post and display the ID of the shared post in the terminal.

To avoid overloading, the API allows up to 300 requests per account within 5 minutes. With ‘limit=3’ we specify that only 3 posts are reblogged at a time – so this is not a problem.

from mastodon import Mastodon

def reblog_datascience():
    mastodon = Mastodon(
        access_token='pytooter_usercred.secret',
        api_base_url='https://techhub.social'
    )
    # Retrieve posts with the hashtag #datascience
    posts = mastodon.timeline_hashtag('datascience', limit=3)
    for post in posts:
        # Reblogging posts
        mastodon.status_reblog(post['id'])
        print(f"Reblogged post ID: {post['id']}")

# Run the function
reblog_datascience()

As soon as you run the file, 3 posts will be reblogged in your profile and you will see the IDs of the 3 posts in the terminal:

3 posts containing the hashtag #datascience will be reposted.

In the terminal we see the IDs of the 3 posts that were reposted.

Note: I have removed the posts from my Mastodon account afterwards as my profile is not labeled as a bot.

Final Thoughts

We could extend the bot even further, for example by adding functions so that duplicate posts are not reblogged or that error messages (e.g. due to missing authorizations) are caught and logged. We could also host the bot on a platform such as AWS, Google Cloud or Heroku instead of running it locally on our computer. For automated execution, it would also make sense to set up a scheduler. On Windows, for example, this can be tried out with the Task Scheduler. This will run the bot regularly (e.g. every morning at 8.00 a.m.), even if the terminal is closed. On Linux or Mac, we could use alternatives such as cron jobs.

Like practically any technology, bots can offer great benefits if we use them in a considered, ethical and data protection-compliant manner. However, they can also disrupt social platforms if we misuse them.

Where can you continue learning?

The post Master Bots Before Starting with AI Agents: Simple Steps to Create a Mastodon Bot with Python appeared first on Towards Data Science.