Amy Ma, Author at Towards Data Science

My Learning to Be Hired Again After a Year… Part 2

Amy Ma — Mon, 31 Mar 2025 19:24:25 +0000

This is the second part of “My learning to being hired again after a year… Part I”.

Hard to believe, but it’s been a full year since I published the first part on TDS. And in that time, something beautiful happened. Every so often, someone would leave a comment, highlight a sentence, or send me a message. Most were simple notes like, “Thank you, Amy. Your post helped me.” But those words lit me up. They brightened entire days. They reminded me that I was never truly alone, not during those long months of unemployment, not in the struggle of figuring out who I was without a job title or company name beneath my email signature or Linkedin profile.

Funny enough, those hard days turned out to be some of the most meaningful ones I’ve had. Maybe even more meaningful than my busiest days at work. Because in losing an identity, I found new ones. I didn’t need a job or a title to feel connected. To many of you, I’m just a pretty lazy writer getting back into the groove. And here I am — returning to my writing routine. So, thank you to everyone who reached out. Your messages rank second on my list of happiest things people give me. The first? That’s easy. My daughter Ellie’s three S’s: her smell, her smile, and her surprises.

Enough talk. Let’s get into Part 2. I’ll pick up where I left off — sharing the lessons that helped me get hired again. This time, I’ll also reflect on how those lessons show up in my work and life today. And for those of you curious about the methods from the book Never Search Alone, I’ve got some thoughts on that too. What worked, what didn’t, and how I made it my own.

Knock, Knock: Opportunity’s at the Door — You Won’t Lose a Penny for Trying

A year into working as a Machine Learning Engineer, I can say this was my biggest life lesson.

Here’s the backstory. I’d been working as a data scientist ever since I finished grad school. Over the past 7 years, I’ve built multiple machine learning models, linear regression, neural networks and Xgboost. All solid stuff. But when it came to designing an entire machine learning system from start to finish? That was a different story. I hadn’t really done that. I knew how to develop models, sure. I even had some experience deploying them, but only parts of the process. If you asked me to design, build, and run an entire system end-to-end, I couldn’t say I had that experience.

And the job market? It was changing fast. Companies didn’t want someone who could just build models anymore. Generative AI was handling a lot of the data analysis now. What they really wanted was someone who could take machine learning and use it to solve real business problems, someone who could own the whole process. Meanwhile, I had just been laid off. I had time. So I decided maybe this was the right moment to pivot. Maybe it was time to go for machine learning engineering.

The first thing I did was reach out to people who had already made that move. Two friends said yes. One had gone from data scientist to machine learning engineer. The other was a data scientist, and her husband worked as an MLE at Apple. We ended up having this long phone call for two hours, maybe more. They were kind. And they didn’t sugarcoat anything. Both of them told me it was tough to make the switch. Not impossible, but tough. If you didn’t have MLOps experience or a solid GitHub portfolio to show off, landing a senior MLE job would be really hard. Especially with how competitive things were getting.

That conversation hit hard. I remember feeling my heart pound, like cold water had been poured over my head. I had two options: I could keep chasing data scientist jobs — applied scientist roles at places like Amazon — but there weren’t many out there. Or swallow my pride, let go of seven years of experience as a data scientist and go for an entry-level MLE role. Honestly, neither choice felt great.

It took me two weeks to work through it. Two long long weeks. But in the end, I made up my mind: I’d try for machine learning engineer jobs at least, even if I had to start from the bottom. I got back to my routine and prepped for interviews. During those hard days, I started blogging on Medium and published on TDS to show my technical muscle, sharing my “Courage to Learn ML” series. Ready for a spoiler alert? I ended up with three offers for senior and even staff level machine learning engineering roles. And I had three other final-round interviews lined up that I had to walk away from, because there just wasn’t enough time or energy for me to do them all.

No, none of those offers came from FAANG companies. But I’m more than happy with where I landed. It was worth the try.

Even now, writing this, I can still feel that chill from when my friends told me the odds were slim. And I can still laugh at how panicked I was. Just the other day, I spoke with a friend who’s looking to move from data engineering into MLE. I told him the same thing I learned for myself: You can do it. And if you decide it’s worth trying, don’t get hung up on the odds. Even if it’s a 1% chance, why not see if you’re in that 1%? But if you don’t try at all, you’re 100% in the group that never made it.

For me, the takeaway is simple. Don’t be afraid of probabilities. Even 99.999999% is not 100%. If you’re worried about the outcome, stop thinking about the outcome. Just do it for fun, for your mental health, for the chance to live without regrets.

A Year Later: I use this lesson almost every day. I blog shamelessly, pretending I don’t care about if people really read those. I make those awkward customer service calls, just to see if someone on the other end might actually help me. I even buy a lottery ticket now and then when the jackpot tops a billion dollars. Who knows? I might end up in that 0.0000…001%. And you know what? I recently won $12 on a ticket. So yes — it’s worth trying.

Learning During the Struggle: Don’t Beg for Jobs

This was another hard lesson from my “to be an MLE or not to be” chapter.

When I spoke with those two friends, they made one thing clear. If I wanted to become a machine learning engineer, I needed hands-on experience with MLOps (machine learning operations). The problem? In my past roles, I’d either handed off my models to software engineers for deployment or handled just one small part of the system myself. I knew I had a gap. And my first instinct was to fill it by any means necessary. So I figured, why not get involved in some real projects? Something complex. Something I could proudly add to my resume.

Since I was out of work, I had time. I joined MLOps communities on Slack and Discord. I posted about my background, offered to work for free with any startup or team that needed help. Just to get some experience in exchange. The response? Pretty discouraging. Hardly anyone replied. A few did, but they expected me to work 50+ hours a week… for free and without any working plans. I remember sending a message to a PhD student after reading his job posting. I told him how I liked his work and wanted to make his product a reality. He didn’t get back with me. He instead changed his posting to say he was seeking experienced MLEs or someone with a PhD. Ouch.

After a few weeks of all that, I was demotivated and burned out. I was pleading for opportunities and it was clear. It was then that I decided to join a Job Search Council (JSC) (I explained JSC in detail in the part 1). We shared the emotional weight of job hunting every Friday. I slowly started letting go of the tension. And that’s when something clicked. I needed to stop pleading for jobs. Instead, I decided to sell what I had.

I rewrote my resume into two versions, one for data scientist roles and the other for MLE roles. I applied for MLE jobs crazily just to increase the chances. But this time around, I approached it differently. I broke down what the hiring managers were actually looking for in an MLE. I saw how all the model building experience I had acquired had actually taught me on debugging, monitoring, and resolving messy business problems. While I didn’t have a lot of MLOps experience, I wasn’t coming from zero. I had a master’s degree in computer science, I was familiar with software development, and I knew data engineering.

In those MLE interviews, I started highlighting those skills. I explained how I applied machine learning to solve business problems, offered subtle hints about my favorite model-training tricks. I showed hiring managers I knew how it felt to run systems into production. I was honest about where I needed to gain more experience. But I made it clear this wasn’t a cold start.

At some point, I stopped acting like a job-beggar and became a salesperson. I wasn’t asking someone to “please hire me. I’m willing to work more and cheaper”. I was selling something. When a company didn’t hire me, it wasn’t a rejection. It just meant they didn’t need someone like me. Maybe I need to tighten the pitch next time.

This mental shift made all the difference. Negative feedback wasn’t personal anymore. It was just feedback, a little data point I could use to make adjustments. When you ask for something, people think less of you. But when you treat yourself as a product, you’re refining and searching for the right buyers. If there’s a flaw, you fix it. If there are good things, you point them out. And sooner or later, you find your people.

A Year Later: I don’t beg anymore. Not for jobs. Not for opportunities. I exchange. I sell. That mindset has become part of me now. It’s my inner tiny salesperson.

Mock Interviews and the Interview Marathon: Practice Really Does Make a Difference

I’ll be straight with you. Before I started interviewing for machine learning engineer roles after my layoff, I had never really practiced behavioral interviews. Not once in my seven years of working. Sure, I wrote out a few stories using the STAR method, like everyone says you should. But I never practiced them out loud, and I definitely never got feedback. It was like stepping on stage to perform in a play without ever going to rehearsal. I never realized how big a mistake that was, probably because, back when the job market was good, I didn’t have to.

But after the layoff? After spending nearly a year at home because of pregnancy? The market was ice cold. There weren’t many chances, and I couldn’t afford to blow any of them. I had to nail the behavioral interviews. Not just by memorizing my stories, but by actually practicing. For real.

So, I made my husband do mock interviews with me. I sat in one room, he sat in another, and we jumped on Zoom like it was the real thing. Poor guy — he’s been at the same job since forever and works in a totally different field, but there he was, asking me random behavioral questions. At first, I didn’t think it was going to help. I figured he didn’t get what I did anyway. But when I started answering with my “well-crafted” stories, something surprising happened. I got nervous. And wordy. Way too wordy.

And then he cut me off. Not gently, either. He told me straight up: I was spending way too much time talking about the background. The company, the project, all the setup. He said by the time I got to the part about what I actually did, he had already tuned out. You know what? He was 100% correct and I’d never noticed it before. I never thought about how much time I was wasting on details that didn’t really matter to the person listening.

After that, I went back through my stories. Almost all of them had the same problem. Too much setup, not enough focus on action and results. Honestly? I was grateful for his brutal feedback. It was a little embarrassing, but I wished I’d done mock interviews like that years ago.

From then on, I decided to practice a lot more. With my new MLE resume ready, I started applying like crazy. Interviews came in, and instead of trying to avoid them, I leaned in. Earlier in my career, I was the kind of person who’d grab the first offer just to escape the stress of interviewing. Selling myself has always made me a little panicky. After all, I’m an introvert. But this time, things were different. The book Never Search Alone and those early mock interviews changed my mindset. (I’ll talk more about the book and why it prevents me from rushing out of the interview process later.)

So I gave myself time. I said yes to almost every interview I could get. At one point, I interviewed with four companies over three days. It felt like a marathon, but somewhere along the way, I got good at telling my story. I watched how the interviewers reacted. I collected feedback from the process. And something strange happened: I stopped caring so much about the results. Whether I got a yes or a no didn’t shake me anymore. I wasn’t just interviewing to get a job. I was practicing to get the job I really wanted.

By the time I had three offers on the table and finally chose the one I liked, I knew I was done. That was my finish line. It felt like I’d run the full race and actually won the prize I wanted not the one I settled for.

Seriously, I can’t say this enough: KEEP interviewing. Back-to-back if you can. Do mock interviews with whoever you trust, even if they aren’t in your field. Practice until you’re less worried about the outcome and more focused on getting better.

A Year Later: It’s hard to say how much of those interview skills I still have in me now. But if I ever need to practice again, you better believe I’ll be dragging my husband back into another round of mock interviews. Maybe even for business presentations. He’s a tough crowd, but he gets results :]

Panic Mode? Deep Breath, the Show Must Go On

During my interview marathon, I started noticing something that completely threw me off. Some interviewers looked… disappointed. Others seemed bored. And me? I cared. A lot. Probably too much. Every time I saw a face that wasn’t smiling or nodding, I panicked. In my head, I’d hear this loud voice saying, “Amy, you’re blowing it.” And once that thought crept in, it was over. My brain and body would scramble to fix the situation, so I’d start talking faster, throwing out more words, hoping to change their minds. I wanted to come across as sharp and impressive. But the truth is, I probably looked like a nervous, rambling mess.

My husband confirmed it after one of our mock interviews. He didn’t sugarcoat it. “You’re not even looking at the camera,” he said. “And you seem really tense.” Again, he is the right.

For an introvert like me, fixing this wasn’t easy. But I found two things that helped. So I will share it here.

The first was simple: breathe. Every time I spotted what I thought was a bad reaction, a frown, a yawn, that blank expression that felt like doom, I forced myself to pause. I took a breath. And instead of rushing to say more, I slowed down. Sometimes I even cracked a cold joke. (I’m surprisingly good at bad jokes. It might be my secret talent.) Then I’d apologize for the joke, take another breath, and move on. That little reset worked in two ways. First, it quieted the voice in my head screaming “You’re ruining this!” Secondly, it made the interviewer’s expression change. Maybe they smiled and got the joke. Maybe they just looked confused and didn’t like it. But at least they weren’t bored or disappointed anymore. I’ll take that.

The second thing I did was tape a picture of my daughter right behind the camera. Her big, shiny smile was right there, and every time I glanced at it, I smiled too. Which, by the way, made me look more relaxed and human on camera. Sometimes the interviewer smiled back, and just like that, the energy shifted. I wasn’t panicking anymore. I was back in control. The show was back on.

I started thinking of myself as a salesperson. Or maybe a showman. What do they do when the audience looks tired or distracted? They keep going. They adjust. They bring the energy back. If you’re like me, someone who takes those reactions personally, you need to have a plan. These were my two tricks. You’ll probably find your own. But the point is: don’t panic. Pause. Breathe. No one will notice. And then, get back to the show.

A Year Later: Honestly, this might be the most important skill I picked up during that tough year. I still use it all the time at work. When I’m presenting my work to a room full of people, I slow myself down. I picture myself in a fancy tailcoat, like an old-school showman, selling my ideas to the audience. Sometimes I throw in one of my classic cold jokes to keep things light.

When I wrap up a presentation, I make sure to give people something easy to take with them. I’ll say, “If you’re heading out and want one thing to remember about this project, here’s the punchline.” Then I boil it down to one or two sentences and say it clearly. Loud enough to stick.

I even use this trick in regular conversations, especially the awkward ones. A little pause makes everything less uncomfortable. And more often than not, things turn out better after that moment to reset.

Do the Mnookin Two-Pager exercise: How I Found a Job That Actually Fit Me

I keep mentioning the book Never Search Alone, and there’s a reason for that. When I first heard about it, I was skeptical. As an introvert, the idea of joining a group of strangers to talk about job hunting made me extremely uncertain and nervous.

My first group didn’t go well. There were five of us, but two people refused to follow the process. They were often late or skipped meetings entirely. It was frustrating, and I almost gave up. Instead, I found another group through the Slack community. That time, it clicked. We met every Friday, and kept each other accountable. We helped one another stay sane through the search. It made a huge difference. If you want to know more about how the JSC (Job Search Council) helped me, I wrote about it in part one of this story.

Looking back, another useful thing the book offered was the Mnookin Two-Pager exercise. You sit down and write out what you love in a job, what you hate, and what your career goals are. Simple, but surprisingly powerful. It forced me to get honest with myself. Without it, I probably would have grabbed the very first offer and rushed out of the market, just to be done with it. I’ve done that before. And regretted it.

This time was different. My two pager list kept me grounded. I knew what I wanted and where I wasn’t willing to settle. That’s how I ended up at Disney. The role hits about 85% of what I was hoping for. More importantly, it steers clear of every red flag on my “hard no” list. A year later, I’m still glad I took the time to figure out exactly what I was looking for before saying yes to anything.

Finally! We Made It to the End.

I’m so glad I finally sat down and finished this. Honestly, I’m the kind of person who thinks a lot. But writing things out like this helps me clear my head and hold on to the lessons I actually want to keep.

If you’ve enjoyed reading this, and you want to read more stories from me, or you just want to smile at how bad my jokes are, please keep an eye on my posts on TDS. Or better yet, subscribe to my newsletter where I write more frequently about AI and ML, along with life lessons, parenting, and, of course, a few of my cold jokes.! If you’d like to support my writing, you can also just buy me a coffee on https://ko-fi.com/amyma101!

The post My Learning to Be Hired Again After a Year… Part 2 appeared first on Towards Data Science.

My learning to being hired again after a year… Part I

Amy Ma — Sun, 23 Jun 2024 15:50:34 +0000

One year ago today, on May 13th 2023, I was laid off. Today, I started the first day at my new job. Over the past year, I became a mother and discovered parts of myself I never knew existed.

I want to share some of my learnings from this journey. But if you’re looking for tips on cracking coding interviews or nailing behavioral questions, this isn’t that kind of post. Those often detail how many big tech companies the authors interviewed with, the offers they received, the prep resources they used, and even provide a funnel of their interview pipeline. They always conclude with, "It wasn’t easy. I cried and worried, but here I am. Good luck!" While I respect and appreciate their candor, they often leave me feeling anxious and inadequate.

This post is for anyone searching for a job, regardless of the type or stage you’re at. I want to reach out to those who feel cold and frustrated on their journey, as I once did. Here’s one of my personal philosophies: success stories don’t motivate me unless they detail the hardships and how they were overcome. I want to learn from mistakes and obstacles, not from someone else’s cheerful party.

Reclaim Your Identity: You Are More Than Just a Job Title

After my layoff last May, I was deeply depressed and couldn’t focus on anything. Pregnancy added to my free-floating anxiety, since I was worried that could have enough income to raise the baby. After Thanksgiving, I started looking for ways to earn extra money alongside Job Hunting, unsure how long this "technical job winter" would last. That is when I started blogging and writing down all my learning while reviewing technical concepts, leading me to my series, ‘Courage to Learn ML’. The courage wouldn’t just be to review ML basics, but to find the courage to start finding job again after my delivery. For the first post, there were twenty views almost immediately after I published via TDS, and on that first day, I made $0.06. I was ecstatic. That paltry sum felt like a life preserver, a tiny but mighty affirmation that I could still contribute, still matter.

What I would like to share is the following: For most people who leave their last job, whether they quit or were laid off, the most significant challenge and frustration is the loss of their social identity. To regain balance, try to find new ways to connect and contribute. Start your own business (you don’t need any permission or offer letter to do that), blog, volunteer with local NGOs, or just spend quality time with your family. Those small incomes, contributions, and comments will be a light through the dark moments; the love, wisdom, and courage you give daily are what truly define you.

Rediscovering your identity is a journey, and every small step and the tiniest of victories, even $0.06, matters. You are valuable, you are needed, and you are more than your job title. Take the time to say goodbye to your old job and routines. Now is the time for something new.

Find Your Job Hunting Crew

During my job search, I read ‘Never Search Alone’ by Phyl Terry. It’s a simple book with a great premise: identify a group of others in your job search and provide mutual support. Luckily, I was a member of a fabulous Job Search Council (JSC). We followed the book and did a lot exercises for job search together, but the most significant benefit for me was the emotional boost. Although we each went for different roles at various levels, we were all in the same emotional boat. We all had families to support, and everyone felt wholly devastated and anxious about the future. We shared weird yet wiled stories about some arrogant interviewers, about someone being ghosted right after talking to hiring managers, shared tips to tweak our resumes and LinkedIn in exciting ways, and created a bond. Even now, I still attend the JSC meetings every Friday that I can because they’ve become part of my family. I do care about them deeply after being in this Job Search Squad.

For those frustrated job seekers, find fellow job seekers to share the journey – kind of like a gym buddies for losing weights. Support each other, and be up in the highs and lows together. It will keep you going without burning out from self-doubt after being ghosted and failing.

Network Makeover: Reconnect and Refresh

As a typical introvert, I would sweat when running into former colleagues. I would start to script numerous ways to say hello in my mind before they even found me. To avoid ‘over’ socializing, I would leave a job and stay quiet, rarely contacting old coworkers. But with the waves of technical layoffs this year, I found myself needing to network to get back to work. Reluctantly, I messaged former colleagues and expected no replies.

Yet, ironically, the results were astonishing. My old manager from my second job offered an hour of feedback and motivation, pointing out areas for improvement. Another one, whom I thought did not have a positive impression of me, called to ask me to drop by the old office and looked for openings in his team.I also made new friends with whom I could hang out, although we did not always share lunch.

One tough encounter came from my last direct manager, someone I respected. She postponed our hangouts multiple times, and when we finally met, it was awkward. During the conversation, she said that my layoff during pregnancy was perfect timing. I was so shocked and disappointed, even though I know she means the pregnancy is buying me some time at least give me good excuse to find a job later, but I still hurt because she is also a female and even her don’t even not understanding the difficulties of pregnancy and postpartum. She didn’t realize my jobless status made me too anxious to fully enjoy parenting. After that meeting, I reached out to her for a recommendation, but she refused. I accepted this painful reality and acknowledged that while she was a good coworker, she didn’t want to be part of my network. So I stopped connecting with her.

My takeaway: don’t be afraid to ask for help, even if you think the person won’t assist. Don’t be too upset if they don’t want to help. Be grateful to those who accept and understand those who do not. Be a minimalist.

Don’t Over Plan, Just Do It.

When I started preparing for interviews, I created a detailed plan focusing on complex machine learning concepts. Two days later, I changed my plan because someone mentioned in their blogs that these concepts are only asked in later interview stages, with coding algorithm questions coming first. A week later, I was back online, searching for materials on ML system design. Because, according to some posts, this is the hardest part of a machine learning engineering interview. Over the next two months, I modified my plan multiple times, from version 1.0 to 7.0. Despite having a solid plan, I devised a plan so detailed it could have been a NASA mission.

One day, I shared my brilliant plan with the smartest person I know. He took a look and said, "So, your actual preparation progress is zero." Ouch! Yes! I spent more time on planning than doing. My version 1.0 plan needed 4–6 months, while version 7.0 promised 3 months. But here’s the plot twist: the endless planning yielded almost zero benefit.

What I learned is that you often land the job when you’re only 80% done with your plan, or even less if the job market is hot. Also, interviews reveal many things that need tweaking, making your plan more personalized. not based on someone else’s job-hunting share posts. So, dive into the messy preparations with your version 1.0 plan. Jump in! Even a bad plan beats no plan. Also, try to picture yourself as the hero in an RPG game – you’ll figure out your way to achieve your goal, one quirky quest at a time.

Every job hunter knows that overplanning can be a trap. However, it’s hard to avoid because anxiety often drives the search for shortcuts. The key is not to focus on the results but on the process.

Recruiters Know Best, Pick Their Brains

By the end of February, I had gone a bit panic. I had applied to several jobs, but my inbox remained suspiciously 0 interview invitations. I still spent my days glued to the computer screen, scanning through jobs, and, in my mind, ticking off every requirement. "I’m the perfect candidate," I thought. But why aren’t they sending me for interviews? Perhaps the 9-month gap? Should I mention that I spent those months being pregnant, parenting, and blogging? I was busy and try to stay active in the field, after all.

The problem was that I had no idea how recruiters viewed my qualifications. Over time, I was getting more panicky as I wondered if my gap made me less valuable in this tough job market. Recruiters are like gatekeepers to the interview process, so how did they see me? Just as I was about to throw in the towel, a recruiter named Amanda, reached out to the JSC community and offered to chat with anyone interested. I jumped at the chance and sign up for one of the open spots.

During our session, she reassured me that the gap wasn’t a problem at all but pointed out that my resume was too wordy and stuffed with technical jargon. Instead, she told me to highlight my most significant accomplishments on LinkedIn and not to include my gap months in the summary. She encouraged me to prepare a set of questions to ask interviewers, those repetitive questions can help me to compare their answers and assess the company fit. She also taught me a great way to introduce myself on calls, like asking what type of person the job requires and then highlighting my experience to match those requirements. This method quickly helps in two ways: finding out if the hiring manager understands what they need and showcasing your qualifications effectively.

After our chat, I started building connections with recruiters. Whenever I got a call about a position, I ended the conversation by asking for a little favor, a chance to ask one more question. I utilize the chance to inquired about my resume, linkedin, the current market, and the hiring process. I even gossiped a bit to gather more info. During one call, a recruiter suggested I include a link to my blog in the summary so managers could easily check it out. This trick proved very helpful when I started applying for jobs again with my updated resume.

My takeaway? Work with recruiters, chat with them beyond the initial basic call, and get their professional opinions. Gather every piece of advice you can to better market yourself. Think of it as getting the inside scoop and using it to sell yourself like a pro.

Market Yourself Right: Resume Reality Check

From the painful experience of one who applied to thousands jobs but got zero interview, I learned some small but powerful lessons about resume. At the beginning, I believed my resume was good enough because I devoted so much time to it. I crafted each term to follow the famous Google XYZ rule. I was so sure of my resume that if anyone questioned its effectiveness, I would become defensive. But the reality is, my resume was not good enough to pass those resume screenings. So I started to work on my resume by treating it as not mine. Here are some key things I have learned during this painful process:

Make your achievements visible and valuable to the audience. A resume isn’t just for industry insiders. Most readers of your resumes are recruiters who may not fully understand the weight of your accomplishments. Be clear and direct with accomplishments. For instance, a 3% increase in model accuracy might be mind-blowing for people who work in machine learning area, but it might not mean much to non-ML recruiters or engineers.
Understand your job and highlight the exemplary aspects. If your job emphasizes outcomes, focus on describing those outcomes. If it requires specific technical skills, highlight those skills. What I want to say is: weight different parts of your resume to match with your job. In my last position, for instance, my machine learning model brought revenue growth – an obvious and significant outcome. But as an ML engineer, I should focus on the algorithms, techniques, and methods used to achieve that, not the impact size.
Value the reader’s time by making their life easier. If your resume is detail-packed, give some keywords and a summary. Help recruiters see your strengths quickly. One of the key tricks is adding specific keywords from the job descriptions to your resumes. I took a lazier approach of just going through job descriptions for keywords and then making a superset covering maybe 80–95% of them before including that in my resume. For example, I specifically mentioned A/B testing for experiment design and highlighted deep learning methods for algorithms. That change gave me an 80% chance for a recruiter call after applying for a job.
Treat yourself like a product, your resume is your advertisement. The resume is all about the language and presentation – margin, blanks, and font size. Most of the time, I felt my resume was too fragile because of the time spent on it. A good resource I found was the Reddit community r/EngineeringResumes, especially the wiki. It has helped me rethink my resume twice rather than wait for recruiters to notice it.

This article, as it turned out, ended up much longer than I had expected. I started writing about these learnings in May and didn’t publish it until June. I keep rewrote most of the content again and again, because I wanted to make it personal and genuine yet encouraging for my readers so they could take away some kind of enjoyment from it and get a bit of strength to survive this bad job market.

A bit update about my current life: While the ordinary routine days have led to a lot of hustle and bustle, my baby, Ellie, has grown smarter. Being a mother and a machine learning engineer, I am debuting and trying to teach her using machine learning techniques. These are exciting methods and interesting experiments. And probably one day, I’m going to write about this learning process. For now, I hope you enjoy my insights, and I look forward to sharing the second part of my learnings soon.

Since you’ve made it to the end, it looks like you really enjoy my writing. So here’s a little shameless self-promotion about my other posts.

Life Bytes:

Techie Tales:

The post My learning to being hired again after a year… Part I appeared first on Towards Data Science.

Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 2)

Amy Ma — Fri, 03 May 2024 17:05:43 +0000

Welcome back to a new chapter of "Courage to Learn ML." For those new to this series, this series aims to make these complex topics accessible and engaging, much like a casual conversation between a mentor and a learner, inspired by the writing style of "The Courage to Be Disliked," with a specific focus on machine learning.

This time we will continue our exploration into how to overcome the challenges of vanishing and exploding gradients. In our opening segment, we talked about why it’s critical to maintain stable gradients to ensure effective learning within our networks. We uncovered how unstable gradients can be barriers to deepening our networks, essentially putting a cap on the potential of deep "learning". To bring these concepts to life, we use the an analogy of running a miniature ice cream factory named DNN (short for Delicious Nutritious Nibbles), and draw parallels to illuminate potent strategies for DNN training akin to orchestrating a seamless factory production line.

Now, in this second installment, we’re diving deeper into each proposed solution, examining them with the same clarity and creativity that brought our ice cream factory to life. Here are the list of topics we’d cover in this part:

Activation Functions
Weight Initialization
Batch Normalization
In Practice (Personal Experience)

Activation Functions

Activation functions are the backbone of our "factory" setup. They’re responsible for passing along information in both forward and backward propagation within our DNN assembly line. Picking the right ones is crucial for the smooth operation of our DNN assembly line and, by extension, our DNN training process. This part isn’t just a simple rundown of activation functions along with their advantages and disadvantages. Here, I will use the Q&A format to uncover the deeper reasoning behind the creation of different activation functions and to answer some important questions that often are overlooked.

Think of these functions as the blenders in our ice cream production analogy. Rather than offering a catalog of available blenders, I’m here to provide an in-depth review and understand the innovations of each and the reasons behind any specific enhancements.

What exactly are activation functions, and how do I choose the right one?

Image created by the author using ChatGPT.

Activation functions are the key elements to grant a neural network model the flexibility and power to capture both linear and nonlinear relationships. The key distinction between logistic regression and DNNs lies in these activation functions combined with multiple layers. They together allow NNs to approximate a wide range of functions. However, this power comes with its challenges. The choice of activation function needs more careful consideration. The wrong selection can stop model from learning effectively, especially during backpropagation.

Picture yourself as the manager of our DNN ice cream factory. You’d want to meticulously select the right activation function (think of them as ice cream blenders) for your production line. This means doing your homework and sourcing the best fit for your needs.

So, the first step in choosing an effective activation function involves addressing two key questions:

How does the choice of activation function affect issues like vanishing and exploding gradients? what criteria define a good activation function?

Note, to deal with the unstable gradient, our discussion focus on the activations in the hidden layers. For output activation function, the choice depends on the task whether its regression or classification problems, and if it’s a multiclass problem.

When dealing with the choice of activation function in hidden layer, the problem is more related to vanishing gradient. This can be traced back to our traditional activation function sigmoid (the very traditional or basic model). The sigmoid function was widely used due to its ability to map inputs to a probability range (0, 1), which is particularly useful in binary classification tasks. This capability allowed researchers to adjust the probability threshold for categorizing predictions, enhancing model flexibility and performance.

However, its application in hidden layers has led to significant challenges, most notably the vanishing gradient problem. This can be attributed to two main factors:

During the forward pass, the sigmoid function compresses inputs to a very narrow range between 0 and 1. If one network only uses sigmoid as activation function in hidden layers, then repeated application through multiple layers further narrows this range. This compression effect not only reduces the variability of outputs but also introduces a bias towards positive values. Since outputs remain between 0 and 1 regardless of the input sign.
During backpropagation, the derivative of the sigmoid function (which has a bell-shaped curve) yields values between 0 and 0.25. This small range can cause gradients to diminish rapidly despite the input as they propagate through multiple layers, resulting in vanishing gradients. Since earlier layer gradients are products of successive layer derivatives, this compounded product of small derivatives results in exponentially smaller gradients, preventing effective learning in earlier layers.

To overcome these limitations, an ideal activation function should exhibit the following properties:

Non-linearity. Allowing the network to capture complex patterns.
Non-saturation. The function and its derivative should not compress the input range excessively, preventing vanishing gradients.
Zero-centered Output. The function should allow for both positive and negative outputs, ensuring that the mean output across the nodes does not introduce bias towards any direction.
Computational Efficiency. Both the function and its derivative should be computationally simple to facilitate efficient learning.

Given these essential properties, how do popular activation functions build upon our basic model, the Sigmoid, and what makes them stand out?

This section aims to provide a general overview of nearly all the current activation functions.

Tanh, A Simple Adjustment to Sigmoid. The hyperbolic tangent (tanh) function can be seen as a modified version of the sigmoid, offering a straightforward enhancement in terms of output range. By scaling and shifting the sigmoid, tanh achieves an output range of [-1, 1] with zero mean, This zero-centered output is advantageous as it aligns with our criteria for an effective activation function, ensuring that the input data and gradients are less biased toward any specific direction, whether positive or negative.

Despite these benefits, tanh retains the core characteristic of sigmoid in terms of its non-linear shape, which means it still compresses the output into a narrow range. This compression leads to similar issues as observed with sigmoid, which causes gradients to saturate. Therefore it affecting the network’s ability to learn effectively during backpropagation.

ReLU, a popular choice in NNs. ReLU (Rectified Linear Unit) stands out for its simplicity, operating as a piecewise linear function where f(x) = max(0, x). This means it outputs zero for any negative input and mirrors the input otherwise. What makes ReLU particularly appealing is its straightforward design, satisfying three of those four key properties (we discussed above) with ease. Its linear nature on the positive side avoids compressing outputs into a tight range, unlike sigmoid or tanh, and its derivative is simple, being either 0 or 1.

One intriguing aspect of ReLU is its ability to turn off neurons for negative inputs, introduces sparsity to models. Similar to the effect of dropout regularization by deactivating certain neurons. This can lead to more generalized models. However, it also leads to the "dying ReLU" issue, where neurons become inactive and stop learning due to zero output and gradient. While some neurons may come back to life, those in early layers are particularly could be permanently deactivated. This is similar to halting feedback in an ice cream production line, where the early stages fail to adapt based on customer feedback or contribute useful intermediate products for subsequent stages.

Another point of consideration is ReLU’s non-differentiability at x=0, due to the sharp transition between its linear segments. In practice, frameworks like PyTorch manage this using the concept of subgradients, often setting the derivative at x=0 to 0.5 or another value within [0, 1]. This typically doesn’t pose an issue due to the rarity of exact zero inputs and the variability of data.

So, is ReLU the right choice for you? Many researchers say yes, thanks to its simplicity, efficiency, and support from major DNN frameworks. Moreover, recent studies, like one at https://arxiv.org/abs/2310.04564, highlight ReLU’s ongoing relevance, marking a kind of renaissance in the ML world.

In certain applications, a variant known as ReLU6, which caps the output at 6, is used to prevent overly large activations. This modification, inspired by practical considerations, further illustrates the adaptability of ReLU in various neural network architectures. Why capping to 6? You can find answer in this post.

Leaky ReLUs, a slight twist on the classic ReLU.When we take a closer look at ReLU, a couple of issues emerge. its zero output for negative inputs, leading to the "dying ReLU" problem where neurons cease to update during training. Additionally, ReLU’s preference for positive values can introduce a directional bias in the model.To counter these drawbacks while retaining ReLU’s advantages, researchers developed several variations, including the concept of ‘leaky’ ReLUs.

Leaky ReLUs modifies the negative part of ReLU, giving it with a small and non-zero slope. This adjustment allows negative inputs to produce small negative output, effectively ‘leaking’ through the otherwise zero-output region. The slope of this leak is controlled by a hyperparameter α, which is typically set close to 0 to maintain a balance between sparsity and keeping neurons active. By allowing a slight negative output, Leaky ReLU aims to centralize the activation function’s output around zero and prevent neurons from becoming inactive, thus addressing the "dying ReLU" issue.

However, introducing α as a hyperparameter adds a layer of complexity to model tuning. To manage this, variations of the original Leaky ReLU have been developed:

Randomized Leaky ReLU (RReLU): This version randomizes α within a specified range during training, fixing it during evaluation. The randomness can help in regularizing the model and preventing overfitting.
Parametric Leaky ReLU (PReLU): PReLU allows α to be learned during training, adapting the activation function to the specific needs of the dataset. Even though this can enhance model performance by tailoring α to the training data, it also risks overfitting.

Exponential Linear Unit (ELU), an Improvement on Leaky ReLU by Enhancing Control Over Leakage. Both Leaky ReLUs and ELUs allow negative values, which help in pushing mean unit activations closer to zero and maintaining the vitality of the activation functions. The challenge with Leaky ReLUs is their inability to regulate the extent of these negative values; theoretically, these values could extend to negative infinity, despite intentions to keep them small. ELU addresses this by incorporating a nonlinear exponential curve for non-positive inputs, effectively narrowing and controlling the negative output range to a maximum of −𝛼 (where 𝛼 is a new hyperparameter, typically set to 1). Additionally, ELU is a smooth function. Its exponential component enables a seamless transition between negative and positive values, which is advantageous for gradient-based optimization because it ensures a well-defined gradient across all input values. This feature also resolves the non-differentiability issues seen with ReLU and Leaky ReLUs.

Scaled Exponential Linear Unit (SELU), an Enhanced ELU with Self-Normalizing Properties. SELU is essentially a scaled version of ELU designed to maintain zero mean and unit variance within neural networks – a concept we’ll explore further in our discussion on Batch Normalization. By integrating a fixed scale factor, λ (which is greater than 1), SELU ensures that the slope for positive net inputs exceeds one. This characteristic is particularly useful as it amplifies the gradient in scenarios where the gradients of the lower layers are diminished, helping to prevent the vanishing gradient problem often encountered in deep neural networks.

Note that the scale factor λ is applied to both negative and positive inputs to uniformly scale the gradient during backpropagation. This uniform scaling helps maintain variance within the network, which is crucial for the SELU activation function’s self-normalizing properties.

For SELU, the parameters (α and λ) have fixed values and are not learnable, which simplifies the tuning process since there are fewer parameters to adjust. You can find these specific values in the SELU implementation in PyTorch.

SELU was introduced by Günter Klambauer et al. in their paper. This comprehensive paper includes an impressive 92-page appendix, which provides detailed insights for those curious about the derivation of the specific values of α and λ. You can find the calculations and rationale behind these parameters in the paper itself.

SELU is indeed a sophisticated "blender" in the world of activation functions, but it comes with specific requirements. It’s most effective in feedforward or sequential networks and may not perform as well in architectures like RNNs, LSTMs, or those with skip connections due to its design.

The self-normalizing feature of SELU requires that input features be standardized – having a mean of 0 and a unit standard deviation is crucial. Additionally, every hidden layer’s weights must be initialized using the LeCun normal initialization, where weights are sampled from a normal distribution with a mean of 0 and a variance of _1/fanin. If you’re not familiar with the term "fan_in," I’ll explain it in a dedicated session on weight initialization.

In summary, for SELU’s self-normalization to function effectively, you need to ensure that the input features are normalized and that the network structure remains consistent without any interruptions. This consistency helps maintain the self-normalizing effect throughout the network without any leakage.

GELU (Gaussian Error Linear Unit) is an innovative activation function that incorporates the idea of regularization from Dropout. Unlike traditional ReLU, which outputs zeros for negative inputs, leaky ReLU, ELU, and SELU allow negative outputs. This helps shift the mean of the activations closer to zero, reducing bias in a way similar to ReLU but without zeroing out negative inputs entirely. However, this leakage means they lose some benefits of "dying ReLU," where inactivity in some neurons can lead to a sparser, more generalized model.

Considering the benefits of sparsity seen in dying ReLU and Dropout’s ability to randomly deactivate and reactivate neurons, GELU takes this a step further. It combines the dying ReLU’s feature of zero outputs with an element of randomness, allowing neurons to potentially ”come back to life’. This approach not only maintains beneficial sparsity but also reintroduces neuron activity, making GELU a robust solution. To fully appreciate its mechanics, Let’t take a closer look at GELU’s definition:

Image created by author using Mathcha.com

In the GELU activation function, the CDF, Φ(x), or the standard Gaussian cumulative distribution function, plays a key role. This function represents the probability that a standard normal random variable will have a value less than or equal to x. Φ(x) transitions smoothly from 0 (for negative inputs) to 1 (for positive inputs), effectively controlling the scaling of the input when modeled with a normal distribution N(0,1). According to a paper by Dan Hendrycks et al. (source), the use of the normal distribution is justified because neuron inputs tend to follow a normal distribution, particularly when using batch normalization.

The function’s design allows inputs to be "dropped" more frequently as x decreases, making the transformation both stochastic and dependent on the input value. This mechanism helps keep the shape similar to the ReLU function by making the usual straight-line function, f(x) = x, smoother, and avoiding sudden changes that you get with a piecewise linear function. The most significant feature of GELU is that it can completely inactivate neurons, potentially allowing them to reactivate with changes in input. This stochastic nature acts like a selective dropout that isn’t entirely random but instead relies on the input, giving neurons the chance to become active again.

Cumulative distribution function from Wikipedia. Source: https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/300px-Normal_Distribution_CDF.svg.png

To summarize, GELU’s main advantage over ReLU is that it considers the entire range of input values, not just whether they’re positive or negative. As Φ(x) decreases, it increases the chances that the GELU function output will be closer to 0, subtly "dropping" the neuron in a probabilistic way. This method is more sophisticated than the typical dropout approach because it depends on the data to determine neuron deactivation, rather than doing it randomly. I find this approach fascinating; it’s like adding a soft cream to an artisan dessert, enhancing it subtly but significantly.

GELU has become a popular activation function in models like GPT-3, BERT, and other Transformers due to its efficiency and strong performance in language processing tasks. Although it’s computationally intensive because of its probabilistic nature, the curve of the standard Gaussian cumulative distribution, Φ(x), is similar to sigmoid and tanh functions. Interestingly, GELU can be approximated using tanh or by using the formula x(1.702*x). Despite these possibilities for simplification, PyTorch’s implementation of GELU is fast enough that such approximations are often unnecessary.

Before we dive deeper with more why, let’s try to summarize.

What is exactly a good activation function by reviewing those ReLU and the other activation function inspired by it?

Günter Klambauer et al.’s paper, where SELU was introduced, highlights essential characteristics of an effective activation function:

Range: It should output both negative and positive values to help manage the mean activation level across the network.
Saturation regions: These are areas where the derivative approaches zero, helping to stabilize overly high variances from lower layers.
Amplifying slopes: A slope greater than one is crucial to boost variance when it’s too low in the lower layers.
Continuity: A continuous curve ensures a fixed point where the effects of variance damping and increasing are balanced.

Additionally, I would suggest two more criteria for an "ideal" activation function:

Non-linearity: This is obvious and necessary because linear functions can’t model complex patterns effectively.
Dynamic output: The ability to output zero and change outputs based on input data allows for dynamic neuron activation and deactivation, which lets the network adjust to varying data conditions efficiently.

Can you give me a more intuitive explanation on why we want activation functions to output negatives?

Think of activation functions as blenders that transform the original input data. Just like blenders that might favor certain ingredients, activation functions can introduce biases based on their inherent characteristics. For example, sigmoid and ReLU functions typically yield only non-negative outputs, regardless of the input. This is akin to a blender that always produces the same flavor, no matter what ingredients you put in.

Image created by the author using ChatGPT.

To minimize this bias, it’s beneficial to have activation functions that can output both negative and positive values. Essentially, we aim for zero-centered outputs. Imagine a seesaw representing the output of an activation function: with functions like Sigmoid and ReLU, the seesaw is heavily tilted towards the positive side, as these functions either ignore or zero out negative inputs. Leaky ReLU attempts to balance this seesaw by allowing negative inputs to produce slightly negative outputs, although the adjustment is minor due to the linear and constant nature of its negative slope. Exponential Linear Unit (ELU), on the other hand, provides a more dynamic push on the negative side with its exponential component, helping the seesaw approach a more balanced state at zero. This balance is crucial for maintaining healthy gradient flow and efficient learning in neural networks, as it ensures that both positive and negative updates contribute to training, avoiding the limitations of unidirectional updates.

Could we create an activation function like ReLU that zeros out positive inputs instead, similar to using min(0, x)? Why do we prefer functions that approach zero from the negative side rather than zeroing out the positive inputs?

Here, saturation on the negative side means that once the input value drops below a certain threshold, further decreases have less and less effect on the output. This limits the impact of large negative inputs.

Certainly, you could design a version of ReLU that zeroes out positive values and lets negative values pass unchanged, like f(x) = min(x, 0). This is technically feasible because the important aspect here isn’t the sign of the values but rather introducing non-linearity into the network. It’s important to remember that these activation functions are typically used in hidden layers, not output layers, so they don’t directly affect the sign of the final output. In other words, the presence of these activation functions within the network means the final output can still be both positive and negative, unaffected by the specific characteristics of these layers.

No matter the sign of the output, the network’s weights and biases can adjust to match the required sign of the output. For example, with traditional ReLU, if the output is 1 and the subsequent layer’s weight is 1, the output remains 1. Similarly, if a proposed ReLU variant outputs -1, and the weight is -1, the result is still 1. Essentially, we are more concerned with the magnitude of the output rather than its sign.

Therefore, ReLU saturating on the negative side is not fundamentally different from it saturating on the positive side. However, the reason we value zero-centered activation functions is because they prevent any inherent preference for positive or negative values, avoiding unnecessary bias in the model. This balance helps maintain neutrality and effectiveness in learning across the network.

I get that for functions like Leaky ReLU, we want to output negative values to keep the output centered around zero. But why are ELU, SELU, and GELU specifically designed to saturate with negative inputs?

To understand this, we can look at the biological inspiration behind ReLU. ReLU mimics biological neurons which have a threshold; inputs above this threshold activate the neuron, while inputs below it do not. This ability to switch between active and inactive states is crucial in neural function. When considering variations like ELU, SELU, and GELU, you’ll notice that their design addresses two distinct needs:

Positive region: Allows signals that exceed the threshold to pass through unchanged during the forward pass, essentially transmitting the desired signals.
Negative region: Serves to minimize or filter out unwanted signals and mitigate the impact of large negative values, acting like a leaky gate.

These functions essentially act as gates for inputs, managing what should and should not influence the neuron’s output. For instance, SELU utilizes these two aspects distinctively:

Positive region: The scaling factor λ (greater than 1) not only passes but slightly amplifies the signal. During backpropagation, the derivative in this region remains constant (about 1.0507), enhancing small but useful gradients to counteract vanishing gradients.
Negative region: The derivative ranges between 0 and λα (with typical values λ ≈ 1.0507 and α ≈ 1.6733), leading to a maximum derivative of about 1.7583. Here, the function nearly approaches zero, effectively reducing overly large gradients to help with the exploding gradient problem.

Here is a really good plot to illustrate the first derivatives of those activation functions.

This design allows these activation functions to balance enhancing useful signals while dampening potentially harmful extremes, providing a more stable learning environment for the network.

The concept of activation functions serving as gates is not a new idea. It has a strong precedent in structures like LSTMs where sigmoid functions decide what to remember, update, or forget. This gating concept helps us understand why variations of ReLU are designed in specific ways. For instance, GELU acts as a dynamic gate that uses a scale factor derived from the standard normal distribution’s cumulative distribution function (CDF). This scaling allows a small fraction of the input to pass through when it’s close to zero, and lets larger positive values pass through largely unaltered. By controlling how much of the input influences subsequent layers, GELU facilitates effective information flow management, particularly useful in architectures like transformers.

All three mentioned activation functions, ELU, SELU, and GELU, make the negative side smoother. This smooth saturation of negative inputs doesn’t just mitigate the effects of large negative values. It also makes the network less sensitive to fluctuations in input data, leading to more stable feature representations.

In summary, the specific area of saturation, whether positive or negative, doesn’t fundamentally matter since these activation functions operate within the middle layers of a network, where weights and biases can adapt accordingly. However, the design of these functions, which allows one side to pass signals unchanged or even amplified while the other side saturates, is important. This arrangement helps organize the signal and facilitate effective backpropagation, enhancing the overall performance and learning stability of the network.

When should we choose each activation function? Why is ReLU still the most popular activation function in practice?

Choosing the right activation function depends on several factors, including computational resources, the specific needs of the network architecture, and empirical evidence from prior models.

Computational Resources: If you have enough computational resources, experimenting with different activation functions using cross-validation can be insightful. This allows you to tailor the activation function to your specific model and dataset. Note that when using SELU, you generally don’t need batch normalization, which can simplify the architecture, unlike other functions where batch normalization might be necessary.
Empirical Evidence: Certain functions have become standard for specific applications. For example, GELU is often the preferred choice for training transformer models due to its effectiveness in these architectures. SELU, with its self-normalizing properties and lack of hyperparameters to tune, is particularly useful for deeper networks where training stability is crucial.
Computation Efficiency and Simplicity: For scenarios where computational efficiency and simplicity are priorities, ReLU and its variants like PReLU and ELU are excellent choices. They avoid the need for parameter tuning and support the model’s sparsity and generalization, helping to reduce overfitting.

Despite the advent of more sophisticated functions, ReLU remains extremely popular due to its simplicity and efficiency. It’s straightforward to implement, easy to understand, and provides a clear method to introduce non-linearity without complicating the computation. The function’s ability to zero out negative parts simplifies calculations and enhances computational speed, which is advantageous especially in large networks.

ReLU’s design inherently increases the model’s sparsity by zeroing out negative activations, which can improve generalization – a critical factor given that overfitting is a significant challenge in training deep neural networks. Moreover, ReLU does not require any extra hyperparameters, which contrasts with functions like PReLU or ELU that introduce additional complexity into model training. Furthermore, because ReLU has been widely adopted, many machine learning frameworks and libraries offer optimizations specifically for it, making it a practical choice for many developers.

In summary, while newer activation functions offer certain benefits for specific scenarios, ReLU’s balance of simplicity, efficiency, and effectiveness makes it a go-to choice for many applications. When moving forward with any activation function, understanding its characteristics thoroughly is crucial to ensure it aligns with your model’s needs and to facilitate troubleshooting during model training.

PyTorch offers a variety of activation functions, each with specific applications and benefits detailed in its documentation. While I won’t cover all possible activation functions here due to length constraints, such as softplus. It’s important to think of these functions as blenders that modify inputs in different ways, building upon the functionality of their predecessors. Understanding how each function evolves from the last helps in quickly grasping new ones and evaluating their advantages and disadvantages. We will dive into how these activation functions interact with different weight initialization strategies later, further enhancing the effective use of these tools in neural network design.

For a more detailed exploration of PyTorch’s activation functions, you can always refer to the official documentation from PyTorch

Weight Initialization

Alright, let’s stop searching for the perfect activation functions to stabilize gradients and focus on another crucial aspect to set up our neural network properly by initializing weights efficiently.

Before diving into the most popular methods for weight initialization, let’s address a fundamental question:

Note that weight initialization is actually more complex than it might seem at first glance, and this post only scratches the surface of the subject. As I mentioned, choosing the right starting point for weight initialization is crucial for the effective optimization of your network. If you’re looking for a deeper and more comprehensive understanding, I recommend checking out the detailed review available here. This could really enhance your grasp of the techniques involved.

Why is weight initialization important, and how can it help mitigate unstable gradients?

Proper weight initialization ensures that gradients flow correctly throughout the model, similar to how semi-products are passed around in an ice cream factory. It’s important that not only the initial machine settings are correct but also that every department works efficiently.

Weight initialization aims to maintain a stable flow of information both forward and backward through the network. Weights that are too large or too small can cause problems. Excessively large weights might increase the output excessively during the forward pass, leading to oversized predictions. On the other hand, very small weights might diminish the output too much. During backpropagation, the magnitude of these weights becomes critical. If a weight is too large, it can cause the gradient to explode, if too small, the gradient might vanish. Understanding this, we avoid initializing weights at extremes, such as zero (which nullifies outputs and gradients) or excessively high values. This balanced approach helps maintain the network’s efficacy and prevents the issues associated with unstable gradients.

What is a good way to initialize weights?

First and foremost, the best weight initialization often comes from using weights that have been pre-trained. If you can obtain a set of weights that have already undergone some learning and are trending towards minimizing loss, continuing from this point is ideal.

However, if you’re starting from scratch, you’ll need to carefully consider how to initially set your weights, especially to prevent unstable gradients. Here’s what you should aim for in a good weight initialization:

Avoid extreme values. As we discussed previously, weights should neither be too large, too small, nor zero. Properly scaled weights help maintain stability during both the forward and backward passes of network training, as discussed previously.
Break symmetry. It’s quite important that weights are diverse to prevent neurons from mirroring each other’s behavior, which would lead them to learn the same features and ignore others. This lack of differentiation can severely limit the network’s ability to model complex patterns. Different initial weights help each neuron to start learning different aspects of the data. This is like having various types of production lines in an ice cream factory enhances the range of flavors that can be produced.
Position favorably on the loss surface. Initial weights should place the model in a decent starting position on the loss surface to make the journey toward the global minimum more feasible. Since we don’t have a clear picture of what the loss landscape looks like, introducing some randomness in weight initialization can be beneficial.

This is why setting all weights to zero is problematic. It causes symmetry issues, where all neurons behave the same and learn at the same rate, preventing the network from effectively capturing diverse patterns. Zero weights also lead to zero outputs, especially with ReLU and its variations, resulting in zero gradients. This lack of gradient flow stops learning altogether, rendering all neurons inactive.

Why not initialize all weights with a small random number?

While using small random numbers to initialize weights can be helpful, it often lacks sufficient control. Randomly assigned weights might be too small, leading to a vanishing gradient problem, where updates during training become insignificantly small, stalling the learning process. Furthermore, completely random initialization doesn’t guarantee the breaking of symmetry. For example, if the initialized values are too similar or all have the same sign, the neurons might still behave too similarly, failing to learn diverse aspects of the data.

In practice, more structured approaches to initialization are used. Famous methods include Glorot (or Xavier) initialization, He (or Kaiming) initialization, and LeCun initialization. These techniques typically rely on either normal or uniform distributions but are calibrated to consider the size of the previous and next layers, providing a balance that promotes effective learning without the risk of vanishing or exploding gradients.

If so, why not just use a standard normal distribution (N(0,1)) for weight initialization?

Using a standard normal distribution (N(0,1)) provides some control over the randomization process, but it isn’t sufficient for optimal weight initialization due to the lack of control over variance. The mean of zero is a solid choice as it helps ensure weights do not all share the same sign, effectively helping to break symmetry. However, a variance of 1 can be problematic.

Consider a scenario where the activation function inputs, 𝑍, depend on the weights. Suppose 𝑍 is calculated by summing the outputs of 𝑁 neurons from the previous layer, each with weights initialized from a standard normal distribution. Here, 𝑍 would also be normally distributed with a mean of zero, but its variance would be 𝑁. If 𝑁=100, for example, the variance of 𝑍 becomes 100, which is too large and leads to uncontrolled inputs into the activation function, potentially causing unstable gradients during backpropagation. Using an ice cream factory as an analogy, this would be like setting a high tolerance for errors in each machine’s settings, resulting in a final product that deviates significantly from the desired outcome due to lack of quality control.

So why do we care about the variance of 𝑍? The variance controls the spread of 𝑍 values. If the variance is too small, the output of 𝑍 may not vary enough to effectively break symmetry. However, a variance that is too large can lead to those values are either too high or low. For activation functions like sigmoid, extremely high or low input push the outputs towards the function’s saturating tails, which can cause the vanishing gradient problem.

Therefore, when initializing weights with a random draw from a distribution, both the mean and the variance are crucial. The goal is to set the mean to zero to break symmetry effectively, while also minimizing the variance to ensure that the semi-product (i.e., the neuron outputs) is neither too large nor too small. Proper initialization ensures a stable flow of information through the network, both forward and backward, maintaining an efficient learning process without introducing instability in gradients. A thoughtful approach to initialization can, therefore, result in a network that learns effectively and robustly.

So, to control the output values in the middle layers of a neural network, which also serve as inputs for subsequent layers, we use distributions with carefully chosen mean and variance for weight initialization. But how do the most popular methods achieve control over this variance?

Before diving into the most common ways to initialize weights, it’s important to note that the variance of 𝑍Z _ is influenced not only by the variance of the weight initialization but also by the number of neurons involved in calculating 𝑍Z._ If only 16 neurons are used, the variance of 𝑍Z _i_s 16, whereas with 100 neurons, it rises to 100. Essentially, this variance isn’t only influenced by the distribution from which the weights are drawn but also by the number of neurons contributing to the calculation, known as the "fan-in." Fan-in refers to the number of input connections coming into a neuron. Similarly, "fan-out" refers to the number of output connections a neuron has.

Let me illustrate with an example: Suppose there is a middle layer in a neural network with 200 neurons, connected to a previous layer of 100 neurons and a subsequent layer of 300 neurons. In this case, the fan-in for this layer is 100, and the fan-out is 300.

Using fan-in and fan-out provides a mechanism to control the variance during weight initialization.

Fan-in helps control the variance of the output 𝑍 of the current layer during the forward pass.
Fan-out adjust the influence that the weights of the subsequent layer have during backpropagation.

Based on the consideration of the number of neurons feed in to the current layer both forward and backward, researchers comes up a bunch of initiation methods are built on top of the ideas. There are Lecun, Xavier/ Glorot initialization and He/ Kaiming initiation. The idea of those methods are quite similar, they use either uniform or normal distribution as the weight random generated from, and use the fan in or fan out to control the variance. The mean of those distribution are all 0 to achieve zero mean of the output value.

In this post, I only provide a quick overview of weight initialization methods. To see the detailed explanations, one can refer to the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow and the Weights & Biases post here for practical insights and historical context.

# Different types of initializations

| Initialization | Activation functions          | σ² (Normal) |
| -------------- | ----------------------------- | ----------- |
| Xavier/Glorot  | None, tanh, logistic, softmax | 1 / fan_avg |
| He/Kaiming     | ReLU and variants             | 2 / fan_in  |
| LeCun          | SELU                          | 1 / fan_in  |

Lecun Initialization is based on scaling down the variance of 𝑍 by using a smaller variance for the weight distribution. If the variance of 𝑍 is the product of fan-in and the variance of each weight, then to ensure 𝑍 has a variance of 1, the variance of each weight should be 1/fan-in. Thus, Lecun initialization draws weights randomly from 𝑁(0,1/fan-in).

Xavier/Glorot Initialization considers not just the impact of the previous layer’s weights (fan-in) but also the effect these weights have during backpropagation on the subsequent layer (fan-out). It balances the variance during both the forward and backward pass by using the formula 2/(fan_in + fan_out)) for the variance, from which weights can be drawn either Normal distribution,N(0,2/(fan_in + fan_out)) or Uniform (- sqrt(6/ (fan_in + fan_out)), sqrt(6/ (fan_in + fan_out)))

Uniform (- sqrt(6/ (fan_in + fan_out)), sqrt(6/ (fan_in + fan_out))) has the same variance as the normal distribution, 2/(fan_in + fan_out), since we use the lower bound and upper bound to define a uniform distribution, and the variance of it is just 1/12 * (upper bound – lower bound) **2. (source)

He/Kaiming Initialization is tailored for ReLU and its variants due to their unique properties. Since ReLU zeroes out negative inputs, half of the neuron activations are expected to be non-zero, which could lead to reduced variance and vanishing gradients. To compensate, He initialization doubles the variance used in Lecun’s method, effectively using 2*1/fan_in, thus maintaining the necessary balance for layers using ReLU. For leaky ReLUs and ELUs, while adjustments are minor (e.g., using a factor of 1.55 for ELU instead of 2, source), the principle remains the same, we want to adjust the variance to stabilize gradients during backpropagation. In contrast, SELU requires using Lecun initialization across all hidden layers to leverage its self-normalizing properties.

This discussion opens up an interesting aspect of how weight initialization is implemented in frameworks like PyTorch, which can be framed as a question –

How is weight initialization implemented in PyTorch, and what makes it special?

In PyTorch, the default approach for initializing weights in linear layers is based on the Lecun initialization method. On the other hand, the default initialization technique used in Keras is the Xavier/Glorot initialization.

However, PyTorch offers a particularly flexible approach when it comes to weight initialization, allowing users to fine-tune the process to match the specific requirements of different activation functions used in their models. This fine-tuning is achieved by considering two key components:

Mode: This component determines whether the variance of the initialized weights is adjusted based on the number of input connections (fan-in) or the number of output connections (fan-out) in the layer.
Gain: This is a scaling factor that adjusts the scale of the initialized weights depending on the activation function employed in the model. PyTorch provides a [torch.nn.init.calculate_gain](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.calculate_gain) function that calculates a tailored gain value, optimizing the initialization process to enhance the overall functioning of the neural network.

This flexibility in customizing weight initialization parameters allows you to set up an initialization approach that is comparable to and compatible with the specific activation functions used in your model. Interestingly, PyTorch’s implementation of weight initialization can help reveal some underlying relationships between different initialization methods.

For instance, while reviewing the PyTorch documentation on the SELU activation function, I discovered an intriguing aspect of weight initialization. The documentation notes that when using kaiming_normal or kaiming_normal_ for initialization with the SELU activation, one should opt for nonlinearity='linear' instead of nonlinearity='selu' to achieve self-normalization. This detail is fascinating because it highlights how the default Lecun initialization in PyTorch, when adjusted with the Kaiming method set to a gain of 1 from nonlinearity='linear', effectively replicates the Lecun initialization method. This demonstrates that the Lecun initialization is a specific application of the more general Kaiming initialization approach. Similarly, the Xavier initialization method can be seen as another variant of the Lecun initialization that adjusts for both the number of input connections (fan-in) and the number of output connections (fan-out).

I get that we need to be careful in choosing the mean and variance when initializing the weights from a distribution. But what I’m still not clear on is why we would want to draw the initial weights from a normal distribution versus a uniform distribution. Can you explain the reasoning behind using one over the other?

You make a fair point regarding the importance of carefully choosing the mean and variance of the distribution when initializing weights. When initializing weights in neural networks, an important consideration is whether to draw from a normal or uniform distribution. While there is no definitive research-backed answer, there are some plausible reasons behind these choices:

The uniform distribution has the highest entropy, meaning all values within the range are equally likely. This unbiased approach can be useful when you lack prior knowledge about which values might work better for initialization. It treats each potential weight value fairly, assigning an equal probability. This is akin to betting evenly across all teams in a game with limited information – it maximizes the likelihood of a favorable outcome. Since you don’t know which specific values make good initial weights, using a uniform distribution ensures an unbiased starting point.

On the other hand, a normal distribution is more likely to initialize weights with smaller values closer to zero. Smaller initial weights are generally preferred because they reduce the variance of the output and help maintain stable gradients during training. This is similar to why we prefer smaller variance in weight initialization methods over unit variance. Additionally, certain activation functions like sigmoid and tanh tend to perform better with smaller initial weight values, even if these activations are only used in the final output layer rather than hidden layers.

Regarding the concept of likelihood, you can refer to my old post where I used an example involving my cat, Bubble, to explain likelihood, maximum likelihood estimation (MLE), and maximum a posteriori (MAP) estimation.

Ultimately, the uniform distribution provides an unbiased start when lacking prior knowledge, treating all potential weight values as equally likely. In contrast, the normal distribution favors smaller initial weights close to zero, which can aid gradient stability and suit certain activation functions like sigmoid and tanh. The choice between these distributions is often guided by empirical findings across different neural architectures and tasks. While no universally optimal approach exists, understanding the properties of uniform and normal distributions allows for more informed, problem-specific initialization decisions.

Do we also use those weight initialization methods for the bias terms? How do we initialize the biases?

Good question. We don’t necessarily use the same initialization techniques for the bias terms as we do for the weights. In fact, it’s a common practice to simply initialize all biases to 0. The reason is that while the weights determine the shape of the function each neuron is learning to approximate the underlying data, the biases just act as an offset value to shift those functions up or down. So the biases don’t directly impact the overall shape learned by the weights.

Since the main goal of initialization is to break symmetry and provide a good starting point for the weight learning, we’re less concerned with how the biases are initialized. Setting them all to 0 is generally good enough. You can find more detailed discussion on this in the CS231n course notes.

Batch Normalization

With the activation functions chosen and weights properly initialized, we can start training our neural network (firing up our mini ice cream factory production line). But quality control is needed, both initially and during training iterations. Two key techniques are feature normalization and batch normalization.

As discussed earlier in my post about gradient descent, these techniques reshape the loss landscape for faster convergence. Feature normalization applies this to the initial data inputs, while batch normalization normalizes inputs to hidden layers between epochs. Both techniques are akin to implementing quality assurance checks at different stages of the ‘production line’.

Why does batch normalization work? Why is making the input to each layer have zero mean and unit variance helpful for solving gradient issues?

Batch normalization helps mitigate issues like vanishing/exploding gradients by reducing the internal covariate shift between layers during training. Let’s think about why this internal shift occurs in the first place. As we update the parameters of each layer based on the gradients, where each layer of the network is a different department in the factory. Every time you update the parameters (or settings) in one department, it changes the input for the next department. This can create a bit of chaos, as each following department has to adjust to the new changes. This is what we call internal covariate shift in Deep Learning. Now, what happens when these shifts occur frequently? The network struggles to stabilize because each layer’s input keeps changing. It’s similar to how constant changes in one part of the factory can lead to inconsistencies in the product quality, confusing the workers and messing up the workflow.

Batch normalization aims to fix this by normalizing the inputs to each layer to have zero mean and unit variance across the mini-batches during training. It enforces a consistent, controlled input distribution that layers can expect. Going back to the factory analogy, it’s like setting a strict quality standard for each department’s output before it gets passed to the next department. For example, setting rules that the baking department must produce ice cream cones of consistent size and shape. This way, the next decoration department doesn’t have to account for cone variance – they can simply add the same amount of ice cream to each standardized cone.

By reducing this internal covariate shift through normalization, batch norm prevents the gradients from going haywire during the training process. The layers don’t have to constantly readjust to wildly shifting input distributions, so the gradients remain more stable.

Additionally, the normalization acts as a regularizer, smoothing out the objective landscape. This allows using higher learning rates for faster convergence. Generally, batch normalization reduces internal variance shifts, stabilizes gradients, regularizes the objective, and enables training acceleration.

As we touched on earlier in the activation section, SELU uses the principles of batch normalization to achieve self-normalization. For a more in-depth exploration of batch normalization, I highly recommend Johann Huber’s detailed post on Medium.

How to Apply Batch Normalization? Should It Be Before or After Activation? How to Handle It During Training and Testing?

Batch normalization has really changed how we train DNNs by adding an extra layer to stabilize gradients. There’s a debate about whether to apply it before or after activation functions in DL area. Honestly, it depends on your model, and you might need to experiment a bit. Just make sure to keep your method consistent, as switching it up can cause unexpected issues.

During training, the batch normalization layer computes the mean and standard deviation for each dimension across the mini-batches. These statistics are then used to normalize the output, ensuring it has zero mean and unit variance. This process can be thought of as transforming the input’s distribution into a standard normal distribution. Unlike feature normalization, which normalizes features using the entire training dataset, batch normalization adjusts based on each mini-batch, making it dynamic and responsive to the data being processed.

Now, testing is a different story. It’s important not to use the mean and variance from the testing data for normalization. Instead, these parameters, viewed as learned features, should be carried over from the training process. Although each mini-batch during training has its own mean and variance, a common practice is to use a moving average of these values throughout the training phase. This provides a stable estimate that can be applied during testing. Another less common method involves conducting one more epoch using the entire training dataset to compute a comprehensive mean and variance.

When training with PyTorch as your DNN framework, it offers extra flexibility with two adjustable hyperparameters, γ and β. These allow for fine-tuning the batch normalization process. Generally, the default settings are quite effective. However, it’s important to note that during the training’s forward pass, PyTorch uses a biased estimator for calculating variance, but it switches to an unbiased estimator for the moving average during testing. This adjustment is beneficial for more accurately approximating the population standard deviation, enhancing the model’s reliability in unseen conditions.

Applying batch normalization correctly is important for effective learning in your network. It ensures that the network not only learns well but also maintains its performance across different datasets and testing scenarios. Think of it as precisely calibrating each segment of a production line, ensuring seamless and consistent operation throughout.

Why is batch normalization applied during the forward pass rather than directly to the gradients during backpropagation?

There are several reasons why batch normalization is typically applied to inputs or activations during the forward pass rather than directly to the gradients during backpropagation.

Firstly, there’s a lack of empirical evidence or practice showing the benefits of applying batch normalization directly to gradients. The concept of internal covariate shift primarily occurs during the forward pass as the distribution of layer inputs changes due to updates in the parameters. Therefore, it makes sense to apply batch normalization during this phase to stabilize these inputs before they are processed by subsequent layers. Also, applying batch normalization directly to the gradients could potentially distort the valuable information carried by the gradients’ magnitude and direction. This is similar to altering customer feedback in a way that changes its inherent meaning, which could mislead future adjustments in a production process of our mini ice cream factory.

However, making minor adjustments to gradients, such as gradient clipping, is generally acceptable and beneficial. This technique caps the gradients to prevent them from becoming excessively large, effectively keeping them within a safe range. This is similar to filtering out extreme outliers in feedback, which helps maintain the integrity of the overall feedback while preventing any drastic reactions that could derail the process. In PyTorch, monitoring gradient norms is a common practice, and if gradients begin to explode, techniques like gradient clipping can be employed. PyTorch offers functions such as [torch.nn.utils.clip_grad_norm_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) and [torch.nn.utils.clip_grad_value_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_value_.html) to help manage this.

You mentioned the option of clipping gradients instead of directly normalizing them. Why exactly do we choose to clip gradients rather than flooring them?

Clipping gradients is a simple yet efficient technique helps prevent the issue of exploding gradients. We’d often manually cap the maximum value of gradients. For instance, the ReLU activation function can be modified to have an upper limit of 6, known as ReLU6 in PyTorch (learn more about ReLU6 here). By setting this cap, we ensure that during the backpropagation process, when gradients are multiplied at each layer according to the chain rule, their values do not become excessively large. This clipping directly prevents the gradients from escalating to a point where they could derail the learning process by ensuring they remain within manageable limits.

Flooring gradients, on the other hand, would set a lower limit to prevent them from getting too small. However, it doesn’t address the fundamental issue of vanishing gradients. Vanishing gradients often occur because certain activation functions, like sigmoid or tanh, stature the gradient values severely as inputs move away from zero. This leads to very small gradient values that make learning extremely slow or stagnant. Flooring the gradients doesn’t solve this because the root of the problem lies in the nature of the activation function compressing the gradient values, not just in the values being too small. Instead, to effectively combat vanishing gradients, it’s more beneficial to adjust the network architecture or the choice of activation functions. Techniques such as using activation functions that do not saturate (like ReLU), adding skip connections (as seen in ResNet architectures), or employing gated mechanisms in RNNs (like LSTM or GRU) can inherently prevent gradients from vanishing by ensuring a healthier flow of gradients throughout the network during backpropagation.

To summarize, while gradient clipping effectively manages overly large gradients, flooring, which sets a lower limit, does not effectively address the issue of overly small gradients. Instead, resolving problems associated with vanishing gradients typically requires making architectural adjustments.

It’s important to note that when using gradient clipping, the Gradient Clipping Threshold becomes an additional hyperparameter that may need to be tuned or set based on findings from other research and the choice of activation function. As always, introducing this extra hyperparameter adds another layer of complexity to the model training process.

In Practice (Personal Experience)

Before wrapping up, it’s clear that all the methods discussed are valuable for addressing vanishing and exploding gradient issues. These are all practical approaches that could enhance your model’s training process. To conclude this post, I’d like to end it with one last question –

What’s the reality? What’s the common process in practice?

In practice, the good news is one don’t need to experiment with every possible solution. When it comes to choosing an activation function, ReLU is often the go-to choice and is very cost-effective. It passes the magnitude of positive inputs unchanged (unlike sigmoid and tanh, which compress large values to 1 regardless of their size) and is straightforward in terms of calculation and its derivatives. It’s also well-supported across major frameworks. If you’re concerned about the issue of dead ReLUs, you might consider alternatives like Leaky ReLU, ELU, SELU, or GELU, but generally, it’s advisable to steer clear of sigmoid and tanh to avoid vanishing gradients.

With ReLU being the preferred activation function, there’s less worry about weight initialization being overly sensitive, which is more of a concern with functions like sigmoid, tanh, and SELU. Instead, focusing on the recommended weight initialization methods for your chosen activation function should suffice (for example, using He/Kaiming initialization with ReLU due to its considerations for the non-linearities of ReLU).

Always incorporate batch normalization in your networks. Decide (or experiment) whether to apply it before or after the activation function, and stick with that choice consistently throughout your model. Batch normalization offers multiple benefits, including regularization effects and enabling the use of higher learning rates, which can speed up training and convergence.

So, what’s worth experimenting with? Optimizers are worth some exploration. In a previous post, I discussed various optimizers, including gradient descent and its popular variations (read more here). While Adam is fast, it can lead to overfitting and might decrease the learning rate too quickly. SGD is reliable and can be very effective, especially in parallel computing environments. Though it tends to be slower, it’s a solid choice if you’re aiming to squeeze every bit of performance from your model. Sometimes, RMSprop might be a better alternative. I personally find starting with Adam for its speed and then switching to SGD in later epochs to find a better minimum and prevent overfitting is a good strategy.

If you’re enjoying this series, remember that your interactions – claps, comments, and follows – do more than just support; they’re the driving force that keeps this series going and inspires my continued sharing.

Other posts in this series:

Reference

Activation function

Weight initialization

Gradient Clipping

https://stackoverflow.com/questions/54716377/how-to-do-gradient-clipping-in-pytorch

The post Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 2) appeared first on Towards Data Science.

Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 1)

Amy Ma — Mon, 05 Feb 2024 20:11:23 +0000

In the last installment of the ‘Courage to Learn ML‘ series, our learner and mentor focus on learning two essential theories of DNN training, gradient descent and backpropagation.

Their journey began with a look at how gradient descent is pivotal in minimizing the loss function. Curious about the complexities of computing gradients in deep neural networks across multiple hidden layers, the learner then turned to backpropagation. By decompose the backpropagation into 3 components, the learner learned about backpropagation and its use of the chain rule to calculate gradients efficiently across these layers. During this Q&A session, the learner questioned the importance of understanding these complex processes in an era of automated advanced deep learning frameworks, such as PyTorch and Tensorflow.

This is the first post of our deep dive into Deep Learning, guided by the interactions between a learner and a mentor. To keep things digestible, I’ve decided to break down my DNN series into more manageable pieces. This way, I can explore each concept thoroughly without overwhelming you.

Today’s discussion promises to address this question by focusing on the challenge of unstable gradients, a major factor making DNN training difficult. We’ll explore various strategies to address this issue, using an analogy of running a miniature ice cream factory, aptly named DNN (short for Delicious Nutritious Nibbles), to illustrate effective solutions. In subsequent posts, the mentor will talk about each solution in detail, showing how these solutions are implemented within the PyTorch framework.

Diving into the world of DNNs, we’re going to use a unique analogy that I’ve been fond of – envisioning DNN as an ice cream factory. Curiously, I once asked ChatGPT what ‘DNN’ might stand for in the realm of ice cream, and after 5 minutes of thinking, it suggested "Delicious Nutritious Nibbles." I loved it! So, I’ve decided to embrace this playful analogy to help demystify those daunting DNN concepts with a dash of sweetness and fun. As we delve into the depths of deep learning, imagine we’re managers running a mini ice cream factory called DNN. Who knows, maybe one day, DNN ice cream will become a reality. It would be a real treat for ML/DL enthusiasts to enjoy!

Let’s begin this journey by learning the basic structure of using PyTorch to train a NN. Always, let’s rephrase it to a fundamental question –

Can you illustrate the backpropgation and gradient descent in PyTorch?

The basic code of training a NN in PyTorch helps understand the relationship and role of gradient descent and backrpropgation.

import torch

# Define the model
model = CustomizedModel()
# Define the loss function
loss_fn = torch.nn.L1Loss()
# Define optimizer
optimizer = torch.optim.SGD(params = model.parameters(), 
                       lr = 0.01, momentum = 0.9)

epoches = 10
for epoch in range(epoches):
  # Step 1: Setting the Model to Training Mode:
  model.train() 

  # Step 2: Forward Pass - Making Predictions
  y_pred = model(X_train)

  # Step 3: Calculating the Loss
  loss = loss_fn(y_pred, y_train)

  # Step 4: Backpropagation - Calculating Gradients
  optimizer.zero_grad() # clears old gradients
  loss.backward() #  performs backpropagation to compute the gradients of the loss w.r.t model parameters

  # Step 5: Gradient Descent - Updating Parameters
  optimizer.step()

In this code snippet, loss.backward() is utilized to execute backpropagation. This process begins from the loss, not the optimizer, because backpropagation’s purpose is to compute the gradients of the loss with respect to each parameter. Once these gradients are determined, the optimizer uses these gradients to update each parameter with a gradient descent step. The optimizer.step() method, as I view it, is appropriately named ‘step’ to indicate a single update of all the model parameters. This can be thought of as taking a step along the loss surface during the optimization process.

For a better understanding of what "a step along the loss surface" means, I strongly suggest reading my post on gradient descent. In it, I use a video game analogy to vividly demonstrate how we navigate the loss surface. Also, you will love the title picture draw by ChatGPT.

Can you explain the concepts of vanishing and exploding gradients in neural networks? What are the primary causes of these issues?

Computing gradients in a deep neural network (DNN) is complex, primarily due to the extensive number of parameters spread across many hidden layers. Backpropagation simplified this calculation by computing the gradient of the loss regard to each parameter, applying the chain rule. This process, due to the chain rule, involves derivatives calculated through successive multiplications, where values can drastically change. It results gradients that might become extremely small or large as we move to the lower layers (layers that are close to the input layer). It’s akin to tracing back through a series of intricate water filters to enhance water quality, layer by layer. The early layers’ influence is hard to evaluate as it gets altered by subsequent layers. Such unstable gradients introduce significant challenges when training deep neural networks, leading to the phenomena known as "vanishing" (extremely small gradients) and "exploding" (excessively large gradients) gradient problems. This instability makes DNN training challenging and restricts the architecture to fewer layers. In other words, without addressing this issue, DNNs cannot achieve the desired depth.

If you wonder about why backpropagation involves successive multiplications, I suggest checking out my earlier article on backpropagation. In the post, I explained the calculations by breaking it down into 3 components and use code examples to make it easier to understand. I also explored why researchers generally favor deeper and narrower DNNs over wider and shallower ones.

I get that vanishing gradients can be an issue because the lower layers’ parameters barely update, making learning difficult with such small gradients. But why are exploding gradients problematic too? Wouldn’t they provide substantial updates for the later layers?

You’re correct in noting that large gradients can lead to significant updates in parameters. However, this isn’t always the case and large update is not always beneficial. Let me explain why exploding gradients prevent effective training in deep neural networks:

Exploding gradients doesn’t always means large updates, it can results in numerical instability and overflow. Exploding gradients don’t always lead to large, meaningful updates. When training with computers, excessively large gradients can cause numerical overflow. For example, consider a network where the gradient at a certain layer is 10⁷, a large but feasible number. During backpropagation, this value, multiplied by other large values due to the chain rule, can exceed the limits of standard floating-point representation, resulting in overflow and the gradients becoming NaN (presents Not a Number). When updating parameters using parameter = parameter - learning_rate * gradient, if the gradient is NaN, the parameters also become NaN. This corrupts the forward pass of the network, making it incapable of generate useful predictions.
Large updates aren’t always beneficial. Large gradient values can lead to dramatic changes in parameters, causing oscillations and instability in the parameter updates. This can result in longer training times, difficulty in converging, and the potential to overshoot the global minimal of the loss function. Coupled with a relatively larger learning rate, these oscillations can significantly slow down the training process.
Large gradients aren’t necessarily informative. It’s a misunderstanding that larger gradients are always more informative and lead to meaningful updates. In fact, if gradients become excessively large, they can overshadow the contributions of smaller, yet more meaningful, gradients. Imagine navigating the loss landscape, where large gradients, often influenced by outliers or noise, can misguide us in choosing our next step. Additionally, large gradients may benefit some layers but not others, resulting in imbalanced learning. For instance, large gradients in upper layers might lead to extreme small derivatives in layers using sigmoid activation functions. This can result in an imbalanced learning process within the network.

Why is it that the oscillation in weight updates caused by exploding gradients turns out to be harmful? I understand that Stochastic Gradient Descent (SGD) with small batch sizes also leads to oscillation, but why doesn’t that cause similar issues?

You’re right in noting that Stochastic Gradient Descent (SGD) with small batch sizes introduces some instability in the weight updates. However, this type of oscillation is both controllable and relatively minor, primarily because it originates from the data itself. Utilizing a small subset of training data typically won’t deviate too far from the behavior of the full dataset, meaning the noise is manageable. Additionally, this manageable level of noise can enhance the model’s generalizability by improving its insensitivity to minor data variations. Essentially, while navigating the loss surface, SGD might make decisions that slightly deviate from the most optimal path to the global minimum. However, these suboptimal steps aren’t drastically far from the ideal ones. This approach reduces the likelihood of getting stuck in plateaus and saddle points, enabling movement away from these areas towards potentially better local minima or even the global minimum in the complex landscape of DNN.

On the other hand, the oscillation caused by exploding gradients is of a different behavior. Due to exploding gradients, we might make substantially large updates to parameters, leading to significant movements on the loss surface. These updates can be so large that they catapult the model to a position quite distant from where it was initially, or even farther from the global minimum. Unlike with SGD, where the steps are small and keep us close to our original position, exploding gradients can negate our previous progress, forcing us to redo all the hard work to find the global minimum.

To visualize it, think of it like playing an RPG video game, where we use magic (akin to an optimizer) to guide our movements towards the treasure located at the lowest point of the map. With the magic of SGD, we might stray from the best route, but we generally head towards the treasure. Thus, moving fast enough, we’ll likely reach and get the treasure. But with exploding gradients, it’s like being thrown randomly to a new, unfamiliar place on the map, requiring us to restart our exploration. In the worst-case scenario, we might end up at the farthest point from the treasure, making it nearly impossible to reach our goal with limited time and computational sources.

So how do we know whether a gradient is too large or too small? How’d we detect those problems?

Detecting unstable gradients is important for ensuring the effective learning of different layers at a consistent pace. If you suspect your model is suffering exploding or vanishing gradients, consider those methods to effective identify those problems:

Tracking training loss. Observing the training loss across epochs is a straightforward and valuable method. A sudden spike in loss or its transition to NaN might indicate that some weights have been updated too aggressively, potentially due to large gradients causing numerical overflow. This scenario often points to exploding gradients. Conversely, a plateau in the loss graph or minuscule decreases over several epochs could be a sign of vanishing gradients, suggesting that the weights are barely updating due to very small gradients. However, interpreting these signs isn’t always straightforward and requires ongoing evaluation to determine whether the training process is progress or not.
Monitoring gradient directly. A more direct approach involves keeping an eye on the gradients themselves. However, inspecting each gradient individually can be impractical in large, deep networks, calculating the gradient norm offers a simplified yet effective alternative. The norm, which can be computed for all layers collectively or individually, focuses on the magnitude of the gradients, providing a single metric for comparison over time. This method is particularly useful for identifying exploding gradients, as excessively large values will always stand out. For vanishing gradients, small norms might hint at an issue, but they’re less definitive. Visualizing gradient distributions through histograms can also highlight outliers and extreme values, and can be easily captured.
Monitoring weight updates and activation output. Since unstable gradients directly impact weight updates, tracking changes in weights is a logical step. Sudden significant large changes or shifts to NaN in weights indicate exploding gradients. If weights remain static over time, vanishing gradients could be the reason. Similarly, analyzing the activation outputs of each layer can offer additional insights into how weights and gradients are behaving.

For practical implementation, TensorBoard stands out as a popular tool for visualizing training progress. It supports both TensorFlow and PyTorch and allows for detailed tracking of losses, gradients and more. It can be a great tool to identify gradient-related issues.

Are you suggesting that, in DNNs, we aim for all layers to learn at an same pace?

We don’t actually expect every layer in a DNN to learn at the same rate. Our goal is to have a consistent and balanced weight update across the entire network. Vanishing gradients, for instance, pose a problem because they lead to the upper layers (those closer to the output) learning more quickly and converging earlier due to larger gradients, while the lower layers (those closer to the input) lag behind with almost random weight adjustments due to small gradients. This can result in the network converging to a less-than-ideal local minimum, with only the upper layers properly trained and the lower layers nearly unchanged. In general, we want make sure each layer’s updating rate align with its impact on the final model prediction. The problem with both vanishing and exploding gradients is that they break this balance, and unstable gradients cause layers get updated disproportionately. It’s similar to trying to walk left by only turning your body without moving your feet, you won’t get very far.

The objective, then, is to achieve stable and uniform weight updates across the network throughout the network. This is where adaptive learning rate optimizers come into play, offering significant benefits. They dynamically adjust the learning rate based on historical gradients, aiding in more efficient loss reduction.

Practically, frameworks like PyTorch and TensorFlow allow for the specification of different learning rates for each layer. This capability is particularly beneficial when fine-tuning pretrained models or during transfer learning. It allows customized learning rate adjustments per layer to suit specific requirements. Here are some useful discussion on Stack Overflow and PyTorch forums.

How do we generally address the issue of unstable gradients when training DNNs?

Imagine our DNN model as a mini-ice cream factory, where ingredients flow through various departments to produce ice cream that’s then rated by customers at the end of the production line. Each department in this mini-factory represents a layer in the DNN, and the customer feedback on the ice cream’s taste represents the loss gradient used to improve the recipe through backpropagation.

However, our factory faces a challenge: the customers feedback isn’t being effectively communicated back through the departments. Some departments overreact to the feedback, making drastic changes (akin to exploding gradients), while others overlooked, barely making any adjustments (akin to vanishing gradients).

As the factory managers, our goal is to ensure that feedback is processed appropriately at each stage to consistently produce ice cream that delights our customers. Here are the strategies we can employ:

Set up factory properly (Weight Initialization). The first step is ensuring the factory operates smoothly and produces quality ice cream. Setting up the factory correctly is quick important. We need to ensure that the initial setting of our production line (akin to weights in a DNN) is set just right, not too strong nor too weak, to produce a base flavor that meets general customer preference. This is like choosing proper weight initialization in DNNs to avoid excessively small or large gradients, ensuring a stable flow of adjustments based on feedback.
Quality control at the beginning and during production (Batch Normalization). As the ingredients mix and progress through the production line (layers in a DNN), we’d introduce quality checkpoints to standardize the mix at various stages. This ensures that each batch of semi-finished product remains consistent, preventing any layer from producing ice creams that are too varied, which could make our customer feedback become misleading and make our adjustment less effective. This mirrors batch normalization in DNNs, where outputs of layers are normalized to maintain a stable distribution of activations, aiding in a smoother gradient flow.
Adjust feedback system to avoid overreactions (Gradient Clipping). When customer feedback arrives, it’s crucial that no department overreacts and makes drastic changes based on one batch of feedback, which could throw off the entire production line. By implementing a system where extreme feedback (either too positive or too negative) is moderated or clipped, you ensure that changes are gradual and controlled, akin to gradient clipping in DNNs, which prevents gradients from exploding. The similar idea is using some special network architecture like ResNet with skip connections can also mitigate vanishing gradients.
Optimizing Workflow and Feedback Paths (Activation Functions). In our DNN ice cream factory, some departments act as messengers, shuttling semi-finished products forward and customer feedback backward. They’re the key for ensuring the final product turns out correctly. If they don’t transfer the intermediate products accurately, the end result could be a batch of ice cream that misses the mark. Similarly, the way they handle customer feedback and pass between departments is also important. If feedbacks are overlooked, important information might be missed, while overly amplified feedback can cause overreactions and dramatic changes in the factory. So, choosing the right communicators (activation functions) ensure the smooth transfer of both products and feedback and keep the production line efficient and responsive. Just like in our factory, choosing the right activation functions in a DNN ensures that we get accurate predictions and maintain stable gradients, avoiding the extremes of vanishing or exploding during backpropagation and ensure effective updates for different parameters.
Tailor feedback response intensity by department (Adaptive Learning Rates). Finally, not all departments should react to feedback with the same intensity. For example, the flavor department might need to be more sensitive to taste feedback than the packaging department. By adjusting how much each department learns from feedback (akin to setting adaptive learning rates in different layers of a DNN), you can fine-tune the factory’s response to customer preferences, ensuring more targeted and efficient improvements.

Next time, I’ll cover the topic of activation functions and weight initialization methods in detail. Rather than simply running through the formulas and listing their pros and cons, I plan to share the essential questions that have shaped my understanding – questions that frequently go unasked but are vital for grasping why each method and function is shaped the way it is. These discussions might even empower us to innovate our own functions and weight initialization methods.

For those keen to journey with me through this series, feel free to follow along. Your engagement – through claps, comments, and follows – fuels this endeavor. It’s not just encouragement; it’s the very heartbeat of this educational series. As I continue to refine my grasp on these topics, I often revisit and update my past posts, enriching them with new insights. So, stay tuned for more!

(Unless otherwise noted, all images are by the author)

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

The post Courage to Learn ML: Tackling Vanishing and Exploding Gradients (Part 1) appeared first on Towards Data Science.

Courage to Learn ML: Explain Backpropagation from Mathematical Theory to Coding Practice

Amy Ma — Wed, 17 Jan 2024 14:27:56 +0000

Image created by the author using ChatGPT.

Welcome back to the latest chapter of ‘Courage to Learn ML. In this series, I aim to demystify complex ML topics and make them engaging through a Q&A format.

This time, our learner is exploring backpropagation and has chosen to approach it through coding. He discovered a Python tutorial on Machine Learning Mastery, which explains backpropagation from scratch using basic Python, without any deep learning frameworks. Finding the code a bit puzzling, he visited the mentor and asked for guidance to better understand both the code and the concept of backpropagation.

As always, here’s a list of the topics we’ll be exploring today:

Understanding backpropagation and its connection to gradient Descent
Exploring the preference for depth over width in DNNs and the rarity of shallow, wide networks.
What is the chain rule?
Breaking down backpropagation calculation into 3 components and examining each thoroughly. Why is it called backpropagation?
Understand backpropagation through straightforward Python code
Gradient vanishing and common preference in activation functions

Let’s start with the fundamental why –

What is backpropagation and how is it related to gradient descent?

Gradient descent is a key optimization method in Machine Learning. It’s not just limited to training DNNs but is also used to train models like logistic and linear regression. The fundamental idea behind it is that by minimizing the differences between predictions and true labels (prediction error), our model will get closer to the underlying true model. In gradient descent, the gradient, represented by ∇𝜃𝐽(𝜃) and formed by the loss function’s partial derivatives with respect to each parameter, guides the update of parameters: 𝜃 = 𝜃 – 𝜂⋅∇𝜃𝐽(𝜃). This process is akin to dissecting a complex movement into basic movements in video games.

However, in DNNs, which involve multiple layers and millions of parameters, calculating these partial derivatives for each parameter becomes computationally intensive. Particularly in a layered structure, it’s crucial to distinguish the error contribution of each layer to understand how the parameters at different layers should change with respect to the overall loss.

This is where Backpropagation comes in. It is a method to efficiently compute gradients for DNNs. Backpropagation assists DNNs in using gradient descent to guide their learning process, calculate the partial derivatives, and adjust each parameter more efficiently. The key to backpropagation lies in its name – it stands for ‘backward propagation of errors.’ This means the process involves sending the error (between the current prediction and the true label) backward and distributing the gradient from the output layer to the input layer and hidden layers in between. This distribution is in the reverse direction compared to the forward direction used to generate predictions.

So the hard part of training DNN is due to multiple layers. But, before talking more about backpropation, I’m curious why DNNs typically go deeper rather than wider. Why aren’t shallow but wide networks popular?

This question is about the preference for deep networks over shallow ones. Before jumping into it, let’s define shallow networks as having only 1 or 2 hidden layers. According to the universal approximation theorem, a single-layer network that’s wide enough can theoretically approximate any function. However, in practice, having many neurons in a wide network is not always practical due to high computational demands.

Deep networks, with multiple layers, are generally more efficient at modeling complex functions with fewer neurons compared to shallow networks. They are particularly good at learning different levels of data representation. For example, in facial recognition using CNNs, the initial layers might learn simple patterns like edges, and deeper layers can recognize more complex features like parts of a face.

Shallow networks, though, have their own advantages. They are easier to train and don’t face some common problems of deep networks, such as vanishing or exploding gradients. They are also more straightforward to understand. But, to capture complex functions, they might need many more neurons, making them less efficient for certain tasks.

To sum up, deep networks are typically favored because they can learn complex patterns in a hierarchical, sturctured manners and do so efficiency with fewer neurons. But the study of shallow versus deep networks is still an active field in machine learning.

Now, let’s explore backpropagation in detail and utilize the code snippets as a resource to deepen our understanding of the concept.

If you’re new to the gradients (gradient descent) or need a refresher, my previous article offers a detailed exploration of gradient descent and popular optimizers, accessible here. This section, inspired by Hung-yi Lee’s engaging lectures, blends my personal insights with his teachings. For those fluent in Chinese, I highly recommend his engaging machine learning lectures, available on YouTube.

The initial step involves understanding the role of gradients in our process. Gradients enable us to calculate the partial derivative of the loss with respect to each parameter from different layers.

Consider a scenario where we train our model using SGD (with Stochastic Gradient Descent) a batch size equal to 1. This means we use a single sample at a time to train our network. Through a certain magic, we can determine the partial derivative of the loss with respect to any weight, regardless of its depth in the network. For instance, within the network as below, we assume we already know the partial derivative of the loss with respect to the weights of the first layer. For example, the gradient of w1 at the first layer, denoted as ∂L(θ)/∂w1. Our goal is to understand what ∂L(θ)/∂w1 actually represents.

We will study backpropagation based on this network. Source: https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s

Author’s Note: A valuable technique I learned while practicing coding on Leetcode is to approach recursive functions by assuming a part of the task is already completed. This method helps in using the results of this assumed part to tackle larger problems. Adopting an attitude of assumed knowledge can provide comfort and ease your mind, fostering confidence as you delve into the topic. Essentially, it’s about ‘faking it till you make it’ in problem-solving and even life attitude

To dissect ∂L(θ)/∂w1 into comprehensible segments, we must explore the chain rule of partial derivatives. So, as is customary, let’s begin with the question:

What is the chain rule?

The chain rule is a fundamental technique in calculus, used for computing the derivative of a composite function. To illustrate what chain rules are, consider the following two examples:

If y = f(x) and x = g(t), this implies y = f(g(t)). Therefore, the derivative of y with respect to t (∂y/∂t) is calculated as the product of the derivative of y with respect to x (∂y/∂x) and the derivative of x with respect to t (∂x/∂t). So we have ∂y/∂t = ∂y/∂x * ∂x/∂t

Image created by the author.

If z = f(x, y), with y = g(t) and x = q(t), the derivative of z with respect to t (∂z/∂t) is the sum of the product of the derivative of z with respect to y (which is ∂z/∂y) and the derivative of y with respect to t (which is ∂y/∂t), and the product of the derivative of z with respect to x (∂z/∂x) and the derivative of x with respect to t (∂x/∂t). So we have ∂z/∂t = (∂z/∂x ∂x/∂t) + (∂z/∂y ∂y/∂t).

To visualize the chain rule, imagine derivatives as streams of water in a landscape. Calculating the derivative of an element in a process is like tracing a water stream’s journey. When functions are nested within each other, envision it as a small stream flowing into a river, then merging into the ocean. This represents how derivatives are multiplied together as they progress from small-scale to large-scale impacts. When two streams (derivatives) originate from different sources and converge, think of them as combining into a single stream. This illustrates the summing of two derivatives in the chain rule’s context.

Imagine derivatives as streams of water in a landscape. Image created by the author using ChatGPT.

An intuitive but technically ‘incorrect’ approach to grasping the chain rule is to liken it to fraction manipulation. Visualize it as a scenario where the denominator cancels out with the numerator when they multiply together. While this visualization simplifies the concept, it’s important to note that it’s incorrect. Nonetheless, it serves as a useful aid in comprehending derivatives.

This image shows a simple but incorrect way to understand the chain rule. Image created by the author.

How is the chain rule applied to our calculation of ∂L(θ)/∂w1?

Returning to our neural network, it’s helpful to visualize the connections within it as water streams. Here, the edges of the network symbolize the directional flow from input to output, with activation functions acting like complex gates altering the stream’s properties. To understand how the weights in the first layer (w1) affect the loss, we trace the ‘water stream’ from the loss back to w1. To better understand how the first layer’s weights (w1) affect the loss, envision tracing the ‘water stream’ from the loss back to w1. In the network diagram below, the red lines illustrate this water stream from the output back to w1, where z’ and z” are separate streams converging before an activation gate.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

This stream flow analogy helps us reinterpret the partial derivative of the loss with respect to the first layer’s weights (∂L(θ)/∂w1) more intuitively.

We can then dissect ∂L(θ)/∂w1, representing the general partial derivative of loss with respect to weight, into three parts using the chain rule:

The partial derivative of the overall loss relative to the current layer, taking into account its role as the input for subsequent layers. This reflects the portion of the loss that can be influenced or contributed by this layer. It’s important to note that the current layer’s output is influenced by the preceding layers.
The partial derivative of the activation function’s output a with respect to its input z, where a = 𝜎(z). Since the activation function is like a gate altering the stream’s characteristics, understanding its impact on the stream becomes crucial when allocating loss to different layers.
The partial derivative of a neuron’s direct output, z, with respect to its weight, w1. In the formula z_=_w_1_x_1+_w_2_x_2+_b, z is the direct outcome of w1 and represents the stream prior to encountering the activation function.

After breaking down the partial derivatives into individual components, we can tackle the calculation by addressing each part separately. Let’s begin with the simplest component by posing this question:

How do we compute ∂z/∂w1, which is the partial derivative of z with respect to the weight that directly determines it in a linear fashion?

∂z/∂w1 represents the gradient of the output z before the activation function with respect to the weight direct associate with its own calculation. Since z, calculated as z_=_w_1_x_1+_w_2_x_2+_b, is a linear combination of inputs and weights, the derivative ∂z/∂w1 is simply the input associated with given weight w1, which is x1. This indicates that the gradient of z with respect to its direct weight does not depend on the other layers or the loss function (error). It’s a direct relationship: the partial derivative equals to the input that w1 multiplies in the forward pass. For a bias term b, the derivative is always 1 since a bias is a constant term.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

To clarify with examples, let’s consider two scenarios. Starting with the first scenario where z is in the first hidden layer of the network. As highlighted in the graph with z and its inputs marked in red, we can calculate the derivatives by using the value of related input directly.

If the layer is the other hidden layer that are not direct associate with input data, the principle applies to those hidden layers as well. The partial derivative of a linear function with respect to its weight is the input to that weight. In the graph, for subsequent layers like z’ and z” in the network, the partial derivatives with respect to their associated weights, such as _w_3 and _w_4, are the outputs of the previous layer’s activation function, which in this case is a. I highlighted those relationships in blue.

In summary, the computation of the partial derivative of z (the output before activation) with respect to its weights is straightforward: it’s the input value to that weight. The derivative with respect to the bias is always 1. One can trace back along the edge of the weight in the network graph to find the corresponding input, and this input is the value of the partial derivative of the output z respect to the weight. This part of the backpropagation process does not require any "backward" information from the loss and is determined during the forward pass while processing the input data.

So, the calculation of ∂z/∂w1 isn’t actually a "backward" process. It’s simply equal to the input of the neuron. How about the partial derivative of the activation output with respect to its input, ∂a/∂z?

The computation of ∂a/∂z is more straightforward compared to ∂z/∂w1, considering that a is just a function result of z. Specifically, a = σ(z), where σ(z) represents the activation function. For those unfamiliar with activation functions, an activation function like σ_(_z) are used to introduce non-linearity into the neural network. Using our analogy of water streams, you can think it as a ‘water gate’ that altering the water stream of computation in a non-linear fashion, enabling the network to capture complex patterns that are non-linear. This non-linearity is vital for the network’s ability to capture the complex, non-linear underlying patterns.

As activation functions are crucial in neural networks and transforming computations to non-linearity, the derivative ∂a/∂z is a key component in backpropagation. It helps adjust gradients as they traversal the network reversely. Given a = σ(z), the derivative ∂a/∂z is simply σ′(z), the derivative of the activation function with respect to the original input z.

Calculating σ′(z) is quite simple. It involves plugging the input z into the derivative of the activation function. The derivative of the activation function, is simply another function which doesn’t require any information about the loss of current prediction. ** For instance, if the activation function is the sigmoid function, and its derivative σ′(z) can be defined in terms of the function itself as _ σ′(z)_=σ(z)(1−σ(z)). In practice, our program can define σ'(z) before trainin**g. We then input z into σ'(z) to get the partial derivative ∂a/∂z.

Thus, like ∂z/∂w1, ∂a/∂z doesn’t need loss information and can be computed during the forward pass. This derivative is then used in the chain rule during backpropagation to find the gradient of the loss with respect to the weights.

Considering our discussion thus far, when computing ∂L/∂w1, two of the three components do not require loss information or input from later layers. Then why do we refer the calculation process as ‘backpropagation’?

I’m glad you noticed that. Let’s recap our discussion so far:

We’ve established that the calculation of ∂z/∂_w_1 is straightforward because it is the input associated with weight w1. Similarly, for the activation function’s output a, represented as a_=σ(z), the derivative ∂a_/∂z is just σ_′(_z), which is the derivative of the activation function. Both components’ calculation are independent of the loss function and can be either pre-defined ahead of training process or pre-computed during the forward pass.

Now, the remaining part involves ∂L/∂a, which tells us how changes in the activation output a affect the overall loss L. Imagine the entire neural network as a cake factory’s production line, tasked with baking, packaging, and shipping cakes to stores. If a cake arrives damaged (akin to a poor prediction), it’s necessary to backtrack and identify which stage of the process was responsible for the damage and to what extent.

Imagine the entire neural network as a cake factory’s production line. Image created by the author using ChatGPT.

In backpropagation, we use the chain rule to decompose ∂L/∂a. In our example network, the a is part of input related to two outputs, z′ and z′′. Then when trying to understand how much loss a contributes to the loss, we need to gather information from those two outputs to measure the impact.

Image by source https://www.youtube.com/watch?v=ibJpTrp5mcE&t=1510s, annotated by the author.

This is analogous to the cake production line, where ∂z′/∂a and ∂z′′/∂a represent the contribution of a to each output, and we must trace the impact back from the loss through these channels.

Given that

We combine these with the activation function derivative ∂a/∂z to get

This is key to understanding why the process is named as backpropagation.

This formula shows that ∂L/∂z is influenced by the gradients from the next layer, showcasing the ‘backward’ aspect of the computation. This is key to understanding why the process is named as backpropagation.

To update weights throughout the neural network, we calculate the partial derivative of the loss with respect to each weight, ∂L/∂w. Forward calculation (from input to output layer) would be inefficient, requiring repeated computation of ∂L/∂z for each layer starting from the input layer. Using the function above, a more efficient approach is to compute ∂L/∂z starting from the output layer and moving backwards to the input layer. This method involves storing the result and reusing the current layer’s ∂L/∂z for the preceding layers. Be aware that, to find ∂L/∂z for the (n-1)th layer, we only need the derivative of the loss with respect to the nth layer. So this process allows for calculating each layer’s derivative just once, in a single, elegant and backward pass.

So, we recursively compute ∂L/∂z for each layer backwards. Then, how do we calculate the partial derivatives of the loss with respect to the output layer’s z, which serves as the first value of ∂L/∂z to start the process?

Let’s consider ∂L/∂z calculation for the output layer. Here, z is the input to the output layer, which gets transformed into the prediction y-hat by the activation function σ_(_z). Then, the prediction (y-hat) is used to calculate the loss, expressed as L(y, y-hat), where L is the loss function. Then we’d apply chain rule to define the calculation of ∂L/∂z for the last layer into two parts.

The derivative of the loss function with respect to the network’s prediction, which depends on the loss function type. For instant, for regression task we often use MSE (Mean Squared Error). While for multi-class classification, we’d choose CE (cross entropy) as loss function. Then for single prediction, the partial derivative of the loss with respect to the prediction would be

The derivative of the prediction (y-hat) with respect to the last layer activation’s input, similar to ∂a/∂z, is the derivative of the last layer’s activation function, σ'(z).

Our discussions so far about backpropagation have focused on using SGD (Stochastic Gradient Descent) with a batch size = 1. What if we use a larger batch size for training, would that alter the calculations?

You are correct that the calculations with a larger batch size do differ slightly, but the fundamental calculations we discussed for backpropagation remain valid. It primarily involves an additional adding up step.

When a batch size > 1, the loss during the forward pass is the average loss across all samples in the batch. So, the partial derivative of the loss with respect to weight, ∂L/∂w1, is the average of the derivatives of individual losses (∂l/∂w) for all data points in the batch.

Mathematically, for batch size n, the gradient of loss L with respect to weight w1 is calculated as

Here i represents the loss for the ith sample within the batch. It’s important to note we use the average of the gradients here. By calculating the average, we ensure that the update to the weight is reflective of the average direction in which the loss decreases across all samples in the batch.

Alright, with our discussion, let’s apply this knowledge to understand the code you’ve got. It’s an excellent way to grasp complex concepts. As usual, let’s shape this into a question:

Based on our discussion, how to combine these elements to construct the backpropagation calculation? How to use our insights to decode this code snippet?

Let’s view the calculation steps by step. All the code shown here is from the Python tutorial on Machine Learning Mastery.

Define activation functions and their derivatives.

The code uses sigmoid function as activation function. To calculate the derivatives of the activation output with respect to its input, the code defines the sigmoid function as transfer and a transfer_derivative function for its derivative.

# Transfer neuron activation
def transfer(activation):
  return 1.0 / (1.0 + exp(-activation))

# Calculate the derivative of an neuron output
def transfer_derivative(output):
 return output * (1.0 - output)

Calculate ∂L/∂a.

The backward_propagate_error function involves looping backwards from output to input layer. In the function, the variableerror represents ∂L/∂a. It differentiates between the output and hidden layers for ∂L/∂a calculation.

For the output layer, the error is simply the difference between the output and expected value. The loss function used here is the prediction error as (y – y_hat). The derivative for this loss function is -1. Therefore, ∂L/∂a is calculated as neuron['output'] - expected[j]. This part of code can be confusing, because typically we’d use MSE (mean squared error) and CE (cross entropy) as loss function. Also the partial derivative of loss with repsect to prediction error is not obvious in the code.

def backward_propagate_error(network, expected):
 for i in reversed(range(len(network))):
  layer = network[i]
  # calculate the loss forr each layer
  if i != len(network)-1:
   ...
  else: 
      # ∂L/∂a of the output layer  
   for j in range(len(layer)):
    neuron = layer[j]
    errors.append(neuron['output'] - expected[j])

For hidden layers, it’s calculated based on the weights and delta (representing ∂L/∂z) of the following layer.

# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
 for i in reversed(range(len(network))):
  layer = network[i]
  errors = list()
  # calculate the loss forr each layer
  if i != len(network)-1:
   ...
        # ∂L/∂a of the hidden layer  
    for neuron in network[i + 1]: 
     error += (neuron['weights'][j] * neuron['delta']) 
    errors.append(error)
  else: 
     ...

Calculate ∂L/∂z.

Here, we compute ∂L/∂z as ∂L/∂a (represented as errors in the code)* σ'(z), and store the result as neuron['delta']

# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
  for i in reversed(range(len(network))):
  layer = network[i]
  errors = list()
 # calculate the partial derivative of the loss with repsect to the input before activation
  for j in range(len(layer)):
   neuron = layer[j] 
   neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])

Calculate ∂L/∂w to adjust weights.

Finally, we’d update the weights at the corresponding layer using ∂L/∂w = ∂L/∂z* ∂z/∂w.The weights are adjusted based on the delta (representing ∂L/∂z) and inputs, while for bias term we will only need delta.

# Update network weights with error
def update_weights(network, row, l_rate):
  for i in range(len(network)):
    inputs = row[:-1]
    if i != 0:
      inputs = [neuron['output'] for neuron in network[i - 1]]
    for neuron in network[i]:
      for j in range(len(inputs)):
        neuron['weights'][j] -= l_rate * neuron['delta'] * inputs[j] 
      neuron['weights'][-1] -= l_rate * neuron['delta'] # bias term the input is 1

Note that the code is based on SGD with a batch size of 1. Therefore, the code doens’t include the calculations that involving averaging of loss or partial derivatives over multiple samples.

You know, backpropagation kind of like the reverse of the forward pass, doesn’t it? Is that a fair way to look at it?

Backpropagation calculation is a akin to the reversed version of forward pass, largely because:

The derivative of activation function mirrors the role of activation function in forward propagation.
The ∂L/∂z calculation is like calculating z in forward pass, adding up the multiplications of the previous layer’s outputs and weights. For backpropagation we use the outputs of the next layer.

However, they are still quite different:

Different starting points. Forward pass begins with input data X, while backpropagation starts from the loss, making the choice of loss function pivotal.
Different purpose. Forward propagation aims to generate predictions based on given data. While the goal of backpropagation is to train the model. We do so by djusting the model’s parameters based on the comparison between predictions and actual values.
Calculation dependency. Backpropgation needs the result of the forward pass, and use the chain rule integrate elements from the forward pass and partial derivatives of loss from following layers. So backpropagation is inseparabel from the forward propagation process.

Why is it important for us to understand the intricate details of backpropagation’s calculations?

Understanding backpropagation is crucial, even with deep learning frameworks like PyTorch and TensorFlow. A thorough understanding of its calculation provides an intuitive grasp of various challenges and training tricks in deep learning, without needing to memorize them. A key insight gained from understanding backpropagation is

How does backpropagation inform the choice of activation functions?

Recall our discussion on calculating ∂a/∂z. When implementing backpropagation in code, to save memory, we’d define both the activation function and its derivative, which explains why tutorials often list derivatives alongside activation functions. These derivatives are as essential as the activation functions themselves and often predefined before modeling.

Activation Functions and their Derivatives. Source: https://dwaithe.github.io/images/activationFunctions.png

Consider the computation of ∂L/∂z _during backpropagation. This calculation relies on the outputs from subsequent layers, and it has a pattern: the gradient for earlier layers is a product of the activation function derivatives at each successive layer. For instance, assuming our example network has one more layers after z’ and z”, with weights w_5 and w6, this calculation becomes an intricate multiplication of these derivatives.

If we choose sigmoid and tanh as our activation function, then they would may lead us to the gradient vanishing problem. Because the derivatives of functions like sigmoid and tanh have very small range. The sigmoid derivative ranges between 0 and 0.25, while tanh’s derivative lies between 0 and 1. As a result, when multiplied, their products tend to diminish progressively, leading to progressively smaller gradients. This reduction in gradient magnitude causes the earlier layers to receive very small gradient updates, keeping them away from effective learning and parameter adjustment.

Source: https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Vanishing-and-Exploding-Gradients-in-Neural-Network-Models-Debugging-Monitoring-and-Fixing-Practical-Guide_7.png?resize=636%2C497&ssl=1

One solution is to use an activation function whose derivative has a broader range. ReLU, for example, returns either 1 or 0, effectively addressing the vanishing gradient issue. However, ReLU has its drawback, because it can lead the multiplication of derivatives become 0. Consequently, neurons receive no updates, become inactive, and fail to contribute to model learning. This problem is known as "dying ReLU".

In summary, a thorough understanding of backpropagation can be very beneficial. It’s a cornerstone in developing effective neural network models and troubleshooting training problems.

(Unless otherwise noted, all images are by the author)

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

Reference:

https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/

The post Courage to Learn ML: Explain Backpropagation from Mathematical Theory to Coding Practice appeared first on Towards Data Science.

Courage to Learn ML: A Detailed Exploration of Gradient Descent and Popular Optimizers

Amy Ma — Tue, 09 Jan 2024 13:14:47 +0000

We will use an RPG game as an analogy today. Created By ChatGPT

Welcome back to a new chapter of ‘Courage to Learn ML. For those new to this series, this series aims to make these complex topics accessible and engaging, much like a casual conversation between a mentor and a learner, inspired by the writing style of "The Courage to Be Disliked," with a specific focus on machine learning.

In our previous discussions, our mentor and learner discussed about some common loss functions and the three fundamental principles of designing loss functions. Today, they’ll explore another key concept: gradient descent.

As always, here’s a list of the topics we’ll be exploring today:

What exactly is a gradient, and why is the technique called ‘gradient descent’?
Why doesn’t vanilla gradient descent perform well in Deep Neural Networks (DNNs), and what are the improvements?
A review of various optimizers and their relationships (Newton’s method, Adagrad, Momentum, RMSprop, and Adam)
Practical insights on selecting the right optimizer based on my personal experience

So, we’ve set up the loss function to measure how different our predictions are from actual results. To close this gap, we adjust the model’s parameters. Why do most algorithms use gradient descent for their learning and updating process?

To address this question, let’s imagine developing a unique update theory, assuming we’re unfamiliar with gradient descent. We start by using a loss function to quantify the discrepancy, which encompasses both the signal (the divergence of the current model from the underlying pattern) and noise (such as data irregularities). The subsequent step involves conveying this data and utilizing it to adjust various parameters. The challenge then becomes determining the extent of modification needed for each parameter. A basic approach might involve calculating the contribution of each parameter and updating it proportionally. For instance, in a linear model like W*x + b = y, if the prediction is 50 for x = 1 but the actual value is 100, the gap is 50. We could then compute the contribution of w and b, adjusting them to align the prediction with the actual value of 100.

However, two significant issues arise:

Calculating the Contribution: With many potential combinations of w and b that could yield 100 when x = 1, how do we decide which combination is better?
Computational Demand in Complex Models: Updating a Deep Neural Network (DNN) with millions of parameters could be computationally demanding. How can we efficiently manage this?

These difficulties highlight the complexity of the problem. It’s nearly impossible to accurately determine each parameter’s contribution to the final prediction, especially in non-linear and intricate models. Therefore, to update model parameters effectively based on the loss, we need a method that can precisely dictate the adjustments for each parameter without being computationally costly.

Thus, rather than focusing on how to allocate loss across each parameter, we could consider it as a strategy to traverse the loss surface. The objective is to locate a set of parameters that guide us to the global minimum – the closest approximation achievable by the model. This adjustment process is akin to playing an RPG game, where the player seeks the lowest point in a map. This is the foundational idea behind gradient descent.

We have a set of basic movements to guide our hero towards finding the treasure, which is located at the lowest point in the map. Created by ChatGPT

So what is exactly gradient descent? Why it call gradient descent?

Let’s break it down. The loss surface, central to optimization, is shaped by the loss function and model parameters, varying with different parameter combinations. Imagine a 3D loss surface plot: the vertical axis represents the loss function value, and the other two axes are the parameters. At the global minimum, we find the parameter set with the lowest loss, our ultimate target to minimize the gap between actual results and our predictions.

Source: https://miro.medium.com/v2/resize:fit:1400/format:webp/1*DDjCOEPSHLsU7tff7LmYUQ.png

But how do we navigate towards this global minimum? That’s where the gradient comes in. It guides us in the direction to move. You might wonder, why calculate the gradient? Ideally, we’d see the entire loss surface and head straight for the minimum. But in reality, especially with complex models and numerous parameters, we can’t visualize the entire landscape – it’s more complex than a simple valley. We can only see what’s immediately around us, like being in a foggy landscape in an RPG game. So, we use the gradient, which points towards the steepest ascent, and then head in the opposite direction, towards the steepest descent. By following the gradient, we gradually descend to the global minimum on the loss surface. This journey is what we call gradient descent.

The foggy landscape. Created By ChatGPT

How exactly does gradient descent decide the adjustments needed for each parameter, and why is it more effective than our initially proposed simple method?

Our objective is to minimize the loss function with a particular set of parameters, achievable only by adjusting these parameters. We indirectly influence the loss function through these changes.

Let’s revisit our RPG analogy. The hero, with only basic movements (left/right, forward/backward) and limited visibility, aims to find the lowest point on an uncharted map to unearth a legendary weapon. We know the gradient indicates the direction to move, but it’s more than just a pointer. It also decomposes into fundamental movements.

The gradient, a vector of partial derivatives with respect to each parameter, signifies how much and in which basic direction (left, right, forward, backward) to move. It’s like a magical guide not only pointing to the hill’s left side will lead us to the legendary weapon but also instructing specific turns and steps to take.

However, it’s crucial to understand what the gradient actually is. Often tutorials suggest imagining you’re on a hill and looking around to choose the steepest direction to descend. But this can be misleading. The gradient isn’t a direction on the loss surface itself but a projection of that direction onto the parameter dimensions (in the graph, the x,y coordinates), guiding us in the loss function’s minimal direction. This distinction is crucial – the gradient isn’t on the loss surface but a directional guide within the parameter space.

The vector v represents the gradient and can be expressed in the coordinate form determined by our parameters x and y. Source: https://eli.thegreenplace.net/images/2016/plot-3d-parabola.png

That’s why most visualizations use parameter contours, not the loss function to illustrate gradient descent processes. The movement is about adjusting parameters, with changes on the loss function being a consequence.

Source: https://machinelearningmastery.com/wp-content/uploads/2021/07/gradientDescent1.png

Gradient is formed by partial derivatives. A partial derivative, on the other hand, helps us understand how a function changes in relation to a specific parameter, holding others constant. This is how gradients quantify each parameter’s influence on the loss function’s directional change.

Gradient descent naturally and efficiently resolves the parameter tuning dilemma we initially faced. It’s especially adept at locating global minima in convex problems and local minima in nonconvex scenarios. Modern implementations benefit from parallelization and acceleration via GPUs or TPUs. Variations like mini-batch gradient descent and Adam optimize its efficiency across different contexts. In summary, gradient descent is stable, capable of handling large datasets and numerous parameters, making it a superior choice for our learning purposes.

In the gradient descent formula 𝜃 = 𝜃 – 𝜂⋅∇𝜃𝐽(𝜃), we use the learning rate (𝜂) multiplied by the gradient (∇𝜃𝐽(𝜃)) to update each parameter. Why is the learning rate necessary? Wouldn’t adjusting the parameter directly with the gradient be faster?

At first glance, using only the gradient to reach the global minimum seems straightforward. However, this overlooks the gradient’s nature. The gradient, a vector of partial derivatives of the loss function with respect to parameters, indicates the direction of steepest ascent and its steepness. Essentially, it guides us in the most promising direction given our limited perspective, but it’s just that – a direction. We must decide how to act on this information.

The learning rate, or step size, is crucial here. It controls the pace of our learning process. Imagine the character at the top of a mountain in our video game example. The game indicates that moving left and downhill is the quickest way to the treasure (representing the global minimum). It also informs you about the hill’s steepness: a steepness of 5 to the left and 10 forward. Now, you decide how far to move your hero. If you’re cautious, you might choose a small step, say a learning rate of 0.001. This means your hero moves 0.001_5 = 0.005 units to the left and 0.001_10 = 0.01 units forward. As a result, your hero moves towards the goal, aligned with the gradient’s direction but at a controlled pace without overshooting.

To sum up, it’s important not to confuse the gradient’s magnitude with the learning pace. The magnitude indicates the steepness of the ascent direction, which can vary depending on the loss surface. The learning rate, on the other hand, is a choice you make, independent of the dataset. It’s a hyperparameter, signifying how cautiously or aggressively we want to proceed in our learning journey.

If the learning rate isn’t dependent on the dataset, how do we determine it? What’s the typical range for a learning rate, and what are the issues with setting it too high or too low?

Since the learning rate is a hyperparameter, we typically choose it through experimentation, like cross-validation, or based on previous experience. The common range for learning rates is between 0.001 and 0.1. This range is based on empirical observations from the Data Science community, who have found that learning rates within this range tend to converge faster and more effectively. Theoretically, we prefer a learning rate no higher than 0.1 because larger rates can alter our parameters too drastically at each step, leading to risks like overshooting. On the practical side, we avoid rates lower than 0.001 as they can slow down the learning process, making it computationally expensive and time-consuming.

The common range gives us insights into the problems with extreme learning rates. When the rate is too high, the large step size might cause the algorithm to move too fast, leading to overshooting and never reaching the goal. Conversely, a very low rate results in tiny steps, potentially taking an excessive amount of time to reach the global minimum, thus wasting computational resources and time.

Here’s a visual representation to help understand the impact of different learning rates on the learning process:

Source: https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png

What are the limitations of vanilla gradient descent, especially when applied to DNN models?

Even though gradient descent works well in theory, it faces significant challenges in practical applications, particularly with DNN models. The key limitations include:

Vanilla gradient descent is computationally intensive. The vanilla gradient descent formula 𝜃 = 𝜃 – 𝜂⋅∇𝜃𝐽(𝜃) requires calculating the average loss 𝐽(𝜃) across the entire dataset. This means comparing all predicted values with their true labels and averaging these differences. For a typical DNN model, which often involves millions of parameters and large datasets, this process becomes computationally intense. This complexity is one of the reasons why practical applications of DNNs weren’t feasible until the advent of models like AlexNet in 2012, despite gradient descent being a much older concept.
The loss surface in DNNs, usually non-convex, contains multiple local minimums. The simplistic criterion of using gradient descent to find points where 𝐽(𝜃) = 0 is inadequate in such scenarios. In our video game analogy, this is like our hero being unsure of where to move next, as the steepest descent direction becomes indiscernible. Particularly problematic are plateau and saddle points, where 𝐽(𝜃) is zero. The areas that are close to saddle and local minimums can be even harmful, since in those areas 𝐽(𝜃) is very small and close to zero. The small value of 𝐽(𝜃) can significantly slow down the learning process, consuming time and computational resources unnecessarily.

Keep in mind that in the foggy landscape scenario, we have no idea whether we’re on a plateau, at a global minimum, or a local minimum. Source: https://miro.medium.com/v2/resize:fit:1400/1*vx235JUNzgv0fGzONlNIYg.png

Why does gradient descent work, and why must the learning rate be small?

Author’s Note: I debated whether to include this section in my post, knowing that mathematical formulas can be intimidating. Yet, while exploring gradient descent, I found myself questioning why the direction of the steepest descent is believed to lead to the global minimum and why tutorials emphasize setting a small learning rate to prevent overshooting. My attempts to rationalize these concepts through RPG game analogies were helpful but not entirely satisfying. It was the straightforward mathematical proof behind gradient descent that finally put my doubts to rest. If you’ve had similar questions, I encourage you to read this section. It might just offer the clarity you need to grasp these concepts fully.

To grasp gradient descent, we need to discuss approximating the loss function using Taylor series. The Taylor series is an incredibly powerful tool for estimating the value of a complex function at a specific point. It’s like trying to describe a complex situation using simpler, individual elements. Instead of using a broad statement like "I had a car accident," you break it down into specific events: driving the dog to the vet, the dog popping up in the backseat, the phone ringing, and then the crash. The Taylor series does something similar. Instead of trying to describe a complex function f(x) with a single general term, it uses a series of terms to describe the specific value of f(x) at x = a. Taylor series breaks down a function into a sum of polynomial terms based on the function’s derivatives at a specific point (x = a). Each term of the series adds more detail to the approximation.

For an engaging explanation of the Taylor series and how the first and second derivatives contribute to its expansion, check out 3Blue1Brown’s video on Taylor series. It effectively demonstrates why these derivatives are often sufficient for a solid approximation around the point of interest.

Now, returning to our objective of minimizing the loss function, 𝐽(𝜃), we can apply the Taylor series, primarily relying on the first derivative for an effective approximation. In this context, 𝐽(𝜃) is a multivariable function, where 𝜃 is a vector representing a set of parameter variables, and a is the current value of these parameters.

To minimize 𝐽(𝜃), we can only adjust (𝜃 – a) to be really small, meaning our next set of 𝜃 values must be very close to our current position, represented by vector a. This necessity for proximity is why a small learning rate is crucial. If 𝜃 deviates significantly from the current parameter values a, minimizing our loss function becomes challenging. Additionally, since the gradient 𝐽'(a) indicates a direction, we choose (𝜃 – a) to move in the opposite direction. This approach explains why following the direction of the steepest descent (opposite to the gradient) steers us towards the global minimum, where 𝐽(𝜃) = 0.

You particularly mentioned that vanilla gradient descent isn’t ideal for DNNs. Why?

The core issue with applying vanilla gradient descent to DNNs lies in the nature of their loss functions, which are typically non-convex. Gradient descent struggles in these scenarios. Imagine navigating a complex map that isn’t just a simple hill or valley but a terrain filled with unpredictable elevations and depressions. In such a landscape, gradient descent may fail to guide us to the global minimum and might even increase the loss function’s value after parameter updates. This is partly because first-order partial derivatives provide limited information, showing how the loss changes with respect to one parameter while holding others constant. But this doesn’t account for interactions between multiple parameters, which are crucial in complex models.

Source: https://www.researchgate.net/publication/334986617/figure/fig1/AS:789459528335363@1565233227897/Illustration-of-the-non-convex-mean-squared-error-cost-function-of-a-two-dimensional.png

Revisiting the Taylor series concept from our previous discussion, we see that when the objective or loss function becomes highly complex, using only the first derivative doesn’t provide an accurate approximation. Hence, vanilla gradient descent is less effective in navigating complex, non-convex loss functions.

It’s important to note that gradient descent can still reduce the loss in non-convex scenarios. It’s a general method for continuous optimization and can be applied to nonconvex functions, where it will converge to a stationary point. However, this point might not be the global minimum but could be a local minimum or even a saddle point, depending on the function’s convexity.

How can we improve upon vanilla gradient descent to accelerate reaching the global minimum?

Great question! Our previous discussions and the video game analogy reveal several ways to refine gradient descent. Let’s explore these enhancements:

Making the map more player-friendly.

Just like navigating a more accommodating map in a game makes reaching the treasure easier, certain modifications can make the path to the global minimum smoother in gradient descent.

One of the most obvious method to modify your map (loss surface) is by selecting the appropriate loss function. For example, in classification problems, cross-entropy is commonly used (or log loss for binary classification). Opting for Mean Squared Error (MSE) in classification tasks can result in a loss surface with multiple local minima, increasing the likelihood of the learning process getting stuck in one of these minima.

Some other methods include using techniques like feature selection, regularization, feature scaling and batch normalization. Feature selection not only reduces computational costs but also simplifies the loss surface, as the loss function is influenced by the number of parameters. L1/L2 regularization can help in this process: L1 for feature selection and L2 for smoothing the loss surface. Feature scaling is crucial because features on different scales can cause uneven step sizes, potentially slowing down convergence or preventing it altogether. By scaling features to similar ranges, we ensure more uniform steps, facilitating faster and more consistent convergence.

Like feature scaling, batch normalization aims to normalize the inputs between layers. This method introduces additional hyperparameters for tuning. Batch normalization standardizes a layer’s output before it becomes the input for the subsequent layer, establishing a dependency among samples in a batch through the calculation of mean and variance of the outputs from the whole batch. Therefore, selecting an appropriate batch size is crucial, as a batch size of 1 is typically ineffective for batch normalization. The paper "How Does Batch Normalization Help Optimization?" suggests that this technique will make the loss surface smoother while minimally altering the position of the global minimum. One thing to note is that, during prediction phrase, batch normalization uses the moving average and variance computed during the training phase, instead of calculating batch-wise statistics from test data. This is the same idea as any data transformation between training and test phrase. It’s important to note that during the prediction phase, batch normalization uses the moving average and variance calculated during training to normalize batches, rather than computing batch-specific statistics from the test data.This approach mirrors aligns with the standard practice of preventing data leakage between the training and testing data during data transformation phrase.

Source: https://qph.cf2.quoracdn.net/main-qimg-591ce5575f71110585d9246bf8c2ca0b-pjlq

Move faster.

There’s a Chinese saying, "天下武功无快不破" (no martial art is indestructible, except for speed). This concept applies to gradient descent as well. In vanilla gradient descent, calculating the loss use the whole training data in each iteration is time-consuming and resource-intensive. However, we can achieve similar results with less data. The idea is that the average loss from a smaller sample is not significantly different from the entire dataset. By employing methods like mini-batch gradient descent (using a subset of data) or Stochastic Gradient Descent (SGD, selecting a random sample each time), we can speed up the process significantly. These approaches enable quicker computations and updates, making them particularly effective in DNNs.

Source: https://editor.analyticsvidhya.com/uploads/58182variations_comparison.png

In the world of machine learning, the term ‘SGD’ (Stochastic Gradient Descent) has evolved to generally refer to any variant of gradient descent that uses a small subset of data for each run. This includes both the original form of SGD and mini-batch gradient descent. Conversely, ‘batch gradient descent’ denotes the approach where the entire training dataset is used for each gradient descent iteration.

Here are some common terms you’ll encounter in discussions about gradient descent:

Batch Size: This is the number of observations (or training data points) used for one iteration of gradient descent.

Iterations per epoch: This indicates how many gradient descent iterations are required to go through the entire dataset once, given the current batch size.

Epoch: An epoch represents one complete pass through the entire training dataset. The number of epochs is a choice made by the modeler and is independent from both the batch size and the number of iterations.

To illustrate, consider a training dataset with 1000 observations. If you opt for a batch size of 10, you’ll need 1000/10 = 100 iterations to cover the entire dataset once. If you set your epoch count to 5, the model will go through the entire dataset 5 times. This means a total of 5 * 100 = 500 gradient descent iterations will be performed.

Move smarter by gathering more information.

Moving smarter in gradient descent involves looking beyond just the steepest descent. In our previous session on the mathematics behind gradient descent, we learned that using the first derivative of f(x) at x = a provides a solid approximation of f(x). However, to enhance this approximation, incorporating additional derivative terms is beneficial. In our video game analogy, this equates to not just finding the steepest descent direction but also moving our camera around to gain a comprehensive understanding of the landscape. This approach is particularly useful for identifying whether we’re at a local/global minimum (like being at the bottom of a valley), or a maximum (at the top of a hill), or a saddle point (surrounded by hills and valleys). However, our limited view means we can’t always distinguish between a local and a global one.

Expanding on this concept, we can apply Newton’s Method. This technique calculates both the first and second derivatives of the objective or loss function. The second derivatives create the Hessian matrix, offering a more detailed view of the loss function’s landscape. With information from both the first and second derivatives, we gain a closer approximation of the loss function. Unlike traditional gradient descent, Newton’s Method doesn’t use a learning rate. However, variations such as Newton’s method with line search do include a learning rate, adding a level of adjustability. While Newton’s Method might seem more efficient than vanilla gradient descent, it’s not commonly used in practice due to its higher computational demands. For more insights into why Newton’s Method is less prevalent in Machine Learning, you can refer to the discussion here.

Adjust step size while moving around.

This strategy can make our journey more efficient. The simplest approach is to reduce the step size as we approach the global minimum. This is particularly relevant in gradient descent variants like SGD, where the learning rate decreases with each epoch (via learning rate schedules). However, this method uniformly adjusts the learning rate across all parameters, which might not be ideal given the complexity of the loss landscape. In the realm of video games, this is akin to varying the intensity of different movements to navigate more effectively. That’s where methods like Adagrad come in, offering an adaptive gradient approach. It does so by accumulating the history of squared gradients for each parameter and using the square root of this accumulation to adjust the learning rate individually. This method is like fine-tuning our actions in the game based on past experiences, especially when unusual updates occur.

However, while intuitive, Adagrad can slow down due to its aggressive rate decay. Variations of this method seek to balance this aspect.

There are two ways to understand why Adagrad use the sum of squared gradients. Firstly, it allows more cautious learning when the current gradient is significantly larger than historical ones. If we encounter a substantially larger gradient at a given time, adding its squared value to the decay term increases the term significantly, leading to a smaller learning rate and more cautious updates. Secondly, this approach approximates the magnitude of the function’s second derivatives. A larger sum of squared first derivatives suggests a steeper surface, indicating the need for smaller steps.

Choosing better basic movements.

Optimizing our moves in gradient descent involves refining our approach based on the landscape. In vanilla gradient descent, our path often zigzags, similar to uncertain steps in a video game. This can be visualized with the contour of two parameters showing more vertical movements and fewer horizontal ones, even though a horizontal path might be more efficient.

Feature scaling can help, but with the complex surfaces in DNNs, a more directed approach is needed. This is where Momentum Gradient Descent comes into play. It identifies if movements in certain directions are counterproductive. Instead of directly updating parameters with the current gradient, it calculates an exponential moving average of past gradients. This ‘momentum’ helps accelerate progress in consistent directions and dampens movements in unproductive ones.

Notice here, unlike Adagrad which uses squared gradients, we accumulate the historical gradients directly. This approach is crucial because it allows for the positive and negative movements to cancel out each other. Think of this as building momentum in a specific direction, accelerating the learning process if we consistently move in a similar direction. Conversely, a small accumulated value suggests little progress in that direction, hinting that it may not be the most promising movement toward the minimum.

To enhance this smoothing effect and capitalize on past movements, we assign more weight to the accumulated gradient history. This is done by setting a higher value for the momentum coefficient, typically denoted as β. A common choice for β is 0.9, which strikes a balance between giving importance to historical movements while still being responsive to new gradient information. This method ensures a smoother journey by favoring directions with consistent progress and dampening oscillations that do not contribute effectively towards reaching the goal.

Combine those strategies!

Merging the principles of Adagrad and Momentum Gradient Descent offers an innovative way to enhance gradient descent. Both these methods rely on historical gradients, but with a key difference in their approach. Momentum Gradient Descent uses the exponential moving average of gradients instead of a simple average. The advantage here is that by adjusting the momentum coefficient β, we can strike a balance between the influence of historical gradient trends and the current gradient.

Inspired by this, we can apply a similar logic to Adagrad, leading to the development of RMSprop (Root Mean Square Propagation). RMSprop is essentially an evolved version of Adagrad, utilizing the exponential moving average of historical gradients rather than a simple average. This modification places more weight on historical gradients, reducing the impact of exceptionally large current gradients. Consequently, it leads to a less aggressive decrease in the learning rate, addressing the issue of slow learning rates that Adagrad often faces.

Building further on this idea, why not combine the learning rate adjustment of Adagrad/RMSprop with the gradient adjustment strategy of Momentum? This thought led to the creation of Adam (Adaptive Moment Estimation, I remember it as the baby of Adaptive learning rate and Momentum). Adam essentially combines these two methods by using historical gradients in two ways: one for adjusting the exponential moving average (momentum), and the other for managing the scale of historical gradients (RMSprop). This dual application makes Adam a highly effective and stable optimizer. Adam is a popular choice of optimizer, even though it introduces two additional hyperparameters for fine-tuning.

With various optimizers available, how should one choose the right optimizer in practice?

In practice, choosing the right optimizer depends on the specifics of your data and learning objectives. Here are some of general guidelines from my observations and experience:

SGD for Online Learning: Online learning involves processing a continuous stream of incoming data, which requires frequent updates to the model with new, small data batches. Stochastic Gradient Descent (SGD) is particularly well-suited for this scenario as it efficiently uses small batches for more frequent model updates compared to other optimizers. Additionally, SGD is effective in environments where data is unstable or experiencing minor shifts. Using an appropriate learning rate to avoid overshooting, SGD can quickly adapt the model to subtle changes in data. However, it’s important to note that while SGD is capable of handling non-stationary environments, it may not be as effective in capturing major shifts in data patterns.
Adam and sparse data. When dealing with data that has a high number of zero entries, it’s described as being sparse. The challenge with sparse data lies in the limited information available for certain features. Adam optimizer is particularly effective in this context as it integrates both momentum and adaptive learning rate mechanisms. This combination allows Adam to tailor the learning rate for each parameter individually. Consequently, it provides more frequent updates for features that are underrepresented or have less information due to the data’s sparsity, ensuring a more balanced and effective learning process.
Don’t combine optimizers, but use them together wisely. While it’s possible to use multiple optimizers in the learning process, they shouldn’t be applied simultaneously within a single learning phase, as this can lead to confusion in the model and complicate the learning path. Instead, there are two strategic approaches to effectively utilizing multiple optimizers:
Switching between optimizers at different stages. For instance, when training DNNs, you might begin with Adam for its rapid progress capabilities in the initial epochs. Later, as the training progresses, switching to SGD in the subsequent epochs can offer more precise control over the learning process, aiding the model in converging to a more optimal local minimum.
Using different optimizers to train different parts of a model. In scenarios where new layers are added to an existing, pre-trained model, a nuanced approach can be beneficial. For the pre-trained layers, a stable and less aggressive optimizer is ideal for maintaining the integrity of the already learned features. In contrast, for the newly added layers, an optimizer that’s more aggressive and facilitates faster learning adjustments would be more suitable. This method ensures that each part of the model receives the most appropriate optimization technique based on its specific learning requirements.

Wait, why can’t we use Adam and SGD together? And why is SGD often preferred in later epochs for finer optimization, even though Adam is considered more advanced?

Adam and SGD differ significantly, especially in how they manage the learning rate. While Adam uses an adaptive learning rate, SGD typically employs a fixed rate or a scheduled adjustment. This distinction makes them fundamentally different optimizers.

Adam’s ability to adjust the learning rate doesn’t always make it "more advanced". In some cases, simplicity is better. SGD’s simple learning rate and steady approach are more effective, particularly in later epochs. Adam may reduce the learning rate too aggressively, leading to instability, whereas SGD’s consistent rate can more reliably approach a better local minimum and enhance stability.

Furthermore, SGD’s slower convergence can help prevent overfitting. Its fine-grained and controlled adjustments allow for a more precise fit to training data, potentially improving generalization to unseen data. The fixed or scheduled learning rate of SGD also offers researchers better control over the model’s learning process, emphasizing preferences on precision and stability over speed, especially in the final phases of model tuning.

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

The post Courage to Learn ML: A Detailed Exploration of Gradient Descent and Popular Optimizers appeared first on Towards Data Science.

Courage to Learn ML: An In-Depth Guide to the Most Common Loss Functions

Amy Ma — Thu, 28 Dec 2023 01:13:43 +0000

Photo by William Warby on Unsplash

Welcome back! In the ‘Courage to Learn ML‘ series, where we conquer machine learning fears one challenge at a time. Today, we’re diving headfirst into the world of loss functions: the silent superheroes guiding our models to learn from mistakes. In this post, we’d cover the following topics:

What is a loss function?
Difference between loss functions and metrics
Explaining MSE and MAE from two perspectives
Three basic ideas when designing loss functions
Using those three basic ideas to interpret MSE, log loss, and cross-entropy loss
Connection between log loss and cross-entropy loss
How to handle multiple loss functions (objectives) in practice
Difference between MSE and RMSE

What are loss functions, and why are they important in machine learning models?

Loss functions are crucial in evaluating a model’s effectiveness during its learning process, akin to an exam or a set of criteria. They serve as indicators of how far the model’s predictions deviate from the true labels ( the ‘correct’ answers). Typically, loss functions assess performance by measuring the discrepancy between the predictions made by the model and the actual labels. This evaluation of the gap informs the model about the extent of adjustments needed in its parameters, such as weights or coefficients, to more accurately capture the underlying patterns in the data.

There are different loss functions in Machine Learning. These factors include the nature of the predictive task at hand, whether it’s regression or classification, the distribution of the target variable, as illustrated by the use of Focal Loss for handling imbalanced datasets, and the specific learning methodology of the algorithm, such as the application of hinge loss in SVMs. Understanding and selecting the appropriate loss function is quite important, since it directly influences how a model learns from the data.

To learn machine learning, one should know the most popular ones. For example, (Mean Squared Error) MSE and (Mean Absolute Error) MAE are commonly used in regression problems, while cross entropy is the most common loss function for classification tasks.

How do loss functions differ from metrics, and in what ways can a loss function also serve as a metric?

Your statement about loss function can also be metrics is misleading. Loss functions and metrics both assess model performance, but in different stages and for different purpose:

Loss Functions: These are used during the model’s learning process to guide its adjustments. They need to be differentiable to facilitate optimization. For instance, Mean Squared Error (MSE) and Mean Absolute Error (MAE) are common loss functions in regression models.
Metrics: These evaluate the model’s performance after training. Metrics should be interpretable and provide clear insights into model effectiveness. While some metrics, like accuracy, can be straightforward, others like F1 score involve threshold decisions and are non-differentiable, making them less suitable for guiding learning.

Notably, some measures, such as MSE and MAE, can serve both as loss functions and metrics due to their differentiability and interpretability. However, not all metrics are suitable as loss functions, primarily due to the need for differentiability in loss functions for optimization purposes.

In practice, one should always carefully choose the loss function and metrics together for learning, and ensure that the learning and evaluation are aligned in the same direction. This alignment ensures that the model is optimized and evaluated based on the same criteria that reflect the end goals of the application.

Author’s Note: It’s important to clarify that using the F1 score as a loss function in machine learning models isn’t entirely infeasible. In my ongoing study, I’ve encountered innovative methods that address the non-differentiability issue commonly associated with the F1 score. For instance, Ashref Maiza’s post introduces a differentiable approximation of the F1 score. This approach involves "softening" precision and recall using likelihood concepts, rather than setting arbitrary thresholds. Additionally, some online discussions like the one explore similar themes.

The challenge lies in the inherent nature of the F1 score. While it’s a highly informative metric, selecting an appropriate loss function to effectively optimize the model under the same criteria can be complex. Moreover, tuning thresholds adds another layer of complexity. I’m really interested into this topic. If you have insights or experiences to share, please feel free to connect with me. I’m eager to expand my understanding and engage in further discussions.

You said MSE and MAE as typical metrics in regression problems. What are they and when to use them?

In regression problems, where the predictions are continuous values, the goal is to minimize the difference between the model’s predictions and the actual values. To assess the model’s effectiveness in grasping the underlying pattern, we use metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). Both these metrics quantify the gap between predictions and actual values, but they do so using different evaluation approaches.

MSE is defined as:

Here, y_i is the actual value, y_hat_i is the predicted value, and n is the number of observations.

MSE calculates the average of the squared differences between predictions and actual values, which is the Euclidean distance (l2 norm) of predictions and the true labels.

On the other hand, MAE is defined as:

Here, the absolute differences between the actual and predicted values are averaged, corresponding to the Manhattan distance (l1 norm). In the other words, MAE calculates the average distance between the estimated values and the actual value without considering the direction (positive or negative).

We talked about Lp norm and different distances in our discussion on l1, l2 regularizations https://medium.com/p/1bb171e43b35,

The primary distinction between MSE and MAE is their response to outliers. MSE, by squaring the errors, amplifies and gives more weight to larger errors, making it sensitive to outliers. This is useful if larger errors are more significant in your problem context. However, MAE assigns equal weight to all errors, making it more robust to outliers and non-normal error distributions.

The choice between MSE and MAE should based on the properties of the training data and the implications of larger errors in the model. MSE is preferable when we want to heavily penalize larger errors, while MAE is better when we want to treat all errors equally.

I get that squaring the differences in MSE amplifies the errors, leading to a greater emphasis on outliers. Are there other perspectives or aspects that help differentiate between these two metrics?

Certainly, there’s another perspective to understand the differences between Mean Squared Error (MSE) and Mean Absolute Error (MAE) beyond their handling of outliers. Imagine you’re tasked with predicting a value ‘y’ without any additional features (no ‘Xs’). In this scenario, the simplest model would predict a constant value for all inputs.

When using MSE as the loss function, the constant that minimizes the MSE is the mean of the target values. This is because the mean is the central point that minimizes the sum of squared differences from all other points. On the other hand, if you use MAE, the median of the target values is the minimizing constant. The median, unlike the mean, is less influenced by extreme values or outliers.

In the universe of Douglas Adams’ ‘The Hitchhiker’s Guide to the Galaxy,’ 42 is the ultimate answer to life, the universe, and everything. Who knows, maybe 42 is also the magic number to shrink your loss function – but hey, it all depends on what your loss function is! Image created by ChatGPT

This difference in sensitivity to outliers stems from how the mean and median are calculated. The mean takes into account the magnitude of each value, making it easier to being skewed by outliers. The median, however, is only concerned with the order of the values, thus maintaining its position regardless of the extremities in the dataset. This intrinsic property of the median contributes to MAE’s robustness to outliers, providing an alternative interpretation of the distinct behaviors of MSE and MAE in modeling contexts.

You can find an explanation of why the mean minimizes MSE and the median minimizes MAE in Shubham Dhingra’s post.

We’ve talked about how MSE and MAE measure errors, but there’s more to the story. Different tasks need different ways to measure how good our models are doing. This is where loss functions come in, and there are three basic ideas behind them. Understanding these ideas will help you pick the right loss function for any job. So, let’s get started with the most important question:

What are the 3 basic ideas that guide the design of any loss function?

In designing loss functions, three basic ideas generally guide the process:

Minimizing Residuals: The key is to reduce the residuals, which are the differences between predicted and actual values. To address both negative and positive discrepancies, we often square these residuals, as seen in the least squares method. This approach, which sums the squared residuals, is a staple in regression problems for its simplicity and effectiveness.
Maximizing Likelihood (MLE): Here, the goal is to adjust the model parameters to maximize the likelihood of the observed data, making the model as representative of the underlying process as possible. This probabilistic approach is fundamental in models like logistic regression and neural networks, where fitting the model to the data distribution is crucial.
Distinguishing Signal from Noise: This principle, rooted in information theory, involves separating valuable data (signal) from irrelevant data (noise). Methods based on this idea, focusing on entropy and impurity, are essential in classification tasks and form the basis for algorithms like decision trees.

Additionally, it’s important to recognize that some loss functions are tailored to specific algorithms, such as the hinge loss for SVM, indicating that the nature of the algorithm also plays a role in loss function design. Additionally, the nature of the data impacts the selection of a loss function. For instance, in cases of imbalanced training data, we might adjust our loss function to a class-balanced loss or opt for focal loss.

Now, equipped with these fundamental concepts, let’s apply them for interpretive analysis to enhance our comprehension. With this approach, we can attempt to address the following question:

How might we apply MLE and the least squares method to enhance our comprehension of MSE?

First, let’s break down MSE with the least squares method. The LSE approach finds the best model fit by minimizing the sum of the squares of the residuals. In linear regression (which deals with continuous outputs), a residual is the difference between the predicted value and the actual label. MSE, or Mean Squared Error, is essentially the average of these squared differences. Therefore, the least squares method aims to minimize MSE (factoring in this averaging step), making MSE an appropriate loss function for this method.

Next, looking at MSE from a Maximum Likelihood Estimation (MLE) perspective, under the assumption of linear regression, we typically assume that residuals follow a normal distribution. This allows us to model the likelihood of observing our data as a product of individual probability density functions (PDFs). For simplification, we take the natural logarithm of this likelihood, transforming it into a sum of the logs of individual PDFs. It’s important to note that we use density functions for continuous variables, as opposed to probability mass functions for discrete variables.

Note: Likelihood calculations differ for discrete and continuous variables. Discrete variables use a probability mass function, while continuous variables employ a probability density function. For more on MLE, refer to my previous post.

When we examine the log likelihood, it comprises two parts: a constant component and a variable component that calculates the squared differences between the true labels and predictions. To maximize this log likelihood, we focus on minimizing the variable component, which is essentially the sum of squared residuals. In the context of linear regression, this minimization equates to minimizing MSE, especially when we consider the scaling factor 1/2_σ_² that arises from the normal distribution assumption.

In summary, MSE can be derived and understood from both the perspectives of the Least Squares Estimation (LSE) and MLE, with each approach providing a unique lens into the significance and application of MSE in regression analysis.

So MSE is a common loss function for regression problem. But can I use it for classification problem? Such as logistic regression?

MSE, while common in regression, isn’t ideal for classification tasks, such as logistic regression. The primary reason is the mismatch in the nature of outputs: logistic regression predicts probabilities, whereas MSE assumes continuous numerical values. This misalignment leads to theoretical and practical challenges.

Practically, MSE creates a non-convex loss surface when combined with logistic regression, which often uses a sigmoid activation function. This non-convexity means the error surface has multiple local minima, making it difficult for optimization algorithms like gradient descent to find the global minimum. Essentially, the algorithm might get ‘stuck’ in a local minimum, leading to suboptimal model performance.

Moreover, combining MSE with the sigmoid function can cause the gradients to become very small, particularly for extreme input values. This leads to the ‘gradient vanishing’ problem, where the model stops learning or learns very slowly because the updates to the model parameters become insignificantly small.

Therefore, for classification problems, especially binary ones like logistic regression, MSE is not a idea loss function.

So what is a good loss function for logistic regression or more general classification problem?

Alright, diving into the world of loss functions for logistic regression, let’s see how we can apply some basic design ideas to understand them better.

First off, let’s look at the least squares method. The core idea here is to minimize the gap between our model’s output and the true labels. A straightforward approach is setting a threshold to convert logistic regression’s probability outputs into binary labels, and then comparing these with the true labels. If we choose, say, a 0.5 threshold for classifying donuts and bagels, we label predictions above 0.5 as donuts and below as bagels, then tally up the mismatches. This approach, known as the 0–1 loss, is directly corresponds to accuracy but isn’t used as a loss function for training due to its non-differentiability and non-convex nature, making it impractical for optimization methods like gradient descent. It’s more of a conceptual approach than a practical loss function.

When I first visited America, I couldn’t tell the difference between a donut and a bagel. A classifier to distinguish between donuts and bagels could be useful. Image created by ChatGPT

Moving on, let’s use the MLE (Maximum Likelihood Estimation) idea. In logistic regression, MLE tries to find the weights and bias that maximize the probability of seeing the actual observed data. Imagine our goal is to find a set of weights and bias that maximize the log likelihood, where the likelihood L is the product of individual probabilities of observing each outcome. We’re assuming our data points are independent and each follows a Bernoulli distribution.

So we’d have the log loss as:

Finally, let’s bring in some information theory, treating logistic regression as a signal capture machine. In this approach, we employ concepts like entropy and cross-entropy to assess the information our model captures. Entropy measures the amount of uncertainty or surprise in an event. Cross-entropy gauge how well our model’s predicted probability distribution lines up with the actual, true distribution. The goal here is to minimize cross-entropy, which is kind of like minimizing the KL divergence. Though not exactly a ‘distance’ in the strict sense, KL divergence represents how far off our model’s predictions are from the actual labels.

Softmax is another topic on my writing list. Source: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

So, through the application of three distinct design principles for loss functions, we’ve crafted various types of loss functions suitable for logistic regression and broader classification challenges.

It’s particularly fascinating to observe that, despite originating from different perspectives, log loss and cross-entropy loss are essentially the same in the context of binary classification. This convergence occurs in situations where only two possible outcomes exist; under these conditions, cross-entropy effortlessly simplifies and transforms into log loss. Comprehending this shift is vital for understanding the interplay and practical application of these theoretical concepts:

Derive log loss from cross-entropy loss. Source: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

Author’s Note: In the future, I’m considering delving into the fascinating world of information theory – a topic that, surprisingly, is both intuitive and practical in real-world applications. Until then, I highly recommend Kiprono Elijah Koech’s post as an excellent resource on the subject. Stay tuned for more!

In practical scenarios, how should one approach the situation where multiple loss functions need to be minimized?

When managing multiple loss functions in a model, balancing them can be challenging, as they may conflict. One common approach is to create a weighted sum of these loss functions, assigning specific weights to each. However, this introduces new hyperparameters (the weights), necessitating careful tuning. Adjusting these weights means retraining the model, which can be time-consuming and may affect interpretability and performance.

Alternatively, a constraint-based approach can be effective. For instance, in SVM, we aim to maximize the margin (reducing variance) while minimizing classification error (reducing bias). This can be achieved by treating the margin maximization as a constraint, using techniques like Lagrange multipliers, and focusing on minimizing the classification error. This method requires a strong mathematical foundation and thoughtful formulation of constraints.

A third option is to decouple the objectives, building separate models for each and then combining their results. This approach simplifies model development and maintenance, as each model can be independently monitored and retrained. It also offers flexibility in responding to changes in objectives or business goals. However, it’s important to consider how the combined results of these models align with the overall objective.

However, it’s important to understand that adversarial loss in GANs isn’t just a combination of the discriminator’s and generator’s losses. This is because the two networks are engaged in a responsive interaction, learning and adapting in response to each other, rather than optimizing their losses independently.

Before we conclude, I’d like to address a straightforward yet practical query:

Why do we sometimes prefer using RMSE (Root Mean Squared Error) instead of MSE ?

MSE (Root Mean Squared Error) is often preferred over MSE (Mean Squared Error) in certain situations due to its interpretability. By taking the square root of MSE, RMSE converts the error units back to the original units of the data. This makes RMSE more intuitive and directly comparable to the scale of the data being analyzed. For instance, if you’re predicting housing prices, RMSE provides an error metric in the same unit as the prices themselves, making it easier to understand the magnitude of the errors.

Additionally, RMSE is more sensitive to larger errors than RMAE (Root Mean Absolute Error) due to the square root transformation, emphasizing significant deviations more than RMAE. This can be particularly useful in scenarios where larger errors are more undesirable.

(Unless otherwise noted, all images are by the author)

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

The post Courage to Learn ML: An In-Depth Guide to the Most Common Loss Functions appeared first on Towards Data Science.

Courage to Learn ML: A Deeper Dive into F1, Recall, Precision, and ROC Curves

Amy Ma — Sun, 17 Dec 2023 17:42:43 +0000

Welcome back to our journey with the ‘Courage to Learn ML‘ series. In this session, we’re exploring the nuanced world of metrics. Many resources introduce these metrics or delve into their mathematical aspects, yet the logic behind these ‘simple’ maths can sometimes remain opaque. For those new to this topic, I recommend checking out Shervin’s thorough post along with the comprehensive guide from neptune.ai.

In typical Data Science interview preparations, when addressing how to handle imbalanced data, the go-to metric is often the F1 score, known as the harmonic mean of recall and precision. However, the rationale behind why the F1 score is particularly suitable for such cases is frequently left unexplained. This post is dedicated to unraveling these reasons, helping you understand the choice of specific metrics in various scenarios.

As usual, this post will outline all the questions we’re tackling. If you’ve been pondering these same queries, you’re in the right place:

What exactly are precision and recall, and how can we intuitively understand them?
Why are precision and recall important, and why do they often seem to conflict with each other? Is it possible to achieve high levels of both?
What’s the F1 score, and why do we calculate it as the harmonic mean of recall and precision?
Why is the F1 score frequently used for imbalanced data? Is it only useful in these scenarios?
How does the interpretation of the F1 score change when the positive class is the majority?
What’s the difference between PR and ROC curves, and when should we prefer using one over the other?

With a fundamental understanding of these metrics, our learner approaches the mentor, who is busy doing laundry, with the first question:

I’m working on a game recommendation system. It’s designed to suggest video games based on users’ preferences and lifestyles. But I’ve noticed that it mostly recommends popular games, like this year’s TGA game – Baldur’s Gate, and users are missing out on niche and cult classic games they’re searching for. How can I tackle this issue? Should I change my algorithm or maybe use LLM, given its power?

Photo by Nick Hamze on Unsplash

Let’s not rush to the conclusion that you need the most advanced algorithm just yet. Instead, let’s explore why your model isn’t performing as expected. It seems your model scores well in Precision@k but get a low Recall@k.

To understand this better, let’s break down these metrics:

Percision@k = (# of top k recommendations that are relevant)/(# of items that are recommended). In simple terms, it measures how many of the games your model recommends are actually relevant to the users.
Recall@k = (# of top k recommendations that are relevant)/(# of all relevant items). This tells us how many of the relevant games actually make it to your top k recommendations.

From this, it seems users often find relevant games in your recommendations, but not all relevant games are making it to your top k list. It’s important to note that the items recommended are those your model predicts to be relevant, which can be considered as ‘the number of items predicted to be relevant’.

Hold on, are you suggesting I should use both recall and precision to evaluate my model? But aren’t recall and precision mainly used for imbalanced data, similar to their harmonic mean, the F1 score?

You’ve grasped an essential aspect of precision and recall, and you understand why accuracy isn’t always reliable. However, your perspective on recall and precision seems a bit limited, and you shouldn’t restrict them to just one scenario, handling imbalanced data. Let’s dissect this into smaller parts, starting with:

What are precision and recall?

Precision measures the accuracy of the model’s positive predictions, calculated as

Precision = # of samples correctly predicted as positive / total # of samples predicted as positive = true positive / (true positive + false positive)

On the other hand, recall assesses how well the model identifies all positive cases, calculated as

recall = # of samples correctly predicted as positive / total # of actual positive samples = true positive / (true positive + false negative)

A quick tip for remembering TP, TN: The first letter (True/False) indicates if your prediction is correct or not, while the second (Positive/Negative) refers to the predicted label. So, true positive means ‘correctly predicted as positive,’ and false negative is ‘incorrectly predicted as negative…it’s actually positive!’

The total number of predicted positives is the sum of true positives (TP) and false positives (FP).

source: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/800px-Precisionrecall.svg.png

Let me offer an intuitive example to understand those two terms: Consider I’m sorting laundry, my aim is to select all the dirty clothes from the pile for washing. In this scenario, precision is about how accurately I can identify a piece of clothing as dirty. Meanwhile, recall measures how many of the actual dirty clothes I can correctly identify

Then, our next question is:

Why do we care about precision and recall, and why do they often seem to conflict? Is it possible to have both high precision and recall?

The importance of precision and recall lies in their complementary nature. Let’s use the laundry sorting analogy again. My objectives are twofold: first, to ensure all dirty clothes are picked up, and second, to avoid unnecessary washing. Linking this back to metrics, precision is like my aim to correctly identify dirty clothes and save effort by not washing clean ones. Recall, meanwhile, assesses how well I manage to gather all the dirty clothes.

To clarify, let’s look at two extreme scenarios:

With a focus solely on high precision, I’d be extremely selective, only choosing visibly stained clothes for the wash. This means potentially overlooking less obvious dirt, like a shirt with only a bit of cat hair. Consequently, I’d end up washing only a small portion of the laundry, leaving behind some dirty items (hence, low recall).
If I prioritize high recall, I’d wash everything without sorting. This ensures all dirty clothes are cleaned but at the expense of washing clean items too (resulting in low precision).

No matter what kind of laundry types you are, you can see the choice of metrics does impact our (the model’s ) behaviors. Recall and precision measure different aspects and optimizing both simultaneously is challenging. That’s why in classification, we talk about the trade-off between them. They work together to ensure our model accurately predicts while capturing all positive cases.

Next, let’s dive into:

What is the F1 score, and why is it calculated as the harmonic mean of recall and precision?

Most data science interview guides suggest handling imbalanced data by using the F1 score, which is the harmonic mean of recall and precision. But often, they don’t explain why the F1 score is effective in these situations.

Then, why use the F1 score? In model evaluations, we’re concerned with balancing precision and recall – we want correct predictions and comprehensive coverage. Monitoring both metrics separately can be tedious, so a single measure that reflects the balance is preferred. A simple average doesn’t reveal much about the balance; a high score could still mask an imbalance. However, the harmonic mean, used in the F1 score, penalizes extreme values more severely. If either recall or precision is low, it significantly lowers the F1 score.

Consider two hypothetical cases to understand why we use the harmonic mean instead of a simple average:

Scenario A: Precision = 0.9, Recall = 0.1
Scenario B: Precision = 0.5, Recall = 0.5 (more balanced case)

Simple average calculation:

Scenario A: (0.9 + 0.1) / 2 = 0.5
Scenario B: (0.5 + 0.5) / 2 = 0.5

Harmonic mean calculation (F1 Score):

Scenario A: 2 (0.9 0.1) / (0.9 + 0.1) = 0.18
Scenario B: 2 (0.5 0.5) / (0.5 + 0.5) = 0.5

While both scenarios have the same average, the simple average hides Scenario A’s poor recall. The harmonic mean, on the other hand, provides a more accurate reflection of the balance between precision and recall. A higher F1 score indicates a better balance.

Then… why is the F1 Score often used for imbalanced data, and is its use limited to these scenarios?

Let’s explore the challenges of imbalanced data, which is common in binary classification problems. Here, one class often has far fewer samples and represents rare but significant cases (like customer churn or cancer diagnosis). These rare cases usually have higher consequences, and accurately identifying them is crucial. We need a model that not only makes accurate predictions but also effectively identifies these rare cases. This requirement leads us to seek a balance between precision and recall, and the F1 score becomes a handy tool. It provides a single number that reflects this balance, making it a preferred metric in imbalanced datasets. The F1 score’s value lies in its ability to accurately portray a model’s efficacy in spotting the minority class.

However, the F1 score’s usefulness isn’t confined to just imbalanced datasets. It’s also relevant wherever balancing precision and recall is essential, even in balanced datasets. The F1 score remains a vital metric for balancing precision and recall, and it simplifies model comparisons.

Beyond the F1 score, other metrics are also useful for assessing model performance in cases of imbalanced data.

I’ve heard that F1, precision, and recall are asymmetric metrics, meaning they depend on which class is labeled as positive. How does the F1 score’s interpretation change when the positive class is actually the majority?

Good question. To answer that, let’s think about how recall and precision would shift if the majority class is considered positive. Achieving a high recall becomes easier because most samples will be predicted as positive.

But here’s the catch: high precision might be misleading in this scenario. With a larger majority class, it’s easy to get high precision just by predicting the majority class all the time. By switching the majority class to positive, we lose sight of how the model handles the rare class, especially in imbalanced data situations. So, the balance between precision and recall doesn’t guarantee the model’s effectiveness anymore since its focus has shifted. This means a model might show a high F1 score even if it’s not great at identifying the minority class.

When the positive class forms the majority, a high F1 score might not truly reflect the model’s ability to identify the minority class. It could simply mean the model often predicts the majority class.

In such cases, it’s wise to include other metrics less biased towards the majority, like the recall of the negative (minority) class, to get a fuller picture of the model’s performance.

What are the limitations of the F1 score, and what other metrics can we use to evaluate model performance on imbalanced data?

You know, we often approach classification problems in a regression-like manner. What I mean is, some algorithms predict probabilities, not just classes. For these, you need to set a threshold. But the F1 score doesn’t really show how the model performs at different thresholds. That’s where ROC curves or precision-recall curves come in, helping us assess performance across various thresholds.

It’s quite common that we do classification in regression way. What I mean is some algorithm will predict probabilities instead of the class. for those models, we need to choose a threshold. However, F1 score cannot give us an illustration on model performance related to different threshold. So we can use ROC curve or precision and recall curve to consider the model performance with various threshold. Additionally, the Area Under the Curve (AUC) metric serves as a single-number summary, facilitating comparison between multiple models. The AUC scale ranges from 1, indicating a perfect classifier, to 0, the mark of the poorest classifier. Notably, an AUC of 0.5 signifies performance equivalent to random guessing, where True Positive Rate (TPR) and the False Positive Rate (FPR) are equal at all thresholds.

An interesting question to explore is

How an AUC (Area Under the Curve) value of 0.5 equates to a classifier making random guesses?

When the AUC equals 0.5, the ROC curve is represented by a diagonal line connecting the points (0,0) and (1,1) on the plot. This diagonal line signifies that with this classifier, we have TPR = FPR at all thresholds. In simpler terms, the likelihood of correctly identifying a positive case is exactly the same as the likelihood of incorrectly labeling it as negative. Therefore, for any given positive instance, the classifier has a 50% chance of making a correct prediction, akin to the randomness of a coin flip. This equivalence to chance highlights why an AUC of 0.5 is considered indicative of a model performing no better than random guessing.

When it comes to imbalanced data, we can use precision-recall curve to observe the model’s performance on balancing precision and recall under different threshold.

To sum it up, for some scenarios, we’d love the model to balance recall and precision well. That’s why we use the F1 score, the harmonic mean of recall and precision, as a single-number metric. But with imbalanced data, where the focus is more on the minority class’s performance, the F1 score becomes particularly valuable since achieving balance is tougher. Other handy tools are ROC and PR (Precision-Recall) curves.

So, for your game recommendation system, consider using multiple metrics. This way, you can better evaluate how well the model retrieves relevant items (recall) and ensures those items are indeed relevant (precision). You could evaluate precision@k and recall@k together, calculate f1@k, or draw PR curves.

In practice, it’s crucial to select model metrics based on the actual cost of errors, like whether recall matters more than precision to you. Using multiple metrics gives a fuller picture of your model’s performance. And remember, the key is to align your metrics with the model’s business or application goals.

Before wrapping up this post, there’s one more topic I’d like to touch on:

What’s the difference between the PR curve and the ROC curve, and when should you choose one over the other?

Most DS interview guides state using the PR curve instead of the ROC curve for imbalanced data, but often don’t explain when to opt for the ROC curve. While I won’t delve into how to draw these curves here (for that, check out the excellent explanation by StatQuest with Josh Starmer here), let’s understand that these curves are drawn by varying the threshold and calculating two metrics (precision and recall for PR, or TPR and FPR for ROC). Both curves represent different balances in binary classification:

A sample calculation of ROC curve vs. PR curve. source: https://modtools.files.wordpress.com/2020/01/roc_pr-1.png?w=946

The ROC curve focuses on TPR and FPR; the PR curve on precision and recall:

TPR (Recall) = # of samples correctly predicted as positive / total actual positive samples.
FPR = # of samples wrongly classified as positive / total actual negative samples.

While precision and recall focus solely on the model’s performance for the positive class, TPR and FPR provide a broader view of predictability (correct positives vs. misclassified samples).

ROC curves are less sensitive to data distribution, as FPR uses the size of the negative class. If the negative class is the majority, the FPR value can still remain low, even with lots of negative predictions, due to the larger size of this class. This means ROC is less affected by data imbalance. On the other hand, PR curves, with precision calculated using predicted positives, are more sensitive to the positive class.

What does this imply? It means when comparing model performance across different datasets, ROC curves offer more stability than PR curves and can better reflect a model’s performance. So, rather than just remembering PR curves as preferable for imbalanced data, it’s important to recognize that ROC curves provide a consistent measure less influenced by data distribution.

In our upcoming session, the mentor-learner duo will delve into the common loss functions, exploring cross-entropy through the lenses of information theory and MLE. If you’re enjoying this series, remember that your interactions – claps, comments, and follows – do more than just support; they’re the driving force that keeps this series going and inspires my continued sharing.

Other posts in this series:

If you liked the article, you can find me on LinkedIn, and please don’t hesitate to connect or reach out with your questions and suggestions!

The post Courage to Learn ML: A Deeper Dive into F1, Recall, Precision, and ROC Curves appeared first on Towards Data Science.

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 4)

Amy Ma — Mon, 11 Dec 2023 16:47:37 +0000

Photo by Dominik Jirovský on Unsplash

Welcome back to ‘Courage to Learn ML: Unraveling L1 & L2 Regularization,’ in its fourth post. Last time, our mentor-learner pair explored the properties of L1 and L2 regularization through the lens of Lagrange Multipliers.

In this concluding segment on L1 and L2 regularization, the duo will delve into these topics from a fresh angle – Bayesian priors. We’ll also summarize how L1 and L2 regularizations are applied across different algorithms.

In this article, we’ll address several intriguing questions. If any of these topics spark your curiosity, you’ve come to the right place!

How MAP priors relate to L1 and L2 regularizations
An intuitive breakdown of using Laplace and normal distributions as priors
Understanding the sparsity induced by L1 regularization with a Laplace prior
Algorithms that are compatible with L1 and L2 regularization
Why L2 regularization is often referred to as ‘weight decay’ in neural network training
The reasons behind the less frequent use of L1 norm in neural networks

So, we’ve talked about how MAP differs from MLE, mainly because MAP takes into account an extra piece of information: our beliefs before seeing the data, or the prior. How does this tie in with L1 and L2 regularizations?

Let’s dive into how different priors in the MAP formula shape our approach to L1 and L2 regularization (for a detailed walkthrough on formulating this equation, check out this post).

When considering priors for weights, Our initial intuition often leads us to choose a normal distribution as the prior for model weights. With this, we typically use a zero-mean normal distribution for each weight wi, sharing the same standard deviation 𝜎. Plugging this belief into the prior term logp(w) in MAP (where p(w) represents the weight’s prior) leads us to sum of squared weights naturally. This term is precisely the L2 norm. This implies that using a normal distribution as our prior equates to applying L2 regularization.

Conversely, adopting a Laplace distribution as our belief results in the L1 norm for weights. Hence, a Laplace prior essentially translates to L1 regularization.

In short, L1 regularization aligns with a Laplace distribution prior, while L2 regularization corresponds to a Normal distribution prior.

Interestingly, when employing a uniform prior in the MAP framework, it essentially "disappears" from the equation (go ahead and try it yourself!). This leaves the likelihood term as the sole determinant of the optimal weight values, effectively transforming the MAP estimation into maximum likelihood estimation (MLE).

So, can you explain the reasoning for having different beliefs when our prior is a Laplace distribution versus a normal distribution? I’d like to visualize this better.

This is a great question. Indeed, having different priors means you hold various initial assumptions about the situation before collecting any data. We’ll delve into the purpose of different distributions later, but for now, let’s look at a simple, intuitive example using Laplace and normal distributions. Consider the number of views on my new Medium posts. Two weeks ago, as a new writer with no followers, I expected zero views. My assumption was that the average daily view count would start low, possibly at zero, but might increase as readers interested in similar topics discover my work. A Laplace prior fits this scenario well. It suggests a range of possible view counts but assigns higher probability to numbers near zero, reflecting my expectation of few views initially but allowing for growth over time.

Now, with 55 viewers (thanks, everyone!), and followers who receive updates on my posts, my expectations have changed. I anticipate that new posts will perform similarly to my previous ones, averaging around my historical view count. This is where a normal distribution prior comes into play, predicting future views based on my established track record.

Hmm… Can you explain the L1 regularization sparsity with a Laplace prior?

Indeed, understanding L1 regularization’s promotion of sparsity can be illuminated by comparing the Laplace distribution to the normal distribution. The key difference lies in their probability densities around zero.** The Laplace distribution is sharply peaked at zero, indicating a higher likelihood of values close to zero. This characteristic mirrors the effect of L1 regularization, where most weights in the model are driven towards zero, promoting sparsity. In contrast, the normal distribution, associated with L2 regularization, is less peaked at zero and more spread out, indicating a preference for distributing weights more evenly.

source: https://austinrochford.com/posts/2013-09-02-prior-distributions-for-bayesian-regression-using-pymc.html

Additionally, the Laplace distribution has heavier tails than the normal distribution, meaning it extends further out. This property allows for some weights to remain significantly away from zero while the others are pretty close to zero. So, by choosing the Laplace distribution as a prior for the weights (L1 regularization) , we encourage the model to learn solutions where most weights are close to zero, achieving sparsity without sacrificing potentially relevant features. This is why L1 regularization can be used as a feature selection method.

So, I see that L1 and L2 regularizations are key for avoiding overfitting and boosting a model’s generalizability. Can you tell me which algorithms these methods can be applied to?

L1, L2 regularization can be apply to many algorithms by adding a penalty term to their loss functions. Here are some specific examples of algorithms where L1 and L2 regularization are applied:

Linear models. Those techniques are particularly useful with high-dimensional problems. In linear models, they are known as lasso and ridge regression, respectively. One thing to note is that L1 regularization not only helps prevent overfitting but also helps with feature selection which preventing multi-collinearity.
SVM. Regularization methods are the core of SVM. By adding the weight penalty term, SVM encourages the model to reduce its margin between decision boundary and the closest support vector (L1 regularization) or smooth the margin (L2 regularization), which leads to better generalization.
Neural Networks. L2 regularization is more commonly used in neural networks and is often referred to as weight decay. L1 regularization can also be used in neural networks, but it is less common due to its tendency to lead to sparse weights.
Ensemble algorithms. Gradient boosting machines like GBM and Xgboost use L1 and L2 regularization to limit the size of individual trees within the ensemble. L1 regularization specifically achieves this by shrinking the weights of weak learners (each tree) in the ensemble, while L2 regularization penalizes the total number of leaves (leaf score) in each tree.

Why do L2 regularization is also called ‘weight decay’ in neural network training? And why is the L1 norm less commonly used in neural networks?

To tackle those two questions, let’s bring in a bit of math to illustrate how weights get updated in the presence of L1 and L2 regularizations.

Loss function with L2 regularization; λ is penalty coefficient and α represents learning rate

In L2 regularization, the weight update process involves a slight reduction of the weights, scaled down according to their own magnitude. This results in what is termed as "weight decay." Specifically, each weight is decreased by an amount that is directly proportional to its current value. This proportional reduction, governed by the typically small settings of the penalty coefficient (λ) and the learning rate (α), ensures that larger weights are subjected to a higher degree of penalization compared to smaller weights. The essence of weight decay lies in this method of scaling down weights, encouraging the model to maintain smaller weights. Such behavior is advantageous in neural networks as it tends to produce smoother decision boundary.

Loss function with L1 regularization; λ is penalty coefficient and α represents learning rate

In contrast, L1 regularization modifies the weight update rule by subtracting or adding a constant amount, determined by αλ and the sign of the weight (w). This approach pushes weights towards zero, regardless of whether they are positive or negative. Under L1 regularization, all weights, irrespective of their magnitude, are adjusted by the same fixed amount. This results in larger weights remaining relatively large, while smaller weights are more rapidly driven to zero, promoting sparsity in the network.

Let’s compare!

Comparing the two, L2’s approach to weight modification is based on the weight’s existing value, leading to larger weights diminishing more quickly than smaller ones. This uniform decay across all weights is why it’s termed ‘weight decay’. On the other hand, L1’s fixed adjustment amount, regardless of weight size, can lead to some issues and become less favorable in NN:

It can zero out some weights, causing ‘dead neurons’ and potentially disrupting information flow within the network, which could impair model performance.
The non-differentiable points at zero introduced by L1 make optimization algorithms like gradient descent less effective.

What effects do adding L1 and L2 regularization have on our loss function? Does incorporating these regularizations lead us away from the original global minimum?

It’s a great question! In short, once we incorporate regularization, we intentionally shift our focus away from the original global minimum. This means adding penalty terms to the loss function, fundamentally changing its landscape. It’s crucial to understand that this change is desirable, not accidental.

By introducing these penalties, we aim to achieve a new optimal solution that balances two crucial goals: fitting the training data well to minimize empirical risk while simultaneously reducing model complexity and enhancing generalization to unseen data. The original global minimum might not achieve this balance, potentially leading to overfitting and poor performance on new data.

If you’re interested in the mathematical details of measuring the distance between the original and regularized optima, I highly recommend chapter 7 (pages 224–229) of Deep Learning by Ian Goodfellow. Pay particular attention to formulas 7.7 and 7.13 for L2 and 7.22 and 7.23 for L1. This provides a quantifiable assessment of the impact regularization terms have on weights, deepening your understanding of L1 and L2 regularization.

We’ve now reached the conclusion of our exploration into L1 and L2 regularization. In our next discussion, I’m excited to delve into the basics of loss functions. A big thank you to all the readers who enjoyed the first part of this series. Initially, my goal was to solidify my grasp of basic ML concepts, but I’m thrilled to see it resonate with many of you . If you have suggestions for our next topic, please feel free to leave a comment!

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

Reference

A Probabilistic Interpretation of Regularization

https://keras.io/api/layers/regularizers/

Why Do We Need Weight Decay in Modern Deep Learning?

The post Courage to Learn ML: Demystifying L1 & L2 Regularization (part 4) appeared first on Towards Data Science.

Courage to Learn ML: Decoding Likelihood, MLE, and MAP

Amy Ma — Sun, 03 Dec 2023 19:05:57 +0000

Photo by Anastasiia Rozumna on Unsplash

Welcome to the ‘Courage to learn ML’. This series aims to simplify complex Machine Learning concepts, presenting them as a relaxed and informative dialogue, much like the engaging style of "The Courage to Be Disliked," but with a focus on ML.

In this installment of our series, our mentor-learner duo dives into a fresh discussion on statistical concepts like MLE and MAP. This discussion will lay the groundwork for us to gain a new perspective on our previous exploration of L1 & L2 Regularization. For a complete picture, I recommend reading this post before reading the fourth part of ‘Courage to Learn ML: Demystifying L1 & L2 Regularization’.

This article is designed to tackle fundamental questions that might have crossed your path in Q&A style. As always, if you find yourself have similar questions, you’ve come to the right place:

What exactly is ‘likelihood’?
The difference between likelihood and probability
Why is likelihood important in the context of machine learning?
What is MLE (Maximum Likelihood Estimation)?
What is MAP (Maximum A Posteriori Estimation)?
The difference between MLE and Least square
The Links and Distinctions Between MLE and MAP

What exactly is ‘likelihood’?

Likelihood, or more specifically the likelihood function, is a statistical concept used to evaluate the probability of observing the given data under various sets of model parameters. It is called likelihood (function) because it’s a function that quantifies how likely it is to observe the current data for different parameter values of a statistical model.

Likelihood seems similar to probability. Is it a form of probability, or if not, how does it differ from probability?

The concepts of likelihood and probability are fundamentally different in statistics. Probability measures the chance of observing a specific outcome in the future, given known parameters or distributions. In this scenario, the parameters or the distribution are known, and we’re interested in predicting the probability of various outcomes. Likelihood, in contrast, measures how well a set of underlying parameters explains the observed outcomes. In this setting, the outcomes are already observed, and we seek to understand what underlying parameters or conditions could have led to these outcomes.

To illustrate this with an intuitive example, consider my cat Bubble’s preference for chicken over beef.

A photo of my cat, Bubble

When I buy cat food, I choose more chicken-flavored cans because I know there’s a higher probability she will enjoy them and finish them all. This is an application of probability, where I use my knowledge of Bubble’s preferences to predict future outcomes. However, Bubble’s preference is not something she explicitly communicates. I inferred it by observing her eating habits over the past six years. Noticing that she consistently eats more chicken than beef indicates a higher likelihood of her preferring chicken. This inference process is an example of using likelihood.

It’s important to note that, in statistics, likelihood is a function. This function calculates the probability that a particular set of parameters is the most suitable explanation for the observed data. Unlike probability, the values of a likelihood function do not necessarily sum up to 1. This is because probability deals with the sum of all possible outcomes for given parameters, which must be 1, while likelihood deals with how probable different parameter sets are given the observed data.

Why is likelihood important in the context of machine learning?

Understanding the application of likelihood in the machine learning context requires us to consider how we evaluate model results. Essentially, we need a set of rules to judge between different sets of parameters. There are two primary approaches to measure how well a model, with its current parameters, explains the observed data:

The first method involves using a difference-based approach. We compare each true label with the corresponding prediction and attempt to find a set of model parameters that minimizes these differences. This is the basic idea behind the least squares method, which focuses on error minimization.

The second method is where likelihood, specifically Maximum Likelihood Estimation (MLE), comes into play. MLE seeks to find a set of parameters that makes the observed data most probable. In other words, by observing the data, we choose parameters that maximize the likelihood of observing the current data set. This approach goes beyond just minimizing error; it considers the probability and models the uncertainty in parameter estimation.

In Maximum Likelihood Estimation (MLE), the underlying assumption is that the optimal parameters for a model are those that maximize the likelihood of observing the given dataset.

In summary, while the least squares method and MLE differ in their approaches – one being error-minimizing and the other probabilistic – both are essential in the machine learning toolkit for parameter estimation and model evaluation. We will explore these methods further, discussing their differences and connections, in future posts.

Could you provide an intuitive example to contrast those two evaluation approaches (MLE vs. least squares)?

Considering my cat Bubble’s preference for food, let’s say I initially assume she likes chicken and beef equally. To test this using the least squares method, I would collect data by buying an equal number of chicken and beef flavored cans. As Bubble eats, I’d record how much of each she consumes. The least squares method would then help me adjust my initial assumption (parameters) by minimizing the difference (squared error) between my prediction (equal preference) and the actual consumption pattern (true labels).

For the MLE approach, instead of starting with an assumption about Bubble’s preference, I would first observe her eating habits over time. Based on this data, I’d use MLE to find the parameter values (in this case, preference for chicken or beef) that make the observed data most probable. For example, if Bubble consistently chooses chicken over beef, the MLE method would identify a higher probability for chicken preference.

So MLE uses likelihood to select parameters. What are their mathematical representations?

In Maximum Likelihood Estimation (MLE), the primary goal is to identify the set of parameters (θ) that most likely produces the observed data. This process involves defining the likelihood function, denoted as L(θ) or L(θ∣x), where x represents the observed data. The likelihood function calculates the probability of observing the given data x assuming the model parameters are θ.

The essence of MLE is to find the parameter values that maximize the likelihood function. Mathematically, this is its representation and calculation process:

The post will explore this equation in depth in a subsequent questions.

Hold on a moment… we define likelihood as L(θ) = p(x|θ), signifying the probability of observing the data x given a set of parameters θ. But earlier, we mentioned that likelihood involves having a set of observations and then calculating the likelihood for a set of parameters. Shouldn’t it be L(θ) = p(θ|x) instead?

In understanding MLE, it’s crucial to distinguish between the likelihood function and probability. The likelihood function, denoted as L(θ), is not the same as the probability p(θ∣x).

While p(θ∣x) refers to the probability of the parameter values θ given the observed data x (a concept central to Bayesian inference), L(θ) is about the likelihood function, which evaluates how plausible different parameter values are in explaining the observed data.

For calculating the likelihood function, we use the probability of observing the data x given certain parameter values θ, denoted as p(x∣θ). This probability function is used to assess the adequacy of different parameter settings. Therefore, in MLE, we have L(θ)=p(x∣θ). It’s important to interpret this equation correctly: the equal sign here signifies that we calculate the likelihood L(θ) using the probability p(x∣θ); it does not imply a direct equivalence between L(θ) and p(x∣θ)

In summary, L(θ) quantifies how well the parameters θ explain the data x, while p(θ∣x) is about the probability of the parameters after observing the data. Understanding this distinction is fundamental to grasping the principles of MLE and its application in statistical modeling.

But wouldn’t using p(θ|x) provide a more direct evaluation of which parameter set is better, instead of relying on the likelihood function?

I’m glad you noticed this important distinction. Theoretically, calculating ( p(θ|x) ) for different parameter sets (θ) and choosing the one with the highest probability would indeed provide a direct evaluation of which set of parameters is better. This is achievable through Bayes’ theorem, which helps in computing the posterior probability ( p(θ|x) ).

To calculate this posterior, we consider three key elements:

Likelihood ( p(x|θ) ) : This represents how probable the observed data is given a set of parameters. It’s the basis of MLE, focusing on how well the parameters explain the observed data.
Prior ( p(θ) ) : This reflects our initial beliefs about the parameters before observing any data. It’s an essential part of Bayesian inference, where prior knowledge about parameter distribution is factored in.
Marginal Likelihood or Evidence ( p(x) ): This measures how probable the observed data is under all possible parameter sets, essentially assessing the probability of observing the data without making specific assumptions about parameters.

In practice, the marginal likelihood ( p(x) ) can often be ignored, especially when comparing different sets of parameters, as it remains constant and doesn’t influence the relative comparison.

With Bayes’ theorem, we find that the posterior ( p(θ|x) ) is proportional to the product of the likelihood and the prior, ( p(x|θ)* p(θ) ).

This means to compare different parameter sets, we must consider both our prior beliefs about the parameters and the likelihood, which is how the observed data modifies our beliefs. Like MLE, in MAP (Maximum A Posteriori Estimation), we seek to maximize the posterior to find the best set of model parameters, integrating both prior knowledge and observed data.

So, MAP incorporates an additional element, which is our prior belief about the parameter.

Correct. MAP indeed uses an extra piece of information, which is our prior belief about the parameters. Let’s use the example of my cat Bubble (again) to illustrate this. In the context of MAP, when determining Bubble’s preferred food flavor – beef or chicken – I would consider a hint from the breeder. The breeder mentioned that Bubble likes to eat boiled chicken breast, so this information forms my prior belief that Bubble may prefer chicken flavor. Consequently, when initially choosing her food, I would lean towards buying more chicken-flavored food. This approach of incorporating the breeder’s insight represents the ‘prior’ in MAP estimation.

I understand that MAP and MLE are related, with MAP adding in our assumption about the parameter. Can you offer a more straightforward example to show me the difference and connections between those two methods?

To demonstrate the connection between MAP and MLE, I’ll introduce some mathematical formulas. While the goal of this discussion is to intuitively explain machine learning concepts through dialogue, showcasing these functions will help. Don’t fret over complexity; these formulas simply highlight the extra insights MAP offers compared to MLE for a clearer understanding.

Maximum Likelihood Estimation (MLE) focuses on identifying the parameter set θ that makes the observed data x most probable. It achieves this by maximizing the likelihood function P(X|θ).

However, directly maximizing the product of probabilities, which are typically less than 1, can be impractical due to computational underflow – a condition where numbers become too small to be represented accurately. To overcome this, we use logarithms, transforming the product into a sum. Since the logarithm function is monotonically increasing, maximizing a function is equivalent to maximizing its logarithm. Thus, the MLE formula often involves the sum of the logarithms of probabilities.

On the other hand, Maximum A Posteriori (MAP) estimation aims to maximize the posterior probability. Applying Bayes’ theorem, we see that maximizing the posterior is equivalent to maximizing the product of the prior probability P(θ) and the likelihood. Like in MLE, we introduce logarithms to simplify computation, converting the product into a sum.

The primary distinction between MLE and MAP lies in the inclusion of the prior P(θ) in MAP. This addition means that in MAP, the likelihood is effectively weighted by the prior, influencing the estimation based on our prior beliefs about the parameters. In contrast, MLE does not include such a prior and focuses solely on the likelihood derived from the observed data.

It seems like MAP might be superior to MLE. Why don’t we always opt for MAP then?

MAP estimation incorporates our pre-existing knowledge about the parameters distribution. But it doesn’t inherently make it superior to MLE. There are several factors to consider:

An assumption about the parameter distribution isn’t always available. In cases where the parameter distribution is assumed to be uniform, MAP and MLE yield equivalent results.
The computational simplicity of MLE often makes it a more practical choice. While MAP provides a comprehensive Bayesian approach, it can be computationally intensive.
MAP’s effectiveness heavily relies on the selection of an appropriate prior. An inaccurately chosen prior can lead to increased computational costs for MAP to identify an optimal set of parameters.

In our next session, our mentor-learner team will return to delve deeper into L1 and L2 regularization. Armed with a solid understanding of MLE and MAP, we’ll be able to view L1 and L2 regularization from a fresh perspective. Looking forward to seeing you in the next post!

Other posts in this series:

If you liked the article, you can find me on LinkedIn.

Reference:

MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation

Difference between Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) – Amir Masoud…

What is the likelihood function, and how is it used in particle physics?

The post Courage to Learn ML: Decoding Likelihood, MLE, and MAP appeared first on Towards Data Science.