I'm back in the Natural Language Processing Game
My Journey to Large Language Models: From Recommenders to GenAI
#beyondAI
After years of studying Natural Language Processing (NLP) as a computer science student, I'm thrilled to be back in the field, this time as an AI Product Manager with a special focus on Generative AI (GenAI).
I wrote my last paper on NLP in 2016, a year before the groundbreaking Attention is All You Need paper was published. Back then, my biggest challenge was enabling my system to understand language, and finding solutions for labeling large datasets for training was a major hurdle.
Now, everything seems much easier.
But let me first start by sharing with you what motivates me today as an AI Product Manager to dive back into NLP.
You might have heard from my announcement in May 2024 on LinkedIn that I'm focusing on some major AI initiatives. One of these is organizational, but the other is purely AI product-focused. With everyone becoming a GenAI Product Manager these days, I thought I should join in too 😁.
Fortunately, I received a fantastic offer from Vodafone Germany's CIO, who believed I would be a great fit for his team.
Now, three months into this role, I can confidently say it was one of the best decisions I've made and I hope my new team feels the same way.
So far so good.
Before, I was more focused on building Recommender & Predictor Systems as AI products. Now, my focus has shifted to GenAI products. I quickly needed to not only understand at a high level how Large Language Models work (as a heavy user, I already knew their capabilities) but also grasp the details, since we were about to build our own adapted models.
Understanding the potential of modern NLP, such as LLMs, would enable me to swiftly address user pain points in my new role and identify further improvements with our AI developers. However, I realized that I needed to understand the foundations of LLMs to serve effectively in this manner.
Thanks to my previous studies in NLP, I quickly grasped the advancements in today's technologies and understood why they are so powerful. The challenges I faced back then have now been addressed, particularly with the development of contextual analysis. This capability makes modern NLP systems incredibly effective.
You might have already guessed it—I plan to start a series about Large Language Models, but this time from an AI Product Manager's perspective rather than a mathematical or architectural one.
But first, I had to relearn the foundations myself. That’s why I also felt the need to dive back into my old studies and papers to see how things have changed.
For today's issue, since I became quite nostalgic reading through my papers, I want to share the Introduction & Motivation text of my thesis. I think it's a great starting point for this series on LLMs and will allow us to compare it to today's technological progress.
I hope you enjoy this introduction to my work on
"The Machine Learning Genome 🧬" – maybe not as much as I did, but I believe you'll find it valuable too:
Introduction (The Machine Learning Genome)
As he does every year, Mark Zuckerberg outlined his personal goals at the beginning of 2016 on his Facebook page: "My personal challenge for 2016 is to build a simple AI to run my home and help me with my work. You can think of it kind of like Jarvis in Iron Man."
In the Hollywood movie Iron Man, Tony Stark, an eccentric billionaire and tech genius, gives commands to a computer system named Jarvis, which follows and executes them precisely. Considering that Iron Man also appears in The Avengers, where alongside Captain America and other superheroes, the Hulk plays a role, one might think Zuckerberg's goal to create "a simple AI [...] like Jarvis in Iron Man" is more fiction than science. However, projects like IBM's Watson demonstrate that it's indeed possible to create at least a light version of Jarvis with available technology. Since Watson won the famous question-answer game Jeopardy in 2011 against the best human players, it has achieved remarkable feats in other real and important areas. Watson now answers oncology questions and enables personalized therapy options for patients who otherwise couldn't afford them. Watson's effectiveness was demonstrated in a test at the North Carolina School of Medicine: it was given 1,000 cancer patient cases and not only suggested the same treatments as doctors but also offered alternative treatment options. It even pointed to a newly discovered treatment method in a research paper that the doctors were unaware of.
It’s clear that Mark Zuckerberg defined a realistic, though ambitious, project for 2016, and it also highlights how current the topic of intelligent computer systems has become.
What's special about these systems is their ability to independently research and acquire knowledge from unstructured data with initial support from domain experts. They read human-produced content, structure the knowledge in databases, and do so at a speed unmatched by humans. Even in terms of accuracy, they can rival or even surpass humans. These databases, also known as knowledge bases, are the core of any intelligent system. Before they can answer questions on a specific topic or science, expert knowledge from millions of documents must be evaluated, processed, and analyzed. This is partly done with the help of experts who guide the system and point out errors, teaching it what is relevant in a particular domain.
In our project, The Machine Learning Genome, we are entering a new field to enable intelligent systems to answer questions in the area of machine learning.
Motivation (The Machine Learning Genome)
In 1957, Ray Solomonoff published "An Inductive Inference Machine," one of the first papers on machine learning. Since then, countless other research papers have emerged in this field. Searching for "Machine Learning" on Google Scholar returns 3.68 million results (as of October 2016), and Google Trends suggests that many more works will follow. The current knowledge in machine learning largely exists in digital form but is hidden in texts, diagrams, and tables. There is no database for machine learning where the knowledge is structured and publicly accessible. Although search engines like Google Scholar and CiteSeer have made it easier to find scientific documents, researchers still face challenges with simple questions like, "Which algorithms were compared with which others?" or "When was Algorithm X first mentioned?" Such questions still require tedious manual research and can be very time-consuming. While machine learning methods help structure and analyze sciences, it seems even more crucial to have solutions that prepare the knowledge about machine learning. With Machine Learning Genome, we aim to create a computer system that consolidates this knowledge in one place, allowing research to progress faster by reducing search time and uncovering undiscovered knowledge through independent reasoning.
As mentioned in the introduction, the long-term goal of the Machine Learning Genome project is to create a Watson-like system for machine learning, and this work lays the first stones.
Objective of the Work (The Machine Learning Genome)
The overarching goal of the "The Machine Learning Genome" project is to create a system that contains knowledge about machine learning and autonomously expands its knowledge using Stanford's knowledge base construction framework, named DeepDive (this doesn’t exist anymore). This work lays the foundation by automatically extracting initial knowledge about machine learning from scientific documents and storing it in a database.
Machine learning information is primarily derived from scientific papers. As a preparatory step, it is sensible to familiarize oneself with the technical language of machine learning and teach the system certain technical terms. This work specifically addresses the recognition of machine learning algorithms and then examines whether the recognized algorithms were compared with each other. In this context, recognizing a comparison relation between algorithms means determining whether algorithms were examined and measured against each other for the same task. It does not matter which algorithm performed better, only whether they were compared. The goals of this work are summarized as follows:
Recognize machine learning algorithms within natural language texts.
Extract algorithms that have been compared with each other.
This is one output of my work:
Connected Graph: shows a section of all algorithm mentions that are linked to each other by the comparison relation.
A Reflection in 2024
Reflecting on my thesis and comparing it to today's technology, especially Large Language Models, it's incredible to see how far we've come and the challenges we've overcome.
In my thesis, a major hurdle was getting the system to understand language and finding solutions for labeling large datasets. This was tough due to unstructured data and the lack of advanced tools. Today, LLMs like OpenAI's GPT-4 excel at understanding and generating human-like text. They process vast amounts of data and grasp contextual relationships. What was once a significant hurdle is now much easier.
Back then, answering specific questions, like finding algorithm comparisons or pinpointing when an algorithm was first mentioned, required laborious manual research. Now, LLMs can understand complex queries and generate accurate responses. They sift through enormous datasets quickly, reducing the time and effort needed for such tasks.
Entity recognition, especially for specialized terms like machine learning algorithms, was another challenge. Modern NLP systems are much better at this. They can identify and categorize technical terms accurately, making it easier to build knowledge bases and perform precise language understanding tasks.
Data preparation involved converting and cleaning text from PDFs, which was a big task. Today's tools and techniques for data extraction and cleaning have vastly improved. Automated systems handle various formats efficiently, extracting and structuring data with minimal manual work. This streamlines the process, making it more efficient and less prone to errors.
Yes, things are much easier 🔥 But it's not just the advancements in model architecture that have driven this progress. The availability of a tremendous amount of diverse labeled datasets for all kinds of topics has been a game-changer. These rich datasets allow models to learn from a wide range of examples, improving their ability to understand and generate accurate, contextually relevant responses.
To demonstrate how these advancements have resolved the challenges mentioned in my thesis, let's pick up one of the questions: "Which algorithms were compared with which others?"
Let's ask an LLM today: Which algorithms have been compared with Random Forest in recent research?
Answer: In recent research, Random Forest has been compared with several other algorithms, including Support Vector Machines (SVM), Gradient Boosting Machines (GBM), k-Nearest Neighbors (k-NN), and Neural Networks. Studies often evaluate these algorithms based on performance metrics such as accuracy, precision, recall, and F1 score across various datasets. For instance, a common comparison found in literature is between Random Forest and Gradient Boosting Machines, where researchers examine their respective strengths in handling overfitting and interpretability.
Well, an answer like this from my own built system back then would definitely be worth a Paper of the Year award 🥇.
But it only got me an A and I am still proud of it 🥹
Humble but hungry ♥️
JBK 🕊️
P.S. If you’ve found my posts valuable, consider supporting my work. While I’m not accepting payments, you can help by sharing, liking, and commenting here or on my LinkedIn posts. This helps me reach more people on this journey, and your feedback is invaluable for improving the content. Thank you for being part of this community ❤️.