The place holds my updated content.
You can find many of my learning notes, shared readings and more...
To see my OMSCS learning experience, please go to this page.
Table of Contents
I haven't caught what have evolved in the field of MLOps for a while. And I encountered this nice article, which gave a quite comprehensive, yet effective review on MLOps practices. I think any data scientists and machine learning engineers would benefit from reading this post (it is long, but clear and effective).
Here is my takeaway. They are more like organizational practices and usually blind spots in ML developments.
Three Vs of MLOps:
Success in MLOps hinges on three crucial factors: Velocity for rapid iteration and prototyping; Validation for early and proactive identification of errors; and Versioning for managing model and data versions to facilitate rollbacks and minimize downtime.
These are more like principles to follow in the DS/ML project developments. I really like this three Vs slogan; it is accurate and strong which I think every data scientist and machine learning engineer should remind themselves of.
ML Engineering’s Experimental Nature:
Collaboration is key
Focus on data iteration
Account for diminishing returns
Small changes are preferable
The points "Collaboration is Key" and "Focus on data iteration" resonate with me a lot. Data science and ML projects are naturally dynamic (sometimes even chaotic). And one has to really understand what "data-driven" means here. It means the value of the projects stems from your data, not your models! That's why collaboration plays a big role here. You have to loop in business people (for the business needs), and domain experts (for a deep understanding of the data). One quick discussion with these people probably will save you many days crunching the data in the dark.
Of course, there are a lot of others to say about the success of DS/ML. How to iteratively your ML pipeline? How to have an effective monitoring plan in place? You can find more discussion on these aspects in the article.
Note that this article is based on the insights of this publication, Operationalizing Machine Learning: An Interview Study.
It is amazing to see how the field of Transformer models has evolved in the past three years. Well, now actually Large Language Model is buzzword in the town.
Among all of these tremendous emerging technology, models and AI talks, perhaps Reinforcement Learning from Human Feedback (RLHF) is the most important idea that has enabled this LLM booming.
I encountered this nice post by Chip Huyen when I tried to have a quick catch up on LLM, more specifically, the RLHF idea. It is a very thorough while technical-friendly post, which presents an overview of the current LLM development process and clearly explains RLHF and its magic. I definitely love it and recommend it. Some excerpt and my notes are as follows.
Reward model (RM)
it use comparison data for training: (prompt, winning_response, losing_response).
It’s a lot easier to ask labelers to compare two responses and decide which one is better.
...
Labelers both give concrete scores from 1 to 7 and rank the responses in the order of preference, but only the ranking is used to train the RM.
...
To speed up the labeling process, they ask each annotator to rank multiple responses. 4 ranked responses, e.g. A > B > C > D, will produce 6 ranked pairs, e.g. (A > B), (A > C), (A > D), (B > C), (B > D), (C > D).
Finetuning using the reward model
Essentially, at this step, we do the reinforcement learning to improve the SFT model.
In this phase, we will further train the SFT model to generate output responses that will maximize the scores by the RM. Today, most people use Proximal Policy Optimization (PPO), a reinforcement learning algorithm released by OpenAI in 2017.
A great diagram that explains the SFT and RLHF for InstructGPT (see the image in the end).
Suggested reading
"Reinforcement Learning for Language Models" by Yoav Goldberg, April 2023.
A great diagram that explains the SFT and RLHF for InstructGPT (source: RLHF post by Chip Huyen)
I recently resumed my Java learning. To be fair, I am not a beginner because I started learning Java two years ago; I finished the "Object Oriented Programming in Java Specialization" on Coursera, and I read Java materials and did the coding exercises now and then along the way. But surprisingly, I felt an absolute boost in my Java knowledge after taking the course "the Ultimate Java Mastery Series", and the reasons can be summarized as:
the courses are so well structured that it covers all the core Java knowledge in a concise while effective way.
The entire series has three parts: Fundamentals; Object-Oriented Programming; Advanced Topics. The instructor organized the topics in a very friendly way so you will build a framework of Java knowledge easily. Meanwhile, you are able to seamlessly get to the advanced and difficult topics without struggling too much with the learning curve.
the instructor introduced and explained important Java concepts by providing clear and real-world contexts/scenarios.
The instructor is able to use simple but not trivial examples to introduce the Java key concepts, such as Interface and Generics. It's more fun and easier to grasp those concepts when you see them in action, given some real-world scenarios.
I enjoyed Part 2 where the instructor tore down the Mortgage Calculator which was first built in a procedural programming fashion, and refactored it in an OOP paradigm step by step. I have a deeper understanding of building loosely-coupled applications.
Other very useful takes are lambda expression, stream and concurrency that I didn't have the chance to closely study before. Some Java code I have encountered before now makes more sense; they are obscured to me because I lacked the knowledge such as Lambda Expression.
In comparison, it came to my awareness that courses (programming courses perhaps) in Coursera are not necessarily the best choices to learn programming. Most of the time, they are less practical, and the content is out-of-date. Now when I looked back at the Java Specialization courses created by Duck University, they are less effective even though they are good courses (I hate to say this but it is really kind of a waste of time to use the BlueJ IDE they created for the courses; a terrible IDE came from the dinosaur age).
Youtube: https://youtu.be/06-AZXmwHjo
MLOps is a relatively nascent field in which many methodologies and tools need to be defined and developed. Similar to DevOps in software application development, MLOps is an important piece of the ML application development, while it is more tricky because we have to tackle both code and data. In his talk, Andrew advocated the shift from model-centric to data-centric. One of MLOps's major responsibilities is to ensure the data quality through the lifecycle of the ML model development, and what is more important, in a systematic way. Various phases for ML model development include project scoping, data collection, model training and deployment in production. He shed some light on MLOps by addressing data-related tasks in different phases:
Data collection:
Consistent labeling;
To collect more data or to better prepare the available data (it's actually surprising to see that to improve the currently available data is usually easier than collecting more data, but people always try to collect more data firstly!)
Model training:
Error analysis;
Iterative improvement.
Deployment in production:
Systematically check for concept drift and data drift (performance degradation)
Follow new data back for continuous refinement of models
It will also be interesting to anticipate what roles should be responsible for MLOps in the future. Should it be Machine Learning engineer, data scientist, DevOps engineer or domain expert. Finally, there are barely any good resources/courses on MLOps. The industry is still exploring it! Let's keep an eye on this and hope the landscape will be shaped soon. Personally I think what is missing in his talk is how to manage the ML application's ifecycle in a more systematic way, not only for data quality. We really need the best practice to put all the pieces from different phases in one central place for the ML application, and we should have a clear picture of the roadmap (code + data, what iterative stage we are in, and what direction we are heading). A lot to learn and a lot to looking forward to.
Happy New Year!
The 2020 was an extraordinary one for the history book. Now we enter 2021. As a ML enthusiast, I will continue pursuing my career goal in this field, and the way is, KEEP LEARNING! This year, I wish I would read 10 books on machine learning/data science/advance statistics/programming, and read 2 papers (occasionally) every month to catch up with the latest research. I haven't made my reading list yet, but here are some topics that I will definitely look into:
Reinforce learning;
Causality in ML (THE BOOK OF WHY by JUDEA PEARL AND DANA MACKENZIE);
Time Series Analysis;
Graphic models;
More software engineering.
Scikit-learn has nice summaries of feature selection. One straightforward and most intuitive method is univariate feature selection. To some degree, this is related to testing the dependency between the predictor (feature) and the target variable, or in a more general sense, testing the dependency between two variables.
Let's first look at how we can check the dependency (or correlation) between two variables. See the table below.
Some notes:
Cramer's V is based on Pearson's Chi-square test, and it's symmetric;
Theil's U (also called uncertainty coefficient) is based on conditional entropy.
Now let's check what univariate feature selection methods are provided by Scikit-learn.
For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif
The documentation doesn't really give very clear instruction on how you should use them under different circumstances. To summarize:
Regression
f_regression. This method indeed uses the linear regression model for fitting and sequentially add the features. For what feature to add in every step, it applies the F-test which is based on the reduction in error. To me, I am not sure it can be used for both continuous variables and categorial variables. And if it could be used for categorical features, should we one-hot-encode the feature first? (see more discussion here)
mutual_info_regression. To discuss later.
Classification
chi2. From the documentation, it's quite clear you can only use it for discrete variables. "X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes". This matches the definition of Pearson's Chi2 test. So I guess if you use this method for categorical feature, you need to do one-hot-encoding (or index) first.
f_classif. This can be confusing. It took me a while to realize that the data instances will be grouped by the classes, and the F-score will be examined (ANOVA-F test). Therefore, it suggests that you can only apply this method for continuous variables. Or if you really want to use it for categorical variables, you need to do OHE first. Also, this method sequentially exams all the features. (reference: here)
Mutual_info_classif. I would say this is the most clearly documented one. You can apply it to continuous features and categorial features. For continuous features, it will use neighbour data points as an approach to discretizing the data. There is a control parameter, n_neighbors. When it applies to continuous target variable, the idea is the same, since mutual_info uses the distribution of the variable, here, p(y).
I think this following paragraph from a Kaggle tutorial gives a good summary:
I prefer the mutual_info_* tests because they "just work" out of the box. These are non-parametric tests which use k-nearest neighbors to measure the scale of the relationship between the predictor values and the target variable. The F-test statistics only find linear relationships, making them only appropriate for performing linear regression, and the chi-squared test requires that the data be appropriately scaled (which it rarely is) and non-negative (which it only sometimes is).
Of course, every time you are not sure, the best way is always to check the source code and spend your time figuring it out. Feel free to check the scikit-learn source code for univariate feature selection.
As a side note, I find the distributions of variables is somewhat related and actually a different topic. Here is a great post, and it's about how to check two distributions are similar or different.
In the past few weeks, I spent a considerate amount of time picking up Java and working on the course work in the Coursera Specialization. There is a mini project in one of the course "Object Oriented Programming in Java", in which you will create GUI to display the earthquakes in the map. The map is interactive so it will respond to certain user events and smartly visualize the earthquake events.
It's a great experience to learn some important concepts in Object Oriented Programming: Inheritance, Polymorphism, etc. And moreover, you make an applet! A project is the best way to understand the package management and some basic paradigms for software engineering. In the end, you are asked to work on your own extension to the applet. And I was really excited about this, because earthquakes are exactly what I studied in my Ph.D. (to be more precise, it is seismology). I built a functionality which enable the user to select a specific region and download the earthquake event information in that specific region. Believe me, this is a very helpful functionality because in seismology usually researchers are only interested in earthquakes in a specific region and they have to write queries or scripts to download the event information they need. I think this is pretty cool so I even made a demo video for it (although I know there are already some python packages providing the similar functionalities, e.g., ObsPy). Feel free to check the Youtube video and my code repository for more details!
After postponing this for several days, I finally started taking a look at the transformer package and checked the tutorial about the GPT2 model. Although I have known that GPT2 is very successful as a generative language model and very powerful in term of generalizability, I was still so impressed after reading the original GPT2 paper. The goal of GPT to adapt to zero-shot setting is very ambitious, and indeed I believe that's the right direction of machine learning towards the NLP problems. But of course, some necessary fine-tuning is always desired to achieve better performance.
With a certain amount of knowledge of GPT model, I think it's a good idea to surfing some examples to better understand it. The Transformer package is particularly of my interest, so I started with its tutorial example on GPT2. The example is about text generation, using TFGPT2LMHeadModel (built on top of TFTrainer). To be fair, the example is quite simple because all you need to do is loading the pre-trained model, and providing a short sequence of words for the model to start with. The model will then generate text, using different decoding methods. I really find this walk-through on different decoding methods is very informative and useful. I assume most of the people, like me, already know Greedy Search and Beam Search. But this tutorial also introduces other latest and more powerful decoding methods. To list all of them:
Greedy Search
Beam Search
Sampling
Top-K Sampling
Top-P (Nucleus) Sampling
The idea of Sampling is very straightforward, you just randomly picking the next word according to its conditional probability distribution. But there will be the issue of incoherence because some very unlikely word can be picked, although this can be alleviated by lowering the so-call temperature. Top-K Sampling is one simple but powerful scheme addressing this issue.
"In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words."
Top-P (Nucleus) Sampling takes one step further. "It chooses from the smallest possible set of words whose cumulative probability exceeds the probability p". What an elegant solution!
And of course, one can use Top-K Sampling and Top-P (Nucleus) Sampling together, which can "avoid very low ranked words while allowing for some dynamic selection". And you can alway get multiple sample outputs and pick the most human-like one (cherry-pick)!
A very delightful experience playing with this tutorial. I wish you can also check it out!
Transformer:
GPT-2: a small version (pytorch implementation) is available in Transformer
minGPT: a PyTorch re-implementation of GPT training
BERT (pytorch):
More...
ULMFiT (Universal Language Model Fine-Tuning method)
ELMo (Embeddings from Language Models)
BERT (Birdirectional Encoder Representations from Transformers)
GPT by OpenAI
Blog: https://openai.com/blog/language-unsupervised/
Blog: https://openai.com/blog/better-language-models/
Blog: https://openai.com/blog/openai-api/
https://www.topbots.com/ai-nlp-research-pretrained-language-models/
I have encountered this wonderful blog when I recapped the knowledge of Transformer. With extensive visualizations by cartoons, this blog illustrates many important concepts in Transformer in a crystal way.
Self-attention:
Query, Key and Values
Multi-head attention
Positional Encoding
Encoder-decoder attention