Joyenjoye

Welcome to Joyenjoye 🤗

2022/12/30 Sunday

Welcome to my blog. My name is Joye. If you can read Chinese, my Chinese name is 李拙.

🤔 Want to know more details about me?

check the About tab for my work experiences and skillsets
check the Hobbies tab for my hobbies outside work

🤔 What can you find here？

Learnings from day-to-day work from a data scientist in PR and Marketing Industry.
Summarization of papers and methodologies in machine learning, deep learning, and NLP.

Machine Learning：Decision Tree Models

2022/06/15 Wednesday

Decision Tree Models can be used for both regression and classification tasks. It is the building block for popular ensemble models such as random forest.

Python Environment Management

2022/06/13 Monday

This post covers the python environment and packages management with different tools for Mac OS. Specifically, it covers the following tools: venv, Anaconda, and Miniforge.

Machine Learning：Linear Models

2022/06/09 Thursday

Linear Models provide simple and fast baselines for more complicated models. When the number of features is large, more complex models may be hard to beat linear models.

Machine Learning：Loss Functions

2022/06/08 Wednesday

This post covers popular loss functions used in machine learning and deep learning models.

NLP Materials

2022/04/15 Friday

The post covers some materials on NLP overview.

Machine Learning：SVM Mathematics

2022/01/08 Saturday

This post covers the mathematics behind Support Vector Machine(SVM). Specifically, it covers the following:

Margin and Support Vector

Support Vector Machine(SVM), also called max margin classifer, is a very popular supervised algorithm. It can handle linear or nonlinear classification, regression as well as outlier detection[2]. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets[1].

Machine Learning：Prepare Data for K-Means Clustering

2022/01/01 Saturday

This post covers data preprocessing steps for K-Means Clustering. Specifically, it covers the following:

Machine Learning：K-Means Mathematics

2021/12/30 Thursday

This post covers the mathematics behind K-Means. Specifically, it covers the following:

Cost Function
Optimization
Initialization
How to Choose the Number of Clusters

Machine Learning：K-Means Overview

2021/12/29 wednesday

The K-Means algorithm is one of the most widely used clustering methods in practice. It is categorized as unsupervised learning which learns from unlabelled data instead of from labelled data, and try to find the “structure” or “pattern” in the data. Also as a type of clustering algorithm, it aims to automatically group the data to coherent clusters[1]. Typical use cases include customer segmentation[2], social network analysis, and document clustering.

Machine Learning：Overview

2021/12/28 Tuesday

This posts covers overveiw on different types of Machine Learning as well as notations or terminology of terminology.

论文总结：GloVe - Global Vectors for Word Representation

2020/04/25 Saturday

对于一个给定的词 $k$，根据其在不同语境 $i$ , $j$ 出现的概率的比值$\frac{ P_{ik}}{P_{jk}}$，可以区分其语义。

论文总结：XGBoost - A Scalable Tree Boosting System

2020/04/05 Sunday

提升树算法（Gradient Tree Boosting）是机器学习中处理分类问题十分有效的方法，常被应用于广告点击率的预测和机器学习类比赛。

2014年，在传统提升树算法模型上，作者提出了XGBoost，并发布了相应的工具包。XGBoost因其计算速度快和模型表示好而广泛被应用在各类数据竞赛中，这些比赛包括：门店销售额预测，网页文本分类，点击率，产品分类等。该论文发表于两年后的2016KDD会议。

论文总结：From Word Embeddings to Document Distance

2020/03/12 Friday

文章提出词移距离(Word Mover’s Distance, WMD)用于计算文档之间的距离。文档之间的距离被看作为一个文档中词与词距离的加权平均。词与词的距离可基于Word Embedding得到的词向量计算，两篇文档词与词的映射关系为可变条件，目标函数为最小化文档之间的距离。求解得到最小的文档距离为词移距离。而这个最优化问题是Earth’s Mover’s Distance的特殊情况，可采用相应的算法进行求解。