2021/12/30 Thursday
Machine Learning:K-Means Mathematics
This post covers the mathematics behind K-Means. Specifically, it covers the following: Cost Function Optimization Initialization How to Choose the Number of Cluster...
2021/12/30 Thursday
This post covers the mathematics behind K-Means. Specifically, it covers the following: Cost Function Optimization Initialization How to Choose the Number of Cluster...
2021/12/29 wednesday
The K-Means algorithm is one of the most widely used clustering methods in practice. It is categorized as unsupervised learning which learns from unlabelled data instead of from...
2021/12/28 Tuesday
This posts covers overveiw on different types of Machine Learning as well as notations or terminology of terminology.
2022/06/15 Wednesday
Decision Tree Models can be used for both regression and classification tasks. It is the building block for popular ensemble models such as random forest.
2022/06/13 Monday
This post covers the python environment and packages management with different tools for Mac OS. Specifically, it covers the following tools: venv, Anaconda, and Miniforge.
2022/06/09 Thursday
Linear Models provide simple and fast baselines for more complicated models. When the number of features is large, more complex models may be hard to beat linear models.
2022/06/08 Wednesday
This post covers popular loss functions used in machine learning and deep learning models.
2024/06/16 Thursday
Bagging is a general strategy that can work with any base models - linear models and decision trees.
2022/01/08 Saturday
This post covers the mathematics behind Support Vector Machine(SVM). Specifically, it covers the following: Margin and Support Vector
2021/01/06 Thursday
Support Vector Machine(SVM), also called max margin classifer, is a very popular supervised algorithm. It can handle linear or nonlinear classification, regression as well as ou...
2022/01/01 Saturday
This post covers data preprocessing steps for K-Means Clustering. Specifically, it covers the following:
2022/12/30 Sunday
Welcome to my blog. My name is Joye. If you can read Chinese, my Chinese name is 李拙 (LI ZHUO). 🤔 Want to know more about me? check the About tab for my work experiences ...
2022/06/13 Monday
This post covers Python environment and package management using different tools for macOS. Specifically, it discusses the following tools: venv, Anaconda, and Miniforge.
2022/04/15 Friday
The post covers some material on an overview of NLP.
2020/04/25 Saturday
对于一个给定的词 $k$,根据其在不同语境 $i$ , $j$ 出现的概率的比值$\frac{ P_{ik}}{P_{jk}}$,可以区分其语义。
2020/04/05 Sunday
提升树算法(Gradient Tree Boosting)是机器学习中处理分类问题十分有效的方法,常被应用于广告点击率的预测和机器学习类比赛。 2014年,在传统提升树算法模型上,作者提出了XGBoost,并发布了相应的工具包。XGBoost因其计算速度快和模型表示好而广泛被应用在各类数据竞赛中,这些比赛包括:门店销售额预测,网页文本分类,点击率,产...
2020/03/12 Friday
文章提出词移距离(Word Mover’s Distance, WMD)用于计算文档之间的距离。文档之间的距离被看作为一个文档中词与词距离的加权平均。词与词的距离可基于Word Embedding得到的词向量计算,两篇文档词与词的映射关系为可变条件,目标函数为最小化文档之间的距离。求解得到最小的文档距离为词移距离。而这个最优化问题是Earth’s M...
2020/01/30 Thursday
本项目旨在爬取成都二手房源的位置,价格和房型等信息。
2019/12/25 Wedesday
上一节我们讲解scrapy的项目管道的使用, 这一节介绍中间件的使用。
2019/11/16 Sunday
上一节,我们了解到scrapy框架,安装和基本使用。其中提到了项目管道的主要作用包括清洗验证数据,检查重复并删除,数据入库。这一节,我们讲解scrapy的项目管道的使用。
2019/11/17 Sunday
前面章节主要用到Requests的方式爬取网页。在小规模爬虫时,Requests能够有效地满足需求,但大规模多线程的爬虫时则需要使用Scrapy。本节讲解Scrapy爬虫的基本框架, 安装和基本使用。
2019/08/11 Sunday
本节以爬取淘宝商品数据为例,讲解如何利用selelium爬取网页数据。
2019/08/11 Sunday
本节以爬取Joyenjoye关注的人为例,讲解如何爬取ajax或者javascript加载的网页。
2019/08/11 Sunday
本节以爬取小王子的豆瓣短评为例,从以下三个方面来初步了解爬虫: 1 数据获取 2 网页解析 3 数据保存
2019/08/11 Sunday
本节从以下三点来全面介绍爬虫:1 爬虫的定义和应用场景 2 爬虫基本知识 3 爬虫协议。