October 2018 ~ Malaysian Developer

Sunday, October 21, 2018

Lesson learn on R conference (ConfeRence 2018, ADAX Malaysia)

October 21, 2018Community Group, Conference, Data Sciences, Education, Talk/Lecture, Technology No comments:

I attended R conferences organized @ ADAX Malaysia (ConfeRence) on October 20, 2018, recently. I do have an interest in AI, ML and Semantics Analysis. It started since 2008 when I further studies focusing on program analysis, specialized on static analysis on C overflow vulnerabilities.

There are key points which I would like to share based on the knowledge shared by local experts.

Ensemble Method

I'm quite new and just heard about this. The presenter shared that there are many methods but common are Bagging, Boosting and Stacking

Bagging

Can produce a Discrete or Continuous Result
Discrete (Classifications) - voting
Continuous - regression -> mean
Basically, the data is put into multiple "bags" in a random selection and you train the using one or various model. Finally, you find the mean of the result and produce the outcome from it.

For more understanding, check the video below

Boosting

From the result, you will get a significant error and continue the training process on the data until you have no error on the latest bag or based on a defined number of the model required.
2 famous algorithm applied - GBM (Gradient Boosting Machine) & XGBoost - eXtreme Gradient Boosting Algorithm

To understand it, just check out the video below

Stacking (aka Ensemble Learners)

In Stacking, you still use various model and numbers of bags.
However, in stacking, you will use different algorithms such as KNN, LinReg, Decision Tree, SVM
The result of the training on each bag which applied different algorithm will be added together to find the mean of it.

Other methods are also applicable such as Random Forest.

Ensemble method in R -> You can use SuperLearner Package library in R Studio

There are issues with ensemble which the data might be highly accurate when it is trained in development. But when you applied in the actual world, the data may produce a different result. To avoid this (inaccuracy or over-fitting problem), you will need to implement elastic net on all model. 2 known algorithms in elastic net are ridge and lasso.

Text Clustering with R

Clustering in R can be calculated by using the algorithm like K-Means, KNN, Hierarchical Clustering, and DBScan.
For Text Clustering, you need to measure the distance to enable you to cluster the text. For instance, what is the distance between the word 'wheel' and 'tire'.
Distance normally is measured using an algorithms known as Euclidean D
For text clustering, there is a library that can be utilized to measure the distance; Wu-palmer developed by. However, the library only available for Python and not for R yet (as of the information shared on Oct 20, 2018)
Available language lexical which can be used to measure distance using the algorithm is WordNet (https://wordnet.princeton.edu/)

Churn Prediction using SNA - Social Network Analysis

2 Library available and shall be used - igraph and R Markdown
Challenges in predicting - how to minimize loss due to algorithm implemented.

Some references:

https://en.wikipedia.org/wiki/Ensemble_learning
https://en.wikipedia.org/wiki/Cluster_analysis
WordNet English (Lexical English Language) https://wordnet.princeton.edu/
WordNet Bahasa Melayu (Lexical Malay Language) - http://wn-msa.sourceforge.net/index.eng.html
https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics)
https://en.wikipedia.org/wiki/Latent_semantic_analysis
https://blog.thedigitalgroup.com/words-similarityrelatedness-using-wupalmer-algorithm
https://en.wikipedia.org/wiki/Euclidean_distance
https://www.researchgate.net/publication/310572659_A_modification_of_Wu_and_Palmer_Semantic_Similarity_Measure

Thanks to the team who share their knowledge. Check it out them at https://www.facebook.com/groups/MalaysiaRUserGroup/

Malaysian Developer

Sunday, October 21, 2018

Lesson learn on R conference (ConfeRence 2018, ADAX Malaysia)

Ensemble Method

Bagging

Boosting

Stacking (aka Ensemble Learners)

Text Clustering with R

Churn Prediction using SNA - Social Network Analysis

Some references:

About Me

Labels

Blog Archive

Blogger templates

Blogroll

About