My experience on my daily works... helping others ease each other

Sunday, October 21, 2018

Lesson learn on R conference (ConfeRence 2018, ADAX Malaysia)

I attended R conferences organized @ ADAX Malaysia (ConfeRence) on October 20, 2018, recently. I do have an interest in AI, ML and Semantics Analysis. It started since 2008 when I further studies focusing on program analysis, specialized on static analysis on C overflow vulnerabilities.

There are key points which I would like to share based on the knowledge shared by local experts.

Ensemble Method

I'm quite new and just heard about this. The presenter shared that there are many methods but common are Bagging, Boosting and Stacking

Bagging


  1. Can produce a Discrete or Continuous Result
  2. Discrete (Classifications) - voting
  3. Continuous - regression -> mean
  4. Basically, the data is put into multiple "bags" in a random selection and you train the using one or various model. Finally, you find the mean of the result and produce the outcome from it.
For more understanding, check the video below


Boosting


  1. From the result, you will get a significant error and continue the training process on the data until you have no error on the latest bag or based on a defined number of the model required.
  2. 2 famous algorithm applied - GBM (Gradient Boosting Machine) & XGBoost - eXtreme Gradient Boosting Algorithm

To understand it, just check out the video below


Stacking (aka Ensemble Learners)


  1. In Stacking, you still use various model and numbers of bags.
  2. However, in stacking, you will use different algorithms such as KNN, LinReg, Decision Tree, SVM
  3. The result of the training on each bag which applied different algorithm will be added together to find the mean of it.



Other methods are also applicable such as Random Forest.

Ensemble method in R -> You can use SuperLearner Package library in R Studio

There are issues with ensemble which the data might be highly accurate when it is trained in development. But when you applied in the actual world, the data may produce a different result. To avoid this (inaccuracy or over-fitting problem), you will need to implement elastic net on all model. 2 known algorithms in elastic net are ridge and lasso.

Text Clustering with R


  1. Clustering in R can be calculated by using the algorithm like K-Means, KNN, Hierarchical Clustering, and DBScan.
  2. For Text Clustering, you need to measure the distance to enable you to cluster the text. For instance, what is the distance between the word 'wheel' and 'tire'. 
  3. Distance normally is measured using an algorithms known as Euclidean D
  4. For text clustering, there is a library that can be utilized to measure the distance; Wu-palmer developed by. However, the library only available for Python and not for R yet (as of the information shared on Oct 20, 2018)
  5. Available language lexical which can be used to measure distance using the algorithm is WordNet (https://wordnet.princeton.edu/)

Churn Prediction using SNA - Social Network Analysis

2 Library available and shall be used - igraph and R Markdown
Challenges in predicting - how to minimize loss due to algorithm implemented.

Some references:


  1. https://en.wikipedia.org/wiki/Ensemble_learning
  2. https://en.wikipedia.org/wiki/Cluster_analysis
  3. WordNet English (Lexical English Language) https://wordnet.princeton.edu/
  4. WordNet Bahasa Melayu (Lexical Malay Language) - http://wn-msa.sourceforge.net/index.eng.html
  5. https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics)
  6. https://en.wikipedia.org/wiki/Latent_semantic_analysis
  7. https://blog.thedigitalgroup.com/words-similarityrelatedness-using-wupalmer-algorithm
  8. https://en.wikipedia.org/wiki/Euclidean_distance
  9. https://www.researchgate.net/publication/310572659_A_modification_of_Wu_and_Palmer_Semantic_Similarity_Measure


Thanks to the team who share their knowledge. Check it out them at https://www.facebook.com/groups/MalaysiaRUserGroup/

Share:

About Me

Somewhere, Selangor, Malaysia
An IT by profession, a beginner in photography

Blog Archive

Blogger templates