Skip to content

mickguy/ds-capstone-survival-guide

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

DS Capstone Survival Guide

Prerequisites

Some things you should know before you start.

  • While it is not required basic familiarity with Unix command-line tools can be extremly useful. If names like grep, sed or wc doesn't mean anything to you it is a good idea to change that.
  • You don't have to be an expert in Natural Language Processing but understanding basic concepts and some experience with analyzing unstructured data will give you a serious advantage. If you're familiar with terms like tokenization, n-gram or Markov chain you're good to go.
  • You don't need a computer science degree to finish this course, but it is useful to understand basic data structures and algorithms. If you are familiar with Big O notation and you can analyze time and memory complexity of the operations like for(i in 1:n) {foo <- c(i, foo)} or which(foo == round(runif(1, 1, n))) you should be fine.
  • While most of the heavy-lifting can be handled by Shiny some practical knowledge of the front-end technologies can make your life much easier.

Development environment

  • Keep your development environment as close as it is possible to the target platform. At this point shinyapps.io is using Ubuntu 12.04 with en_US.UTF-8 locale. You can create similar environment using tools like Docker, Vagrant or VirtualBox.
  • Create a reproducible R environment (Packrat is your friend) for your project. Dealing with broken dependencies is a painful and time-consuming process.
  • If you execute memory/CPU intensive task try to avoid RStudio.

Use of Unix command-line tools

  • Most operations involving identifying 'unique' words or n-grams, and counting them, can take hours in R and just a few seconds/minuts using Unix/Linux pipes.
  • If you work on a Windows machine, keep in mind that you can use Git Bash for Unix/Linux command-line tools.
  • Using Linux/Unix does not mean you have to give up the ideals of reproducible research: the R function system() allows you to call OS commands; if you have a Windows machine, you can solve this problem by using cloud services such as Domino which accept Linux/Unix commands.

Languages and libraries

  • Some libraries are more equal than others. Even if some library looks like a great fit it doesn't mean it can handle amount of data you have to process.
  • JVM based libraries (RWeka, OpenNLP) can provide some very useful functions but it comes at a price. In a restricted environment like ShinyApps it can be a deal breaker.
  • Some libraries which had beeen proven to be useful: stringi, Hadleyverse tools (data.table in particular), hash.
  • Most likely more trouble than it's worth: tm, RWeka. These are good packages but simply not well suited to Capstone specific workload.
  • While you can use any tools you want to process input data and build your model the final project must be deployed on shinyapps.io. Running external code on ShinyApps is possible, but may violate Terms of Use and makes debugging even harder. Keep that in mind before you decide to use your-favorite-language.

Common problems and troubleshooting

  • If you don't know where to start or you are stuck be sure to check fora. To quote @LaurentFranckx they can make the difference between passing and failing.
  • If you ask a question regarding issues with your code, it is wise to provide the SSCCE and all the additional information which can be useful (error messages, warnings, sessionInfo) to diagnose the problem. Describing what have you tried is always welcome.
  • Contrary to popular belief ShinyApps is not a devious trap created only to make your application fail but it is not a forgiving platform either. Even if your application works fine on your own machine with lots of spare resources it doesn't mean it will behave in the same way after deployment. Take away message here is deploy early and often.
  • If your problem concerns a deployed application be sure to provide an application log and related errors visible in a browser console (Firefox, Chrome). You can use logging package to create custom log messages.
  • If you don't understand R memory management you will most likely fail. You can use common monitoring tools, like htop or more advanced software like Munin to monitor your application. If you see unexpected memory spikes use lineprof to find the source of the problem.
  • Keep shiny application directory clean. Uploading things like saved R workspace may have unexpected results.
  • Double check submitted links especially if you use Rich Text Editor. What you see is not always what you get.
  • Locale matters. en_US.UTF-8 is usually the best choice. If in doubt use Sys.getlocale and Sys.setlocale.
  • Input file encoding matters as well. Be sure to understand how to use encoding parameter when you create connections or use readLines. If you still encounter problems (like embeded nulls) you can always read and encode raw data.
  • Handle exceptions and prepare fallback strategy. If there is an easy way to break your application you can be sure reviewers will find it.
  • Deploying application on ShinyApps requires a current version of the shinyapps package. If you see shinyapps package out of date simply repeat installation procedure.
  • Consider using shiny::showReactLog to track reactive dependencies.

ShinyApps limits

  • Manage shiny hours carefully. Archive any apps you don't need. Set the Instance Idle Timeout for a short period, particularly during your testing. Stop the app in the shiny console when you finish a testing session.
  • Size of an application deployed using deployApp is limited to 1GB.
  • Free tier is restricted to medium size. In practice it means 512MB RAM at your disposal. Few orders of magnitude more than the Apollo Guidance Computer but most likely less than your smartphone. Use it wisely.
  • Changes in a file system won't persist beyond a single session.
  • Considering ShinyApps architecture it is best to assume worst-case scenario, in which every user has to go through a whole starting process (including operations executed outside shinyServer) from scratch.

Time management and deadlines

  • Don't wait until the last moment to deploy your projects. If anything can go wrong, it will happened a day before deadline. Just ask anyone who finished this specialization.
  • Grading weeks tend to be quite intensive and you have to be prepared to deal with some unexpected issues. Long story short don't leave town.

Slow or unreliable internet connections

  • If you have a slow internet connection redeploying your app can be a time-consuming process. Since your data most likely won't change as often as your code you can create a data only package, keep it on GitHub and let Shinyapps handle dependencies. If you're not sure how to do it take a look at devtools::create, devtools::use_data and devtools::install_github. Be sure to check working with large files guide.
  • Seeing cryptic Error in headers[["Content-Type"]] : subscript out of bounds during deployment may suggest some kind of connection problem. If you're sure there is nothing wrong with your app please try again using different network.

Uploading to Rpubs issues

People may get an SSL error when uploading to RPubs. This could be either needing to update software or your need configure your machine to use SSL uploads (normally through making an .RProfile file). Let's start with the update

-SSL error due to needing updates

A lot of SSL things have needed updating over the past year. If you are getting a SSL error when uploading to RPubs the first step is:

  • Update RStudio to the latest version
  • In the new version of RStudio, update the bitops and RCurl packages (the markdown package should update when you start using .Rmd documents, but if you are following an unusual path you may need to update the markdown package). Remember, the safe way to update is to unload the package (which you can do by taking the tick out of the box beside the package in the packages tab) then reinstall.
  • in the new version, with bitops and RCurl loaded, try knitting and publishing to RPubs.

If still no luck, then (particularly if using Windows) you will need to use the command options(rpubs.upload.method = "internal") to have Internet Explorer handle the https transfer. if this works and you want to make the change permanent, put it in an .RProfie file.

To create a .Rprofile file can be a bit tricky as normally suffix files starting with a . are invisible on most computer systems. I suggest using RStudio, with the startup directory set to the folder with the .rmd file in it (Technically, R uses a multistage process of reading in settings and the help for Startup explains the other possible places)

  • Use the File menu to make a New File -> Text File
  • Put in options(rpubs.upload.method = "internal") and no other text at all
  • Use the Save command and save it with the name .RProfile
  • Quit RStudio, restart RStudio, make sure your startup directory is set to the folder, then try to publish again. For other places the command could go see ?Startup

You can do the equivalent in Windows using the console (hat tip to Iman Tang for this tip) without actually creating the .Rprofile file.

install packages("markdown")  
library(markdown)  
rpubsUpload(title, htmlFile, id = NULL, properties = list(), method = getOption("rpubs.upload.method", "internal"))

Please note that the last part is also ("rpubs.upload.method", "internal") as others have mentioned.

This "rpubsUpload" function returns a "continueURL". A browser can then be used to open this "continueURL" and finish up the publishing in RPubs. For details, please type "Upload an HTML file to RPubs" in the help section of Rstudio.

General remarks

  • Be civil and try to have fun.
  • Word clouds are evil and therefore should be forbidden.
  • Using typical NLP pipelines (tokenization -> punctuation removal -> stop words removal -> stemming) is not the best approach. It will bite you sooner or later.
  • Carefully review the Grading Rubrics.
  • Document and explain your design choices and never assume that the reviewers are familiar with some proprietary product x.

Useful resources

  • Other

Creative Commons License
DS Capstone Survival Guide by Maciej Szymkiewicz and Authors is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

A short guide how to survive DS Capstone and stay sane.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published