- Prerequisites
- Development environment
- Use of Unix command-line tools
- Languages and libraries
- Common problems and troubleshooting
- ShinyApps limits
- Time management and deadlines
- Slow or unreliable internet connections
- Uploading to Rpubs issues
- Useful resources
- General remarks
Some things you should know before you start.
- While it is not required basic familiarity with Unix command-line tools can be extremly useful. If names like
grep
,sed
orwc
doesn't mean anything to you it is a good idea to change that. - You don't have to be an expert in Natural Language Processing but understanding basic concepts and some experience with analyzing unstructured data will give you a serious advantage. If you're familiar with terms like tokenization, n-gram or Markov chain you're good to go.
- You don't need a computer science degree to finish this course, but it is useful to understand basic data structures and algorithms. If you are familiar with Big O notation and you can analyze time and memory complexity of the operations like
for(i in 1:n) {foo <- c(i, foo)}
orwhich(foo == round(runif(1, 1, n)))
you should be fine. - While most of the heavy-lifting can be handled by Shiny some practical knowledge of the front-end technologies can make your life much easier.
- Keep your development environment as close as it is possible to the target platform. At this point shinyapps.io is using Ubuntu 12.04 with
en_US.UTF-8
locale. You can create similar environment using tools like Docker, Vagrant or VirtualBox. - Create a reproducible R environment (Packrat is your friend) for your project. Dealing with broken dependencies is a painful and time-consuming process.
- If you execute memory/CPU intensive task try to avoid RStudio.
- Most operations involving identifying 'unique' words or n-grams, and counting them, can take hours in R and just a few seconds/minuts using Unix/Linux pipes.
- If you work on a Windows machine, keep in mind that you can use Git Bash for Unix/Linux command-line tools.
- Using Linux/Unix does not mean you have to give up the ideals of reproducible research: the R function
system()
allows you to call OS commands; if you have a Windows machine, you can solve this problem by using cloud services such as Domino which accept Linux/Unix commands.
- Some libraries are more equal than others. Even if some library looks like a great fit it doesn't mean it can handle amount of data you have to process.
- JVM based libraries (RWeka, OpenNLP) can provide some very useful functions but it comes at a price. In a restricted environment like ShinyApps it can be a deal breaker.
- Some libraries which had beeen proven to be useful:
stringi
, Hadleyverse tools (data.table
in particular),hash
. - Most likely more trouble than it's worth:
tm
,RWeka
. These are good packages but simply not well suited to Capstone specific workload. - While you can use any tools you want to process input data and build your model the final project must be deployed on shinyapps.io. Running external code on ShinyApps is possible, but may violate Terms of Use and makes debugging even harder. Keep that in mind before you decide to use your-favorite-language.
- If you don't know where to start or you are stuck be sure to check fora. To quote @LaurentFranckx they can make the difference between passing and failing.
- If you ask a question regarding issues with your code, it is wise to provide the SSCCE and all the additional information which can be useful (error messages, warnings,
sessionInfo
) to diagnose the problem. Describing what have you tried is always welcome. - Contrary to popular belief ShinyApps is not a devious trap created only to make your application fail but it is not a forgiving platform either. Even if your application works fine on your own machine with lots of spare resources it doesn't mean it will behave in the same way after deployment. Take away message here is deploy early and often.
- If your problem concerns a deployed application be sure to provide an application log and related errors visible in a browser console (Firefox, Chrome). You can use logging package to create custom log messages.
- If you don't understand R memory management you will most likely fail. You can use common monitoring tools, like htop or more advanced software like Munin to monitor your application. If you see unexpected memory spikes use lineprof to find the source of the problem.
- Keep shiny application directory clean. Uploading things like saved R workspace may have unexpected results.
- Double check submitted links especially if you use Rich Text Editor. What you see is not always what you get.
- Locale matters.
en_US.UTF-8
is usually the best choice. If in doubt useSys.getlocale
andSys.setlocale
. - Input file encoding matters as well. Be sure to understand how to use encoding parameter when you create connections or use
readLines
. If you still encounter problems (like embeded nulls) you can always read and encode raw data. - Handle exceptions and prepare fallback strategy. If there is an easy way to break your application you can be sure reviewers will find it.
- Deploying application on ShinyApps requires a current version of the
shinyapps
package. If you seeshinyapps package out of date
simply repeat installation procedure. - Consider using
shiny::showReactLog
to track reactive dependencies.
- Manage shiny hours carefully. Archive any apps you don't need. Set the Instance Idle Timeout for a short period, particularly during your testing. Stop the app in the shiny console when you finish a testing session.
- Size of an application deployed using
deployApp
is limited to 1GB. - Free tier is restricted to medium size. In practice it means 512MB RAM at your disposal. Few orders of magnitude more than the Apollo Guidance Computer but most likely less than your smartphone. Use it wisely.
- Changes in a file system won't persist beyond a single session.
- Considering ShinyApps architecture it is best to assume worst-case scenario, in which every user has to go through a whole starting process (including operations executed outside
shinyServer
) from scratch.
- Don't wait until the last moment to deploy your projects. If anything can go wrong, it will happened a day before deadline. Just ask anyone who finished this specialization.
- Grading weeks tend to be quite intensive and you have to be prepared to deal with some unexpected issues. Long story short don't leave town.
- If you have a slow internet connection redeploying your app can be a time-consuming process. Since your data most likely won't change as often as your code you can create a data only package, keep it on GitHub and let Shinyapps handle dependencies. If you're not sure how to do it take a look at
devtools::create
,devtools::use_data
anddevtools::install_github
. Be sure to check working with large files guide. - Seeing cryptic
Error in headers[["Content-Type"]] : subscript out of bounds
during deployment may suggest some kind of connection problem. If you're sure there is nothing wrong with your app please try again using different network.
People may get an SSL error when uploading to RPubs. This could be either needing to update software or your need configure your machine to use SSL uploads (normally through making an .RProfile file). Let's start with the update
-SSL error due to needing updates
A lot of SSL things have needed updating over the past year. If you are getting a SSL error when uploading to RPubs the first step is:
- Update RStudio to the latest version
- In the new version of RStudio, update the bitops and RCurl packages (the markdown package should update when you start using .Rmd documents, but if you are following an unusual path you may need to update the markdown package). Remember, the safe way to update is to unload the package (which you can do by taking the tick out of the box beside the package in the packages tab) then reinstall.
- in the new version, with bitops and RCurl loaded, try knitting and publishing to RPubs.
If still no luck, then (particularly if using Windows) you will need to use the command options(rpubs.upload.method = "internal") to have Internet Explorer handle the https transfer. if this works and you want to make the change permanent, put it in an .RProfie file.
To create a .Rprofile file can be a bit tricky as normally suffix files starting with a . are invisible on most computer systems. I suggest using RStudio, with the startup directory set to the folder with the .rmd file in it (Technically, R uses a multistage process of reading in settings and the help for Startup explains the other possible places)
- Use the File menu to make a New File -> Text File
- Put in options(rpubs.upload.method = "internal") and no other text at all
- Use the Save command and save it with the name .RProfile
- Quit RStudio, restart RStudio, make sure your startup directory is set to the folder, then try to publish again. For other places the command could go see ?Startup
You can do the equivalent in Windows using the console (hat tip to Iman Tang for this tip) without actually creating the .Rprofile file.
install packages("markdown")
library(markdown)
rpubsUpload(title, htmlFile, id = NULL, properties = list(), method = getOption("rpubs.upload.method", "internal"))
Please note that the last part is also ("rpubs.upload.method", "internal") as others have mentioned.
This "rpubsUpload" function returns a "continueURL". A browser can then be used to open this "continueURL" and finish up the publishing in RPubs. For details, please type "Upload an HTML file to RPubs" in the help section of Rstudio.
- Be civil and try to have fun.
- Word clouds are evil and therefore should be forbidden.
- Using typical NLP pipelines (tokenization -> punctuation removal -> stop words removal -> stemming) is not the best approach. It will bite you sooner or later.
- Carefully review the Grading Rubrics.
- Document and explain your design choices and never assume that the reviewers are familiar with some proprietary product x.
- Discussion lists and QA sites.
- ShinyApps Users Group - A great place to ask a question about the ShinyApps platform.
- Shiny Users Group.
- Stack Overflow.
- Books
- MOOCs
- Natural Language Processing, Columbia University.
- The Analytics Edge, MITx - Unit 5: Text Analytics in particular.
- Text Retrieval and Search Engines, University of Illinois at Urbana-Champaign.
- Introduction to Linux, LinuxFoundationX.
- Other
- awesome-R - a curated list of awesome R frameworks, packages and software.
- How to make a great R reproducible example?
- Next word prediction benchmark.
- Bitbucket - unlimited private Git repositories.
- Shinyapps.io Scaling and Performance Tuning - should give required knowledge about ShinyApps architecture.
- w3schools - enough HTML/CSS/JavaScript to keep you going.
- Software development skills for data scientists
DS Capstone Survival Guide by Maciej Szymkiewicz and Authors is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.