This Jupyter Notebook contains the data and visualizations that are crawled ICLR 2019 OpenReview webpages. All the crawled data (sorted by the average ratings) can be found here. The accepted papers have an average rating of 6.611 and 4.716 for rejected papers. The distributions are plotted as follows.
- Python3.5
- selenium
- pyvirtualdisplay (run on a headless device)
- wordcloud
- imageio
The word clouds formed by keywords of submissions show the hot topics including reinforcement learning, generative adversarial networks, generative models, imitation learning, representation learning, etc.
This figure is plotted with python word cloud generator
from wordcloud import WordCloud
wordcloud = WordCloud(max_font_size=64, max_words=160,
width=1280, height=640,
background_color="black").generate(' '.join(keywords))
plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
The distributions of reviewer ratings center around 5 to 6 (mean: 5.15).
You can compute how many papers are beaten by yours with
def PR(rating_mean, your_rating):
pr = np.sum(your_rating >= np.array(rating_mean))/len(rating_mean)*100
return pr
my_rating = (5+6+7)/3 # your average rating here
print('Your papar beats {:.2f}% of submission '
'(well, jsut based on the ratings...)'.format(PR(rating_mean, my_rating)))
# ICLR 2017: accept rate 39.1% (198/507) (15 orals and 183 posters)
# ICLR 2018: accept rate 32% (314/981) (23 orals and 291 posters)
# ICLR 2018: accept rate ?% (?/1580)
The top 50 common keywords and their frequency.
The average reviewer ratings and the frequency of keywords indicate that to maximize your chance to get higher ratings would be using the keywords such as theory, robustness, or graph neural network.
See How to install Selenium and ChromeDriver on Ubuntu.
To crawl data from dynamic websites such as OpenReview, a headless web simulator is created by
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
executable_path = '/Users/waltersun/Desktop/chromedriver' # path to your executable browser
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(options=options, executable_path=executable_path)
Then, we can get the content of a webpage
browser.get(url)
To know what content we can crawl, we will need to inspect the webpage layout.
I chose to get the content by
key = browser.find_elements_by_class_name("note_content_field")
value = browser.find_elements_by_class_name("note_content_value")
The data includes the abstract, keywords, TL; DR, comments.
The following content is hugely borrowed from a nice post written by Christopher Su.
- Install Google Chrome for Debian/Ubuntu
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
- Install
xvfb
to run Chrome on a headless device
sudo apt-get install xvfb
- Install ChromeDriver for 64-bit Linux
sudo apt-get install unzip # If you don't have unzip package
wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
If your system is 32-bit, please find the ChromeDriver releases here and modify the above download command.
- Install Python dependencies (Selenium and pyvirtualdisplay)
pip install pyvirtualdisplay selenium
- Test your setup in Python
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(1024, 1024))
display.start()
browser = webdriver.Chrome()
browser.get('http://shaohua0116.github.io/')
print(browser.title)
print(browser.find_element_by_class_name('bio').text)
Collected at 2019-12-05 11:31:13.692315
Number of submissions: 1579 (withdrawn submissions: 0)