Great list! The 65 best papers in Data Science history

One of the best ways to learn about Data Science, and to be informed of the new developments and ideas, is to read and study the papers on the various subjects.

Derived of being a very hot topic, there are tons of Data Science papers with a huge variety of subjects. Depending on which topic you are more interessed in, there is a ton of information to read and search for.

In order to facilitate your search for information, we decided to compile a list of the best papers in data science, divided by 3 big topics: General, Clustering Algorithms and Machine Learning.

Here is the list!

General Papers

  1. MapReduce: Simplified Data Processing on Large Clusters
    Authors: Jeffrey Dean and Sanjay Ghemawat
    Year: 2004
  2. Dynamo: Amazon’s Highly Available Key-value Store
    Authors: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels
    Year: 2007
  3. Bigtable: A Distributed Storage System for Structured Data
    Authors: Fay Chang,Jeffrey Dean, Sanjay Ghemawat,Wilson C. Hsieh,Deborah A. Wallach Mike Burrows,Tushar Chandra,Andrew Fikes,Robert E.Gruber
    Year: 2006
  4. NoSQL Databases
    Author: Christof Strauch
    Year: 2009
  5. The Pathologies of Big Data
    Author: Adam Jacobs
    Year: 2009
  6. Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics
    Author: Ralph Kimball
    Year: 2011
  7. Big data: The next frontier for innovation, competition, and productivity
    Authors: James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers
    Year: 2011
  8. Dremel: Interactive Analysis of Web-Scale Datasets
    Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer,Shiva Shivakumar, Matt Tolton, Theo Vassilakis
  9. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
    Authors: Ion Stoica ,Robert Morris ,David Liben-Nowell ,David R. Karger ,M. Frans Kaashoek ,Frank Dabek ,Hari Balakrishnan
    Year: 2001
  10. Cassandra – A Decentralized Structured Storage System
    Authors: Avinash Lakshman and Prashant Malik
    Year: 2009
  11. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
    Authors: Antony Rowstron and Peter Druschel
    Year: 2001
  12. Interpreting the Data: Parallel Analysis with Sawzall
    Authors: Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan
    Year: 2005
  13. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
    Authors: Yongqiang He, Rubao Lee,Yin Huai, Zheng Shao,Namit Jain, Xiaodong Zhang ,ZhiweiXu
    Year: 2011
  14. The Google File System
    Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
    Year: 2003
  15. Spanner: Google’s Globally-Distributed Database
    Authors: Google Team
    Year: 2012
  16. Large-scale Incremental Processing Using Distributed Transactions and Notifications
    Authors: Daniel Peng, Frank Dabek
    Year: 2010
  17. A Relational Model of Data for Large Shared Data Banks
    Author: E. F. Codd
    Year: 1970
  18. Pasting Small Votes for Classification in Large Databases and On-Line
    Author: Leo Breiman
    Year: 1999
  19. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
    Authors: Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts
    Year: 2013
  20. Megastore: Providing Scalable, Highly Available Storage for Interactive Services
    Authors: Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh
    Year: 2011
  21. F1: A Distributed SQL Database That Scales
    Authors: Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Little?eld, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, Himani Apte
    Year: 2013
  22. Top 10 algorithms in data mining
    Authors: Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg
    Year: 2007
  23. Show and Tell: A Neural Image Caption Generator
    Authors: Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan
    Year: 2014
  24. Data Science and its Relationship to Big Data and Decision Making
    Authors: Foster Provost and Tom Fawcett
    Year: 2013
  25. Mining Contrast Subspaces
    Authors: Lei Duan, Guanting Tang, Jian Pei, James Bailey, Guozhu Dong, Akiko Campbell and Changjie Tang
    Year: 2014
  26. Experimental evidence of massive-scale emotional contagion through social networks
    Authors: Adam D. I. Kramera, Jamie E. Guilloryb and Jeffrey T. Hancockb
    Year: 2013
  27. Preventing False Discovery in Interactive Data Analysis is Hard
    Authors: Moritz Hardt and Jonathan Ullman
    Year: 2014
  28. ClusCite: Effective Citation Recommendation by Information Network-Based Clustering
    Authors: Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang and Jiawei Han
    Year: 2014
  29. Reducing the Sampling Complexity of Topic Models
    Authors: Aaron Q. Li, Amr Ahmed, Sujith Ravi and Alexander J. Smola
    Year: 2014
  30. LSTM: A Search Space Odyssey
    Authors: Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink and Jürgen Schmidhuber
    Year: 2015
  31. Semi-Supervised Learning with Ladder Network
    Authors: Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund and Tapani Raiko
    Year: 2015
  32. Towards Neural Network-based Reasoning
    Authors: Baolin Peng, Zhengdong Lu, Hang Li and Kam-Fai Wong
    Year: 2015

Clustering Algorithms

  1. Algorithms for hierarchical clustering: An overview
    Authors: Fionn Murtagh and Pedro Contreras
    Year: 2012
  2. SLINK: An optimally efficient algorithm for the single-link cluster method
    Author: R. Sibson
    Year: 1972
  3. Optimal algorithms for complete linkage clustering in d dimensions
    Authors: Drago Krznaric and Christos Levcopoulos
    Year: 2002
  4. An efficient algorithm for a complete link method
    Author: D. Defays
    Year: 1977
  5. Robust Hierarchical Clustering
    Authors: Maria Florina Balcan and Pramod Gupta
    Year: 2014
  6. Optimal Implementations of UPGMA and Other Common Clustering Algorithm
    Authors: Ilan Gronaua and Shlomo Moran
    Year: 2007
  7. An Efficient k-Means Clustering Algorithm: Analysis and Implementation
    Authors: Tapas Kanungo, David M. Mount, Nathan SD. Netanyahu,  Christine D. Piatko, Ruth Silverman and Angela Y. Wu
    Year: 2002
  8. A K-Means Clustering Algorithm
    Authors: J. A. Hartigan and M. A. Wong
    Year: 1979
  9. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
    Authors: Martin Ester, Hans-Peter Kriegel, Jiirg Sander and Xiaowei Xu
    Year: 1996
  10. OPTICS: Ordering Points To Identify the Clustering Structure
    Authors: Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander
    Year: 1999
  11. BIRCH: An Efficient Data Clustering Method for Very Large Databases
    Authors: Tian Zhang, Raghu Ramakrishnan and Miron Livny
    Year: 1996
  12. CURE: An Efficient Clustering Algorithm for Large Databases
    Authors: Sudipto Guha, Rajeev Rastogi and Kyuseok Shim
    Year: 2001
  13. CLARANS: a method for clustering objects for spatial data mining
    Authors: Raymond T. Ng and Jiawei Han
    Year: 2002
  14. FCM: The Fuzzy C-Means Clustering Algorithm
    Authors: James C. Bezdek, Robert Ehrlich and William Full
    Year: 1982
  15. The Expectation Maximization Algorithm
    Author: Frank Dellaert
    Year: 2002
  16. The EM Algorithm
    Author: Xiaojin Zhu
    Year: 2007

Machine Learning

  1. Parallel Spectral Clustering in Distributed Systems
    Authors: Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang
    Year: 2011
  2. Learning Multiple Layers of Features from Tiny Images
    Author: Alex Krizhevsky
    Year: 2009
  3. Distributed Algorithms for Topic Models
    Authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling
    Year: 2009
  4. Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation
    Authors: U Kang, Brendan Meeder and Christos Faloutsos
    Year: 2011
  5. Large Language Models in Machine Translation
    Authors: Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean
    Year: 2007
  6. Learning using Large Datasets
    Authors: Léon Bottou and  Olivier Bousquet
    Year: 2008
  7. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
    Authors: Samy Bengio, Oriol Vinyals, Navdeep Jaitly and Noam Shazeer
    Year: 2015
  8. Training recurrent networks online without backtracking
    Authors: Yann Ollivier and Guillaume Charpiat
    Year: 2015
  9. PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations
    Authors: U Kang , Charalampos E. Tsourakakis and Christos Faloutso
    Year: 2009
  10. Learning Deep Architectures for AI
    Authors: Yoshua Bengio
    Year: 2009
  11. Intriguing properties of neural networks
    Authors: Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus
    Year: 2014
  12. Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
    Author: Uzi Vishkin
    Year: 2010
  13. Pattern Recognition and Machine Learning
    Author: Christopher M. Bishop
    Year: 2006
  14. A Few Useful Things to Know about Machine Learning
    Authors: Pedro Domingos
    Year: 2012
  15. Map-Reduce for Machine Learning on Multicore
    Authors: Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun
    Year: 2006
  16. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
    Authors: Anh Nguyen, Jason Yosinski and Jeff Clune
    Year: 2014
  17. Towards Neural Network-based Reasoning
    Authors: Baolin Peng, Zhengdong Lu, Hang Li and Kam-Fai Wong
    Year: 2015

Hope you like our selection of important papers on Data Science! We think you could learn A LOT  about certain Data Science subjects by reading and studying these documents. We’ll be updating and enlarging this list in order to provide the best list ever of papers.