TWiki> ANC Web>MAST (03 May 2017, Main.alouis)EditAttach

Machine learning for the Analysis of Source code Text (MAST)

MAST is a reading group focusing on Machine Learning for source code and software engineering. Meetings take place Wednesdays at 4pm. More information about the group and what we do can be found on our website and GitHub repository.

Next Reading

Our next reading will be as follows:

Wednesday 3th May 2017 at 4pm:

Introduction and planning meeting

Table of Contents

Future Readings

Software Engineering 'Classics'

M Lam, R Sethi, J Ullman, A Aho: Compilers: Principles, Techniques, and Tools. book

S Muchnick: Advanced Compiler Design & Implementation. book

A Abran, P Bourque: SWEBOK: Guide to the software engineering Body of Knowledge. book

Hindle, A., Barr, E. T., Su, Z., Gabel, M., and Devanbu, P. On the naturalness of software. ICSE 2012, p. 837-847.

Gabel, M., and Su, Z. A study of the uniqueness of source code. FSE 2010, p. 147-156.

Bruch, M., Monperrus, M., and Mezini, M. Learning from examples to improve code completion systems. FSE 2009, p. 213-222.

Mining API Patterns

Zhong, H., Xie, T., Zhang, L., Pei, J., and Mei, H. MAPO: mining and recommending API usage patterns. ECOOP 2009, p. 318-343. paper

Buse, R. P., and Weimer, W. Synthesizing API usage examples. ICSE 2012, p. 782-792. paper

Xing, Z., and Stroulia, E. API-evolution support with diff-CatchUp. Software Engineering, IEEE Transactions on 33, 12 (2007), p. 818-836.

Nguyen et. al. "Statistical learning approach for mining API usage mappings for code migration" ASE 2014. paper

Zhu, Zixiao, et al. "Mining API Usage Examples from Test Code." ICSME 2014. paper

Petrosyan, G., Robillard, M. P., & De Mori, R. Discovering Information Explaining API Types Using Text Classification. ICSE 2015. paper

Moreno, L., Bavota, G., Di Penta, M., Oliveto, R., & Marcus, A. How Can I Use This Method?. ICSE 2015. paper

Nguyen, Tam The, et al. "Learning API Usages from Bytecode: A Statistical Approach." ICSE 2016. paper

Saied, Mohamed Aymen, et al. "Could We Infer API Usage Patterns only using the Library Source Code?." paper

Mining Software

Hassan, A. E. The road ahead for mining software repositories. FoSM 2008, p. 48-57.

Gabel, M., and Su, Z. Javert: fully automatic mining of general temporal properties from dynamic traces. FSE 2008, p. 339-349.

Livshits, B., and Zimmermann, T. DynaMine: finding common error patterns by mining software revision histories. In Software Engineering Notes (2005), vol. 30, p. 296-305. paper

Negara, S., Codoban, M., Dig, D., and Johnson, R. E. Mining fine-grained code changes to detect unknown change patterns. ICSE 2014.

Moreno, L., et al. "On the Use of Stack Traces to Improve Text Retrieval-based Bug Localization." ICSME 2014. paper

Han, Shi, et al. "Performance debugging in the large via mining millions of stack traces." ICSE 2012. paper

Program Synthesis

Text processing for program synthesis.

Code Summarization

Ying, A. T., and Robillard, M. P. Code fragment summarization. FSE 2013.

Haiduc, S., Aponte, J., Moreno, L., & Marcus, A. On the use of automated text summarization techniques for summarizing source code. WCRE 2010.

Eddy, B. P., Robinson, J. A., Kraft, N. A., & Carver, J. C. Evaluating source code summarization techniques: Replication and expansion. ICPC 2013.

McBurney, P. W., Liu, C., McMillan, C., and Weninger, T. Improving Topic Model Source Code Summarization, ICPC 2014. paper

McBurney, P. W., McMillan, C. Automatic Documentation Generation via Source Code Summarization of Method Context ICPC 2014. paper

Moreno, Laura. "Summarization of complex software artifacts." Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 2014. paper

Liu, Yu, et al. "Supporting program comprehension with program summarization." ICIS 2014 . paper

Bairi, R. B., Iyer, R., Ramakrishnan, G., & Bilmes, J. Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures. paper

McBurney, P. W., McMillan, C., "Automatic Source Code Summarization of Context for Java Methods", TSE 2015. paper

Rodeghero, P., Liu, C., McBurney, P. W., McMillan, C., "An Eye-Tracking Study of Java Programmers and Application to Source Code Summarization", TSE 2015. paper

McBurney, P. W., McMillan, C., "An Empirical Study of the Textual Similarity between Source Code and Source Code Summaries", EMSE 2015. paper

Design Patterns

Basit, Hamid Abdul, and Stan Jarzabek. "Detecting higher-level similarity patterns in programs." In ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 156-165. ACM, 2005. paper

Gil, Joseph Yossi, and Itay Maman. "Micro patterns in Java code." In ACM SIGPLAN Notices, vol. 40, no. 10, pp. 97-116. ACM, 2005. paper

Shi, Nija, and Ronald A. Olsson. "Reverse engineering of design patterns from java source code." ASE 2006. paper

Posnett, Daryl, Christian Bird, and Premkumar Devanbu. "THEX: Mining metapatterns from java." MSR 2010. paper

Tsantalis, Nikolaos, Alexander Chatzigeorgiou, George Stephanides, and Spyros T. Halkidis. "Design pattern detection using similarity scoring." Software Engineering, IEEE Transactions on 32, no. 11 (2006): 896-909. paper

Guéhéneuc, Yann-Gaël, Houari Sahraoui, and Farouk Zaidi. "Fingerprinting design patterns." WCRE 2004. paper

Guéhéneuc, Y-G., and Giuliano Antoniol. "Demima: A multilayered approach for design pattern identification." Software Engineering, IEEE Transactions on 34, no. 5 (2008): 667-684. paper

Heuzeroth, Dirk, Thomas Holl, Gustav Hogstrom, and Welf Lowe. "Automatic design pattern detection." IWPC 2003. paper

Balanyi, Zsolt, and Rudolf Ferenc. "Mining design patterns from C++ source code." ICSM 2003. paper

Antoniol, Giuliano, Gerardo Casazza, Massimiliano Di Penta, and Roberto Fiutem. "Object-oriented design patterns recovery." Journal of Systems and Software 59, no. 2 (2001): 181-196. paper

Smith, Jason M., and David Stotts. "SPQR: Flexible automated design pattern extraction from source code." ASE 2003. paper

Dynamic Analysis

Pradel, Michael, Parker Schuh, and Koushik Sen. "TypeDevil: Dynamic type inconsistency analysis for JavaScript." Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on. Vol. 1. IEEE, 2015. paper

Guo, Philip J., et al. "Dynamic inference of abstract types." Proceedings of the 2006 international symposium on Software testing and analysis. ACM, 2006. paper

Other Papers

  1. De Lucia, A., Di Penta, M., Oliveto, R., Panichella, A., and Panichella, S. Labeling source code with information retrieval methods: an empirical study. Empirical Software Engineering (2013), 1-38.
  2. Gruska, N., Wasylkowski, A., and Zeller, A. Learning from 6,000 projects: lightweight cross-project anomaly detection. In Proceedings of the 19th international symposium on Software testing and analysis (2010), ACM, p. 119-130.
  3. Harman, M. The current state and future of search based software engineering. In 2007 Future of Software Engineering (2007), IEEE Computer Society, p. 342-357.
  4. Maurer, P. M. Generating test data with enhanced context-free grammars. Software, IEEE 7, 4 (1990), 50-55.
  5. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. A machine learning framework for programming by example. ICML 2013.
  6. Omar, C. Structured statistical syntax tree prediction. In Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity (2013), ACM, p. 113-114.
  7. Begel, Andrew, and Thomas Zimmermann. "Analyze This! 145 Questions for Data Scientists in Software Engineering." (2014). paper
  8. Nguyen, A., Piech, C., Huang, J., & Guibas, L. Codewebs: Scalable Homework Search for Massive Open Online Programming Courses. WWW 2014. paper
  9. Tu, Zhaopeng, Zhendong Su, and Prem Devanbu. "On the Localness of Software." paper
  10. Mou, L., Li, G., Jin, Z., Zhang, L., & Wang, T. (2014). TBCNN: A Tree-Based Convolutional Neural Network for Programming Language Processing. paper
  11. Campbell, Joshua Charles, Abram Hindle, and José Nelson Amaral. "Python: Where the Mutants Hide or, Corpus-based Coding Mistake Location in Dynamic Languages." paper
  12. Movshovitz-Attias, Dana, and William W. Cohen. "KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts" paper
  13. Thummalapenta, Suresh, et al. "Automating test automation." ICSE 2012. paper
  14. Barr, Earl T. et al "The Plastic Surgery Hypothesis" paper
  15. Gulwani, Sumit, and Nebojsa Jojic. "Program verification as probabilistic inference." ACM SIGPLAN Notices. Vol. 42. No. 1. ACM, 2007. paper
  16. Livshits, Benjamin, et al. "Merlin: specification inference for explicit information flow problems." ACM Sigplan Notices. Vol. 44. No. 6. ACM, 2009. paper
  17. Galenson, Joel, et al. "CodeHint: Dynamic and interactive synthesis of code snippets." ICSE 2014. paper
  18. Devanbu, P. "New Initiative: The Naturalness of Software." ICSE 2015. paper
  19. Wong, E., Liu, T., & Tan, L. CloCom: Mining Existing Source Code for Automatic Comment Generation. SANER 2015. paper
  20. Schkufza, Eric, Rahul Sharma, and Alex Aiken. "Stochastic superoptimization." ACM SIGARCH Computer Architecture News. Vol. 41. No. 1. ACM, 2013. paper
  21. Ray, Baishakhi, et al. The Uniqueness of Changes: Characteristics and Applications. Microsoft Research Technical Report, 2014. paper
  22. White, M., Vendome, C., Linares-Vásquez, M., & Poshyvanyk, D. Toward Deep Learning Software Repositories. In practice,1(1), 1. paper
  23. Drummond, Anna, et al. "Learning to Grade Student Programs in a Massive Open Online Course." ICDM 2014.
  24. Xin Zhang, Ravi Mangal, Mayur Naik, and Aditya Nori. A User-Guided Approach to Program Analysis. FSE 2015.
  25. Ray, Baishakhi, et al. "On the" Naturalness" of Buggy Code." arXiv preprint arXiv:1506.01159 (2015). paper
  26. Nguyen, A. T., & Nguyen, T. N. Graph-based Statistical Language Model for Code. paper
  27. Chen, Yang, et al. "Taming compiler fuzzers." ACM SIGPLAN Notices. Vol. 48. No. 6. ACM, 2013. paper
  28. Nguyen, Hoan Anh, et al. "Consensus-Based Mining of API Preconditions in Big Code." short paper
  29. Petrosyan, Gayane, Martin P. Robillard, and Renato De Mori. "Discovering Information Explaining API Types Using Text Classification." paper
  30. Nguyen A. T. et al. Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code. ASE 2015.
  31. Lam A. N. et al. Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports. ASE 2015.
  32. Grigore, Radu, and Hongseok Yang. "Abstraction Refinement by a Learnt Probabilistic Model." arXiv preprint arXiv:1511.01874 (2015). paper
  33. Aggarwal, Karan, Mohammad Salameh, and Abram Hindle. Using machine translation for converting Python 2 to Python 3 code. No. e1817. PeerJ PrePrints, 2015. paper
  34. Transforming Spreadsheet Data Types using Examples. Rishabh Singh, Sumit Gulwani. paper
  35. Program Synthesis with Noise. Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause. paper
  36. Livshits, Benjamin, et al. "Merlin: specification inference for explicit information flow problems." PLDI 2009. paper
  37. Rahul Sharma, Aditya Nori, Alex Aiken. "Bias-variance tradeoffs in program analysis." POPL 2014. paper
  38. Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Aditya Nori. "Verification as Learning Geometric Concepts." SAS, 2013. paper
  39. Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Percy Liang, Aditya Nori. "A Data Driven Approach for Algebraic Loop Invariants". ESOP, 2013. paper
  40. Rahul Sharma, Aditya Nori, Alex Aiken. "Interpolants as Classifiers". CAV, 2012. paper
  41. Gulwani et. al. "Autograder for programming assignments". PLDI 2013. paper
  42. Data-Driven Precondition Inference with Learned Features. Saswat Padhi, Rahul Sharma, Todd Millstein. PLDI 2016.
  43. Polymorphic Type Inference for Machine Code. Matt Noonan, Alexey Loginov, David Cok. PLDI 2016.
  44. Refinement Types for TypeScript. Panagiotis Vekris, Benjamin Cosman, Ranjit Jhala. PLDI 2016.
  45. Statistical Similarity of Binaries. Yaniv David, Nimrod Partush, Eran Yahav. PLDI 2016.
  46. Long, Fan, and Martin Rinard. "Automatic patch generation by learning correct code." Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 2016. [ link]
  47. Schkufza, Eric, Rahul Sharma, and Alex Aiken. "Stochastic program optimization." Communications of the ACM 59.2 (2016): 114-122.
  48. Brun, Yuriy, and Michael D. Ernst. "Finding latent code errors via machine learning over program executions." Proceedings of the 26th International Conference on Software Engineering. IEEE Computer Society, 2004. link
  49. Exploring the limits of language modeling link
  50. Trong Duc Nguyen, Anh Tuan Nguyen, and Tien N. Nguyen, "Mapping API Elements for Code Migration with Vector Representations"
  51. Thanh Van Nguyen, Anh Tuan Nguyen, and Tien N. Nguyen, "Characterizing API Elements via Textual Descriptions in Software Documentation with Vector Representation"
  52. History Driven Program RepairXuan-Bach D. Le, David Lo, Claire Le Goues. [ link]
  53. Pedro Domingos, practical fault localization AAAI 2016 [link
  54. Nguyen, Anh Tuan, Hoan Anh Nguyen, and Tien N. Nguyen. "A large-scale study on repetitiveness, containment, and composability of routines in open-source projects." Proceedings of the 13th International Workshop on Mining Software Repositories. ACM, 2016.
  55. Nguyen, Trong Duc, Anh Tuan Nguyen, and Tien N. Nguyen. "Mapping API elements for code migration with vector representations." Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 2016.
  56. Pham, Hung Viet, Phong Minh Vu, and Tung Thanh Nguyen. "Learning API usages from bytecode: a statistical approach." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
  57. Liu, Hui, et al. "Nomen est omen: exploring and exploiting similarities between argument and parameter names." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
  58. Yaghmazadeh, Navid, et al. "Synthesizing Transformations on Hierarchically Structured Data."
  59. Bunel, Rudy, et al. "Adaptive Neural Compilation." arXiv preprint arXiv:1605.07969 (2016).
  60. Gu, Xiadong et al. Deep API Learning
  61. Riedel, Sebastian, Matko Bošnjak, and Tim Rocktäschel. "Programming with a Differentiable Forth Interpreter." arXiv preprint arXiv:1605.06640 (2016).
  62. Pham, Hung Viet, Phong Minh Vu, and Tung Thanh Nguyen. "Learning API usages from bytecode: a statistical approach." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
  63. Amann, Sven, et al. "A study of visual studio usage in practice." Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER'16). 2016.
  64. Pu, Yewen, et al. "sk_p: a neural program corrector for MOOCs." arXiv preprint arXiv:1607.02902 (2016).
  65. Christakis, Maria, and Christian Bird. "What Developers Want and Need from Program Analysis: An Empirical Study."
  66. Bastani, O., Sharma, R., Aiken, A., & Liang, P. Synthesizing Program Input Grammars. [ link]

Past Readings

Software Engineering

Mens, T., Tourwé, T. A survey of software refactoring. Software Engineering, 30(2), 126-139, 2004. paper (28th May)

Ernst, M. D. et al. Dynamically discovering likely program invariants to support program evolution. Software Engineering, 27(2), 99-123, 2001. paper (21st May)

Harman, M., et. al. Search Based Software Engineering:A Comprehensive Analysis and Review of Trends Techniques and Applications. Tech Rep. 2009. paper (14th May)

Fast, Ethan, et al. "Emergent, crowd-scale programming practice in the IDE." CHI 2014. paper

Liblit, Ben, et al. "Scalable statistical bug isolation." ACM SIGPLAN Notices. Vol. 40. No. 6, 2005. paper

Velez, Martin, et al. "A Study of" Wheat" and" Chaff" in Source Code." arXiv preprint arXiv:1502.01410 (2015). paper (25th March)

Little, G., & Miller, R. C. (2009). Keyword programming in Java. ASE, 16(1), 37-71. paper (8th April)

E. T. Barr et al. Automated Software Transplantation. ISSTA’15. paper (21st October)

Gulwani et al. Semi-Supervised Verified Feedback Generation. arXiv preprint. paper (30th March)

Machine Translation for Source Code

Karaivanov, S., Raychev V., and Vechev, M. "Phrase-Based Statistical Translation of Programming Languages." Onward! 2014. paper (1st October)

Nguyen et al. Lexical statistical machine translation for language migration. FSE 2013. paper (19th March)

Sahil Bhatia and Rishabh Singh. "Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks." arXiv preprint arXiv:1603.06129 (2016) paper (20th April)

Oda, Yusuke, et al. "Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation."paper (30th September)

Mining Source Code

Davies, Steven, and Marc Roper. "What's in a bug report?." ESEM 2014. paper (16th October)

Nguyen et al. Mining interprocedural, data-oriented usage patterns in JavaScript web applications. ICSE 2014. paper (4th June)

Nguyen et al. A study of repetitiveness of code changes in software evolution. ASE 2013. paper (6th May)

Wang, J., et al. Mining succinct and high-coverage API usage patterns from source code. MSR 2013. paper (26th Feb)

Hsiao, Chun-Hung, Michael Cafarella, and Satish Narayanasamy. "Using web corpus statistics for program analysis." OOPSLA 2014. paper (5th Nov)

Nguyen et. al. "Mining Preconditions of APIs in Large-scale Code Corpus." FSE 2014. paper (22nd April)

Movshovitz-Attias, D., & Cohen, W. W. Grounded Discovery of Coordinate Term Relationships between Software Entities. arXiv preprint 2015. arXiv:1505.00277. paper (1st July)

Chen, Fuxiang, and Sunghun Kim. "Crowd Debugging." paper (16th Sep)

Code Summarization

Cortés-Coy, Luis Fernando, et al. "On Automatically Generating Commit Messages via Summarization of Source Code Changes." SCAM 14. paper (21st October)

Rodeghero, P., et. al. Improving Automated Source Code Summarization via an Eye-Tracking Study of Programmers. ICSE 2014. paper (23rd April)

D. Movshovitz-Attias and W. Cohen. Natural Language Models for Predicting Programming Comments. ACL 2013. paper (9th April)

Wong, E., Yang, J., and Tan, L. AutoComment: mining question and answer sites for automatic comment generation. ASE 2013. paper (2nd April)

A. Ying and M. Robillard. "Selection and Presentation Practices for Code Example Summarization." (FSE 2014) paper (3rd December)

Language Models for Source Code

Raychev, V., Vechev, M., Yahav, E. Code Completion with Statistical Language Models. PLDI 2014. paper (30th April)

Nguyen et al. A statistical semantic language model for source code. FSE 2013. paper (19th Feb)

C. J. Maddison, D. Tarlow. Structured Generative Models of Natural Source Code. paper (12th Feb)

Raychev, V., Vechev, M., Krause. A. "Predicting Program Properties from Big Code" 2015. paper (26th Nov)

Tsarfaty, R. et al. "Semantic Parsing using Content and Context: A Case Study from Requirements Elicitation." EMNLP 2014. paper (28th Jan)

Clarke, James, et al. "Driving semantic parsing from the world's response." CoNLL 2010. paper (18th Feb)

Learning Program Embeddings to Propagate Feedback on Student Code. Chris Piech, Jonathan Huang, Andy Nguyen, Leonidas Guibas, Mehran Sahami, ICML 2015 paper (15th July)

Hill, Felix, et al. "Learning to Understand Phrases by Embedding the Dictionary." arXiv preprint arXiv:1504.00548 (2015). paper (29th July)

V. Musco et al. A Generative Model of Software Dependency Graphs to Better Understand Software Evolution. arXiv preprint. paper (11th November)

Yujia Li, Daniel Tarlow, Marc Brockschmidt, Richard Zemel. Gated Graph Sequence Neural Networks paper (10th February)

Code Synthesis

Gulwani, S., and Marron, M. "NLyze: Interactive programming by natural language for spreadsheet data analysis and manipulation." SIGMOD, 2014. paper (9th October)

Cozzie, A. and King, S. Macho: Writing programs with natural language and examples.University of Illinois Technical Report 2012. paper (21st Jan)

Lei, Tao, et al. "From natural language specifications to program input parsers." ACL 2013. paper (11th Feb)

Perelman, Daniel, et al. "Test-driven synthesis." PLDI 2014. paper (25th Feb)

V. Le, S. Gulwani, and Z. Su. "Smartsynth: Synthesizing smartphone automation scripts from natural language." MobiSys '13. paper (2nd April)

Chris Quirk, Raymond Mooney and Michel Galley, Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. ACL 2015. paper (17th June)

Kremenek, Ted, et al. "From uncertainty to belief: Inferring the specification within." Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006. paper (12th August)

Gvero, Tihomir, and Viktor Kuncak. On Synthesizing Code from Free-Form Queries. No. EPFL-REPORT-201606. 2014. paper (7th October)

L. Mou et al. On End-to-End Program Generation from User Intention by Deep Neural Networks. paper (4th November)

He Zhu, Aditya Nori, and Suresh Jagannathan. "Learning refinement types." ICFP 2015. paper (27th January)

Ling, Wang, et al. "Latent Predictor Networks for Code Generation." arXiv preprint arXiv:1603.06744 (2016) paper] (6th April)

Scott Reed, Nando de Freitas. "Neural Programmer-Interpreters.” ICLR 2016 Best Paper. paper (27th April)

Topic revision: r114 - 03 May 2017 - 12:30:58 - Main.alouis
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies