Breakout Group Details
Breakout Group 1
Details
- Topic: Deep Learning for Software Engineering
- Time: 10:45am - 11:45am
- Room: Normal Heights
- Session Lead: Prem Devanbu
- Session Scribe: Denys Poshyvanyk
Participants
Bogdan Vasilescu | Charles Sutton |
Sonia Haiduc | Audris Mockus |
Abram Hindle | Ranjit Jhala |
Collin McMillan | Raymond Mooney |
Lin Tan |
Discussion Points
-
What tasks matter in DL applications to software engineering?
- Individual development tasks?
- Tasks related to collaboration? co-ordination?
- Tasks for Deployment?
- Tasks relating to learning? Training? Education?
- Any difference between tasks in Open Source & Commercial Settings?
-
What kinds of data resources are available?
- What representations of source code matter? (token-level, source code, AST, test, data flow)
- Is labeling available in sufficient quantity?
- How to deal with label sparsity, if any ? (Transfer learning, distant supervision, etc)
- What are the limitations of alignments in software engineering? (code-english, code-tests, code-invariants etc).
-
What Deep learning architectures are of interest?
- Transformers, GGNNs, GANs, RNNs, what are the limitations of each for software artifacts?
- What are the training challenges of each kind of architecture?
- Are there practical (computational, human, social, legal) limitations to deploying DL technologies in IDEs or operational environments?
Discussion Notes
Breakout Group 2
Details
- Topic: Verification & Validation of Deep Learning Systems
- Time: 10:45am - 11:45am
- Room: University Heights
- Session Lead: Matthew Dwyer
- Session Scribe: Sebastian Elbaum
Participants
Koushik Sen | Aditya Thakur |
Gail Kaiser | Zhenming Liu |
Shiqing Ma | Xiangyu Zhang |
Bo Li |
Discussion Points
-
What properties can be specified of DL models?
- Output invariants, variations on robustness, relational specifications with pre-defined feature predicates, more general metamorphic properties, inferred properties from models, probabilistic properties, ...
-
How can verification/validation address the sparsity of the training distribution?
- A DL model is “well defined” on an infinitessimal portion of its input space thus performing V&V of the entire input space is unecessary and horribly inefficient. *What are meaningful coverage criteria given this?
-
How can verification techniques for feedforward DNNs be scaled beyond toy problems?
-
Should the research community seek to shape the evolution of these techniques, e.g., by “demanding” reproducibility and direct comparison on standard benchmarks as has been helpful for SAT and SMT?
-
How do system level safety arguments flow down to DL components?
- Is there any difference at the requirements level between an algorithmic implementation of bounding-box detection for a pedestrian in an image and a DL implementation?
-
How do techniques developed for feedforward DL models apply to DRL or RNN models?
- How do the property specifications change, e.g., temporal?
- Can analogs of symbolic trajectories that connect state vs. path abstractions in non-DL systems be applied
-
Given the inherent stochasticity in their definition, are DL models more amenable to N-version approaches for correctness than deterministic systems?
- What frameworks could be used to argue that ensembles are safer than individual networks?
Discussion Notes
Breakout Group 3
Details
- Topic: Development & Deployment Challenges for Deep Learning Systems
- Time: 10:45am - 11:45am
- Room: Cortez 3
- Session Lead: Mike Lowry
- Session Scribe: Kevin Moran
Participants
Tim Menzies | Satish Chandra |
Christian Bird | Danny Tarlow |
Vijayaraghavan Murali | Nachi Nagappan |
Rishabh Singh |
Discussion Points
-
What approaches to certification of DL models can be imported from traditional safety-critical software?
- What are the implications in terms of development processes, especially as relates to continuous updates of DL models with new data?
-
What approaches to certification of DL models require divergences from certification of traditional safety-critical software?
-
What are the technical challenges to deploying DL systems that are capable of adapting in-situ, in other words where deep learning is performed as part of the system input/output ?
-
What system architectures would provide both the safety and the adaptivity for DL systems incorporating in-situ learning ?
Discussion Notes
Breakout Group 4
Details
- Topic: Maintenance of Deep Learning Systems
- Time: 2:00pm - 3:00pm
- Room: Normal Heights
- Session Lead: Sebastian Elbaum
- Session Scribe: Mike Lowry
Participants
Tim Menzies | Sonia Haiduc |
Audris Mockus | Abram Hindle |
Aditya Thakur | Collin McMillan |
Zhenming Liu | Denys Poshyvanyk |
Discussion Points
-
ML/DL systems code and data level tech debt.
- Infrastructure code incurs significant technical debt: “a mature DL-based system may contain 95% glue code connecting different ML libraries and packages”
-
Little or no support for evaluating data dependencies for DL systems (as compared to many existing tools for existing classic software where static analysis can be used).
-
DL-based systems frequently reuse pre-trained parameters from other data sets (transfer learning), which adds dependencies to the data and other evolving models/configurations.
-
ML/DL specific bad practices, e.g., experimental code paths (dead flags in traditional software).
-
Configuration management of ML/DL systems can significantly impact performance; needs thorough testing as much as code and data.
-
DL-based systems rely on rapidly improving hardware (e.g., GPUs) and software (e.g., packages) so managing dependencies becomes an issue.
- This requires careful and “clever” monitoring and potentially maintenance becomes more expensive.
-
ML/DL systems are dependent on evolving languages, formats and infrastructures.
- keeping ML/DL systems up to date requires monitoring and logging to detect changes in underling “plumbing and glue” code.
Discussion Notes
Breakout Group 5
Details
- Topic: Testing of Deep Learning Systems
- Time: 2:00pm - 3:00pm
- Room: University Heights
- Session Lead: Xiangyu Zhang
- Session Scribe: Matthew Dwyer
Participants
Koushik Sen | Bo Li |
Nachi Nagappan | Gail Kaiser |
Lin Tan | Shiqing Ma |
Discussion Points
-
How to test DL models beyond norm-based adversarial attacks?
- What kind of errors may appear in DNN models (inadequate data, incorrect data, architectural defects, problems that occur while interacting between data and architecture)?
- How to test more practical attacks(e.g., physical attacks)
-
How much confidence a testing framework can guarantee?
- What kind of testing metrics be helpful?
- What kind of guarantees can be achieved by backbox vs. whitebox vs. greybox settings?
-
How to generate meaningful test inputs?
- Do we need to define DSL for generating inputs?
- Can we leverage the lesson learned from fuzzing, mutation testing, etc?
-
Once a problem is identified, how to guide debugging?
- What do we mean by bug localization here?
-
How to guide repair based on testing/debugging results?
- Guided data augmentation?
- How to fix architectural issues?
-
Can we leverage different architectures developed for the same task?
- Leveraging differential testing
-
What kind of systems level (i.e., high level) properties we can test for DL?
- Can we leverage metamorphic properties and testing?
-
How to do regression testing for evolving models?
-
How can help tailor testing of DL for when DL models are applied to SE-specific tasks?
Discussion Notes
Breakout Group 6
Details
- Topic: Deep Learning for Code Generation
- Time: 2:00pm - 3:00pm
- Room: Cortez 3
- Session Lead: Rishabh Singh
- Session Scribe: Kevin Moran
Participants
Bogdan Vasilescu | Charles Sutton |
Satish Chandra | Christian Bird |
Danny Tarlow | Ranjit Jhala |
Raymond Mooney | Vijayaraghavan Murali |
Premkumar Devanbu |
Discussion Points
-
What applications seem most promising both in the near future, and in the longer term with automated code generation?
- e.g. program superoptimization, code completion, repairing program bugs with small program patches, end-user programming, mobile app development
-
What are the boundaries of software systems in which to consider automated code generation?
-
Generating full systems code automatically is unlikely (or maybe not?), generating code in specialized domains, function level synthesis, end-user programming
-
What are good specification mechanisms to describe programmer’s high-level intent?
- Full specifications are probably as difficult as writing the program — alternate options could be partial programs, unit tests, I/O examples, natural language, UI
-
What are suitable architectures for embedding programmer’s intent and code generations?
- Different neural architectures for embedding examples, partial programs, natural language specifications etc.
-
Similarly, what are good architectures for generative models of code?
-
Programmer-CodeGenerator collaboration
-
What might be good interface boundaries where the synthesizer and programmer can collaborate to write code more efficiently
-
What could be some good challenge benchmarks to measure progress on the ability to generate code of different complexity?
-
What might be different ways to combine neural and symbolic techniques for more efficient code generation?
-
How to ensure maintainability of automatically generated code?
-
Generating code from scratch vs composing pre-defined functions?
Discussion Notes