Skip to content

Breakout Group Details

Breakout Group 1

Details

  • Topic: Deep Learning for Software Engineering
  • Time: 10:45am - 11:45am
  • Room: Normal Heights
  • Session Lead: Prem Devanbu
  • Session Scribe: Denys Poshyvanyk

Participants

Bogdan Vasilescu Charles Sutton
Sonia Haiduc Audris Mockus
Abram Hindle Ranjit Jhala
Collin McMillan Raymond Mooney
Lin Tan

Discussion Points

  1. What tasks matter in DL applications to software engineering?

    • Individual development tasks?
    • Tasks related to collaboration? co-ordination?
    • Tasks for Deployment?
    • Tasks relating to learning? Training? Education?
    • Any difference between tasks in Open Source & Commercial Settings?
  2. What kinds of data resources are available?

    • What representations of source code matter? (token-level, source code, AST, test, data flow)
    • Is labeling available in sufficient quantity?
    • How to deal with label sparsity, if any ? (Transfer learning, distant supervision, etc)
    • What are the limitations of alignments in software engineering? (code-english, code-tests, code-invariants etc).
  3. What Deep learning architectures are of interest?

    • Transformers, GGNNs, GANs, RNNs, what are the limitations of each for software artifacts?
    • What are the training challenges of each kind of architecture?
    • Are there practical (computational, human, social, legal) limitations to deploying DL technologies in IDEs or operational environments?

Discussion Notes


Breakout Group 2

Details

  • Topic: Verification & Validation of Deep Learning Systems
  • Time: 10:45am - 11:45am
  • Room: University Heights
  • Session Lead: Matthew Dwyer
  • Session Scribe: Sebastian Elbaum

Participants

Koushik Sen Aditya Thakur
Gail Kaiser Zhenming Liu
Shiqing Ma Xiangyu Zhang
Bo Li

Discussion Points

  1. What properties can be specified of DL models?

    • Output invariants, variations on robustness, relational specifications with pre-defined feature predicates, more general metamorphic properties, inferred properties from models, probabilistic properties, ...
  2. How can verification/validation address the sparsity of the training distribution?

    • A DL model is “well defined” on an infinitessimal portion of its input space thus performing V&V of the entire input space is unecessary and horribly inefficient. *What are meaningful coverage criteria given this?
  3. How can verification techniques for feedforward DNNs be scaled beyond toy problems?

  4. Should the research community seek to shape the evolution of these techniques, e.g., by “demanding” reproducibility and direct comparison on standard benchmarks as has been helpful for SAT and SMT?

  5. How do system level safety arguments flow down to DL components?

    • Is there any difference at the requirements level between an algorithmic implementation of bounding-box detection for a pedestrian in an image and a DL implementation?
  6. How do techniques developed for feedforward DL models apply to DRL or RNN models?

    • How do the property specifications change, e.g., temporal?
    • Can analogs of symbolic trajectories that connect state vs. path abstractions in non-DL systems be applied
  7. Given the inherent stochasticity in their definition, are DL models more amenable to N-version approaches for correctness than deterministic systems?

  8. What frameworks could be used to argue that ensembles are safer than individual networks?

Discussion Notes


Breakout Group 3

Details

  • Topic: Development & Deployment Challenges for Deep Learning Systems
  • Time: 10:45am - 11:45am
  • Room: Cortez 3
  • Session Lead: Mike Lowry
  • Session Scribe: Kevin Moran

Participants

Tim Menzies Satish Chandra
Christian Bird Danny Tarlow
Vijayaraghavan Murali Nachi Nagappan
Rishabh Singh

Discussion Points

  1. What approaches to certification of DL models can be imported from traditional safety-critical software?

    • What are the implications in terms of development processes, especially as relates to continuous updates of DL models with new data?
  2. What approaches to certification of DL models require divergences from certification of traditional safety-critical software?

  3. What are the technical challenges to deploying DL systems that are capable of adapting in-situ, in other words where deep learning is performed as part of the system input/output ?

  4. What system architectures would provide both the safety and the adaptivity for DL systems incorporating in-situ learning ?

Discussion Notes


Breakout Group 4

Details

  • Topic: Maintenance of Deep Learning Systems
  • Time: 2:00pm - 3:00pm
  • Room: Normal Heights
  • Session Lead: Sebastian Elbaum
  • Session Scribe: Mike Lowry

Participants

Tim Menzies Sonia Haiduc
Audris Mockus Abram Hindle
Aditya Thakur Collin McMillan
Zhenming Liu Denys Poshyvanyk

Discussion Points

  1. ML/DL systems code and data level tech debt.

    • Infrastructure code incurs significant technical debt: “a mature DL-based system may contain 95% glue code connecting different ML libraries and packages”
  2. Little or no support for evaluating data dependencies for DL systems (as compared to many existing tools for existing classic software where static analysis can be used).

  3. DL-based systems frequently reuse pre-trained parameters from other data sets (transfer learning), which adds dependencies to the data and other evolving models/configurations.

  4. ML/DL specific bad practices, e.g., experimental code paths (dead flags in traditional software).

  5. Configuration management of ML/DL systems can significantly impact performance; needs thorough testing as much as code and data.

  6. DL-based systems rely on rapidly improving hardware (e.g., GPUs) and software (e.g., packages) so managing dependencies becomes an issue.

    • This requires careful and “clever” monitoring and potentially maintenance becomes more expensive.
  7. ML/DL systems are dependent on evolving languages, formats and infrastructures.

    • keeping ML/DL systems up to date requires monitoring and logging to detect changes in underling “plumbing and glue” code.

Discussion Notes


Breakout Group 5

Details

  • Topic: Testing of Deep Learning Systems
  • Time: 2:00pm - 3:00pm
  • Room: University Heights
  • Session Lead: Xiangyu Zhang
  • Session Scribe: Matthew Dwyer

Participants

Koushik Sen Bo Li
Nachi Nagappan Gail Kaiser
Lin Tan Shiqing Ma

Discussion Points

  1. How to test DL models beyond norm-based adversarial attacks?

    • What kind of errors may appear in DNN models (inadequate data, incorrect data, architectural defects, problems that occur while interacting between data and architecture)?
    • How to test more practical attacks(e.g., physical attacks)
  2. How much confidence a testing framework can guarantee?

    • What kind of testing metrics be helpful?
    • What kind of guarantees can be achieved by backbox vs. whitebox vs. greybox settings?
  3. How to generate meaningful test inputs?

    • Do we need to define DSL for generating inputs?
    • Can we leverage the lesson learned from fuzzing, mutation testing, etc?
  4. Once a problem is identified, how to guide debugging?

    • What do we mean by bug localization here?
  5. How to guide repair based on testing/debugging results?

    • Guided data augmentation?
    • How to fix architectural issues?
  6. Can we leverage different architectures developed for the same task?

    • Leveraging differential testing
  7. What kind of systems level (i.e., high level) properties we can test for DL?

    • Can we leverage metamorphic properties and testing?
  8. How to do regression testing for evolving models?

  9. How can help tailor testing of DL for when DL models are applied to SE-specific tasks?

Discussion Notes


Breakout Group 6

Details

  • Topic: Deep Learning for Code Generation
  • Time: 2:00pm - 3:00pm
  • Room: Cortez 3
  • Session Lead: Rishabh Singh
  • Session Scribe: Kevin Moran

Participants

Bogdan Vasilescu Charles Sutton
Satish Chandra Christian Bird
Danny Tarlow Ranjit Jhala
Raymond Mooney Vijayaraghavan Murali
Premkumar Devanbu

Discussion Points

  1. What applications seem most promising both in the near future, and in the longer term with automated code generation?

    • e.g. program superoptimization, code completion, repairing program bugs with small program patches, end-user programming, mobile app development
  2. What are the boundaries of software systems in which to consider automated code generation?

  3. Generating full systems code automatically is unlikely (or maybe not?), generating code in specialized domains, function level synthesis, end-user programming

  4. What are good specification mechanisms to describe programmer’s high-level intent?

    • Full specifications are probably as difficult as writing the program — alternate options could be partial programs, unit tests, I/O examples, natural language, UI
  5. What are suitable architectures for embedding programmer’s intent and code generations?

    • Different neural architectures for embedding examples, partial programs, natural language specifications etc.
  6. Similarly, what are good architectures for generative models of code?

  7. Programmer-CodeGenerator collaboration

  8. What might be good interface boundaries where the synthesizer and programmer can collaborate to write code more efficiently

  9. What could be some good challenge benchmarks to measure progress on the ability to generate code of different complexity?

  10. What might be different ways to combine neural and symbolic techniques for more efficient code generation?

  11. How to ensure maintainability of automatically generated code?

  12. Generating code from scratch vs composing pre-defined functions?

Discussion Notes