research-article Open Access Artifacts Available / v1.1
- Authors:
- Siva Kesava Reddy Kakarla Microsoft Research, Redmond, USA
- Francis Y. Yan Microsoft Research, Redmond, USA
- Ryan Beckett Microsoft Research, Redmond, USA
Proceedings of the ACM on Programming LanguagesVolume 8Issue PLDIArticle No.: 155pp 199–222https://doi.org/10.1145/3656385
Related Artifact: Source code for article "Diffy: Data-Driven Bug Finding for Configurations" June 2024softwarehttps://doi.org/10.5281/zenodo.10740687
- 0citation
- 0
- Downloads
Metrics
Total Citations0Total Downloads0Last 12 Months0
Last 6 weeks0
- Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
- Publisher Site
- eReader
Proceedings of the ACM on Programming Languages
Volume 8, Issue PLDI
PreviousArticleNextArticle
Abstract
Configuration errors remain a major cause of system failures and service outages. One promising approach to identify configuration errors automatically is to learn common usage patterns (and anti-patterns) using data-driven methods. However, existing data-driven learning approaches analyze only simple configurations (e.g., those with no hierarchical structure), identify only simple types of issues (e.g., type errors), or require extensive domain-specific tuning. In this paper, we present Diffy, the first push-button configuration analyzer that detects likely bugs in structured configurations. From example configurations, Diffy learns a common template, with "holes" that capture their variation. It then applies unsupervised learning to identify anomalous template parameters as likely bugs. We evaluate Diffy on a large cloud provider's wide-area network, an operational 5G network testbed, and MySQL configurations, demonstrating its versatility, performance, and accuracy. During Diffy's development, it caught and prevented a bug in a configuration timer value that had previously caused an outage for the cloud provider.
Skip Supplemental Material Section
Supplemental Material
Available for Download
zip
pldi24main-p42-p-archive.zip (834 KB)
The PDF in the folder is the full paper with appendices.
References
- Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal, 24 (2015), 557–581.Google Scholar
Digital Library
- John Backes, Pauline Bolignano, Byron Cook, Catherine Dodge, Andrew Gacek, Kasper Luckow, Neha Rungta, Oksana Tkachuk, and Carsten Varming. 2018. Semantic-based automated reasoning for AWS access policies using SMT. In 2018 Formal Methods in Computer Aided Design (FMCAD). IEEE, Austin, Texas, USA. 1–9.Google Scholar
- Ryan Beckett and Aarti Gupta. 2022. Katra: Realtime Verification for Multilayer Networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA. 617–634. isbn:978-1-939133-27-4 https://www.usenix.org/conference/nsdi22/presentation/beckettGoogle Scholar
- Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. 2017. A General Approach to Network Configuration Verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA. 155–168. isbn:978-1-4503-4653-5 https://doi.org/10.1145/3098822.3098834Google Scholar
Digital Library
- Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. 2018. Control Plane Compression. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA. 476–489. isbn:9781450355674 https://doi.org/10.1145/3230543.3230583Google Scholar
Digital Library
- Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. 2019. Abstract Interpretation of Distributed Network Control Planes. Proc. ACM Program. Lang., 4, POPL (2019), Article 42, dec, 27 pages. https://doi.org/10.1145/3371110Google Scholar
Digital Library
- Tim Bray. 2017. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259. https://doi.org/10.17487/RFC8259Google Scholar
Digital Library
- Qingrong Chen, Teng Wang, Owolabi Legunsen, Shanshan Li, and Tianyin Xu. 2020. Understanding and discovering software configuration dependencies in cloud and datacenter systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA. 362–374. isbn:9781450370431 https://doi.org/10.1145/3368089.3409727Google Scholar
Digital Library
- Cisco. 2023. Basic Router Configuration. https://www.cisco.com/c/en/us/td/docs/routers/access/800M/software/800MSCG/routconf.html [Online; accessed 30-March-2023]Google Scholar
- Oracle Corporation. 2023. MySQL. https://www.mysql.com/ Accessed: 2023-11-01Google Scholar
- Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 859–864.Google Scholar
Cross Ref
- Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. 2001. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (SOSP ’01). Association for Computing Machinery, New York, NY, USA. 57–72. isbn:1581133898 https://doi.org/10.1145/502034.502041Google Scholar
Digital Library
- Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. 2001. Bugs as deviant behavior: A general approach to inferring errors in systems code. ACM SIGOPS Operating Systems Review, 35, 5 (2001), 57–72.Google Scholar
Digital Library
- Evolven. 2022. Downtime, Outages and Failures - Understanding Their True Costs. https://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html Accessed: March 26, 2023Google Scholar
- Seyed K. Fayaz, Tushar Sharma, Ari Fogel, Ratul Mahajan, Todd Millstein, Vyas Sekar, and George Varghese. 2016. Efficient Network Reachability Analysis Using a Succinct Control Plane Representation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA. 217–232. isbn:978-1-931971-33-1 https://www.usenix.org/conference/osdi16/technical-sessions/presentation/fayazGoogle Scholar
Digital Library
- Kathleen Fisher, David Walker, Kenny Q Zhu, and Peter White. 2008. From dirt to shovels: fully automatic tool generation from ad hoc data. Acm sigplan notices, 43, 1 (2008), 421–434.Google Scholar
Digital Library
- Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd Millstein. 2015. A General Approach to Network Configuration Analysis. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA. 469–483. isbn:978-1-931971-218 https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/fogelGoogle Scholar
- Cloud Native Computing Foundation. 2023. Kubernetes Documentation. https://kubernetes.io/docs/home/ Accessed: 2023-11-01Google Scholar
- Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. 149–158.Google Scholar
- Aaron Gember-Jacobson, Raajay Viswanathan, Aditya Akella, and Ratul Mahajan. 2016. Fast Control Plane Analysis Using an Abstract Representation. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ’16). ACM, New York, NY, USA. 300–313. isbn:978-1-4503-4193-6 https://doi.org/10.1145/2934872.2934876Google Scholar
Digital Library
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46, 1 (2011), 317–330.Google Scholar
Digital Library
- Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatap*rnwongsa, Tiratat Patana-Anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, and Vincentius Martin. 2014. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing. 1–14.Google Scholar
Digital Library
- Songqiao Han, Xiyang Hu, Hailiang Huang, Mingqi Jiang, and Yue Zhao. 2022. ADBench: Anomaly Detection Benchmark. In Neural Information Processing Systems (NeurIPS).Google Scholar
- Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). 33–40.Google Scholar
Cross Ref
- Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). 207–218.Google Scholar
Cross Ref
- Alex Horn, Ali Kheradmand, and Mukul Prasad. 2017. Delta-net: Real-time Network Verification Using Atoms. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA. 735–749. isbn:978-1-931971-37-9 https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/horn-alexGoogle Scholar
Digital Library
- 2023. Istio Configuration. https://istio.io/latest/docs/ops/configuration/Google Scholar
- Karthick Jayaraman, Nikolaj Bjorner, Jitu Padhye, Amar Agrawal, Ashish Bhargava, Paul-Andre C Bissonnette, Shane Foster, Andrew Helwer, Mark Kasten, Ivan Lee, Anup Namdhari, Haseeb Niaz, Aniruddha Parkhi, Hanukumar Pinnamraju, Adrian Power, Neha Milind Raje, and Parag Sharma. 2019. Validating Datacenters at Scale. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM ’19). ACM, New York, NY, USA. 200–213. isbn:978-1-4503-5956-6 https://doi.org/10.1145/3341302.3342094Google Scholar
Digital Library
- Juniper Networks. 2023. CLI User Guide for Junos OS. https://www.juniper.net/documentation/us/en/software/junos/cli/index.html Accessed: April 2, 2023Google Scholar
- Siva Kesava Reddy Kakarla, Ryan Beckett, Behnaz Arzani, Todd Millstein, and George Varghese. 2020. GRoot: Proactive Verification of DNS Configurations. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’20). Association for Computing Machinery, New York, NY, USA. 310–328. isbn:9781450379557 https://doi.org/10.1145/3387514.3405871Google Scholar
Digital Library
- Siva Kesava Reddy Kakarla, Alan Tang, Ryan Beckett, Karthick Jayaraman, Todd Millstein, Yuval Tamir, and George Varghese. 2020. Finding Network Misconfigurations by Automatic Template Inference. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA. 999–1013. isbn:978-1-939133-13-7 https://www.usenix.org/conference/nsdi20/presentation/kakarlaGoogle Scholar
- Siva Kesava Reddy Kakarla, Francis Y. Yan, and Ryan Beckett. 2024. Diffy: Data-Driven Bug Finding for Configurations. https://github.com/microsoft/DiffyConfigAnalyzer Accessed: April 5, 2024Google Scholar
- Siva Kesava Reddy Kakarla, Francis Y. Yan, and Ryan Beckett. 2024. Diffy: Data-Driven Bug Finding for Configurations. https://doi.org/10.5281/zenodo.10740687 Accessed: April 5, 2024Google Scholar
Cross Ref
- Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Header Space Analysis: Static Checking for Networks. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX Association, San Jose, CA. 113–126. isbn:978-931971-92-8 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/kazemianGoogle Scholar
- Ahmed Khurshid, Xuan Zou, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey. 2013. VeriFlow: Verifying Network-Wide Invariants in Real Time. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX, Lombard, IL. 15–27. isbn:978-1-931971-00-3 https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/khurshidGoogle Scholar
Digital Library
- Franck Le, Sihyung Lee, Tina Wong, Hyong S. Kim, and Darrell Newcomb. 2006. Minerals: Using Data Mining to Detect Router Misconfigurations. In Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data (MineNet ’06). Association for Computing Machinery, New York, NY, USA. 293–298. isbn:159593569X https://doi.org/10.1145/1162678.1162681Google Scholar
Digital Library
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, Pisa, Italy. 413–422.Google Scholar
Digital Library
- Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Jiaxin Cao, Sri Tallapragada, Nuno P Lopes, Andrey Rybalchenko, Guohan Lu, and Lihua Yuan. 2017. Crystalnet: Faithfully emulating large production networks. In Proceedings of the 26th Symposium on Operating Systems Principles. 599–613.Google Scholar
Digital Library
- Nuno P. Lopes, Nikolaj Bjørner, Patrice Godefroid, Karthick Jayaraman, and George Varghese. 2015. Checking Beliefs in Dynamic Networks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI’15). USENIX Association, USA. 499–512. isbn:9781931971218Google Scholar
Digital Library
- Haohui Mai, Ahmed Khurshid, Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, and Samuel Talmadge King. 2011. Debugging the Data Plane with Anteater. SIGCOMM Comput. Commun. Rev., 41, 4 (2011), aug, 290–301. issn:0146-4833 https://doi.org/10.1145/2043164.2018470Google Scholar
Digital Library
- Nextgov. 2021. Commercial Cloud Outages Are a Wake-Up Call. https://www.nextgov.com/ideas/2021/03/commercial-cloud-outages-are-wake-call/172731/ Accessed: 2023-11-01Google Scholar
- David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it? In 4th Usenix Symposium on Internet Technologies and Systems (USITS 03).Google Scholar
Digital Library
- Saswat Padhi. 2018. FlashProfileDemo: A C# application that demonstrates the capabilities of FlashProfile. https://github.com/SaswatPadhi/FlashProfileDemo/tree/master/tests Accessed: March 25, 2023Google Scholar
- Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2018. FlashProfile: A Framework for Synthesizing Data Profiles. PACMPL, 2, OOPSLA (2018), 150:1–150:28. https://doi.org/10.1145/3276520Google Scholar
Digital Library
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. SIGPLAN Not., 50, 10 (2015), oct, 107–126. issn:0362-1340 https://doi.org/10.1145/2858965.2814310Google Scholar
Digital Library
- Raymond Pompon. 2021. BGP, DNS, and the fragility of our critical systems. https://www.f5.com/labs/articles/cisotociso/bgp-dns-and-the-fragility-of-our-critical-systems Accessed: 2023-11-01Google Scholar
- Ariel Rabkin and Randy Howard Katz. 2012. How hadoop clusters break. IEEE software, 30, 4 (2012), 88–94.Google Scholar
- Teri Radichel. 2023. About the 5-hour Microsoft Outage. https://medium.com/cloud-security/about-the-5-hour-microsoft-outage-18d47543769d Accessed: 2023-11-01Google Scholar
- Yakov Rekhter, Susan Hares, and Tony Li. 2006. A Border Gateway Protocol 4 (BGP-4). RFC 4271. https://doi.org/10.17487/RFC4271Google Scholar
Digital Library
- Mark Santolucito, Ennan Zhai, Rahul Dhodapkar, Aaron Shim, and Ruzica Piskac. 2017. Synthesizing configuration file specifications with association rule learning. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), 1–20.Google Scholar
Digital Library
- Mark Santolucito, Ennan Zhai, and Ruzica Piskac. 2016. Probabilistic automated language learning for configuration files. In Computer Aided Verification: 28th International Conference, CAV 2016, Proceedings, Part II 28. Springer, Cham, Toronto, ON, Canada. 80–87.Google Scholar
- Temple F Smith and Michael S Waterman. 1981. Identification of common molecular subsequences. Journal of molecular biology, 147, 1 (1981), 195–197.Google Scholar
Cross Ref
- Alan Tang, Siva Kesava Reddy Kakarla, Ryan Beckett, Ennan Zhai, Matt Brown, Todd Millstein, Yuval Tamir, and George Varghese. 2021. Campion: Debugging Router Configuration Differences. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (SIGCOMM ’21). Association for Computing Machinery, New York, NY, USA. 748–761. isbn:9781450383837 https://doi.org/10.1145/3452296.3472925Google Scholar
Digital Library
- Liam Tung. 2019. Azure global outage: Our DNS update mangled domain records, says Microsoft. https://www.zdnet.com/article/azure-global-outage-our-dns-update-mangled-domain-records-says-microsoft/ Accessed: 2023-11-01Google Scholar
- Kurt Wise. 2017. High Number of AWS Misconfigurations Leaves Huge Security Holes. https://virtualizationreview.com/articles/2017/04/19/aws-misconfigurations-leaves-huge-security-holes.aspxGoogle Scholar
- Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. 2013. Do not blame users for misconfigurations. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 244–259.Google Scholar
Digital Library
- Hongkun Yang and Simon S. Lam. 2016. Real-time Verification of Network Properties Using Atomic Predicates. IEEE/ACM Trans. Netw., 24, 2 (2016), April, 887–900. issn:1063-6692 https://doi.org/10.1109/TNET.2015.2398197Google Scholar
Digital Library
- Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. 159–172.Google Scholar
Digital Library
- Iris Zarecki. 2019. 19 of the worst IT outages in 2019 – A Recap of Being Let Down. https://www.continuitysoftware.com/blog/19-of-the-worst-it-outages-in-2019-a-recap-of-being-let-down/Google Scholar
- Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 687–700. isbn:9781450323055 https://doi.org/10.1145/2541940.2541983Google Scholar
Digital Library
- Peng Zhang, Xu Liu, Hongkun Yang, Ning Kang, Zhengchang Gu, and Hao Li. 2020. APKeep: Realtime Verification for Real Networks. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA. 241–255. isbn:978-1-939133-13-7 https://www.usenix.org/conference/nsdi20/presentation/zhang-pengGoogle Scholar
Cited By
View all
Index Terms
Diffy: Data-Driven Bug Finding for Configurations
Networks
Network properties
Network reliability
Software and its engineering
Software notations and tools
Software configuration management and version control systems
Recommendations
- Implementation of Packet Filter Configurations Anomaly Detection System with SIERRA
Information and Communications Security
Abstract
Packet filtering in a firewall is one of the useful tools for network security. Packet filtering examines network packet and decides whether to accept, or deny it and this decision is determined by a packet filtering configuration developed by the ...
Read More
- An Extensive Analysis of Efficient Bug Prediction Configurations
PROMISE: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering
Background: Bug prediction helps developers steer maintenance activities towards the buggy parts of a software. There are many design aspects to a bug predictor, each of which has several options, i.e., software metrics, machine learning model, and ...
Read More
- Effective Bug Triage Based on Historical Bug-Fix Information
ISSRE '14: Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability Engineering
For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been ...
Read More
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Article
- Information
- Contributors
Published in
Proceedings of the ACM on Programming Languages Volume 8, Issue PLDI
June 2024
2198 pages
EISSN:2475-1421
DOI:10.1145/3554317
- Editor:
- Michael Hicks
Amazon, USA
Issue’s Table of Contents
Copyright © 2024 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2024
Published in pacmpl Volume 8, Issue PLDI
Permissions
Request permissions about this article.
Author Tags
- anomaly detection
- configuration bug finding
- template synthesis
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics
- Bibliometrics
- Citations0
Article Metrics
- View Citations
Total Citations
Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet
PDF Format
View or Download as a PDF file.
eReader
View online with eReader.
eReader
Digital Edition
View this article in digital edition.
View Digital Edition
- Figures
- Other
Close Figure Viewer
Browse AllReturn
Caption
View Issue’s Table of Contents