Why Enterprises should Build Infrastructure for Artificial Intelligence – AI first

Why Enterprises should Build Infrastructure for Artificial Intelligence - AI first

Artificial Intelligence - AI is bringing new levels of Automation to everything from Cars and Kiosks to Utility Grids, Healthcare, Life Sciences, and Financial Networks. But it’s easy to forget that before the enterprise can automate the world, it has to Automate itself first.

As with most complicated systems, IT Infrastructure Management is ripe for Intelligent Automation. As data loads become larger and more complex and the infrastructure itself extends beyond the Datacenter into the Cloud and the edge, the speed at which new environments are provisioned, optimized, and decommissioned will soon exceed the capabilities of even an army of human operators. That means Artificial Intelligence - AI will be needed on the ground level to handle the demands of Artificial Intelligence - AI initiatives higher up the IT stack.

Artificial Intelligence - AI begins with Infrastructure

In a classic Catch-22, however, most enterprises are running into trouble deploying Artificial Intelligence - AI on their infrastructure, in large part because they lack the tools to leverage the technology in a meaningful way. A recent survey by Run: AI shows that few Artificial Intelligence - AI algorithms and models are getting into production – less than 10% at some organizations – with many data scientists still resorting to manual access to GPUs and other elements of Data Infrastructure to get projects to the finish line.

Another study by Global Surveys showed that just 17% of AI and IT practitioners report seeing high utilization of hardware resources, with 28% reporting that much of their infrastructure remains idle for large periods of time. And this is after their organizations have poured millions of dollars into new hardware, software, and Cloud Resources, in large part to leverage Artificial Intelligence - AI, Machine Learning - ML, and Deep Learning.

If the enterprise is to successfully carry out the transformation from traditional modes of operation to fully digitized ones, Artificial Intelligence - AI will have to play a prominent role. IT consultancy Aarav Solutions points out that Artificial Intelligence - AI is invaluable when it comes to automating infrastructure support, security, resource provisioning, and a host of other activities. Its secret sauce is the capability to analyze massive data sets at high speed and with far greater accuracy than manual processes, giving decision-makers granular insight into the otherwise hidden forces affecting their operations.

A deeper look into all the interrelated functions that go into Infrastructure Management on a daily basis, sparks wonder at how the enterprise has gotten this far without Artificial Intelligence - AI. XenonStack COO and CDS Jagreet Kaur Gill, recently highlighted the myriad functions that can be kicked into hyper-speed with Artificial Intelligence - AI, everything from Capacity Planning and Resource Utilization to Anomaly Detection and Real-Time Root Cause Analysis. With the ability to track and manage literally millions of events at a time, Artificial Intelligence - AI will provide the foundation that allows the enterprise to maintain the scale, reliability, and dynamism of the digital economy.

Artificial Intelligence and Edge Computing

With this kind of management stack in place, says Sandeep Singh, vice president of storage marketing at HPE, it’s not too early to start talking about Artificial Intelligence - AI and Operations (AIOps) driven frameworks and fully autonomous IT operations, particularly in greenfield deployments between the Edge and the Cloud. The Edge, after all, is where much of the storage and processing of the Internet of Things - IoT, Industrial Internet of Things - IIoT, and Internet of Medical Things - IoMT data will take place. But it is also characterized by a highly dispersed physical footprint, with small, interconnected nodes pushed as close to user devices as possible. But its very nature, then, the Edge must be Autonomous. Using AIOps, organizations will be able to build self-sufficient, Real-Time Analytics and decision-making capabilities, while at the same time ensuring maximum uptime and fail-over should anything happen to disrupt operations at a given endpoint.

Looking forward, it’s clear that Artificial Intelligence - AI empowered infrastructure will be more than just a competitive advantage, but an operational necessity. With the amount of data generated by an increasingly connected world, plus the quickly changing nature of all the digital processes and services this entails, there is simply no other way to manage these environments without AI.

Intelligence will be the driving force in enterprise operations as the decade unfolds, but just like any other technology initiative, it must be implemented from the ground up – and that process starts with infrastructure.

SOURCE: VentureBeat | Author: Arthur Cole

URL: https://venturebeat.com/2021/11/22/why-enterprises-should-build-ai-infrastructure-first/

Microsoft Ignite 2021 | Book of News

Microsoft Ignite 2021 | Book of News
Microsoft Ignite 2021 | Book of News

Microsoft Ignite 2021 | Book of News

For the latest Technology and Business Innovations announced by Microsoft, we recommend that you visit the Microsoft Ignite Book of News, which is "Online" and located at: https://news.microsoft.com/ignite-november-2021-book-of-news/


Welcome everyone to Microsoft Ignite, and once again we have a book’s worth of news about Microsoft 365, Azure, Dynamics 365, Security, Power Platform, AI and much more.

Our goal with the Book of News is to provide you with a guide to all the announcements we are making, with all the detail you need. Our standing goal remains as it has always been – to make it as easy as possible for you to navigate all the latest information and provide key details on the topics you are most interested in.

Microsoft Ignite is a seminal moment for our company. We will welcome more than 100,000 global attendees across a variety of industries to experience our latest and greatest technologies while also getting a sneak peek at new products and services that will be coming in the future.

The backdrop for our news at Ignite is the Microsoft Cloud. The Microsoft Cloud powers an organization’s digital capability, while providing the safeguards necessary to keep data confidential and secure. There is no question that the past year and a half has been a catalyst for structural change in every industry, from the adoption of telehealth in healthcare, to digital wallets in financial services, to curbside pick-up and contact-less shopping in retail.

  • Digital technology will be more necessary than ever, for every organization, in every sector. The implications for IT are profound.
  • Fundamentally, we are moving into an era in which people expect their digital data to be available anywhere, at any time and on any device.
  • We have a great lineup of news and some really exciting moments planned for this year’s Ignite. I hope that you can join us.

As always, send us your feedback! We want to know how we can do better. Are you getting the information and context you need? What can we do to make the experience ever better next time?

Foreword by Frank X. Shaw

What is the Book of News?

The Microsoft Ignite Book of News is your guide to key news items that we are announcing at Microsoft Ignite. The interactive Table of Contents gives you the option to select the items you are interested in, and the translation capabilities make the Book of News more accessible globally. (Just click the Translate button above the Table of Contents to enable translations.)

We also pulled together a folder of imagery related to a few of the news items. Please take a look at the imagery here.

We hope the Book of News provides all the information, executive insight and context you need. If you have any questions or feedback regarding content in the Book of News, please email eventcom@microsoft.com.

For the latest Technology and Business Innovations announced by Microsoft, we recommend that you visit the Microsoft Ignite Book of News, which is "Online" and located at: https://news.microsoft.com/ignite-november-2021-book-of-news/

Using Machine Learning – ML to Predict High-Impact Research

Using Machine Learning - ML to Predict High-Impact Research

DELPHI, an artificial intelligence framework, can give an “early-alert” signal for future key technologies by learning from patterns gleaned from previous scientific publications.
MIT Media Lab
Publication Date:
Using machine learning to predict high-impact research

Using machine learning to predict high-impact research

An artificial intelligence framework built by MIT researchers can give an “early-alert” signal for future high-impact technologies, by learning from patterns gleaned from previous scientific publications.

In a retrospective test of its capabilities, DELPHI, short for Dynamic Early-warning by Learning to Predict High Impact, was able to identify all pioneering papers on an experts’ list of key foundational biotechnologies, sometimes as early as the first year after their publication.

James W. Weis, a research affiliate of the MIT Media Lab, and Joseph Jacobson, a professor of media arts and sciences and head of the Media Lab’s Molecular Machines research group, also used DELPHI to highlight 50 recent scientific papers that they predict will be high impact by 2023. Topics covered by the papers include DNA nanorobots used for cancer treatment, high-energy density lithium-oxygen batteries, and chemical synthesis using deep neural networks, among others.

The researchers see DELPHI as a tool that can help humans better leverage funding for scientific research, identifying “diamond in the rough” technologies that might otherwise languish and offering a way for governments, philanthropies, and venture capital firms to more efficiently and productively support science.

“In essence, our algorithm functions by learning patterns from the history of science, and then pattern-matching on new publications to find early signals of high impact,” says Weis. “By tracking the early spread of ideas, we can predict how likely they are to go viral or spread to the broader academic community in a meaningful way.”

The paper has been published in Nature Biotechnology.

Searching for the “diamond in the rough”

The machine learning algorithm developed by Weis and Jacobson takes advantage of the vast amount of digital information that is now available with the exponential growth in scientific publication since the 1980s. But instead of using one-dimensional measures, such as the number of citations, to judge a publication’s impact, DELPHI was trained on a full time-series network of journal article metadata to reveal higher-dimensional patterns in their spread across the scientific ecosystem.

The result is a knowledge graph that contains the connections between nodes representing papers, authors, institutions, and other types of data. The strength and type of the complex connections between these nodes determine their properties, which are used in the framework. “These nodes and edges define a time-based graph that DELPHI uses to learn patterns that are predictive of high future impact,” explains Weis.

Together, these network features are used to predict scientific impact, with papers that fall in the top 5 percent of time-scaled node centrality five years after publication considered the “highly impactful” target set that DELPHI aims to identify. These top 5 percent of papers constitute 35 percent of the total impact in the graph. DELPHI can also use cutoffs of the top 1, 10, and 15 percent of time-scaled node centrality, the authors say.

DELPHI suggests that highly impactful papers spread almost virally outside their disciplines and smaller scientific communities. Two papers can have the same number of citations, but highly impactful papers reach a broader and deeper audience. Low-impact papers, on the other hand, “aren’t really being utilized and leveraged by an expanding group of people,” says Weis.

The framework might be useful in “incentivizing teams of people to work together, even if they don’t already know each other — perhaps by directing funding toward them to come together to work on important multidisciplinary problems,” he adds.

Compared to citation number alone, DELPHI identifies over twice the number of highly impactful papers, including 60 percent of “hidden gems,” or papers that would be missed by a citation threshold.

"Advancing fundamental research is about taking lots of shots on goal and then being able to quickly double down on the best of those ideas,” says Jacobson. “This study was about seeing whether we could do that process in a more scaled way, by using the scientific community as a whole, as embedded in the academic graph, as well as being more inclusive in identifying high-impact research directions."

The researchers were surprised at how early in some cases the “alert signal” of a highly impactful paper shows up using DELPHI. “Within one year of publication we are already identifying hidden gems that will have significant impact later on,” says Weis.

He cautions, however, that DELPHI isn’t exactly predicting the future. “We’re using machine learning to extract and quantify signals that are hidden in the dimensionality and dynamics of the data that already exist.”

Fair, efficient, and effective funding

The hope, the researchers say, is that DELPHI will offer a less-biased way to evaluate a paper’s impact, as other measures such as citations and journal impact factor number can be manipulated, as past studies have shown.

“We hope we can use this to find the most deserving research and researchers, regardless of what institutions they’re affiliated with or how connected they are,” Weis says.

As with all machine learning frameworks, however, designers and users should be alert to bias, he adds. “We need to constantly be aware of potential biases in our data and models. We want DELPHI to help find the best research in a less-biased way — so we need to be careful our models are not learning to predict future impact solely on the basis of sub-optimal metrics like h-Index, author citation count, or institutional affiliation.”

DELPHI could be a powerful tool to help scientific funding become more efficient and effective, and perhaps be used to create new classes of financial products related to science investment.

“The emerging metascience of science funding is pointing toward the need for a portfolio approach to scientific investment,” notes David Lang, executive director of the Experiment Foundation. “Weis and Jacobson have made a significant contribution to that understanding and, more importantly, its implementation with DELPHI.”

It’s something Weis has thought about a lot after his own experiences in launching venture capital funds and laboratory incubation facilities for biotechnology startups.

“I became increasingly cognizant that investors, including myself, were consistently looking for new companies in the same spots and with the same preconceptions,” he says. “There’s a giant wealth of highly-talented people and amazing technology that I started to glimpse, but that is often overlooked. I thought there must be a way to work in this space — and that machine learning could help us find and more effectively realize all this unmined potential.”

Source: Massachusetts Institute of Technology

Source URL: https://news.mit.edu/2021/using-machine-learning-predict-high-impact-research-0517?utm_campaign=Learning%20Posts&utm_content=167488607&utm_medium=social&utm_source=twitter&hss_channel=tw-3018841323

Understanding Design Docs Principles

Run Your Data Projects Effectively with The Right Design Docs

A good design docs is inseparable from A Good Data Scientist and Engineer — Vincent Tatan, Google ML Engineer

In most cases, Engineers spent 18 months contemplating and writing documents on how best to serve the customer. — Eugene Yan, Amazon Data Scientist

Last month, I presented the undeniable importance of Design Docs which explained design docs as the conceptual lighthouses to build and run machine learning systems.

As design docs are hugely important, I would love to share my principles to create design docs to strongly execute your data projects.


Design docs provide conceptual lighthouses to guide your data projects.

  • Design docs conceptually guides you in every step to understand your goals, impacts, and executions to benefit stakeholders. Design docs ensure your projects land with impacts.
  • Design docs save you time to design to highlight implementations and alternative solutions before executing them.
  • Design docs host discussions among teams to brainstorm best solutions and implement data projects.
  • Design docs serve as permanent artifacts to solidify your ideas for future collaborations.


Your audience is the key reason why you write design docs. In every design docs writeup, you must understand your audiences such as:

  • Yourself: To identify learning journeys, brainstorm ideas and future impactful projects.
  • Team members: To identify collaboration points, escalations, and system specific impacts. In the design docs, You need to align your assumptions to team members’ prior knowledge.
  • Cross departments: To identify cross departmental collaborations. Your design docs need to communicate prior knowledge and success metrics.
  • Executives: To make decisions. You need to provide solid recommendations to move the needle on high level metrics (e.g: user adoptions, revenue, and goodwill)
  • External: To foster professional reputations and network. You need to deliver solid takeaways and avoid using jargons .

Finding the Right Design Docs Types for The Right Space (Context)

I would like to highlight three different contexts which require various types of design docs. For terminology, I would highlight these contexts as solutions spaces: Architecture Space, Implementation Space, and Idea Space.

Architecture Space (Stable)

Design Docs Characteristics

  • Objectives: Document high level architecture on systems which solve complex problems and leads to direct user impacts.
  • Main Audience: Executives, Tech Leads
  • Time: Slow and stable. Ideally, it rarely changes except due to disruptions.

Types of Design Docs

  • Architecture Doc: Document high level system architectures with clear objectives For example, project Google Loon aims to solve the scarcity of reliable internet infrastructures.
  • North Star Metrics: Identify critical metrics to measure success/failures for executive communications. For example, in customer facing apps, the metrics will be user adoptions while it will be protecting users in abuse fighting apps.

Implementation Spaces (Launch)

Design Docs Characteristics

  • Objectives: To facilitate system designs implementations for example data storage, ML Ops, data privacy access, etc. This document ensures data products are launched, scaled, and evaluated properly.
  • Main Audience: Tech Leads, Cross departments (especially up/downstream applications)
  • Time: Moderate changes

Types of Design Docs

  • System Design: Highlight system implementation flowcharts, up/downstream interactions, data storage, appeals, etc.
  • Timeline Launch Documents: Highlight progress and timeline for a system to launch. In Google, we have Standard Operating Procedures (SOP) to ensure each launch is properly maintained and scaled.
  • Privacy Documents: Manage confidential data regarding users or other sensitive agencies.

Idea Spaces (Experimental)

Design Docs Characteristics

  • Objectives: To experiment minor tweaks, idea brainstorm, and quick feedback gathering. Idea spaces allow data professionals to seek ideas to deliver big impacts quickly.
  • Main Audience: Everyone including cross department
  • Time: Highly dynamic. One pagers are drafted, analysed and discarded on a daily basis. Your goal is to fail quickly and move on.

Types of Design Docs

  • One pager: Fast moving design docs to facilitate early idea reviews. As ideas grow to proven concepts, the one pager will be promoted into two pagers and system design docs.
  • Learning Journeys: Identify learning journeys in terms of past presentations, design documents, models launched. In big companies, the learning journeys are necessary to keep track of changes that happen very quickly in cross department and regional collaborations.
  • Pre Execution Evaluations: What are the expected impacts if we launch a product (e.g: models / tweaks)?
  • Post Execution Evaluations: What are the impacts after the past launched product (e.g: models / tweaks)?
  • Pre Mortem: What could go wrong when the product is launched?
  • Post Mortem: What has gone wrong after the past launched products?

“If you don’t know what you want to achieve in your presentation, your audience never will.” — Harvey Diamond

Five principles To Manage Your Design Docs

These are simple guides on how you can manage design docs.

  1. Start from Ideas: Always start your experiment on one pager (idea spaces). Create a thought experiment and brainstorm quickly before investing further time into the idea.
  2. Invest Time in Lower Level Spaces: The higher the space (Idea → Implementation → Architecture), the more time you should invest. Spend at most 1 week on one pager (idea space), 1 month on implementation/analysis space, and 1 quarter/semester on architectural space. Of course this depends on the scope, but you get the gist.
  3. Prioritize Promising One Pagers: Promote and scale your one pagers based on impacts. If the idea is intended for cross collaboration, spend more time deliberating how your goals align the high level spaces (North Star Metrics, System Design, etc).
  4. Land Into North Star: For general success metrics, focus on ideas with lowest time investments (e.g: small tweaks on machine learning), with high impacts to North Star Metrics. This helps you to build solid foundations to land in higher spaces.
  5. Point Directly to Golden Nugget: Your design docs need to point the audience to the golden nugget as direct as possible. The higher the space is, the more direct this golden nugget should be.


In general, by knowing these principles, you will create design docs to help you:

  1. Navigate dynamic, ambiguous or not well understood projects through all conceptual spaces.
  2. Generate high impact projects that could be promotable for executive communications.
  3. Optimize time investments for the best impacts which highlight the North Star Metrics.

I hope this article helps you create design docs and run your data projects effectively.

Soli Deo Gloria.

About the Author

Vincent fights internet abuse with ML @ Google. Vincent uses advanced data analytics, machine learning, and software engineering to protect Chrome and Gmail users.

Apart from his stint at Google, Vincent is also a featured writer for Towards Data Science Medium to guide aspiring ML and data practitioners with 1M+ viewers globally.

During his free time, Vincent studies for ML Master Degree in Georgia Tech and trains for triathlons/cycling trips.

Lastly, please reach out to Vincent via LinkedIn, Medium or Youtube Channel


Source: Towards Data Science

Source Twitter: @TDataScience

Source URL: https://towardsdatascience.com/understanding-design-docs-principles-for-achieving-data-scientists-53e6d5ad6f7e

How to Write Better with The Why, What, How Framework

Here’s a story from the early days of Amazon Web Services: Before writing any code, engineers spent 18 months contemplating and writing documents on how best to serve the customer. Amazon believes this is the fastest way to work—thinking deeply about what the customer needs before executing on that rigorously refined vision.

Similarly, as a data scientist, though I solve problems via code, a lot of the work happens before writing any code. Such work takes the form of thinking and/via writing documents. This is especially so in Amazon, which is famous for its writing culture.

This post (and the next) answers the most voted-for question on the topic poll:

How to write design documents for data science/machine learning projects?

I’ll start by sharing three documents I’ve written: one-pagers, design documents, and after action reviews. Then, I’ll reveal the framework I use to structure most of my writing, including this post. In the next post, we’ll discuss design docs.

One-pagers, design docs, after-action reviews

I usually write three types of documents when building/operating a system. The first two help to get alignment and feedback; the last is used to reflect—all three assist with thinking deeply and improving outcomes.

One-pagers: I use these to achieve alignment with business/product stakeholders. Also used as background memos for quarterly/yearly prioritization. In a single page, they should allow readers to quickly understand the problem, expected outcomes, proposed solution, and high-level approach. Extremely useful to reference when you’re deep in the weeds of a project, or encounter scope creep.

Design docs: I use these to get feedback from fellow scientists and engineers. They help identify design issues early in the process. Furthermore, you can iterate on design docs more rapidly than on systems, especially if said systems are already in production. It usually covers methodology and system design, and includes experiment results and technical benchmarks (if available).

Design docs are more commonly seen in engineering projects; not so much for data science/machine learning. Nonetheless, I’ve found it invaluable for building better ML systems and products.

After-action reviews: I use these to reflect after shipping a project, or after a major error. If it’s a project review, we cover what went well (and not so well), follow-up actions, and how to do better next time. It’s like a scrum retrospective, except with more time to think and written as a document. The knowledge can then be shared with other teams.

If it’s an error review (e.g., the system goes down), we diagnose the root cause and identify follow-up actions to prevent reoccurrence. Nowhere do we blame individuals. The intent is to discuss what we can do better and share the (sometimes painful) lessons with the greater organization. Amazon calls these Correction of Errors; here’s how it looks like.

Writing framework: Why, What, How, (Who)

The Why-What-How framework is so simple that it sounds like a reading/writing lesson for first graders. Nonetheless, it guides most, if not all, of my work documents. My writing on this site also follows it (the other format being lists like this and this).

Why: Start by explaining Why the document is important. This is often framed around the problem or opportunity we want to address, and the expected benefits. We might also answer the question of Why now?

Think of this as the hook for your document. After reading the Why, readers should feel compelled to blaze through the rest of your doc (and hopefully commit to your proposal). In resource-strapped environments (e.g., start-ups), this section convinces decision-makers to invest resources into your idea.

Thus, it’s critical that—after reading this section—your audience understands the problem and context. Describe it simply in their terms: customer benefits, business gains, productivity improvements. Contrast the two Whys below; which is better suited for a business audience?

“We need to procure GPU clusters for distributed training of SOTA deep learning models that will improve nDCG@10 by 20%.”

“We need to invest in infrastructure to improve customer recommendations, with an expected conversion and revenue uplift of 5%.”

The first one might be a tad exaggerated, but I’ve seen Whys that start like that. 🤦‍♂️ It’s a great way to lose the audience from the get-go.

What: After the audience is convinced we should solve the problem, share what a good solution looks like. What are the expected outcomes and ways to measure them?

One way to frame What is via measures of success and constraints. Measures of success define what a good (or bad) solution looks like; constraints define what solutions can (and cannot) do. Together, they enable readers to evaluate and decide on proposals, make trade-offs, and provide feedback.

Another way of framing What is via requirements. Business requirements specify the expected customer experience, uplift to business metrics (success measures), and budget (constraints). They might also be framed as product or functional requirements. Technical requirements specify throughput, latency, security, privacy, etc., usually as constraints.

How: Finally, explain How you’ll achieve the Why and What. This includes methodology, high-level design, tech decisions, etc. It’s also useful to add how you’re not implementing it (i.e., out of scope).

The depth of this section depends on the document. For one-pagers, it could be a paragraph or two on deliverables, with details in the appendix. For design docs, you may want to include a system context diagram, tech decisions (e.g., centralized vs. distributed, EC2 vs. EMR vs. SageMaker), offline experiment results (e.g., hit rate, nDCG), and benchmarks (e.g., throughput, latency, instance count).

Having a solid Why and What provides context and makes this section easier to write. It also makes it easier for readers to evaluate and give feedback on your idea. Conversely, poorly articulated intent and requirements make it difficult to spot a good solution even when it’s in front of us.

(Who): While writing docs, we should keep our audience in mind. Although Who may not show up as a section in the doc, it’ll influence how it turns out (topics, depth, language).

A document for business leaders will (and should!) look very different from a document for engineers. Difference audiences will focus on different aspects: customer pain points, business outcomes, ROI vs. technical requirements, design choices, API specifications.

Writing with your Who in mind makes for more productive discussions and feedback. We don’t ask business leaders for feedback on infra choices, and we don’t ask devops engineers for guidance on business strategy.

How to use the framework to structure your docs

Here are some examples of using Why-What-How to structure a one-pager, design doc, after-action review, and my writing on this site.

Why? What? How?
One-Pager • Problem or opportunity
• Hypothesized benefits
• Success metrics
• Constraints
• Deliverables
• Define out-of-scope
Design Doc • Why the problem is important
• Expected ROI
• Business / product requirements
• Technical requirements & constraints
• Methodology & system design
• Diagrams, experiment results, tech choices, integration
After-action Review • Context of incident
• Root cause analysis (5 Whys)
• Tangible & intangible impact
• Estimates (e.g., downtime, $)
• Follow-up actions & owners
Writing on this site • Why reading the post is important (e.g., anecdotes)
• The topic being discussed (e.g., documents we write at work) • The insight being shared (e.g., Why-What-How, examples)

One-pager example

Why: Our data science team (in an e-commerce company) is challenged to help customers discover products easier. Senior leaders hypothesize that better product discovery will improve customer engagement and business outcomes.

What: First-order metrics are engagement (e.g., CTR) and revenue (e.g., conversion, revenue per session). Second-order metrics include app usage (e.g., daily active users) and retention (e.g., monthly active users). Constraints are set via a budget and timeline.

How: The team considered several online (e.g., search, recommendations) and offline (e.g., targeted emails, push notifications) approaches. Their analysis showed the majority of customer activity occurs on product pages. Thus, an item-to-item (i2i) recommender—on product pages—is hypothesized to yield the greatest ROI.

Appendix: Breakdown of inbound channels and site activity, overview of the various approaches, detailed explanation on recommendation systems.

Design document example

Why: Currently, our product pages lack a way for users to discover similar products. To address this, we are building an i2i recommender to improve product discoverability and customer engagement.

What: Business requirements are similar to those specified in the one-pager, albeit with greater detail. We collaborated with the web and mobile app teams to define technical requirements such as throughput (> 1,000 requests per second), latency (<150ms at p99), and availability (99% uptime). Our constraints include cost (<10% of revenue generated, with an absolute threshold) and integration points.

How: This will be the meatiest section of the design doc. We’ll share the methodology and high-level design, including system-context-diagrams, tech choices, initial offline evaluation metrics (for ML), and address aspects of throughput, latency, cost, security, data privacy, integration, etc.

Appendix: Trade-offs, what was considered but excluded, API specs, UI, etc.

After-action review example

Context: During a peak sales day (11/11), the i2i recommender was not visible on product pages for a period of time. This was discovered by category managers inspecting their products’ discounts.

Why (5 Whys): The spike in traffic led to increased latency (>150ms) when serving recommendations. The increased latency led to the recommender widget timing out—and not being shown—on product pages. While autoscaling was enabled, it hit the instance quotas and could not scale beyond that. Though we conducted load tests at 3x normal traffic, these were insufficient as peak traffic was 30x normal traffic. In addition, it was not discovered earlier because our alarms didn’t account for results not being displayed.

What: Customer experience was unaffected as product pages continued to load within expected latency. Nonetheless, not serving recommendations led to loss of expected revenue. Based on revenue attributed to the recommender during the rest of the day, the estimated loss is $x.

How: We will take these follow-up actions to prevent a repeated incident and detect similar issues earlier. These are their respective owners.

Appendix: Timeline of incident, overall learnings and recommendations.

Personal writing example

Why: Why is writing documents important? Share anecdote. Mention it’s highly voted-for.

What: What documents do I write? Share some examples.

How: Explain the Why-What-How approach and share examples of how I use it.

Writing docs is expensive, but cheap

Writing documents cost money. They take time to write, review, and iterate on—this is time that could have been spent on implementation.

Nonetheless, writing is a cheap way to ensure we solve the right problems in the right way. They save money by helping teams avoid rabbit holes or building systems that aren’t used. They also help align stakeholders, improve initial ideas, and scale knowledge.

If the problem is ambiguous, the proposed solution contentious, the effort required high (> 3-6 months), and/or consensus is required across multiple teams, starting with a document will save effort in the medium to long term.

So before you start your next project, write a document using Why-What-How. Here’s more detail about one-pagers (and other things I do before starting a project).

Source: Eugene Yan

Source URL: https://eugeneyan.com/writing/writing-docs-why-what-how/

Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He's currently an Applied Scientist at Amazon. Previously, he led the data science teams at Lazada and uCare.ai. He also writes & speaks about effective data science, data/ML systems, and career growth.

How to Write Design Docs for Machine Learning Systems

Design documents come in all shapes and sizes. But IMHO, they have the same purpose—to help the author think deeply about the problem and solution, and get feedback.

Thinking deeply comes with writing the design doc. To propose a good design, you have to research and understand the problem space. Then, communicating what you’ve learned via a document with different levels of detail forces you to clarify and organize your thoughts. Good writing does not come without good thinking.

“Full sentences are harder to write. They have verbs. The paragraphs have topic sentences. There is no way to write a six-page, narratively structured memo and not have clear thinking.” — Jeff Bezos

Distributing and getting feedback on design docs is also easier. They tend to be detailed, standalone documents that reviewers can read and provide comments on asynchronously. Contrast this to PowerPoint presentations which require a presenter and the audience in the same room (or now, in the same Zoom).

Is it a must to write a design doc? Of course not. But not writing one incurs the risk of building the wrong thing, or something that was requested but ends up unused. I’ve also observed costly projects halted due to design flaws discovered late in the project, because of an ill-defined problem statement or a tech choice that doesn’t scale. In hindsight, such waste could have been mitigated by investing time into writing and reviewing a design doc.

We’ll go over pointers on what to cover in design docs for machine learning systems—these pointers will guide the thinking process. My design docs tend to be structured via the Why, What, How framework shared last week (please skim it if you’ve not read it yet). Then, I’ll share how I get feedback via a two-step review process.

  • The Why and What of design docs
  • The How of design docs
  • Methodology: How to solve the problem with data and ML
  • Implementation: How to build and operate the system
  • Alternatives considered and rejected
  • Reviewing design docs in two stages

A simple template, available for the low price of free: ml-design-docs

The Why and What of design docs

A design doc should start by addressing the Whys and Whats.

Why should we solve this problem? Why now? Explain the motivation for your proposal and convince readers of its importance. What is the customer or business benefit? If you’re building a replacement system, explain why improvements to the existing system will not work as well. If there are alternatives, explain why your proposed system is better.

What are the success criteria? These are often framed as business goals, such as increased customer engagement, revenue, or reduced cost. They can also be framed as operational goals or new capabilities (e.g., ability to rollback models, serve features in real-time, etc.)

What are the requirements and constraints? Functional requirements are those that must be met to deliver the project. Describe them from the customer’s point of view—how will the customer experience it and/or benefit? Specific to machine learning, we’ll have specific requirements for each application, such as:

  • Recommendations: Proportion of items or customers with >5 recommended items
  • Fraud detection: Upper bound on the proportion or count of false positives
  • Automated classification: Threshold on proportion or count of low-confidence predictions that require human review and approval

Non-functional/technical requirements define the quality of your system and determine how the system should be implemented. Usually, customers won’t notice them unless they’re not met (e.g., exceptionally high latency). Most systems will consider a similar set of requirements such as throughput, latency, security, data privacy, costs, etc.

What is in-scope vs out-of-scope? Some problems can be too big to solve all at once. To ship—and get feedback from customers—in a reasonable amount of time, we might need to chop it down to size. Be upfront about what’s out of scope. We might also need to take on tech debt to meet time and budget constraints. This is fine. Nonetheless, be deliberate about it and have a plan to pay off tech debt as soon as possible.

What are our assumptions? Make explicit your assumptions and understanding of the environment. For example, if building a recsys, how many products and users do you have? What is the expected number of requests per second? This guides how you frame the problem. It can be hard to apply reinforcement learning to large discrete action spaces (i.e., a large number of products) whereas simple approximate nearest neighbors scale well.

The How of design docs

Addressing the How in a design doc can look very different for each ML system. That said, here’s a list of things to consider in a design doc, split into two sections (methodology and implementation). These should serve as a checklist/reference and are not meant to be exhaustive. Remember, the aim of the design doc is to help you think and feedback. Thus, write whatever is necessary to achieve this goal.

Methodology: How to solve problems with data and ML

This section is similar to the methods section in machine learning papers. A couple of key points I usually cover are:

Problem statement. Declare how you’ll frame the problem. In machine learning, the same problem can have vastly different approaches. If it’s a recommender system, are you taking a content or collaboration-based approach? Will it be an item-to-item or user-to-item recommender? Is your system focused on candidate generation or ranking? Being specific helps narrow down your search space and simplifies the rest of the design doc.

Also, be clear about the problem you’re solving. For example, recommendation systems often involve solving a surrogate problem—the Netflix Challenge assumes that accurately predicting user ratings leads to effective movie recommendations. Other labels include the probability of a video being played and the number of minutes watched. The choice of your surrogate learning problem will have an outsized importance on A/B testing.

As another example, consider fraud detection. This can be solved via unsupervised or supervised approaches. An unsupervised approach won’t need labels and can adopt techniques such as outlier detection via isolation forests or identifying fraud networks via graph clustering. A supervised approach will need to consider label acquisition and how to balance between precision (more uncaught fraud) and recall (more false alarms).

Data. Describe the data and entities your ML model will be trained on. Commonly used data include customer (e.g., demographics), customer events (e.g., clicks, purchases), and items (e.g., metadata, text description, images). If you’re using customer data, pay attention to the aspects of data privacy and security (covered under implementation).

Techniques. Outline the machine learning techniques you’ll try/tried. Include baselines for comparison. This section may also include details on how you’ll clean and prepare the data, as well as your feature engineering approach. While not necessary, it’s a good idea to provide sufficient detail so that readers can implement/reproduce your work.

Validation and experimentation. Explain how you’ll evaluate models offline. (IMHO, you won’t go wrong using a time-based split most of the time.) Note the difference between leave-one-last, temporal, random, and user-based splits. Explain your choice of evaluation metrics(s) and why you think they are good proxy metrics for production conditions. If you’ve conducted experiments with validation results, include them.

If you’re conducting an A/B test, specify if treatment and control groups will be split by customers or sessions. Indicate the metrics you’ll monitor and distinguish between success and guardrail metrics. Success metrics measure the extent of the desired outcome (e.g., increased clicks, conversion, etc.) Guardrail metrics protect the overall customer experience and prevent deterioration of the system—they ensure the outcome is at least neutral (to the customer) and cannot get worse no matter how success metrics improve. (As much as possible, the offline and online metrics should be correlated, but I’ve found this more of an art than science.)

Human-in-the-loop. Indicate how human intervention can be incorporated into your system. I’ve had category managers implement rules to prevent certain product categories (e.g., adult toys, lingerie, weapons) from appearing on the home page. Conversely, customers might want to exclude themselves from recommendations (e.g., they get recommendations they don’t want seen on their home page). If it’s an automated fraud detection/loan approval system, we might also want dollar value thresholds that trigger mandatory human review and approval.

Implementation: How to build and operate the system

This section lists the non-function/technical requirements and is more engineering-heavy; it’s not necessary to address all of them. If in doubt, consult engineers for help.

High-level design. It’s a good idea to start with a diagram providing a high-level view. System-context diagrams and data-flow diagrams work well. In ML systems, some key components are data stores, pipelines (e.g., data preparation, feature engineering, training), and serving. Show how components interact with one another. I often use data-flow diagrams to show how raw data is transformed and used to train models, as well as the input and output of my model in serving.

Infra + scalability. Briefly list the infra options and your final choice. Will it run on-premise, in the cloud, or a mix of both (e.g., data processing and training on-premise for data security, model serving in the cloud for scalability). If you work in big tech with many different compute and hosting options, try to narrow down your search space early. Also, consider how your choice of infra will impact scalability—it’s easier to scale a cloud-based system than to add server racks.

Performance (throughput + latency). Address requirements on throughput (i.e., requests per second) and latency (e.g., x ms @ p99) and list how performance can be improved (e.g., pre-computation, caching). If additional throughput is required (e.g., to handle peak sales days), will you scale vertically (i.e., bigger machines) or horizontally (i.e., more machines of the same size)—your ability to do this will be tied to your choice of infra.

Security. Specify how you’ll secure your application and authenticate users and incoming requests. If your application endpoint is publicly accessible, you might want to plan for a denial-of-service attack. Organizations with centralized security teams might have an internal certification process that you can undergo to identify and patch risks.

Data privacy. Indicate how you’ll protect and ensure the privacy of customer data. Will your ML model learn on personally identifiable information (PII)? If so, detail how this PII will be stored, processed, and used in your model. Also, address how your system will comply with data retention and deletion policies such as GDPR. (I’ve built systems—in healthcare and human resources—where the PII was considered so sensitive that we declined to receive, not to mention use.)

Monitoring + alarms. Operating a system without monitoring is like driving at night without headlights—the lack of visibility is unnerving. Detail how you’ll monitor your system performance (e.g., throughput, latency, error rate, etc.) Monitoring can be done server-side (e.g., model endpoint) or client-side (e.g., consumer), with the latter including network latency. Also list the alarms that will trigger human intervention (e.g., on-call).

Cost. This will be a key concern for decision-makers who hold the purse strings. It won’t make sense if the cost of operating your system exceeds the revenue it generates. This should include labour cost—how many engineers and scientists do you need to build the system, and for how long? If your system runs in the cloud, estimate the number of instances required for data processing (e.g., EMR clusters), and model training and serving (e.g., GPU instances, AWS Lambda).

Integration points. Define how downstream services will use and interact with your endpoint. Share how the API specification looks like, and the expected input and output data. Keeping the API generic enough ensures extendability to other consuming services (i.e., higher adoption of your system).

Risks and uncertainties. Risks are the known unknowns; uncertainties are the unknown unknowns. Call them out to the best of your ability. This allows reviewers to help spot design flaws and rabbit holes, and provide feedback on how to avoid/address them.

Other stuff. There’s a non-exhaustive list of other concerns that might be relevant to your system. This includes ops strategy (e.g., monitoring, on-call), model rollbacks, quality assurance, extensibility, and model footprint and power consumption (if used in mobile apps). Address them if they are key to your system.

Alternatives considered and rejected

It’s useful to include a section on alternatives you’ve considered but rejected. List their pros and cons as well as the rationale for your decision. Your decision will be based on your assumptions about the environment and the requirements, so it’s good to document it down. If the environment changes, this section can help you reconsider past decisions.

This section helps you dive into the ambiguous, hidden choices and the implicit decisions made while designing your system. Being transparent allows others to check your blind spots and correct invalid assumptions. The aim is to suggest improvements to your design early, saving you from making bad or unnecessarily difficult design choices.

A disclaimer on design doc templates

Reviewing design docs in two stages

I find it helpful to conduct reviews in two stages: pre-review and review.

Pre-review involves quickly iterating and seeking feedback from a small group (often as part of the writing process). At this stage, the design doc might be a tad rough around the edges, with open questions and paths to explore. Nonetheless, my reviewers express a preference for being involved early as the raw and fluid state (of the design) allows them to provide feedback that meaningfully shapes the direction of the system. This is the stage where mentors and seniors can help to narrow the search space and simplify the design.

At this stage, the document will likely be low resolution and lacking in details—this is a feature, not a bug, and allow for quickly brainstorming and iterating through alternatives. Mine looks like an outline of the eventual design doc, with most of the details and feedback in bullet form. Much of it doesn’t make it into the final design doc.

I tend to conduct pre-reviews one-on-one in a casual setting, usually with individual team members or mentors. If you’re doing this, be clear that it’s the pre-review phase. I’ve caused unnecessary concern when a pre-reviewer thought he was reading the final design doc (when it was just the first iteration).

The review will be more formal and involve a larger audience of senior technical folk and decision-makers. Be clear what you want from the review. What risks/uncertainties need to be addressed? What decisions need to be made? What help do you need? If you’ve done your pre-review well, it shouldn’t be too common to make major design changes at this stage.

At this stage, the design doc should have the necessary details and be in structured prose. Quantified estimates (e.g., throughput, latency, cost) and offline experiment results (e.g., hit@10, nDCG) will be very helpful. Diagrams are a must. Questions asked by pre-reviewers can be addressed in the appendix via a FAQ section.

Scott wrote a post that includes suggestions on how to conduct meetings (at Amazon) such as “having the right people in the room” and “checking your ego at the door”. I think much of it applies to design doc reviews as well so I’ll refer you to his post.


Writing design docs is overhead. Minor changes (e.g., adding a feature column) or low-effort tasks (e.g., a few days) shouldn’t need a design doc—the cost of writing a full design doc will outweigh the benefits. Alternatively, prototyping can be a feasible approach for smaller systems.

Nonetheless, it can be useful to write a design doc when:

  • The problem and/or solution is ambiguous or not well understood (e.g., blockchain)
  • The impact is high (e.g., customer-facing, downstream impact on other services)
  • The implementation effort is high (e.g., multiple teams for a few months)

Whether you’re writing your first or 20th design doc for a machine learning system, I hope this write-up will be helpful for you. Did I miss anything? Reach out @eugeneyan!

Source: Eugene Yan

Source URL: https://eugeneyan.com/writing/ml-design-docs/#the-why-and-what-of-design-docs

Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He's currently an Applied Scientist at Amazon. Previously, he led the data science teams at Lazada and uCare.ai. He also writes & speaks about effective data science, data/ML systems, and career growth.