index.html

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8">
    <title>The Nutanix Bible</title>
    <meta name="description" content="The Nutanix Bible - A detailed narrative of the Nutanix architecture, how the software and features work and how to leverage it for maximum performance."/>
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    <meta name="keywords" content="nutanix, nutanix bible,nutanix architecture, prism, acropolis,nutanix openstack, webscale"/>
    <meta name="robots" content="index, follow"/>

    <!-- Open Graph data -->
    <meta property="og:title" content="The Nutanix Bible - NutanixBible.com"/>
    <meta property="og:locale" content="en_US"/>
    <meta property="og:type" content="website"/>
    <meta property="og:description" content="The Nutanix Bible - A detailed narrative of the Nutanix architecture, how the software and features work and how to leverage it for maximum performance."/>
    <meta property="og:url" content="http://NutanixBible.com"/>
    <meta property="og:site_name" content="NutanixBible.com"/>
    <meta property="og:image" content="http://nutanixbible.com/assets/Bible.png"/>

    <!-- Twitter Card data -->
    <meta name="twitter:card" content="summary"/>
    <meta name="twitter:url" content="http://NutanixBible.com"/>
    <meta name="twitter:description" content="The Nutanix Bible - A detailed narrative of the Nutanix architecture, how the software and features work and how to leverage it for maximum performance."/>
    <meta name="twitter:title" content="The Nutanix Bible - NutanixBible.com"/>
    <meta name="twitter:site" content="@StevenPoitras"/>
    <meta name="twitter:domain" content="NutanixBible.com"/>
    <meta name="twitter:image:src" content="http://nutanixbible.com/assets/Bible.png"/>
    <meta name="twitter:creator" content="@StevenPoitras"/>

    <!-- Google+ data -->
    <meta itemprop="name" content="The Nutanix Bible - NutanixBible.com">
    <meta itemprop="description" content="The Nutanix Bible - A detailed narrative of the Nutanix architecture, how the software and features work and how to leverage it for maximum performance.">
    <meta itemprop="image" content="http://nutanixbible.com/assets/Bible.png">

    <script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

      ga('create', 'UA-66778923-1', 'auto');
      ga('send', 'pageview');
    </script>

    <link rel="stylesheet" type="text/css" href="css/nutanixbible.css">
  </head>
  <body data-type="book">
    <!-- Google Tag Manager -->
    <noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TR9PVL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
    new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
    j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
    '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
    })(window,document,'script','dataLayer','GTM-TR9PVL');</script>
    <!-- End Google Tag Manager -->
<div class="container">
<section data-type="titlepage" class="page-title" id="the-nutanix-bible-L02Ia">
    <img src="assets/Bible.svg" alt="" class="biblesvg">
    <h1>The Nutanix Bible</h1>

    <p class="author">by Steven Poitras</p>

</section>

<section data-type="copyright-page" class="page-title" id="id-7ANIl">

<img src="assets/ornament1.svg" alt="" class="ornament">
<p class="small"><b>Copyright (c) 2016:</b> The Nutanix Bible and NutanixBible.com, 2016. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Steven Poitras and NutanixBible.com with appropriate and specific direction to the original content.</p>

<p>
  Have feedback? Find a typo?  Send feedback to <a href="mailto:biblefeedback@nutanix.com?Subject=Nutanix%20Bible%20Feedback!"</a>biblefeedback@nutanix.com!
</p>

<br>

<p>
  Localized versions available:
</p>

<div class="localization">
  <a href="http://nutanixbible.jp/" target="_blank">
    <img src="assets/flag-japanese.svg" alt="Japanese" class="japanese">
  </a>
  <a href="http://www.virtual-space.co.kr/nutanix-works.html" target="_blank">
    <img src="assets/flag-korean.svg" alt="Korean" class="korean">
  </a>
  <a href="http://nutanix.ru/" target="_blank">
    <img src="assets/flag-russian.svg" alt="Russian" class="russian">
  </a>
  <a href="http://go.nutanix.com/rs/031-GVQ-112/images/Nutanix%20Bible[CN].pdf" target="_blank">
    <img src="assets/flag-chinese.svg" alt="Chinese" class="Chinese">
  </a>
</div>

</section>

<!-- START: Ken Chen 11-17-2015-->
<div id="nav-icon"><div></div></div>
<div class="nav-title">
  Table of Content
  <div id="nav-close-button"></div>
</div>
<!-- END: Ken Chen 11-17-2015-->

<nav data-type="toc">
</nav>

<section data-type="preface" class="preface" id="foreword-7kBIw">
<h1 id="anchor-foreword-1">Foreword</h1>


<figure class="small" id="id-wntQsz">
<img alt="Dheeraj Pandey" class="iimagesv2dheeraj_pandeyjpg" src="imagesv2/Dheeraj_Pandey.jpg" style="width:60%; max-width:218px; horizontal-align:middle">
<figcaption><span class="label">Figure 1-1. </span>
<p class="sign">Dheeraj Pandey, CEO, Nutanix</p>
</figcaption>
</figure>

<blockquote>
<p>I am honored to write a foreword for this book that we've come to call "The Nutanix Bible." First and foremost, let me address the name of the book, which to some would seem not fully inclusive vis-à-vis their own faiths, or to others who are agnostic or atheist. There is a Merriam Webster meaning of the word "bible" that is not literally about scriptures: "a publication that is preeminent especially in authoritativeness or wide readership". And that is how you should interpret its roots. It started being written by one of the most humble yet knowledgeable employees at Nutanix, Steven Poitras, our first Solution Architect who continues to be authoritative on the subject without wielding his "early employee" primogeniture. Knowledge to him was not power -- the act of sharing that knowledge is what makes him eminently powerful in this company. Steve epitomizes culture in this company -- by helping everyone else out with his authority on the subject, by helping them automate their chores in Power Shell or Python, by building insightful reference architectures (that are beautifully balanced in both content and form), by being a real-time buddy to anyone needing help on Yammer or Twitter, by being transparent with engineers on the need to self-reflect and self-improve, and by being ambitious.</p>

<p>When he came forward to write a blog, his big dream was to lead with transparency, and to build advocates in the field who would be empowered to make design trade-offs based on this transparency. It is rare for companies to open up on design and architecture as much as Steve has with his blog. Most open source companies -- who at the surface might seem transparent because their code is open source -- never talk in-depth about design, and "how it works" under the hood. When our competitors know about our product or design weaknesses, it makes us stronger -- because there is very little to hide, and everything to gain when something gets critiqued under a crosshair. A public admonition of a feature trade-off or a design decision drives the entire company on Yammer in quick time, and before long, we've a conclusion on whether it is a genuine weakness or a true strength that someone is fear-mongering on. Nutanix Bible, in essence, protects us from drinking our own kool aid. That is the power of an honest discourse with our customers and partners.</p>

<p>This ever-improving artifact, beyond being authoritative, is also enjoying wide readership across the world. Architects, managers, and CIOs alike, have stopped me in conference hallways to talk about how refreshingly lucid the writing style is, with some painfully detailed illustrations, visio diagrams, and pictorials. Steve has taken time to tell the web-scale story, without taking shortcuts. Democratizing our distributed architecture was not going to be easy in a world where most IT practitioners have been buried in dealing with the "urgent". The Bible bridges the gap between IT and DevOps, because it attempts to explain computer science and software engineering trade-offs in very simple terms. We hope that in the coming 3-5 years, IT will speak a language that helps them get closer to the DevOps' web-scale jargon.</p>

<p>With this first edition, we are converting Steve's blog into a book. The day we stop adding to this book is the beginning of the end of this company. I expect each and everyone of you to keep reminding us of what brought us this far: truth, the whole truth, and nothing but the truth, will set you free (from complacency and hubris).</p>

<p>Keep us honest.</p>
</blockquote>

<p>&nbsp;</p>

<p class="sign">--Dheeraj Pandey, CEO, Nutanix</p>

<p>&nbsp;</p>

<figure id="id-aztlFk"><img alt="Stuart Miniman" class="iimagesv2stujpg" src="imagesv2/Stu.jpg" style="width:80%; max-width:218px; horizontal-align:middle">
<figcaption><span class="label">Figure 1-2. </span>
<p class="sign">Stuart Miniman, Principal Research Contributor, Wikibon</p>
</figcaption>
</figure>

<blockquote>
<p>Users today are constantly barraged by new technologies. There is no limit of new opportunities for IT to change to a "new and better way", but the adoption of new technology and more importantly, the change of operations and processes is difficult. Even the huge growth of open source technologies has been hampered by lack of adequate documentation. Wikibon was founded on the principal that the community can help with this problem and in that spirit, The Nutanix Bible, which started as a blog post by Steve Poitras, has become a valuable reference point for IT practitioners that want to learn about hypercovergence and web-scale principles or to dig deep into Nutanix and hypervisor architectures. The concepts that Steve has written about are advanced software engineering problems that some of the smartest engineers in the industry have designed a solution for. The book explains these technologies in a way that is understandable to IT generalists without compromising the technical veracity.</p>

<p>The concepts of distributed systems and software-led infrastructure are critical for IT practitioners to understand. I encourage both Nutanix customers and everyone who wants to understand these trends to read the book. The technologies discussed here power some of the largest datacenters in the world.</p>
</blockquote>

<p>&nbsp;</p>

<p class="sign">--Stuart Miniman, Principal Research Contributor, Wikibon</p>

<h2>Introduction</h2>

<figure id="id-ZptOIk"><img alt="Steven Poitras" src="imagesv2/poitras_pic.jpg" style="width:80%; max-width:500px; horizontal-align:middle">

<figcaption><span class="label">Figure 1-3. </span>
<p class="sign">Steven Poitras, Principal Solutions Architect, Nutanix</p>
</figcaption>
</figure>

<blockquote>
<p>Welcome to The Nutanix Bible!&nbsp; I work with the Nutanix platform on a daily basis – trying to find issues, push its limits as well as administer it for my production benchmarking lab.&nbsp; This item is being produced to serve as a living document outlining tips and tricks used every day by myself and a variety of engineers here at Nutanix.</p>

<p>NOTE: What you see here is an under the covers look at how things work.&nbsp; With that said, all topics discussed are abstracted by Nutanix and knowledge isn't required to successfully operate a Nutanix environment!</p>

<p>Enjoy!</p>
</blockquote>

<p>&nbsp;</p>

<p class="sign">--Steven Poitras, Principal Solutions Architect, Nutanix</p>

</section>

<div data-type="part" id="a-brief-lesson-in-history-6qVi1">
<h1><span class="label">Part I. </span>A Brief Lesson in History</h1>

<p>A brief look at the history of infrastructure and what has led us to where we are today.</p>

<section data-type="chapter" id="the-evolution-of-the-datacenter-R5INu4">
<h2>The Evolution of the Datacenter</h2>

<p>The datacenter has evolved significantly over the last several decades. The following sections will examine each era in detail.&nbsp;&nbsp;</p>

<section data-type="sect1" id="the-era-of-the-mainframe-NYI5u8uq">
<h3>The Era of the Mainframe</h3>

<p>The mainframe ruled for many years and laid the core foundation of where we are today. It allowed companies to leverage the following key characteristics:</p>

<ul>
	<li>Natively converged CPU, main memory, and storage</li>
	<li>Engineered internal redundancy</li>
</ul>

<p>But the mainframe also introduced the following issues:</p>

<ul>
	<li>The high costs of procuring infrastructure</li>
	<li>Inherent complexity</li>
	<li>A lack of flexibility and highly siloed environments</li>
</ul>
</section>

<section data-type="sect1" id="the-move-to-stand-alone-servers-22IlTzuZ">
<h3>The Move to Stand-Alone Servers</h3>

<p>With mainframes, it was very difficult for organizations within a business to leverage these capabilities which partly led to the entrance of pizza boxes or stand-alone servers. Key characteristics of stand-alone servers included:</p>

<ul>
	<li>CPU, main memory, and DAS storage</li>
	<li>Higher flexibility than the mainframe</li>
	<li>Accessed over the network</li>
</ul>

<p>These stand-alone servers introduced more issues:</p>

<ul>
	<li>Increased number of silos</li>
	<li>Low or unequal resource utilization</li>
	<li>The server became a single point of failure (SPOF) for both compute AND storage</li>
</ul>
</section>

<section data-type="sect1" id="centralized-storage-3jIvSMu5">
<h3>Centralized Storage</h3>

<p>Businesses always need to make money and data is a key piece of that puzzle. With direct-attached storage (DAS), organizations either needed more space than was locally available, or data high availability (HA) where a server failure wouldn’t cause data unavailability.</p>

<p>Centralized storage replaced both the mainframe and the stand-alone server with sharable, larger pools of storage that also provided data protection. Key characteristics of centralized storage included:</p>

<ul>
	<li>Pooled storage resources led to better storage utilization</li>
	<li>Centralized data protection via RAID eliminated the chance that server loss caused data loss</li>
	<li>Storage were performed over the network</li>
</ul>

<p>Issues with centralized storage included:</p>

<ul>
	<li>They were potentially more expensive, however data is more valuable than the hardware</li>
	<li>Increased complexity (SAN Fabric, WWPNs, RAID groups, volumes, spindle counts, etc.)</li>
	<li>They required another management tool / team</li>
</ul>
</section>

<section data-type="sect1" id="the-introduction-of-virtualization-PgIzHQum">
<h3>The Introduction of Virtualization</h3>

<p>At this point in time, compute utilization was low and resource efficiency was impacting the bottom line. Virtualization was then introduced and enabled multiple workloads and operating systems (OSs) to run as virtual machines (VMs) on a single piece of hardware. Virtualization enabled businesses to increase utilization of their pizza boxes, but also increased the number of silos and the impacts of an outage. Key characteristics of virtualization included:</p>

<ul>
	<li>Abstracting the OS from hardware (VM)</li>
	<li>Very efficient compute utilization led to workload consolidation</li>
</ul>

<p>Issues with virtualization included:</p>

<ul>
	<li>An increase in the number of silos and management complexity</li>
	<li>A lack of VM high-availability, so if a compute node failed the impact was much larger</li>
	<li>A lack of pooled resources</li>
	<li>The need for another management tool / team</li>
</ul>
</section>

<section data-type="sect1" id="virtualization-matures-lkInFEuj">
<h3>Virtualization Matures</h3>

<p>The hypervisor became a very efficient and feature-filled solution. With the advent of tools, including VMware vMotion, HA, and DRS, users obtained the ability to provide VM high availability and migrate compute workloads dynamically. The only caveat was the reliance on centralized storage, causing the two paths to merge. The only down turn was the increased load on the storage array before and VM sprawl led to contention for storage I/O. Key characteristics included:</p>

<ul>
	<li>Clustering led to pooled compute resources</li>
	<li>The ability to dynamically migrate workloads between compute nodes (DRS / vMotion)</li>
	<li>The introduction of VM high availability (HA) in the case of a compute node failure</li>
	<li>A requirement for centralized storage</li>
</ul>

<p>Issues included:</p>

<ul>
	<li>Higher demand on storage due to VM sprawl</li>
	<li>Requirements to scale out more arrays creating more silos and more complexity</li>
	<li>Higher $ / GB due to requirement of an array</li>
	<li>The possibility of resource contention on array</li>
	<li>It made storage configuration much more complex due to the necessity to ensure:
	<ul>
		<li>VM to datastore / LUN ratios</li>
		<li>Spindle count to facilitate I/O requirements</li>
	</ul>
	</li>
</ul>
</section>

<section data-type="sect1" id="solid-state-disks-ssds-M2I9tquP">
<h3>Solid State Disks (SSDs)</h3>

<p>SSDs helped alleviate this I/O bottleneck by providing much higher I/O performance without the need for tons of disk enclosures.&nbsp; However, given the extreme advances in performance, the controllers and network had not yet evolved to handle the vast I/O available. Key characteristics of SSDs included:</p>

<ul>
	<li>Much higher I/O characteristics than traditional HDD</li>
	<li>Essentially eliminated seek times</li>
</ul>

<p>SSD issues included:</p>

<ul>
	<li>The bottleneck shifted from storage I/O on disk to the controller / network</li>
	<li>Silos still remained</li>
	<li>Array configuration complexity still remained</li>
</ul>
</section>

<section data-type="sect1">
<h3>In Comes Cloud</h3>
<p>
  The term cloud can be very ambiguous by definition.  Simply put it's the ability to consume and leverage a service hosted somewhere provided by someone else.
</p>

<p>
  With the introduction of cloud, the perspectives IT, the business and end-users
</p>

<p>
  Core pillars of any cloud service:
</p>
<ul>
  <li>
    Self-service  / On-demand
    <ul>
      <li>
        Rapid time to value (TTV) / little barrier to entry
      </li>
    </ul>
  </li>
  <li>
    Service and SLA focus
    <ul>
      <li>
        Contractual guarantees around uptime / availability / performance
      </li>
    </ul>
  </li>
  <li>
    Fractional consumption model
    <ul>
      <li>
        Pay for what you use (some services are free)
      </li>
    </ul>
  </li>
</ul>

<h5>Cloud Classifications</h5>
<p>
  Most general classifications of cloud fall into three main buckets (starting at the highest level and moving downward):
</p>

<ul>
  <li>
    Software as a Service (SaaS)
    <ul>
      <li>
        Any software / service consumed via a simple url
      </li>
      <li>
        Examples: Workday, Salesforce.com, Google search, etc.
      </li>
    </ul>
  </li>
  <li>
    Platform as a Service (PaaS)
    <ul>
      <li>
        Development and deployment platform
      </li>
      <li>
        Examples: Amazon Elastic Beanstalk / Relational Database Services (RDS), Google App Engine, etc.
      </li>
    </ul>
  </li>
  <li>
    Infrastructure as a Service (IaaS)
    <ul>
      <li>
        VMs/Containers/NFV as a service
      </li>
      <li>
        Examples: Amazon EC2/ECS, Microsoft Azure, Google Compute Engine (GCE), etc.
      </li>
    </ul>
  </li>
</ul>

<h5>Shift in IT focus</h5>
<p>
  Cloud poses an interesting dilemma for IT. They can embrace it, or they can try to provide an alternative.  They want to keep the data internal, but need to allow for the self-service, rapid nature of cloud.
</p>

<p>
  This shift forces IT to act more as a legitimate service provider to their end-users (company employees).
</p>

</section>
</section>

<section data-type="chapter" id="the-importance-of-latency-13I4Ta">
<h2>The Importance of Latency</h2>

<p>The figure below characterizes the various latencies for specific types of I/O:</p>

<table>
  <tr>
    <th>Item</th>
    <th>Latency</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>L1 cache reference</td>
    <td>0.5 ns</td>
    <td></td>
  </tr>
  <tr>
    <td>Branch Mispredict</td>
    <td>5 ns</td>
    <td></td>
  </tr>
  <tr>
    <td>L2 cache reference</td>
    <td>7 ns</td>
    <td>14x L1 cache</td>
  </tr>
  <tr>
    <td>Mutex lock/unlock</td>
    <td>25 ns</td>
    <td></td>
  </tr>
  <tr>
    <td>Main memory reference</td>
    <td>100 ns</td>
    <td>20x L2 cache, 200x L1 cache</td>
  </tr>
  <tr>
    <td>Compress 1KB with Zippy</td>
    <td>3,000 ns</td>
    <td></td>
  </tr>
  <tr>
    <td>Sent 1KB over 1Gbps network</td>
    <td>10,000 ns</td>
    <td>0.01 ms</td>
  </tr>
  <tr>
    <td>Read 4K randomly from SSD</td>
    <td>150,000 ns</td>
    <td>0.15 ms</td>
  </tr>
  <tr>
    <td>Read 1MB sequentially from memory</td>
    <td>250,000 ns</td>
    <td>0.25 ms</td>
  </tr>
  <tr>
    <td>Round trip within datacenter</td>
    <td>500,000 ns</td>
    <td>0.5 ms</td>
  </tr>
  <tr>
    <td>Read 1MB sequentially from SSD</td>
    <td>1,000,000 ns</td>
    <td>1 ms, 4x memory</td>
  </tr>
  <tr>
    <td>Disk seek</td>
    <td>10,000,000 ns</td>
    <td>10 ms, 20x datacenter round trip</td>
  </tr>
  <tr>
    <td>Read 1MB sequentially from disk</td>
    <td>20,000,000 ns</td>
    <td>20 ms, 80x memory, 20x SSD</td>
  </tr>
  <tr>
    <td>Send packet CA -&gt; Netherlands -&gt; CA</td>
    <td>150,000,000 ns</td>
    <td>150 ms</td>
  </tr>
</table>

<p><em>(credit: Jeff Dean, https://gist.github.com/jboner/2841832)</em></p>

<p>The table above shows that the CPU can access its caches at anywhere from ~0.5-7ns (L1 vs. L2). For main memory, these accesses occur at ~100ns, whereas a local 4K SSD read is ~150,000ns or 0.15ms.</p>

<p>If we take a typical enterprise-class SSD (in this case the Intel S3700 - <a href="http://download.intel.com/newsroom/kits/ssd/pdfs/Intel_SSD_DC_S3700_Product_Specification.pdf">SPEC</a>), this device is capable of the following:</p>

<ul>
	<li>Random I/O performance:
	<ul>
		<li>Random 4K Reads: Up to 75,000 IOPS</li>
		<li>Random 4K Writes: Up to 36,000 IOPS</li>
	</ul>
	</li>
	<li>Sequential bandwidth:
	<ul>
		<li>Sustained Sequential Read: Up to 500MB/s</li>
		<li>Sustained Sequential Write: Up to 460MB/s</li>
	</ul>
	</li>
	<li>Latency:
	<ul>
		<li>Read: 50us</li>
		<li>Write: 65us</li>
	</ul>
	</li>
</ul>

<section data-type="sect1" id="looking-at-the-bandwidth-QMIXtzTn">
<h3>Looking at the Bandwidth</h3>

<p>For traditional storage, there are a few main types of media for I/O:</p>

<ul>
	<li>Fiber Channel (FC)
	<ul>
		<li>4-, 8-, and 16-Gb</li>
	</ul>
	</li>
	<li>Ethernet (including FCoE)
	<ul>
		<li>1-, 10-Gb, (40-Gb IB), etc.</li>
	</ul>
	</li>
</ul>

<p>For the calculation below, we are using the 500MB/s Read and 460MB/s Write BW available from the Intel S3700.</p>

<p>The calculation is done as follows:</p>

<p>numSSD = ROUNDUP((numConnections * connBW (in GB/s))/ ssdBW (R or W))</p>

<p><i>NOTE:&nbsp;</i><em>Numbers were rounded up as a partial SSD isn’t possible. This also does not account for the necessary CPU required to handle all of the I/O and assumes unlimited controller CPU power.</em></p>

<table>
	<tbody>
		<tr>
			<th colspan="2" rowspan="1">Network BW</th>
			<th colspan="2" rowspan="1">SSDs required to saturate network BW</th>
		</tr>
		<tr>
			<th>Controller Connectivity</th>
			<th>Available Network BW</th>
			<th>Read I/O</th>
			<th>Write I/O</th>
		</tr>
		<tr>
			<td>Dual 4Gb FC</td>
			<td>8Gb == 1GB</td>
			<td>2</td>
			<td>3</td>
		</tr>
		<tr>
			<td>Dual 8Gb FC</td>
			<td>16Gb == 2GB</td>
			<td>4</td>
			<td>5</td>
		</tr>
		<tr>
			<td>Dual 16Gb FC</td>
			<td>32Gb == 4GB</td>
			<td>8</td>
			<td>9</td>
		</tr>
		<tr>
			<td>Dual 1Gb ETH</td>
			<td>2Gb == 0.25GB</td>
			<td>1</td>
			<td>1</td>
		</tr>
		<tr>
			<td>Dual 10Gb ETH</td>
			<td>20Gb == 2.5GB</td>
			<td>5</td>
			<td>6</td>
		</tr>
	</tbody>
</table>

<p>As the table shows, if you wanted to leverage the theoretical maximum performance an SSD could offer, the network can become a bottleneck with anywhere from 1 to 9 SSDs depending on the type of networking leveraged</p>
</section>

<section data-type="sect1" id="the-impact-to-memory-latency-J5IMf8Ty">
<h3>The Impact to Memory Latency</h3>

<p>Typical main memory latency is ~100ns (will vary), we can perform the following calculations:</p>

<ul>
	<li>Local memory read latency = 100ns + [OS / hypervisor overhead]</li>
	<li>Network memory read latency = 100ns + NW RTT latency + [2 x OS / hypervisor overhead]</li>
</ul>

<p>If we assume a typical network RTT is ~0.5ms (will vary by switch vendor) which is ~500,000ns that would come down to:</p>

<ul>
	<li>Network memory read latency = 100ns + 500,000ns + [2 x OS / hypervisor overhead]</li>
</ul>

<p>If we theoretically assume a very fast network with a 10,000ns RTT:</p>

<ul>
	<li>Network memory read latency = 100ns + 10,000ns + [2 x OS / hypervisor overhead]</li>
</ul>

<p>What that means is even with a theoretically fast network, there is a 10,000% overhead when compared to a non-network memory access. With a slow network this can be upwards of a 500,000% latency overhead.</p>

<p>In order to alleviate this overhead, server side caching technologies are introduced.</p>
</section>
</section>

<section data-type="chapter" id="book-of-web-scale-NYIQSn">
<h2>Book of Web-Scale</h2>

<p class="definition"><strong>web·scale - /web ' skãl/ - noun - computing architecture</strong>
<br>
a new architectural approach to infrastructure and computing.</p>

<p>This section will present some of the core concepts behind “Web-scale” infrastructure and why we leverage them. Before I get started, I just wanted to clearly state the Web-scale doesn’t mean you need to be “web-scale” (e.g. Google, Facebook, or Microsoft).&nbsp; These constructs are applicable and beneficial at any scale (3-nodes or thousands of nodes).</p>

<p>Historical challenges included:</p>

<ul>
	<li>Complexity, complexity, complexity</li>
	<li>Desire for incremental based growth</li>
	<li>The need to be agile</li>
</ul>

<p>There are a few key constructs used when talking about “Web-scale” infrastructure:</p>

<ul>
	<li>Hyper-convergence</li>
	<li>Software defined intelligence</li>
	<li>Distributed autonomous systems</li>
	<li>Incremental and linear scale out</li>
</ul>

<p>Other related items:</p>

<ul>
	<li>API-based automation and rich analytics</li>
	<li>Self-healing</li>
</ul>

<p>The following sections will provide a technical perspective on what they actually mean.</p>

<section data-type="sect1" id="hyper-convergence-ONIRcvSY">
<h3>Hyper-Convergence</h3>

<p>There are differing opinions on what hyper-convergence actually is.&nbsp; It also varies based on the scope of components (e.g. virtualization, networking, etc.). However, the core concept comes down to the following: natively combining two or more components into a single unit. ‘Natively’ is the key word here. In order to be the most effective, the components must be natively integrated and not just bundled together. In the case of Nutanix, we natively converge compute + storage to form a single node used in our appliance.&nbsp; For others, this might be converging storage with the network, etc. What it really means:</p>

<ul>
	<li>Natively integrating two or more components into a single unit which can be easily scaled</li>
</ul>

<p>Benefits include:</p>

<ul>
	<li>Single unit to scale</li>
	<li>Localized I/O</li>
	<li>Eliminates traditional compute / storage silos by converging them</li>
</ul>
</section>

<section data-type="sect1" id="software-defined-intelligence-nrIRIWSn">
<h3>Software-Defined Intelligence</h3>

<p>Software-defined intelligence is taking the core logic from normally proprietary or specialized hardware (e.g. ASIC / FPGA) and doing it in software on commodity hardware. For Nutanix, we take the traditional storage logic (e.g. RAID, deduplication, compression, etc.) and put that into software that runs in each of the Nutanix Controller VMs (CVM) on standard x86 hardware. What it really means:</p>

<ul>
	<li>Pulling key logic from hardware and doing it in software on commodity hardware</li>
</ul>

<p>Benefits include:</p>

<ul>
	<li>Rapid release cycles</li>
	<li>Elimination of proprietary hardware reliance</li>
	<li>Utilization of commodity hardware for better economics</li>
</ul>
</section>

<section data-type="sect1" id="distributed-autonomous-systems-b1IeU4Sb">
<h3>Distributed Autonomous Systems</h3>

<p>Distributed autonomous systems involve moving away from the traditional concept of having a single unit responsible for doing something and distributing that role among all nodes within the cluster.&nbsp; You can think of this as creating a purely distributed system. Traditionally, vendors have assumed that hardware will be reliable, which, in most cases can be true.&nbsp; However, core to distributed systems is the idea that hardware will eventually fail and handling that fault in an elegant and non-disruptive way is key.</p>

<p>These distributed systems are designed to accommodate and remediate failure, to form something that is self-healing and autonomous.&nbsp; In the event of a component failure, the system will transparently handle and remediate the failure, continuing to operate as expected. Alerting will make the user aware, but rather than being a critical time-sensitive item, any remediation (e.g. replace a failed node) can be done on the admin’s schedule.&nbsp; Another way to put it is fail in-place (rebuild without replace) For items where a “master” is needed an election process is utilized, in the event this master fails a new master is elected.&nbsp; To distribute the processing of tasks MapReduce concepts are leveraged. What it really means:</p>

<ul>
	<li>Distributing roles and responsibilities to all nodes within the system</li>
	<li>Utilizing concepts like MapReduce to perform distributed processing of tasks</li>
	<li>Using an election process in the case where a “master” is needed</li>
</ul>

<p>Benefits include:</p>

<ul>
	<li>Eliminates any single points of failure (SPOF)</li>
	<li>Distributes workload to eliminate any bottlenecks</li>
</ul>
</section>

<section data-type="sect1" id="incremental-and-linear-scale-out-rkIZhxSN">
<h3>Incremental and linear scale out</h3>

<p>Incremental and linear scale out relates to the ability to start with a certain set of resources and as needed scale them out while linearly increasing the performance of the system.&nbsp; All of the constructs mentioned above are critical enablers in making this a reality. For example, traditionally you’d have 3-layers of components for running virtual workloads: servers, storage, and network – all of which are scaled independently.&nbsp; As an example, when you scale out the number of servers you’re not scaling out your storage performance. With a hyper-converged platform like Nutanix, when you scale out with new node(s) you’re scaling out:</p>

<ul>
	<li>The number of hypervisor / compute nodes</li>
	<li>The number of storage controllers</li>
	<li>The compute and storage performance / capacity</li>
	<li>The number of nodes participating in cluster wide operations</li>
</ul>

<p>What it really means:</p>

<ul>
	<li>The ability to incrementally scale storage / compute with linear increases to performance / ability</li>
</ul>

<p>Benefits include:</p>

<ul>
	<li>The ability to start small and scale</li>
	<li>Uniform and consistent performance at any scale</li>
</ul>
</section>
</section>

<section data-type="sect1" id="making-sense-of-it-all-22IWHQ">
<h3>Making Sense of It All</h3>

<p>In summary:</p>

<ol>
	<li>Inefficient compute utilization led to the move to virtualization</li>
	<li>Features including vMotion, HA, and DRS led to the requirement of centralized storage</li>
	<li>VM sprawl led to the increase load and contention on storage</li>
	<li>SSDs came in to alleviate the issues but changed the bottleneck to the network / controllers</li>
	<li>Cache / memory accesses over the network face large overheads, minimizing their benefits</li>
	<li>Array configuration complexity still remains the same</li>
	<li>Server side caches were introduced to alleviate the load on the array / impact of the network, however introduces another component to the solution</li>
	<li>Locality helps alleviate the bottlenecks / overheads traditionally faced when going over the network</li>
	<li>Shifts the focus from infrastructure to ease of management and simplifying the stack</li>
	<li>The birth of the Web-Scale world!</li>
</ol>
</section>
</div>

<div data-type="part" id="book-of-prism-7gEiv">
<h1><span class="label">Part II. </span>Book of Prism</h1>

<p class="definition"><strong>prism - /'prizɘm/ - noun - control plane</strong>
<br>
one-click management and interface for datacenter operations.</p>

<section data-type="chapter" id="design-methodology-and-iterations-13IRuV">
<h2>Design Methodology and Iterations</h2>

<p>
	Building a beautiful, empathetic and intuitive product are core to the Nutanix platform and something we take very seriously.  This section will cover our design methodology and how we iterate on them.  More coming here soon!
</p>

<p>
	In the meantime feel free to check out this great post on our design methodology and iterations by our Product Design Lead, Jeremy Sallee (who also designed this) - <a href="http://salleedesign.com/stuff/sdwip/blog/nutanix-case-study/">http://salleedesign.com/stuff/sdwip/blog/nutanix-case-study/</a>
</p>

<p>
  You can download the Nutanix Visio stencils here: <a href="http://www.visiocafe.com/nutanix.htm">http://www.visiocafe.com/nutanix.htm</a>
</p>

</section>

<section data-type="chapter" id="architecture-NYIVT0">
<h2>Architecture</h2>

<p>Prism is a distributed resource management platform which allows users to manage and monitor objects and services across their Nutanix environment.</p>

<p>These capabilities are broken down into two key categories:</p>

<ul>
	<li>Interfaces
	<ul>
		<li>HTML5 UI, REST API, CLI, PowerShell CMDlets, etc.</li>
	</ul>
	</li>
	<li>Management
	<ul>
		<li>Policy definition and compliance, service design and status, analytics and monitoring</li>
	</ul>
	</li>
</ul>

<p>The figure highlights an image illustrating the conceptual nature of Prism as part of the Nutanix platform:</p>

<figure id="id-XWtxHVTW"><img alt="High-Level Prism Architecture" class="iimagesv2arch_prismpng" src="imagesv2/arch_prism.png">
<figcaption><span class="label">Figure 5-1. </span>High-Level Prism Architecture</figcaption>
</figure>

<p>Prism is broken down into two main components:</p>

<ul>
	<li>Prism Central (PC)
	<ul>
		<li>Multi-cluster manager responsible for managing multiple Acropolis Clusters to provide a single, centralized management interface. &nbsp;Prism Central is an optional software appliance (VM) which can be deployed in addition to the Acropolis Cluster (can run on it).</li>
		<li>1-to-many cluster manager</li>
	</ul>
	</li>
	<li>Prism Element (PE)
	<ul>
		<li>Localized cluster manager responsible for local cluster management and operations. &nbsp;Every Acropolis Cluster has Prism Element built-in.</li>
		<li>1-to-1 cluster manager</li>
	</ul>
	</li>
</ul>

<p>The figure shows an image illustrating the conceptual relationship between Prism Central and Prism Element:</p>

<figure id="id-zmt2i4Tx"><img alt="Prism Architecture" class="iimagesv2prism_arch2png" src="imagesv2/prism_arch2.png">
<figcaption><span class="label">Figure 5-2. </span>Prism Architecture</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-05i5cRT9"><h6>Note</h6>
<h5>Pro tip</h5>

<p>For larger or distributed deployments (e.g. more than one cluster or multiple sites) it is recommended to use Prism Central to simplify operations and provide a single management UI for all clusters / sites.</p>
</div>

<h3>Prism Services</h3>

<p>A Prism service runs on every CVM with an elected Prism Leader which is responsible for handling HTTP requests.&nbsp; Similar to other components which have a Master, if the Prism Leader fails, a new one will be elected. &nbsp;When a CVM which is not the Prism Leader gets a HTTP request it will permanently redirect the request to the current Prism Leader using HTTP response status code 301.</p>

<p>Here we show a conceptual view of the Prism services and how HTTP request(s) are handled:</p>

<figure id="id-DktNCvTn"><img alt="Prism Services - Request Handling" class="iimagesv2prism_services3png" src="imagesv2/prism_services3.png">
<figcaption><span class="label">Figure 5-3. </span>Prism Services - Request Handling</figcaption>
</figure>

<div data-type="note" class="note" id="prism-ports-53iysDTA"><h6>Note</h6>
<h3>Prism ports</h3>

<p>Prism listens on ports 80 and 9440, if HTTP traffic comes in on port 80 it is redirected to HTTPS on port 9440.</p>
</div>

<p>When using the cluster external IP (recommended),&nbsp;it will always be hosted by the current Prism Leader. &nbsp;In the event of a Prism Leader failure the cluster IP will be assumed by the newly elected Prism Leader and a gratuitous ARP (gARP) will be used to clean any stale ARP cache entries. &nbsp;In this scenario any time the cluster IP is used to access Prism, no redirection is necessary as that will already be the Prism Leader.</p>

<div data-type="note" class="note" id="pro-tip-1RiATNTX"><h6>Note</h6>
<h5>Pro tip</h5>

<p>You can determine the current Prism leader by running 'curl localhost:2019/prism/leader' on any CVM.</p>
</div>
</section>

<section data-type="chapter" id="navigation-22IzS5">
<h2>Navigation</h2>

<p>Prism is fairly straight forward and simple to use, however we'll cover some of the main pages and basic usage.</p>

<p>Prism Central (if deployed) can be accessed using the IP address specified during configuration or corresponding DNS entry. &nbsp;Prism Element can be accessed via Prism Central (by clicking on a specific cluster) or by navigating to any Nutanix CVM or cluster IP (preferred).</p>

<p>Once the page has been loaded you will be greeted with the Login page where you will use your Prism or Active Directory credentials to login.</p>

<figure id="id-XWtpS0SW"><img alt="" class="iimagesv2prismprism_loginpng" src="imagesv2/Prism/prism_login.png">
<figcaption><span class="label">Figure 6-1. </span>Prism Login Page</figcaption>
</figure>

<p>Upon successful login you will be sent to the dashboard page which will provide overview information for managed cluster(s) in Prism Central or the local cluster in Prism Element.</p>

<p>Prism Central and Prism Element will be covered in more detail in the following sections.</p>

<section data-type="sect1" id="prism-central-4aI3tqSe">
<h3>Prism Central</h3>

<p>Prism Central contains the following main pages:</p>

<ul>
	<li>Home Page
	<ul>
		<li>Environment wide monitoring dashboard including detailed information on service status, capacity planning, performance, tasks, etc. &nbsp;To get further information on any of them you can click on the item of interest.</li>
	</ul>
	</li>
	<li>Explore Page
	<ul>
		<li>Management and monitoring of services, cluster, VMs and hosts</li>
	</ul>
	</li>
	<li>Analysis Page
	<ul>
		<li>Detailed performance analysis for cluster and managed objects with event correlation</li>
	</ul>
	</li>
	<li>Alerts
	<ul>
		<li>Environment wide alerts</li>
	</ul>
	</li>
</ul>

<p>The figure shows a sample Prism Central dashboard where multiple clusters can be monitored / managed:</p>

<figure class="large" id="id-0OtpSxt2SW"><img alt="Prism Central - Dashboard" class="iimagesv2prismpc_dashboard2png" src="imagesv2/Prism/PC_dashboard2.png">
<figcaption><span class="label">Figure 6-2. </span>Prism Central - Dashboard</figcaption>
</figure>

<p>From here you can monitor the overall status of your environment, and dive deeper if there are any alerts or items of interest.</p>

<div data-type="note" class="note" id="pro-tip-G9iVFptNS9"><h6>Note</h6>
<h5>Pro tip</h5>

<p>If everything is green, go back to doing something else :)</p>
</div>
</section>

<section data-type="sect1" id="prism-element-BNIWfdSz">
<h3>Prism Element</h3>

<p>Prism Element contains the following main pages:</p>

<ul>
	<li>Home Page
	<ul>
		<li>Local cluster monitoring dashboard including detailed information on alerts, capacity, performance, health, tasks, etc. &nbsp;To get further information on any of them you can click on the item of interest.</li>
	</ul>
	</li>
	<li>Health Page
	<ul>
		<li>Environment, hardware and managed object health and state information. &nbsp;Includes NCC health check status as well.</li>
	</ul>
	</li>
	<li>VM Page
	<ul>
		<li>Full VM management, monitoring and CRUD (Acropolis)</li>
		<li>VM monitoring (non-Acropolis)</li>
	</ul>
	</li>
	<li>Storage Page
	<ul>
		<li>Container management, monitoring and CRUD</li>
	</ul>
	</li>
	<li>Hardware
	<ul>
		<li>Server, disk and network management,&nbsp;monitoring and health. &nbsp;Includes cluster expansion as well as node and disk removal.</li>
	</ul>
	</li>
	<li>Data Protection
	<ul>
		<li>DR, Cloud Connect and Metro Availability configuration. &nbsp;Management of PD objects,&nbsp;snapshots, replication and restore.</li>
	</ul>
	</li>
	<li>Analysis
	<ul>
		<li>Detailed performance analysis for cluster and managed objects with event correlation</li>
	</ul>
	</li>
	<li>Alerts
	<ul>
		<li>Local cluster and environment alerts</li>
	</ul>
	</li>
</ul>

<p>The home page will provide detailed information on alerts, service status, capacity, performance, tasks, and much more. &nbsp;To get further information on any of them you can click on the item of interest.</p>

<p>The figure shows a sample Prism Element dashboard where local cluster details are displayed:</p>

<figure class="large" id="id-DktkHEfaSr"><img alt="Prism Element - Dashboard" class="iimagesv2prismpe_dashboardpng" src="imagesv2/Prism/PE_dashboard.png">
<figcaption><span class="label">Figure 6-3. </span>Prism Element - Dashboard</figcaption>
</figure>

<div data-type="note" class="note" id="keyboard-shortcuts-53i0FXfDS9"><h6>Note</h6>
<h3>Keyboard Shortcuts</h3>

<p>Accessibility and ease of use is a very critical construct in Prism. &nbsp;To simplify things for the end-user a set of shortcuts have been added to allow users to do everything from their keyboard.</p>

<p>The following characterizes some of the key shortcuts:</p>

<p>Change view (page context aware):</p>

<ul>
	<li>O - Overview View</li>
	<li>D - Diagram View</li>
	<li>T - Table View</li>
</ul>

<p>Activities and Events:</p>

<ul>
	<li>A - Alerts</li>
	<li>P - Tasks</li>
</ul>

<p>Drop down and Menus (Navigate selection using arrow keys):</p>

<ul>
	<li>M - Menu drop-down</li>
	<li>S - Settings (gear icon)</li>
	<li>F - Search bar</li>
	<li>U - User drop down</li>
	<li>H - Help</li>
</ul>
</div>
</section>
</section>

<section data-type="chapter" id="usage-and-troubleshooting-3jIxHa">
<h2>Usage and Troubleshooting</h2>

<p>In the following sections we'll cover some of the typical Prism uses as well as some common troubleshooting scenarios.</p>

<section data-type="sect1" id="nutanix-software-upgrade-lkIBu8H5">
<h3>Nutanix Software Upgrade</h3>

<p>Performing a Nutanix software upgrade is a very simple and non-disruptive process.</p>

<p>To begin, start by logging into Prism and clicking on the gear icon on the top right (settings) or by pressing 'S' and selecting 'Upgrade Software':</p>

<figure id="id-d0t0TNunHd"><img alt="Prism - Settings - Upgrade Software" class="iimagesv2prismupgradeupgrade_1png" src="imagesv2/Prism/upgrade/upgrade_1.png">
<figcaption><span class="label">Figure 7-1. </span>Prism - Settings - Upgrade Software</figcaption>
</figure>

<p>This will launch the 'Upgrade Software' dialog box and will show your current software version and if there are any upgrade versions available. &nbsp;It is also possible to manually upload a NOS binary file.</p>

<p>You can then download the upgrade version from the cloud or upload the version manually:</p>

<figure id="id-89tyFxuAHl"><img alt="Upgrade Software - Main" class="iimagesv2prismupgradeupgrade_2png" src="imagesv2/Prism/upgrade/upgrade_2.png">
<figcaption><span class="label">Figure 7-2. </span>Upgrade Software - Main</figcaption>
</figure>

<p>It will then upload the upgrade software onto the Nutanix CVMs:</p>

<figure id="id-0OtVfvu9HW"><img alt="Upgrade Software - Upload" class="iimagesv2prismupgradeupgrade_3png" src="imagesv2/Prism/upgrade/upgrade_3.png">
<figcaption><span class="label">Figure 7-3. </span>Upgrade Software - Upload</figcaption>
</figure>

<p>After the software is loaded click on 'Upgrade' to start the upgrade process:</p>

<figure id="id-DktQcOunHr"><img alt="Upgrade Software - Upgrade Validation" class="iimagesv2prismupgradeupgrade_4png" src="imagesv2/Prism/upgrade/upgrade_4.png">
<figcaption><span class="label">Figure 7-4. </span>Upgrade Software - Upgrade Validation</figcaption>
</figure>

<p>You'll then be prompted with a confirmation box:</p>

<figure id="id-GRtGUyubH9"><img alt="Upgrade Software - Confirm Upgrade" class="iimagesv2prismupgradeupgrade_5png" src="imagesv2/Prism/upgrade/upgrade_5.png">
<figcaption><span class="label">Figure 7-5. </span>Upgrade Software - Confirm Upgrade</figcaption>
</figure>

<p>The upgrade will start with pre-upgrade checks then start upgrading the software in a rolling manner:</p>

<figure id="id-RBtOCvuWHR"><img alt="Upgrade Software - Execution" class="iimagesv2prismupgradeupgrade_6png" src="imagesv2/Prism/upgrade/upgrade_6.png">
<figcaption><span class="label">Figure 7-6. </span>Upgrade Software - Execution</figcaption>
</figure>

<p>Once the upgrade is complete you'll see an updated status and have access to all of the new features:</p>

<figure id="id-NMtXu8unHl"><img alt="Upgrade Software - Complete" class="iimagesv2prismupgradeupgrade_7png" src="imagesv2/Prism/upgrade/upgrade_7.png">
<figcaption><span class="label">Figure 7-7. </span>Upgrade Software - Complete</figcaption>
</figure>

<div data-type="note" class="note" id="note-P0i1TQuRHz"><h6>Note</h6>
<h5>Note</h5>

<p>Your Prism session will briefly disconnect during the upgrade when the current Prism Leader is upgraded. &nbsp;All VMs and services running remain unaffected.</p>
</div>
</section>

<section data-type="sect1" id="hypervisor-upgrade-M2I5TWHq">
<h3>Hypervisor Upgrade</h3>

<p>Similar to Nutanix software upgrades, hypervisor upgrades can be fully automated in a rolling manner via Prism.</p>

<p>To begin follow the similar steps above to launch the 'Upgrade Software' dialogue box and select 'Hypervisor'.</p>

<p>You can then download the hypervisor upgrade version from the cloud or upload the version manually:</p>

<figure id="id-zmtoS4TxHM"><img alt="Upgrade Hypervisor - Main" class="iimagesv2prismupgradehyp_upgrade_1png" src="imagesv2/Prism/upgrade/hyp_upgrade_1.png">
<figcaption><span class="label">Figure 7-8. </span>Upgrade Hypervisor - Main</figcaption>
</figure>

<p>It will then load the upgrade software onto the Hypervisors. &nbsp;After the software is loaded click on 'Upgrade' to start the upgrade process:</p>

<figure id="id-xbtnFYTqHQ"><img alt="Upgrade Hypervisor - Upgrade Validation" class="iimagesv2prismupgradehyp_upgrade_2png" src="imagesv2/Prism/upgrade/hyp_upgrade_2.png">
<figcaption><span class="label">Figure 7-9. </span>Upgrade Hypervisor - Upgrade Validation</figcaption>
</figure>

<p>You'll then be prompted with a confirmation box:</p>

<figure id="id-A0t1fYT9Hg"><img alt="Upgrade Hypervisor - Confirm Upgrade" class="iimagesv2prismupgradehyp_upgrade_3png" src="imagesv2/Prism/upgrade/hyp_upgrade_3.png">
<figcaption><span class="label">Figure 7-10. </span>Upgrade Hypervisor - Confirm Upgrade</figcaption>
</figure>

<p>The system will then go through host pre-upgrade checks and upload the hypervisor upgrade to the cluster:</p>

<figure id="id-k1tzcqT5HX"><img alt="Upgrade Hypervisor - Pre-upgrade Checks" class="iimagesv2prismupgradehyp_upgrade_4png" src="imagesv2/Prism/upgrade/hyp_upgrade_4.png">
<figcaption><span class="label">Figure 7-11. </span>Upgrade Hypervisor - Pre-upgrade Checks</figcaption>
</figure>

<p>Once the pre-upgrade checks are complete the rolling hypervisor upgrade will then proceed:</p>

<figure id="id-5WtYUDTAH9"><img alt="Upgrade Hypervisor - Execution" class="iimagesv2prismupgradehyp_upgrade_5png" src="imagesv2/Prism/upgrade/hyp_upgrade_5.png">
<figcaption><span class="label">Figure 7-12. </span>Upgrade Hypervisor - Execution</figcaption>
</figure>

<p>Similar to the rolling nature of the Nutanix software upgrades, each host will be upgraded in a rolling manner with zero impact to running VMs. &nbsp;VMs will be live-migrated off the current host, the host will be upgraded, and then rebooted. &nbsp;This process will iterate through each host until all hosts in the cluster are upgraded.</p>

<div data-type="note" class="note" id="pro-tip-21iyCmTqHR"><h6>Note</h6>
<h5>Pro tip</h5>

<p>You can also get cluster wide upgrade status from any Nutanix CVM by running 'host_upgrade --status'. &nbsp;The detailed per host status is logged to ~/data/logs/host_upgrade.out on each CVM.</p>
</div>

<p>Once the upgrade is complete you'll see an updated status and have access to all of the new features:</p>

<figure id="id-2btAumTqHR"><img alt="Upgrade Hypervisor - Complete" class="iimagesv2prismupgradehyp_upgrade_6png" src="imagesv2/Prism/upgrade/hyp_upgrade_6.png">
<figcaption><span class="label">Figure 7-13. </span>Upgrade Hypervisor - Complete</figcaption>
</figure>

</section>

<section data-type="sect1" id="cluster-expansion-add-node-QMI5SwHj">
<h3>Cluster Expansion (add node)</h3>

<p>
	The ability to dynamically scale the Acropolis cluster is core to its functionality.  To scale an Acropolis cluster, rack / stack / cable the nodes and power them on.  Once the nodes are powered up they will be discoverable by the current cluster using mDNS.
</p>

<p>
	The figure shows an example 7 node cluster with 1 node which has been discovered:
</p>

<figure id="id-zmt9TOSxHM"><img alt="Add Node - Discovery" src="imagesv2/Prism/addnode/expand_1.png">
<figcaption><span class="label">Figure 7-14. </span>Add Node - Discovery</figcaption>
</figure>

<p>
	Multiple nodes can be discovered and added to the cluster concurrently.
</p>

<p>
	Once the nodes have been discovered you can begin the expansion by clicking 'Expand Cluster' on the upper right hand corner of the 'Hardware' page:
</p>

<figure id="id-0Ot0FaS9HW"><img alt="Hardware Page - Expand Cluster" src="imagesv2/Prism/addnode/expand_2a.png">
<figcaption><span class="label">Figure 7-15. </span>Hardware Page - Expand Cluster</figcaption>
</figure>

<p>
	You can also begin the cluster expansion process from any page by clicking on the gear icon:
</p>

<figure id="id-DktrfgSnHr"><img alt="Gear Menu - Expand Cluster" src="imagesv2/Prism/addnode/expand_2b.png">
<figcaption><span class="label">Figure 7-16. </span>Gear Menu - Expand Cluster</figcaption>
</figure>

<p>
	This launches the expand cluster menu where you can select the node(s) to add and specify IP addresses for the components:
</p>

<figure id="id-GRtMcVSbH9"><img alt="Expand Cluster - Host Selection" src="imagesv2/Prism/addnode/expand_3.png">
<figcaption><span class="label">Figure 7-17. </span>Expand Cluster - Host Selection</figcaption>
</figure>

<p>
	After the hosts have been selected you'll be prompted to upgrade a hypervisor image which will be used to image the nodes being added:
</p>

<figure id="id-RBtnU4SWHR"><img alt="Expand Cluster - Host Configuration" src="imagesv2/Prism/addnode/expand_4.png">
<figcaption><span class="label">Figure 7-18. </span>Expand Cluster - Host Configuration</figcaption>
</figure>

<p>
	After the upload is completed you can click on 'Expand Cluster' to being the imaging and expansion process:
</p>

<figure id="id-NMtbCnSnHl"><img alt="Expand Cluster - Execution" src="imagesv2/Prism/addnode/expand_5.png">
<figcaption><span class="label">Figure 7-19. </span>Expand Cluster - Execution</figcaption>
</figure>

<p>
	The job will then be submitted and the corresponding task item will appear:
</p>

<figure id="id-3rtEuXSeHB"><img alt="Expand Cluster - Execution" src="imagesv2/Prism/addnode/expand_6.png">
<figcaption><span class="label">Figure 7-20. </span>Expand Cluster - Execution</figcaption>
</figure>

<p>
	Detailed tasks status can be viewed by expanding the task(s):
</p>

<figure id="id-lNtDSzS5HV"><img alt="Expand Cluster - Execution" src="imagesv2/Prism/addnode/expand_7.png">
<figcaption><span class="label">Figure 7-21. </span>Expand Cluster - Execution</figcaption>
</figure>

<p>
	After the imaging and add node process has been completed you'll see the updated cluster size and resources:
</p>

<figure id="id-Q4t9FESjHm"><img alt="Expand Cluster - Execution" src="imagesv2/Prism/addnode/expand_9.png">
<figcaption><span class="label">Figure 7-22. </span>Expand Cluster - Execution</figcaption>
</figure>

</section>

<section data-type="sect1" id="capacity-planning-J5IrHVHl">
<h3>Capacity Planning</h3>

<p>To get detailed capacity planning details you can click on a specific cluster under the 'cluster runway' section in Prism Central to get more details:</p>

<figure class="large" id="id-zmt8uyHxHM"><img alt="Prism Central - Capacity Planning" class="iimagesv2prismpc_capplannerpng" src="imagesv2/Prism/pc_capplanner.png">
<figcaption><span class="label">Figure 7-23. </span>Prism Central - Capacity Planning</figcaption>
</figure>

<p>This view provides detailed information on cluster runway and identifies the most constrained resource (limiting resource). &nbsp;You can also get detailed information on what the top consumers are as well as some potential options to clean up additional capacity or ideal node types for cluster expansion.</p>

<figure id="id-xbtOSXHqHQ"><img alt="Prism Central - Capacity Planning - Recommendations" class="iimagesv2prismpc_recommendationpng" src="imagesv2/Prism/pc_recommendation.png">
<figcaption><span class="label">Figure 7-24. </span>Prism Central - Capacity Planning - Recommendations</figcaption>
</figure>
</section>

<p>The HTML5 UI is a key part to Prism to provide a simple, easy to use management interface. &nbsp;However, another core ability are the APIs which are available for automation. &nbsp;All functionality exposed through the Prism UI is also exposed through a full set of REST APIs to allow for the ability to programmatically interface with the Nutanix platform. &nbsp;This allow customers and partners to enable automation, 3rd-party tools, or even create their own UI. &nbsp;</p>

<p>The following section covers these interfaces and provides some example usage.</p>
</section>

<section data-type="chapter" id="apis-and-interfaces-PgIAF1">
<h2>APIs and Interfaces</h2>

<p>Core to any dynamic or “software-defined” environment, Nutanix provides a vast array of interfaces allowing for simple programmability and interfacing. Here are the main interfaces:</p>

<ul>
	<li>REST API</li>
	<li>CLI - ACLI &amp; NCLI</li>
	<li>Scripting interfaces</li>
</ul>

<p>Core to this is the REST API which exposes every capability and data point of the Prism UI and allows for orchestration or automation tools to easily drive Nutanix action.&nbsp; This enables tools like Saltstack, Puppet, vRealize Operations, System Center Orchestrator, Ansible, etc.&nbsp;to easily create custom workflows for Nutanix. Also, this means that any third-party developer could create their own custom UI and pull in Nutanix data via REST.</p>

<p>The following figure shows a small snippet of the Nutanix REST API explorer which allows developers to interact with the API and see expected data formats:</p>

<figure class="large" id="id-Vrt4HjF5"><img alt="Prism REST API Explorer" class="iimagesv2restapipng" src="imagesv2/RestAPI.png">
<figcaption><span class="label">Figure 8-1. </span>Prism REST API Explorer</figcaption>
</figure>

<p>Operations can be expanded to display details and examples of the REST call:</p>

<figure class="large" id="id-89tJtRFA"><img alt="Prism REST API Sample Call" class="iimagesv2restapi2png" src="imagesv2/RestAPI2.png">
<figcaption><span class="label">Figure 8-2. </span>Prism REST API Sample Call</figcaption>
</figure>

<div data-type="note" class="note" id="api-authentication-schemes-Ami1fVF9"><h6>Note</h6>
<h5>API Authentication Scheme(s)</h5>

<p>As of 4.5.x basic authentication over HTTPS is leveraged for client and HTTP call authentication.</p>
</div>

<section data-type="sect1" id="acli-b1IPigFB">
<h3>ACLI</h3>

<p>The Acropolis CLI (ACLI) is the CLI for managing the Acropolis portion of the Nutanix product. &nbsp;These capabilities were enabled in releases after 4.1.2.</p>

<p>NOTE: All of these actions can be performed via the HTML5 GUI and REST API.&nbsp; I just use these commands as part of my scripting to automate tasks.</p>

<h5>Enter ACLI shell</h5>

<p class="codedescription">Description: Enter ACLI shell (run from any CVM)</p>

<p class="codetext">Acli</p>

<p>OR</p>

 <p class="codedescription">Description: Execute ACLI command via Linux shell</p>

<p class="codetext">ACLI &lt;Command&gt;</p>

<h5>Output ACLI response in json format</h5>

 <p class="codedescription">Description: Lists Acropolis nodes in the cluster.</p>

<p class="codetext">Acli –o json</p>

<h5>List hosts</h5>

 <p class="codedescription">Description: Lists Acropolis nodes in the cluster.</p>

<p class="codetext">host.list</p>

<h5>Create network</h5>

 <p class="codedescription">Description: Create network based on VLAN</p>

<p class="codetext">net.create &lt;TYPE&gt;.&lt;ID&gt;[.&lt;VSWITCH&gt;] ip_config=&lt;A.B.C.D&gt;/&lt;NN&gt;</p>

<p class="codetext">Example: net.create vlan.133 ip_config=10.1.1.1/24</p>

<h5>List network(s)</h5>

 <p class="codedescription">Description: List networks</p>

<p class="codetext">net.list</p>

<h5>Create DHCP scope</h5>

 <p class="codedescription">Description: Create dhcp scope</p>

<p class="codetext">net.add_dhcp_pool &lt;NET NAME&gt; start=&lt;START IP A.B.C.D&gt; end=&lt;END IP W.X.Y.Z&gt;</p>

<p class="note">Note: .254 is reserved and used by the Acropolis DHCP server if an address for the Acropolis DHCP server wasn’t set during network creation</p>

<p class="codetext">Example: net.add_dhcp_pool vlan.100 start=10.1.1.100 end=10.1.1.200</p>

<h5>Get an existing networks details</h5>

 <p class="codedescription">Description: Get a network's properties</p>

<p class="codetext">net.get &lt;NET NAME&gt;</p>

<p class="codetext">Example: net.get vlan.133</p>

<h5>Get an existing networks details</h5>

 <p class="codedescription">Description: Get a network's VMs and details including VM name / UUID, MAC address and IP</p>

<p class="codetext">net.list_vms &lt;NET NAME&gt;</p>

<p class="codetext">Example: net.list_vms vlan.133</p>

<h5>Configure DHCP DNS servers for network</h5>

<p class="codedescription">Description: Set DHCP DNS</p>

<p class="codetext">net.update_dhcp_dns &lt;NET NAME&gt; servers=&lt;COMMA SEPARATED DNS IPs&gt; domains=&lt;COMMA SEPARATED DOMAINS&gt;</p>

<p class="codetext">Example: net.set_dhcp_dns vlan.100 servers=10.1.1.1,10.1.1.2 domains=splab.com</p>

<h5>Create Virtual Machine</h5>

 <p class="codedescription">Description: Create VM</p>

<p class="codetext">vm.create &lt;COMMA SEPARATED VM NAMES&gt; memory=&lt;NUM MEM MB&gt; num_vcpus=&lt;NUM VCPU&gt; num_cores_per_vcpu=&lt;NUM CORES&gt; ha_priority=&lt;PRIORITY INT&gt;</p>

<p class="codetext">Example: vm.create testVM memory=2G num_vcpus=2</p>

<h5>Bulk Create Virtual Machine</h5>

 <p class="codedescription">Description: Create bulk VM</p>

<p class="codetext">vm.create &nbsp;&lt;CLONE PREFIX&gt;[&lt;STARTING INT&gt;..&lt;END INT&gt;]&nbsp;memory=&lt;NUM MEM MB&gt; num_vcpus=&lt;NUM VCPU&gt; num_cores_per_vcpu=&lt;NUM CORES&gt; ha_priority=&lt;PRIORITY INT&gt;</p>

<p class="codetext">Example: vm.create testVM[000..999]&nbsp;memory=2G num_vcpus=2</p>

<h5>Clone VM from existing</h5>

<p class="codedescription">Description: Create clone of existing VM</p>

<p class="codetext">vm.clone &lt;CLONE NAME(S)&gt; clone_from_vm=&lt;SOURCE VM NAME&gt;</p>

<p class="codetext">Example: vm.clone testClone clone_from_vm=MYBASEVM</p>

<h5>Bulk Clone VM from existing</h5>

 <p class="codedescription">Description: Create bulk clones of existing VM</p>

<p class="codetext">vm.clone &lt;CLONE PREFIX&gt;[&lt;STARTING INT&gt;..&lt;END INT&gt;] clone_from_vm=&lt;SOURCE VM NAME&gt;</p>

<p class="codetext">Example: vm.clone testClone[001..999]&nbsp;clone_from_vm=MYBASEVM</p>

<h5>Create disk and add to VM</h5>

<p class="codetext"># Description: Create disk for OS</p>

<p class="codetext">vm.disk_create &lt;VM NAME&gt; create_size=&lt;Size and qualifier, e.g. 500G&gt; container=&lt;CONTAINER NAME&gt;</p>

<p class="codetext"> class="codetext"Example: vm.disk_create testVM create_size=500G container=default</p>

<h5>Add NIC to VM</h5>

 <p class="codedescription">Description: Create and add NIC</p>

<p class="codetext">vm.nic_create &lt;VM NAME&gt; network=&lt;NETWORK NAME&gt; model=&lt;MODEL&gt;</p>

<p class="codetext">Example: vm.nic_create testVM network=vlan.100</p>

<h5>Set VM’s boot device to disk</h5>

 <p class="codedescription">Description: Set a VM boot device</p>

<p>Set to boot form specific disk id</p>

<p class="codetext">vm.update_boot_device &lt;VM NAME&gt; disk_addr=&lt;DISK BUS&gt;</p>

<p class="codetext">Example: vm.update_boot_device testVM disk_addr=scsi.0</p>

<h5>Set VM’s boot device to CDrom</h5>

<p>Set to boot from CDrom</p>

<p class="codetext">vm.update_boot_device &lt;VM NAME&gt; disk_addr=&lt;CDROM BUS&gt;</p>

<p class="codetext">Example: vm.update_boot_device testVM disk_addr=ide.0</p>

<h5>Mount ISO to CDrom</h5>

 <p class="codedescription">Description: Mount ISO to VM cdrom</p>

<p>Steps:</p>

<p>1. Upload ISOs to container</p>

<p>2. Enable whitelist for client IPs</p>

<p>3. Upload ISOs to share</p>

<p>Create CDrom with ISO</p>

<p class="codetext">vm.disk_create &lt;VM NAME&gt; clone_nfs_file=&lt;PATH TO ISO&gt; cdrom=true</p>

<p class="codetext">Example: vm.disk_create testVM clone_nfs_file=/default/ISOs/myfile.iso cdrom=true</p>

<p>If a CDrom is already created just mount it</p>

<p class="codetext">vm.disk_update &lt;VM NAME&gt; &lt;CDROM BUS&gt; clone_nfs_file&lt;PATH TO ISO&gt;</p>

<p class="codetext">Example: vm.disk_update atestVM1 ide.0 clone_nfs_file=/default/ISOs/myfile.iso</p>

<h5>Detach ISO from CDrom</h5>

 <p class="codedescription">Description: Remove ISO from CDrom</p>

<p class="codetext">vm.disk_update &lt;VM NAME&gt; &lt;CDROM BUS&gt; empty=true</p>

<h5>Power on VM(s)</h5>

 <p class="codedescription">Description: Power on VM(s)</p>

<p class="codetext">vm.on &lt;VM NAME(S)&gt;</p>

<p class="codetext">Example: vm.on testVM</p>

<p>Power on all VMs</p>

<p class="codetext">Example: vm.on *</p>

<p>Power on range of VMs</p>

<p class="codetext">Example: vm.on testVM[01..99]</p>
</section>

<section data-type="sect1" id="ncli-rkIAcbFJ">
<h3>NCLI</h3>

<p>NOTE: All of these actions can be performed via the HTML5 GUI and REST API.&nbsp; I just use these commands as part of my scripting to automate tasks.</p>

<h5>Add subnet to NFS whitelist</h5>

<p class="codedescription">Description: Adds a particular subnet to the NFS whitelist</p>

<p class="codetext">ncli cluster add-to-nfs-whitelist ip-subnet-masks=10.2.0.0/255.255.0.0</p>

<h5>Display Nutanix Version</h5>

 <p class="codedescription">Description: Displays the current version of the Nutanix software</p>

<p class="codetext">ncli cluster version</p>

<h5>Display hidden NCLI options</h5>

 <p class="codedescription">Description: Displays the hidden ncli commands/options</p>

<p class="codetext">ncli helpsys listall hidden=true [detailed=false|true]</p>

<h5>List Storage Pools</h5>

 <p class="codedescription">Description: Displays the existing storage pools</p>

<p class="codetext">ncli sp ls</p>

<h5>List containers</h5>

 <p class="codedescription">Description: Displays the existing containers</p>

<p class="codetext">ncli ctr ls</p>

<h5>Create container</h5>

<p class="codedescription">Description: Creates a new container</p>

<p class="codetext">ncli ctr create name=&lt;NAME&gt; sp-name=&lt;SP NAME&gt;</p>

<h5>List VMs</h5>

 <p class="codedescription">Description: Displays the existing VMs</p>

<p class="codetext">ncli vm ls</p>

<h5>List public keys</h5>

 <p class="codedescription">Description: Displays the existing public keys</p>

<p class="codetext">ncli cluster list-public-keys</p>

<h5>Add public key</h5>

 <p class="codedescription">Description: Adds a public key for cluster access</p>

<p>SCP public key to CVM</p>

<p>Add public key to cluster</p>

<p class="codetext">ncli cluster add-public-key name=myPK file-path=~/mykey.pub</p>

<h5>Remove public key</h5>

 <p class="codedescription">Description: Removes a public key for cluster access</p>

<p class="codetext">ncli cluster remove-public-keys name=myPK</p>

<h5>Create protection domain</h5>

 <p class="codedescription">Description: Creates a protection domain</p>

<p class="codetext">ncli pd create name=&lt;NAME&gt;</p>

<h5>Create remote site</h5>

 <p class="codedescription">Description: Create a remote site for replication</p>

<p class="codetext">ncli remote-site create name=&lt;NAME&gt; address-list=&lt;Remote Cluster IP&gt;</p>

<h5>Create protection domain for all VMs in container</h5>

 <p class="codedescription">Description: Protect all VMs in the specified container</p>

<p class="codetext">ncli pd protect name=&lt;PD NAME&gt; ctr-id=&lt;Container ID&gt; cg-name=&lt;NAME&gt;</p>

<h5>Create protection domain with specified VMs</h5>

 <p class="codedescription">Description: Protect the VMs specified</p>

<p class="codetext">ncli pd protect name=&lt;PD NAME&gt; vm-names=&lt;VM Name(s)&gt; cg-name=&lt;NAME&gt;</p>

<h5>Create protection domain for DSF files (aka vDisk)</h5>

 <p class="codedescription">Description: Protect the DSF Files specified</p>

<p class="codetext">ncli pd protect name=&lt;PD NAME&gt; files=&lt;File Name(s)&gt; cg-name=&lt;NAME&gt;</p>

<h5>Create snapshot of protection domain</h5>

 <p class="codedescription">Description: Create a one-time snapshot of the protection domain</p>

<p class="codetext">ncli pd add-one-time-snapshot name=&lt;PD NAME&gt; retention-time=&lt;seconds&gt;</p>

<h5>Create snapshot and replication schedule to remote site</h5>

 <p class="codedescription">Description: Create a recurring snapshot schedule and replication to n remote sites</p>

<p class="codetext">ncli pd set-schedule name=&lt;PD NAME&gt; interval=&lt;seconds&gt; retention-policy=&lt;POLICY&gt; remote-sites=&lt;REMOTE SITE NAME&gt;</p>

<h5>List replication status</h5>

 <p class="codedescription">Description: Monitor replication status</p>

<p class="codetext">ncli pd list-replication-status</p>

<h5>Migrate protection domain to remote site</h5>

 <p class="codedescription">Description: Fail-over a protection domain to a remote site</p>

<p class="codetext">ncli pd migrate name=&lt;PD NAME&gt; remote-site=&lt;REMOTE SITE NAME&gt;</p>

<h5>Activate protection domain</h5>

 <p class="codedescription">Description: Activate a protection domain at a remote site</p>

<p class="codetext">ncli pd activate name=&lt;PD NAME&gt;</p>

<h5>Enable DSF Shadow Clones</h5>

 <p class="codedescription">Description: Enables the DSF Shadow Clone feature</p>

<p class="codetext">ncli cluster edit-params enable-shadow-clones=true</p>

<h5>Enable Dedup for vDisk</h5>

 <p class="codedescription">Description: Enables fingerprinting and/or on disk dedup for a specific vDisk</p>

<p class="codetext">ncli vdisk edit name=&lt;VDISK NAME&gt; fingerprint-on-write=&lt;true/false&gt; on-disk-dedup=&lt;true/false&gt;</p>
</section>

<section data-type="sect1" id="PowerShell-cmdlets-mkIgIjFr">
<h3>PowerShell CMDlets</h3>

<p>The below will cover the Nutanix PowerShell CMDlets, how to use them and some general background on Windows PowerShell.</p>

<h5>Basics</h5>

<p>Windows PowerShell is a powerful shell (hence the name ;P) and scripting language built on the .NET framework.&nbsp; It is a very simple to use language and is built to be intuitive and interactive.&nbsp; Within PowerShell there are a few key constructs/Items:</p>

<h5>CMDlets</h5>

<p>CMDlets are commands or .NET classes which perform a particular operation.&nbsp; They are usually conformed to the Getter/Setter methodology and typically use a &lt;Verb&gt;-&lt;Noun&gt; based structure.&nbsp; For example: Get-Process, Set-Partition, etc.</p>

<h5>Piping or Pipelining</h5>

<p>Piping is an important construct in PowerShell (similar to its use in Linux) and can greatly simplify things when used correctly.&nbsp; With piping you’re essentially taking the output of one section of the pipeline and using that as input to the next section of the pipeline.&nbsp; The pipeline can be as long as required (assuming there remains output which is being fed to the next section of the pipe). A very simple example could be getting the current processes, finding those that match a particular trait or filter and then sorting them:</p>

<p class="codetext">Get-Service | where {$_.Status -eq "Running"} | Sort-Object Name</p>

<p>Piping can also be used in place of for-each, for example:</p>

<p class="codetext"># For each item in my array
<br>
$myArray | %{
<br>
&nbsp; # Do something
<br>
}</p>

<h5>Key Object Types</h5>

<p>Below are a few of the key object types in PowerShell.&nbsp; You can easily get the object type by using the .getType() method, for example: $someVariable.getType() will return the objects type.</p>

<h5>Variable</h5>

<p class="codetext">$myVariable = "foo"</p>

<p class="note">Note: You can also set a variable to the output of a series or pipeline of commands:</p>

<p class="codetext">$myVar2 = (Get-Process | where {$_.Status -eq "Running})</p>

<p>In this example the commands inside the parentheses will be evaluated first then variable will be the outcome of that.</p>

<h5>Array</h5>

<p class="codetext">$myArray = @("Value","Value")</p>

<p class="note">Note: You can also have an array of arrays, hash tables or custom objects</p>

<h5>Hash Table</h5>

<p class="codetext">$myHash = @{"Key" = "Value";"Key" = "Value"}</p>

<h5>Useful commands</h5>

<p>Get the help content for a particular CMDlet (similar to a man page in Linux)</p>

<p class="codetext">Get-Help &lt;CMDlet Name&gt;</p>

<p class="codetext">Example: Get-Help Get-Process</p>

<p>List properties and methods of a command or object</p>

<p class="codetext">&lt;Some expression or object&gt; | Get-Member</p>

<p class="codetext">Example: $someObject | Get-Member</p>

<h5>Core Nutanix CMDlets and Usage</h5>

<p>Download Nutanix CMDlets Installer The Nutanix CMDlets can be downloaded directly from the Prism UI (post 4.0.1) and can be found on the drop down in the upper right hand corner:</p>

<figure class="small" id="id-pltZS0I8FO"><img alt="Prism CMDlets Installer Link" class="iimagesv2cmdlets_dlpng" src="imagesv2/cmdlets_dl.png">
<figcaption><span class="label">Figure 8-3. </span>Prism CMDlets Installer Link</figcaption>
</figure>

<h5>Load Nutanix Snappin</h5>

<p>Check if snappin is loaded and if not, load</p>

<p class="codetext">if ( (Get-PSSnapin -Name NutanixCmdletsPSSnapin -ErrorAction SilentlyContinue) -eq $null )
<br>
{
<br>
&nbsp;&nbsp;&nbsp; Add-PsSnapin NutanixCmdletsPSSnapin
<br>
}</p>

<h5>List Nutanix CMDlets</h5>

<p class="codetext">Get-Command | Where-Object{$_.PSSnapin.Name -eq "NutanixCmdletsPSSnapin"}</p>

<h5>Connect to a Acropolis Cluster</h5>

<p class="codetext">Connect-NutanixCluster -Server $server -UserName "myuser" -Password "myuser" -AcceptInvalidSSLCerts</p>

<p>Or secure way prompting user for password</p>

<p class="codetext">Connect-NutanixCluster -Server $server -UserName "myuser" -Password (Read-Host "Password: ") -AcceptInvalidSSLCerts</p>

<h5>Get Nutanix VMs matching a certain search string</h5>

<p>Set to variable</p>

<p class="codetext">$searchString = "myVM"
<br>
$vms = Get-NTNXVM | where {$_.vmName -match $searchString}</p>

<p>Interactive</p>

<p class="codetext">Get-NTNXVM | where {$_.vmName -match "myString"}</p>

<p>Interactive and formatted</p>

<p class="codetext">Get-NTNXVM | where {$_.vmName -match "myString"} | ft</p>

<h5>Get Nutanix vDisks</h5>

<p>Set to variable</p>

<p class="codetext">$vdisks = Get-NTNXVDisk</p>

<p>Interactive</p>

<p class="codetext">Get-NTNXVDisk</p>

<p>Interactive and formatted</p>

<p class="codetext">Get-NTNXVDisk | ft</p>

<h5>Get Nutanix Containers</h5>

<p>Set to variable</p>

<p class="codetext">$containers = Get-NTNXContainer</p>

<p>Interactive</p>

<p class="codetext">Get-NTNXContainer</p>

<p>Interactive and formatted</p>

<p class="codetext">Get-NTNXContainer | ft</p>

<h5>Get Nutanix Protection Domains</h5>

<p>Set to variable</p>

<p class="codetext">$pds = Get-NTNXProtectionDomain</p>

<p>Interactive</p>

<p class="codetext">Get-NTNXProtectionDomain</p>

<p>Interactive and formatted</p>

<p class="codetext">Get-NTNXProtectionDomain | ft</p>

<h5>Get Nutanix Consistency Groups</h5>

<p>Set to variable</p>

<p class="codetext">$cgs = Get-NTNXProtectionDomainConsistencyGroup</p>

<p>Interactive</p>

<p class="codetext">Get-NTNXProtectionDomainConsistencyGroup</p>

<p>Interactive and formatted</p>

<p class="codetext">Get-NTNXProtectionDomainConsistencyGroup | ft</p>

<h5>Resources and Scripts:</h5>

<ul>
	<li>Nutanix Github - <a href="https://github.com/nutanix/Automation" target="_blank">https://github.com/nutanix/Automation</a></li>
	<li>Manually Fingerprint vDisks - <a href="http://bit.ly/1syOqch" target="_blank">http://bit.ly/1syOqch</a></li>
	<li>vDisk Report - <a href="http://bit.ly/1r34MIT" target="_blank">http://bit.ly/1r34MIT</a></li>
	<li>Protection Domain Report - <a href="http://bit.ly/1r34MIT" target="_blank">http://bit.ly/1r34MIT</a></li>
	<li>Ordered PD Restore - <a href="http://bit.ly/1pyolrb" target="_blank">http://bit.ly/1pyolrb</a></li>
</ul>

<p>You can find more scripts on the Nutanix Github located at <a href="https://github.com/nutanix" target="_blank">https://github.com/nutanix</a></p>
</section>
</section>

<section data-type="chapter" id="integrations-lkIJt8">
<h2>Integrations</h2>

<section data-type="sect1" id="openstack-M2IosAtq">
<h3>OpenStack</h3>

<p><a href="https://www.openstack.org/">OpenStack </a>is an open source platform for managing and building clouds. &nbsp;It is primarily broken into the front-end (dashboard and API) and infrastructure services (compute, storage, etc.).</p>

<p>The OpenStack and Nutanix solution is composed of a few main components:</p>

<ul>
	<li>OpenStack Controller (OSC)
		<ul>
			<li>An existing, or newly provisioned VM or host hosting the OpenStack UI, API and services.  Handles all OpenStack API calls.  In an Acropolis OVM deployment this can be co-located with the Acropolis OpenStack Drivers.</li>
		</ul>
	</li>
  <li>Acropolis OpenStack Driver
    <ul>
      <li>Responsible for taking OpenStack RPCs from the OpenStack Controller and translates them into native Acropolis API calls.  This can be deployed on the OpenStack Controller, the OVM (pre-installed), or on a new VM.</li>
    </ul>
  </li>
  <li>Acropolis OpenStack Services VM (OVM)
		<ul>
			<li>VM with Acropolis drivers that is responsible for taking OpenStack RPCs from the OpenStack Controller and translates them into native Acropolis API calls.</li>
		</ul>
	</li>
</ul>

<p>The OpenStack Controller can be an existing VM / host, or deployed as part of the OpenStack on Nutanix solution.  The Acropolis OVM is a helper VM which is deployed as part of the Nutanix OpenStack solution.</p>

<p>The client communicates with the OpenStack Controller using their expected methods (Web UI / HTTP, SDK, CLI or API) and the OpenStack controller communicates with the Acropolis OVM which translates the requests into native Acropolis REST API calls using the OpenStack Driver.</p>

<p>The figure shows a high-level overview of the communication:</p>

<figure id="id-0OtMtbsQtW"><img alt="OpenStack + Acropolis OpenStack Driver" src="imagesv2/openstack_overview2.png">
<figcaption><span class="label">Figure 9-1. </span>OpenStack + Acropolis OpenStack Driver</figcaption>
</figure>

<div data-type="note" class="note" id="supported-openstack-controllers-kpi1fysgtX">
<h5>Supported OpenStack Controllers</h5>

<p>The current solution (as of 4.5.1) requires an OpenStack Controller on version Kilo or later.</p>
</div>

<p>The table shows a high-level conceptual role mapping:</p>

<table>
  <tr>
    <th>Item</th>
    <th>Role</th>
    <th>OpenStack Controller</th>
    <th>Acropolis OVM</th>
    <th>Acropolis Cluster</th>
    <th>Prism</th>
  </tr>
  <tr>
    <td>Tenant Dashboard</td>
    <td>User interface and API</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Admin Dashboard</td>
    <td>Infra monitoring and ops</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td>X</td>
  </tr>
  <tr>
    <td>Orchestration</td>
    <td>Object CRUD and lifecycle management</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Quotas</td>
    <td>Resource controls and limits</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Users, Groups and Roles</td>
    <td>Role based access control (RBAC)</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>SSO</td>
    <td>Single-sign on</td>
    <td>X</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Platform Integration</td>
    <td>OpenStack to Nutanix integration</td>
    <td></td>
    <td>X</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Infrastructure Services</td>
    <td>Target infrastructure (compute, storage, network)</td>
    <td></td>
    <td></td>
    <td>X</td>
    <td></td>
  </tr>
</table>

<section data-type="sect2" id="openstack-components-vlIPIGs4tE">
<h4>OpenStack Components</h4>

<p>OpenStack is composed of a set of components which are responsible for serving various infrastructure functions.  Some of these functions will be hosted by the OpenStack Controller and some will be hosted by the Acropolis OVM.</p>

<p>The table shows the core OpenStack components and role mapping:</p>

<table>
  <tr>
    <th>Component</th>
    <th>Role</th>
    <th>OpenStack Controller</th>
    <th>Acropolis OVM</th>
  </tr>
  <tr>
    <td>Keystone</td>
    <td>Identity service</td>
    <td>X</td>
    <td></td>
  </tr>
  <tr>
    <td>Horizon</td>
    <td>Dashboard and UI</td>
    <td>X</td>
    <td></td>
  </tr>
  <tr>
    <td>Nova</td>
    <td>Compute</td>
    <td></td>
    <td>X</td>
  </tr>
  <tr>
    <td>Swift</td>
    <td>Object storage</td>
    <td>X</td>
    <td>X</td>
  </tr>
  <tr>
    <td>Cinder</td>
    <td>Block storage</td>
    <td></td>
    <td>X</td>
  </tr>
  <tr>
    <td>Glance</td>
    <td>Image service</td>
    <td>X</td>
    <td>X</td>
  </tr>
  <tr>
    <td>Neutron</td>
    <td>Networking</td>
    <td></td>
    <td>X</td>
  </tr>
	<tr>
		<td>Heat</td>
		<td>Orchestration</td>
		<td>X</td>
		<td></td>
	</tr>
	<tr>
    <td>Others</td>
    <td>All other components</td>
    <td>X</td>
    <td></td>
  </tr>
</table>

<p>The figure shows a more detailed view of the OpenStack components and communication:</p>

<figure id="id-2btWHXIosBtD"><img alt="OpenStack + Nutanix API Communication" class="iimagesv2openstack_commarch2png" src="imagesv2/openstack_commarch4.png">
<figcaption><span class="label">Figure 9-2. </span>OpenStack + Nutanix API Communication</figcaption>
</figure>

<p>
	In the following sections we will go through some of the main OpenStack components and how they are integrated into the Nutanix platform.
</p>

<h5>Nova</h5>
<p>
Nova is the compute engine and scheduler for the OpenStack platform.  In the Nutanix OpenStack solution each Acropolis OVM acts as a compute host and every Acropolis Cluster will act as a single hypervisor host eligible for scheduling OpenStack instances. The Acropolis OVM runs the Nova-compute service.
</p>

<p>
You can view the Nova services using the OpenStack portal under 'Admin'-&gt;'System'-&gt;'System Information'-&gt;'Compute Services'.
 </p>

 <p>The figure shows the Nova services, host and state:</p>

<figure class="large" id="id-J2tNIjI8sAty"><img alt="OpenStack Nova Services" src="imagesv2/openstack_nova_services.png">
<figcaption><span class="label">Figure 9-3. </span>OpenStack Nova Services</figcaption>
</figure>

<p>
	The Nova scheduler decides which compute host (i.e. Acropolis OVM) to place the instances based upon the selected availability zone.  These requests will be sent to the selected Acropolis OVM which will forward the request to the target host's (i.e. Acropolis cluster) Acropolis scheduler.  The Acropolis scheduler will then determine optimal node placement within the cluster.  Individual nodes within a cluster are not exposed to OpenStack.
</p>

<p>
	You can view the compute and hypervisor hosts using the OpenStack portal under 'Admin'-&gt;'System'-&gt;'Hypervisors'.
</p>

<p>The figure shows the Acropolis OVM as the compute host:</p>

<figure id="id-n3tOsPIos3te"><img alt="OpenStack Compute Host" src="imagesv2/openstack_nova_computehost.png">
<figcaption><span class="label">Figure 9-4. </span>OpenStack Compute Host</figcaption>
</figure>

<p>The figure shows the Acropolis cluster as the hypervisor host:</p>

<figure class="large" id="id-rQtPT2I9sqtD"><img alt="OpenStack Hypervisor Host" src="imagesv2/openstack_nova_hypervisorhost.png">
<figcaption><span class="label">Figure 9-5. </span>OpenStack Hypervisor Host</figcaption>
</figure>

<p>
	As you can see from the previous image the full cluster resources are seen in a single hypervisor host.
</p>

<h5>Swift</h5>
<p>
	Swift in an object store used to store and retrieve files.  This is currently only leveraged for backup / restore of snapshots and images.
</p>

<h5>Cinder</h5>
<p>
	Cinder is OpenStack's volume component for exposing iSCSI targets.  Cinder leverages the Acropolis Volumes API in the Nutanix solution.  These volumes are attached to the instance(s) directly as block devices (as compared to in-guest).
</p>

<p>
	You can view the Cinder services using the OpenStack portal under 'Admin'-&gt;'System'-&gt;'System Information'-&gt;'Block Storage Services'.
</p>

<p>The figure shows the Cinder services, host and state:</p>

<figure class="large" id="id-YatRIzIDs3tG"><img alt="OpenStack Cinder Services" src="imagesv2/openstack_cinder_services.png">
<figcaption><span class="label">Figure 9-6. </span>OpenStack Cinder Services</figcaption>
</figure>

<h5>Glance / Image Repo</h5>
<p>
	Glance is the image store for OpenStack and shows the available images for provisioning.  Images can include ISOs, disks, and snapshots.
</p>
<p>
	The Image Repo is the repository storing available images published by Glance.  These can be located within the Nutanix environment or by an external source.  When the images are hosted on the Nutanix platform, they will be published to the OpenStack controller via Glance on the OVM.  In cases where the Image Repo exists only on an external source, Glance will be hosted by the OpenStack Controller and the Image Cache will be leveraged on the Acropolis Cluster(s).
</p>
<p>
	 Glance is enabled on a per-cluster basis and will always exist with the Image Repo.  When Glance is enabled on multiple clusters the Image Repo will span those clusters and images created via the OpenStack Portal will be propagated to all clusters running Glance.  Those clusters not hosting Glance will cache the images locally using the Image Cache.
</p>

<div data-type="note" class="note" id="pro-tip-oZibumIWs2tJ"><h6>Note</h6>
<h5>Pro tip</h5>

<p>For larger deployments Glance should run on at least two Acropolis Clusters per site.  This will provide Image Repo HA in the case of a cluster outage and ensure the images will always be available when not in the Image Cache.</p>
</div>

<p>
	When external sources host the Image Repo / Glance, Nova will be responsible for handling data movement from the external source to the target Acropolis Cluster(s).  In this case the Image Cache will be leveraged on the target Acropolis Cluster(s) to cache the image locally for any subsequent provisioning requests for the image.
</p>

<h5>Neutron</h5>
<p>
	Neutron is the networking component of OpenStack and responsible for network configuration.  The Acropolis OVM allows network CRUD operations to be performed by the OpenStack portal and will then make the required changes in Acropolis.
</p>

<p>
	You can view the Neutron services using the OpenStack portal under 'Admin'-&gt;'System'-&gt;'System Information'-&gt;'Network Agents'.
</p>

<p>The figure shows the Neutron services, host and state:</p>

<figure class="large" id="id-VrtXf2INsZtD"><img alt="OpenStack Neutron Services" src="imagesv2/openstack_neutron_services.png">		+	Currently only Local and VLAN network types are supported.
<figcaption><span class="label">Figure 9-7. </span>OpenStack Neutron Services</figcaption>
</figure>
<p>
	Neutron will assign IP addresses to instances when they are booted.  In this case Acropolis will receive a desired IP address for the VM which will be allocated.  When the VM performs a DHCP request the Acropolis Master will respond to the DHCP request on a private VXLAN as usual with Acropolis Hypervisor.
 </p>

<div data-type="note" class="note" id="supported-network-types-05iAcpIrsEtB"><h6>Note</h6>
<h5>Supported Network Types</h5>

<p>Currently only Local and VLAN network types are supported.</p>
</div>

<p>The Keystone and Horizon components run in an OpenStack Controller which interfaces with the Acropolis OVM. The OVM(s) have an OpenStack Driver which is responsible for translating the OpenStack API calls into native Acropolis API calls.</p>
</section>
<section data-type="sect2" id="design-and-deployment-jrIJU1swtX">
<h4>Design and Deployment</h4>

<p>
	For large scale cloud deployments it is important to leverage a delivery topology that will be distributed and meet the requirements of the end-users while providing flexibility and locality.
</p>

<p>
	OpenStack leverages the following high-level constructs which are defined below:
</p>

<ul>
	<li>
		Region
		<ul>
			<li>
				A geographic landmass or area where multiple Availability Zones (sites) are located.  These can include regions like US-Northwest or US-West.
			</li>
		</ul>
	</li>
	<li>
		Availability Zone (AZ)
		<ul>
			<li>
				A specific site or datacenter location where cloud services are hosted.  These can include sites like US-Northwest-1 or US-West-1.
			</li>
		</ul>
	</li>
	<li>
		Host Aggregate
		<ul>
			<li>
				A group of compute hosts, can be a row, aisle or equivalent to the site / AZ.
			</li>
		</ul>
	</li>
	<li>
		Compute Host
		<ul>
			<li>
				An Acropolis OVM which is running the nova-compute service.
			</li>
		</ul>
	</li>
	<li>
		Hypervisor Host
		<ul>
			<li>
				A Acropolis Cluster (seen as a single host).
			</li>
		</ul>
	</li>
</ul>

<p>
	The figure shows the high-level relationship of the constructs:
</p>

<figure id="id-3rtxHwUOsdtk"><img alt="OpenStack - Deployment Layout" src="imagesv2/openstack_regions.png">
<figcaption><span class="label">Figure 9-8. </span>OpenStack - Deployment Layout</figcaption>
</figure>
<p>
	The figure shows an example application of the constructs:
</p>

<figure id="id-lNtJtJUrsjtv"><img alt="OpenStack - Deployment Layout - Example" src="imagesv2/openstack_regions_example.png">
<figcaption><span class="label">Figure 9-9. </span>OpenStack - Deployment Layout - Example</figcaption>
</figure>
<p>
	You can view and manage hosts, host aggregates and availability zones using the OpenStack portal under 'Admin'-&gt;'System'-&gt;'Host Aggregates'.
</p>

<p>The figure shows the host aggregates, availability zones and hosts:</p>

<figure class="large" id="id-J2t5cvU8sAty"><img alt="OpenStack Host Aggregates and Availability Zones" src="imagesv2/openstack_hostagg_az.png">
<figcaption><span class="label">Figure 9-10. </span>OpenStack Host Aggregates and Availability Zones</figcaption>
</figure>

<section data-type="sect3" id="services-design-and-scaling-okI9IMUWs2tJ">
<h4>Services Design and Scaling</h4>

<p>For larger deployments it is recommended to have multiple Acropolis OVMs connected to the OpenStack Controller abstracted by a load balancer. This allows for HA and of the OVMs as well as distribution of transactions. The OVM(s) don't contain any state information allowing them to be scaled.</p>

<p>The figure shows an example of scaling OVMs for a single site:</p>

<figure id="id-n3tkTPIzUBsNt1"><img alt="OpenStack - OVM Load Balancing" src="imagesv2/openstack_svmha2.png">
<figcaption><span class="label">Figure 9-11. </span>OpenStack - OVM Load Balancing</figcaption>
</figure>

<p>
	One method to achieve this for the OVM(s) is using Keepalived and HAproxy.
</p>

<p>For environments spanning multiple sites the OpenStack Controller will talk to multiple Acropolis OVMs across sites.</p>

<p>The figure shows an example of the deployment across multiple sites:</p>
<figure id="id-WmtatqIzUesDtk"><img alt="OpenStack - Multi-Site" src="imagesv2/openstack_siteha2.png">
<figcaption><span class="label">Figure 9-12. </span>OpenStack - Multi-Site</figcaption>
</figure>
</section>
<h4>Deployment</h4>
<p>
	The OVM can be deployed as a standalone RPM on a CentOS / Redhat distro or as a full VM.  The Acropolis OVM can be deployed on any platform (Nutanix or non-Nutanix) as long as it has network connectivity to the OpenStack Controller and Nutanix Cluster(s).
</p>
<p>
	The VM(s) for the Acropolis OVM can be deployed on a Nutanix AHV cluster using the following steps.  If the OVM is already deployed you can skip past the VM creation steps.  You can use the full OVM image or use an existing CentOS / Redhat VM image.
</p>
<p>
	First we will import the provided Acropolis OVM disk image to Acropolis cluster.  This can be done by copying the disk image over using SCP or by specifying a URL to copy the file from.  We will cover importing this using the Images API.  Note: It is possible to deploy this VM anywhere, not necessarily on a Acropolis cluster.
</p>

<p>
	To import the disk image using Images API, run the following command:
</p>
<p class="codetext">image.create &lt;IMAGE_NAME&gt; source_url=&lt;SOURCE_URL&gt; container=&lt;CONTAINER_NAME&gt;</p>
<p>Next create the Acropolis VM for the OVM by running the following ACLI commands on any CVM:</p>
<p class="codetext">vm.create &lt;VM_NAME&gt; num_vcpus=2 memory=16G<br>
vm.disk_create &lt;VM_NAME&gt; clone_from_image=&lt;IMAGE_NAME&gt;<br>
vm.nic_create &lt;VM_NAME&gt; network=&lt;NETWORK_NAME&gt;<br>
vm.on &lt;VM_NAME&gt;
</p>

<p>Once the VM(s) have been created and powered on, SSH to the OVM(s) using the provided credentials.</p>

<div data-type="note" class="note"><h6>Note</h6>
<h5>OVMCTL Help</h5>

<p>Help txt can be displayed by running the following command on the OVM:</p>
  <p class="codetext">ovmctl --help

</div>

<p>
  The OVM supports two deployment modes:
</p>

<ul>
  <li>
    OVM-alinone
    <ul>
      <li>
        OVM includes all Acropolis drivers and OpenStack controller
      </li>
    </ul>
  </li>
  <li>
    OVM-services
    <ul>
      <li>
        OVM includes all Acropolis drivers and communicates with external/remote OpenStack controller
      </li>
    </ul>
  </li>
</ul>

<p>
  Both deployment modes will be covered in the following sections.  You can use in any mode and also switch between modes.
</p>

<h5>OVM-allinone</h5>
<p>
  The following steps cover the OVM-allinone deployment.  Start by SSHing to the OVM(s) to run the following commands.
</p>

<p class="codetext">
  # Register OpenStack Driver service <br>
	ovmctl --add ovm --name &lt;OVM_NAME&gt; --ip &lt;OVM_IP&gt; --netmask &lt;NET_MASK&gt; --gateway &lt;DEFAULT_GW&gt; --domain &lt;DOMAIN&gt; --nameserver &lt;DNS&gt;
</p>

<p class="codetext">
# Register OpenStack Controller <br>
ovmctl --add controller --name &lt;OVM_NAME&gt; --ip &lt;OVM_IP&gt;
</p>

<p class="codetext">
# Register Acropolis Cluster(s) (run for each cluster to add) <br>
ovmctl --add cluster --name &lt;CLUSTER_NAME&gt; --ip &lt;CLUSTER_IP&gt; --username &lt;PRISM_USER&gt; --password &lt;PRISM_PASSWORD&gt;
		<br>
		<br>
The following values are used as defaults: <br>
    <span style="margin-left:2em">Number of VCPUs per core = 4</span><br>
    <span style="margin-left:2em">Container name = default</span><br>
    <span style="margin-left:2em">Image cache = disabled, Image cache URL = None</span><br>
</p>

<p>
  Next we'll verify the configuration using the following command:
</p>

<p class="codetext">
  ovmctl --show
</p>

<p>
  At this point everything should be up and running, enjoy.
</p>

<h5>OVM-services</h5>
<p>
  The following steps cover the OVM-services deployment.  Start by SSHing to the OVM(s) to run the following commands.
</p>

<p class="codetext">
  # Register OpenStack Driver service <br>
	ovmctl --add ovm --name &lt;OVM_NAME&gt; --ip &lt;OVM_IP&gt;
</p>

<p class="codetext">
# Register OpenStack Controller <br>
ovmctl --add controller --name &lt;OS_CONTROLLER_NAME&gt; --ip &lt;OS_CONTROLLER_IP&gt; --username &lt;OS_CONTROLLER_USERNAME&gt; --password &lt;OS_CONTROLLER_PASSWORD&gt;
			<br>
			<br>
 The following values are used as defaults:<br>
      <span style="margin-left:2em">Authentication: auth_strategy = keystone, auth_region = RegionOne</span><br>
      <span style="margin-left:4em">auth_tenant = services, auth_password = admin</span><br>
      <span style="margin-left:2em">Database: db_{nova,cinder,glance,neutron} = mysql, db_{nova,cinder,glance,neutron}_password = admin</span><br>
      <span style="margin-left:2em">RPC: rpc_backend = rabbit, rpc_username = guest, rpc_password = guest</span>
</p>

<p class="codetext">
# Register Acropolis Cluster(s) (run for each cluster to add) <br>
ovmctl --add cluster --name &lt;CLUSTER_NAME&gt; --ip &lt;CLUSTER_IP&gt; --username &lt;PRISM_USER&gt; --password &lt;PRISM_PASSWORD&gt;
		<br>
		<br>
The following values are used as defaults: <br>
    <span style="margin-left:2em">Number of VCPUs per core = 4</span><br>
    <span style="margin-left:2em">Container name = default</span><br>
    <span style="margin-left:2em">Image cache = disabled, Image cache URL = None</span><br>
</p>

<p>
  If non-default passwords were used for the OpenStack controller deployment, we'll need to update those:
</p>

<p class="codetext">
# Update controller passwords (if non-default are used)<br>
ovmctl --update controller --name &lt;OS_CONTROLLER_NAME&gt; --auth_nova_password &lt;&gt; --auth_glance_password &lt;&gt; --auth_neutron_password &lt;&gt; --auth_cinder_password &lt;&gt; --db_nova_password &lt;&gt; --db_glance_password &lt;&gt; --db_neutron_password &lt;&gt; --db_cinder_password &lt;&gt;
</p>

<p>
  Next we'll verify the configuration using the following command:
</p>

<p class="codetext">
  ovmctl --show
</p>

<p>
	Now that the OVM has been configured, we'll configure the OpenStack Controller to know about the Glance and Neutron endpoints.
</p>

<p>
  Log in to the OpenStack controller and enter the keystonerc_admin source:
</p>

<p class="codetext">
  # enter keystonerc_admin <br />
  source ./keystonerc_admin
</p>

<p>
  First we will delete the existing endpoint for Glance that is pointing to the controller:
</p>

<p class="codetext">
 # Find old Glance endpoint id (port 9292) <br>
 keystone endpoint-list
 # Remove old keystone endpoint for Glance <br>
 keystone endpoint-delete &lt;GLANCE_ENDPOINT_ID&gt;
</p>

<p>
  Next we will create the new Glance endpoint that will point to the OVM:
</p>

<p class="codetext">
  # Find Glance service id<br>
  keystone service-list | grep glance<br>
  # Will look similar to the following:<br>
| 9e539e8dee264dd9a086677427434982 |   glance   |      image      |<br>
<br>
  # Add Keystone endpoint for Glance <br>
  keystone endpoint-create \ <br>
  --service-id &lt;GLANCE_SERVICE_ID&gt; \ <br>
  --publicurl http://&lt;OVM_IP&gt;:9292 \ <br>
  --internalurl http://&lt;OVM_IP&gt;:9292 \ <br>
  --region &lt;REGION_NAME&gt; \ <br>
  --adminurl http://&lt;OVM_IP&gt;:9292
</p>

<p>
  Next we will delete the existing endpoint for Neutron that is pointing to the controller:
</p>

<p class="codetext">
# Find old Neutron endpoint id (port 9696) <br />
keystone endpoint-list
# Remove old keystone endpoint for Neutron <br>
keystone endpoint-delete &lt;NEUTRON_ENDPOINT_ID&gt;
</p>

<p>
  Next we will create the new Glance endpoint that will point to the OVM:
</p>

<p class="codetext">
  # Find Glance service id<br>
  keystone service-list | grep glance<br>
  # Will look similar to the following:<br>
| f4c4266142c742a78b330f8bafe5e49e |  neutron   |     network     |<br>
<br>
# Add Keystone endpoint for Neutron <br>
keystone endpoint-create \ <br>
--service-id &lt;NEUTRON_SERVICE_ID&gt; \ <br>
--publicurl http://&lt;OVM_IP&gt;:9696 \ <br>
--internalurl http://&lt;OVM_IP&gt;:9696 \ <br>
--region &lt;REGION_NAME&gt; \ <br>
--adminurl http://&lt;OVM_IP&gt;:9696
</p>

<p>
	After the endpoints have been created we will update the Nova and Cinder configuration files with new Acropolis OVM IP of Glance host.</p>
<p>
	First we will edit Nova.conf which is located at /etc/nova/nova.conf and edit the following lines:
</p>
<p class="codetext">
	[glance] <br>
	... <br>
  # Default glance hostname or IP address (string value) <br>
  host=&lt;OVM_IP&gt; <br>
  <br>
  # Default glance port (integer value) <br>
  port=9292 <br>
	... <br>
  # A list of the glance api servers available to nova. Prefix <br>
  # with https:// for ssl-based glance api servers. <br>
  # ([hostname|ip]:port) (list value) <br>
  api_servers=&lt;OVM_IP&gt;:9292
</p>

<p>
  Now we will disable nova-compute on the OpenStack controller (if not already):
</p>
<p class="codetext">
systemctl disable openstack-nova-compute.service<br>
systemctl stop openstack-nova-compute.service<br>
service openstack-nova-compute stop<br>
</p>
<p>
	Next we will edit Cinder.conf which is located at /etc/cinder/cinder.conf and edit the following items:
</p>
<p class="codetext">
	# Default glance host name or IP (string value) <br>
  glance_host=&lt;OVM_IP&gt; <br>
  # Default glance port (integer value) <br>
  glance_port=9292 <br>
  # A list of the glance API servers available to cinder <br>
  # ([hostname|ip]:port) (list value) <br>
  glance_api_servers=$glance_host:$glance_port <br>
</p>
<p>
  We will also comment out lvm enabled backends as those will not be leveraged:
</p>
<p class="codetext">
  # Comment out the following lines in cinder.conf<br>
#enabled_backends=lvm<br>
#[lvm]<br>
#iscsi_helper=lioadm<br>
#volume_group=cinder-volumes<br>
#iscsi_ip_address=<br>
#volume_driver=cinder.volume.drivers.lvm.LVMVolumeDriver<br>
#volumes_dir=/var/lib/cinder/volumes<br>
#iscsi_protocol=iscsi<br>
#volume_backend_name=lvm<br>
</p>
<p>
  Now we will disable cinder volume on the OpenStack controller (if not already):
</p>
<p class="codetext">
  systemctl disable openstack-cinder-volume.service<br>
  systemctl stop openstack-cinder-volume.service<br>
  service openstack-cinder-volume stop<br>
</p>
<p>
  Now we will disable glance-image on the OpenStack controller (if not already):
</p>
<p class="codetext">
  systemctl disable openstack-glance-api.service <br>
systemctl disable openstack-glance-registry.service<br>
systemctl stop openstack-glance-api.service <br>
systemctl stop openstack-glance-registry.service<br>
service openstack-glance-api stop<br>
service openstack-glance-registry stop<br>
</p>

<p>
	After the files have been edited we will restart the Nova and Cinder services to take the new configuration settings.  The services can be restarted with the following commands below or by running the scripts which are available for download.
</p>

<p class="codetext">
	# Restart Nova services<br>
	service openstack-nova-api restart<br>
	service openstack-nova-consoleauth restart<br>
	service openstack-nova-scheduler restart<br>
	service openstack-nova-conductor restart<br>
	service openstack-nova-cert restart<br>
	service openstack-nova-novncproxy restart<br>
	<br>
	# OR you can also use the script which can be downloaded as part of the helper tools:<br>
	~/openstack/commands/nova-restart <br>
	<br>
	# Restart Cinder <br>
	service openstack-cinder-api restart<br>
	service openstack-cinder-scheduler restart<br>
	service openstack-cinder-backup restart <br>
	<br>
	# OR you can also use the script which can be downloaded as part of the helper tools:<br>
	~/openstack/commands/cinder-restart
</p>

<!-- TODO - add iptables rules -->

</section>

<section data-type="sect2" id="troubleshooting-andamp-advanced-administration-yAIkhjswtx">
<h4>Troubleshooting &amp; Advanced Administration</h4>

<section data-type="sect3" id="key-log-locations-eqIlsmhJsJtN">
<h5>Key log locations</h5>
<table>
  <tr>
    <th>Component</th>
    <th>Key Log Location(s)</th>
  </tr>
  <tr>
    <td>Keystone</td>
    <td>/var/log/keystone/keystone.log</td>
  </tr>
  <tr>
    <td>Horizon</td>
    <td>/var/log/horizon/horizon.log</td>
  </tr>
  <tr>
    <td>Nova</td>
    <td>/var/log/nova/nova-api.log<br>/var/log/nova/nova-scheduler.log<br>/var/log/nova/nove-compute.log*</td>
  </tr>
  <tr>
    <td>Swift</td>
    <td>/var/log/swift/swift.log</td>
  </tr>
  <tr>
    <td>Cinder</td>
    <td>/var/log/cinder/api.log<br>/var/log/cinder/scheduler.log<br>/var/log/cinder/volume.log</td>
  </tr>
  <tr>
    <td>Glance</td>
    <td>/var/log/glance/api.log<br>/var/log/glance/registry.log</td>
  </tr>
  <tr>
    <td>Neutron</td>
    <td>/var/log/neutron/server.log<br>/var/log/neutron/dhcp-agent.log*<br>/var/log/neutron/l3-agent.log*<br>/var/log/neutron/metadata-agent.log*<br>/var/log/neutron/openvswitch-agent.log*</td>
  </tr>
</table>
<p>
	Logs marked with * are on the Acropolis OVM only.
</p>

<div data-type="note" class="note" id="pro-tip-l9ioTesBhEsBtZ"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Check NTP if a service is seen as state 'down' in OpenStack Manager (Admin UI or CLI) even though the service is running in the OVM.  Many services have a requirement for time to be in sync between the OpenStack Controller and Acropolis OVM.</p>
</div>

</section>

<section data-type="sect3" id="command-reference-wrIWughosjtE">
<h5>Command Reference</h5>

<p>Load Keystone source (perform before running other commands)</p>

<p class="codetext">source keystonerc_admin</p>

<p>List Keystone services</p>

<p class="codetext">keystone service-list</p>

<p>List Keystone endpoints</p>

<p class="codetext">keystone endpoint-list</p>

<p>Create Keystone endpoint</p>

<p class="codetext">
	keystone endpoint-create \ <br>
  --service-id=&lt;SERVICE_ID&gt; \ <br>
  --publicurl=http://&lt;IP:PORT&gt; \ <br>
  --internalurl=http://&lt;IP:PORT&gt; \ <br>
  --region=&lt;REGION_NAME&gt; \ <br>
  --adminurl=http://&lt;IP:PORT&gt; <br>
</p>

<p>List Nova instances</p>

<p class="codetext">nova list</p>

<p>Show instance details</p>

<p class="codetext">nova show &lt;INSTANCE_NAME&gt;</p>

<p>List Nova hypervisor hosts</p>

<p class="codetext">nova hypervisor-list</p>

<p>Show hypervisor host details</p>

<p class="codetext">nova hypervisor-show &lt;HOST_ID&gt;</p>

<p>List Glance images</p>

<p class="codetext">glance image-list</p>

<p>Show Glance image details</p>

<p class="codetext">glance image-show &lt;IMAGE_ID&gt;</p>

</section>
</section>
</section>
</section>
</div>

<div data-type="part" id="book-of-acropolis-KEaio">
<h1><span class="label">Part III. </span>Book of Acropolis</h1>

<p class="definition"><strong>a·crop·o·lis  -  /ɘ ' kräpɘlis/  -  noun  -  data plane</strong>
<br>
storage, compute and virtualization platform.
</p>

<section data-type="chapter" id="architecture-NYI5ul">
<h2>Architecture</h2>
<p>Acropolis is a distributed multi-resource manager,&nbsp;orchestration platform and data plane.</p>

<p>It is broken down into three main components:</p>

<ul>
	<li>Distributed Storage Fabric (DSF)
	<ul>
		<li>This is at the core and birth of the Nutanix platform and expands upon the Nutanix Distributed Filesystem (NDFS).&nbsp; NDFS has now evolved from a distributed system pooling storage resources into a much larger and capable storage platform.</li>
	</ul>
	</li>
	<li>App Mobility Fabric (AMF)
	<ul>
		<li>Hypervisors abstracted the OS from hardware, and the AMF abstracts workloads (VMs, Storage, Containers, etc.) from the hypervisor.&nbsp; This will provide the ability to dynamically move the workloads between hypervisors, clouds, as well as provide the ability for Nutanix nodes to change hypervisors.</li>
	</ul>
	</li>
	<li>Hypervisor
	<ul>
		<li>A multi-purpose hypervisor based upon the CentOS KVM hypervisor.</li>
	</ul>
	</li>
</ul>

<p>Building upon the distributed nature of everything Nutanix does, we’re expanding this into the virtualization and resource management space.&nbsp; Acropolis is a back-end service that allows for workload and resource management, provisioning, and operations.&nbsp; Its goal is to abstract the facilitating resource (e.g., hypervisor, on-premise, cloud, etc.) from the workloads running, while providing a single “platform” to operate.&nbsp;</p>

<p>This gives workloads the ability to seamlessly move between hypervisors, cloud providers, and platforms.</p>

<p>The figure highlights an image illustrating the conceptual nature of Acropolis at various layers:</p>

<figure id="id-d0tvtNuD"><img alt="High-level Acropolis Architecture" class="iimagesv2arch_acropolispng" src="imagesv2/arch_acropolis.png">
<figcaption><span class="label">Figure 10-1. </span>High-level Acropolis Architecture</figcaption>
</figure>

<div data-type="note" class="note" id="supported-hypervisors-for-vm-management-8Mirfxuk"><h6>Note</h6>
<h5>Supported Hypervisors for VM Management</h5>

<p>Currently, the only fully supported hypervisor for VM management is Acropolis Hypervisor, however this may expand in the future. &nbsp;The Volumes API and read-only operations are still supported on all.</p>
</div>

<section data-type="sect1" id="converged-platform-ONIRcmuA">
<h3>Converged Platform</h3>

<p>For a video explanation you can watch the following video: <a href="https://youtu.be/OPYA5-V0yRo">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//OPYA5-V0yRo"></iframe></div>

<p>The Nutanix solution is a converged storage + compute solution which leverages local components and creates a distributed platform for virtualization, also known as a virtual computing platform. The solution is a bundled hardware + software appliance which houses 2 (6000/7000 series) or 4 nodes (1000/2000/3000/3050 series) in a 2U footprint.</p>

<p>Each node runs an industry-standard hypervisor (ESXi, KVM, Hyper-V currently) and the Nutanix Controller VM (CVM).&nbsp; The Nutanix CVM is what runs the Nutanix software and serves all of the I/O operations for the hypervisor and all VMs running on that host.&nbsp; For the Nutanix units running VMware vSphere, the SCSI controller, which manages the SSD and HDD devices, is directly passed to the CVM leveraging VM-Direct Path (Intel VT-d).&nbsp; In the case of Hyper-V, the storage devices are passed through to the CVM.</p>

<p>The following figure provides an example of what a typical node logically looks like:</p>

<figure id="id-GRtVF5czu1"><img alt="Converged Platform" class="iimagesv2converged_platformpng" src="imagesv2/converged_platform.png">
<figcaption><span class="label">Figure 10-3. </span>Converged Platform</figcaption>
</figure>
</section>

<section data-type="sect1" id="software-defined-nrIRIyux">
<h3>Software-Defined</h3>

<p>As mentioned above (likely numerous times), the Nutanix platform is a software-based solution which ships as a bundled software + hardware appliance.&nbsp; The controller VM is where the vast majority of the Nutanix software and logic sits and was designed from the beginning to be an extensible and pluggable architecture. A key benefit to being software-defined and not relying upon any hardware offloads or constructs is around extensibility.&nbsp; As with any product life cycle,&nbsp;advancements and new features will always be introduced.&nbsp;</p>

<p>By not relying on any custom ASIC/FPGA or hardware capabilities, Nutanix can develop and deploy these new features through a simple software update.&nbsp; This means that the deployment of a new feature (e.g., deduplication) can be deployed by upgrading the current version of the Nutanix software.&nbsp; This also allows newer generation features to be deployed on legacy hardware models. For example, say you’re running a workload running an older version of Nutanix software on a prior generation hardware platform (e.g., 2400).&nbsp; The running software version doesn’t provide deduplication capabilities which your workload could benefit greatly from.&nbsp; To get these features, you perform a rolling upgrade of the Nutanix software version while the workload is running, and you now have deduplication.&nbsp; It’s really that easy.</p>

<p>Similar to features, the ability to create new “adapters” or interfaces into DSF is another key capability.&nbsp; When the product first shipped, it solely supported iSCSI for I/O from the hypervisor, this has now grown to include NFS and SMB.&nbsp; In the future, there is the ability to create new adapters for various workloads and hypervisors (HDFS, etc.).&nbsp; And again, all of this can be deployed via a software update. This is contrary to most legacy infrastructures, where a hardware upgrade or software purchase is normally required to get the “latest and greatest” features.&nbsp; With Nutanix, it’s different. Since all features are deployed in software, they can run on any hardware platform, any hypervisor, and be deployed through simple software upgrades.</p>

<p>The following figure shows a logical representation of what this software-defined controller framework looks like:</p>

<figure id="id-GRtnH4Izu1"><img alt="Software-Defined Controller Framework" class="iimagesv2software_defined_controllerpng" src="imagesv2/software_defined_controller.png">
<figcaption><span class="label">Figure 10-4. </span>Software-Defined Controller Framework</figcaption>
</figure>
</section>

<section data-type="sect1" id="cluster-components-b1IeUMu5">
<h3>Cluster Components</h3>

<p>For a visual explanation you can watch the following video:&nbsp;<a href="https://youtu.be/3v5RI_IbfV4">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//3v5RI_IbfV4"></iframe></div>

<p>The Nutanix platform is composed of the following high-level components:</p>

<figure id="id-GRtkSgUzu1"><img alt="Nutanix Cluster Components" class="iimagesv2cluster_componentspng" src="imagesv2/cluster_components.png">
<figcaption><span class="label">Figure 10-5. </span>Nutanix Cluster Components</figcaption>
</figure>

<h5>Cassandra</h5>

<ul>
	<li>Key Role: Distributed metadata store</li>
	<li>Description: Cassandra stores and manages all of the cluster metadata in a distributed ring-like manner based upon a heavily modified Apache Cassandra.&nbsp; The Paxos algorithm is utilized to enforce strict consistency.&nbsp; This service runs on every node in the cluster.&nbsp; The Cassandra is accessed via an interface called Medusa.</li>
</ul>

<h5>Zookeeper</h5>

<ul>
	<li>Key Role: Cluster configuration manager</li>
	<li>Description: Zookeeper stores all of the cluster configuration including hosts, IPs, state, etc. and is based upon Apache Zookeeper.&nbsp; This service runs on three nodes in the cluster, one of which is elected as a leader.&nbsp; The leader receives all requests and forwards them to its peers.&nbsp; If the leader fails to respond, a new leader is automatically elected.&nbsp;&nbsp; Zookeeper is accessed via an interface called Zeus.</li>
</ul>

<h5>Stargate</h5>

<ul>
	<li>Key Role: Data I/O manager</li>
	<li>Description: Stargate is responsible for all data management and I/O operations and is the main interface from the hypervisor (via NFS, iSCSI, or SMB).&nbsp; This service runs on every node in the cluster in order to serve localized I/O.</li>
</ul>

<h5>Curator</h5>

<ul>
	<li>Key Role: MapReduce cluster management and cleanup</li>
	<li>Description: Curator is responsible for managing and distributing tasks throughout the cluster, including disk balancing, proactive scrubbing, and many more items.&nbsp; Curator runs on every node and is controlled by an elected Curator Master who is responsible for the task and job delegation.&nbsp; There are two scan types for Curator, a full scan which occurs around every 6 hours and a partial scan which occurs every hour.</li>
</ul>

<h5>Prism</h5>

<ul>
	<li>Key Role: UI and API</li>
	<li>Description: Prism is the management gateway for component and administrators to configure and monitor the Nutanix cluster.&nbsp; This includes Ncli, the HTML5 UI, and REST API.&nbsp; Prism runs on every node in the cluster and uses an elected leader like all components in the cluster.</li>
</ul>

<h5>Genesis</h5>

<ul>
	<li>Key Role: Cluster component &amp; service manager</li>
	<li>Description:&nbsp; Genesis is a process which runs on each node and is responsible for any services interactions (start/stop/etc.) as well as for the initial configuration.&nbsp; Genesis is a process which runs independently of the cluster and does not require the cluster to be configured/running.&nbsp; The only requirement for Genesis to be running is that Zookeeper is up and running.&nbsp; The cluster_init and cluster_status pages are displayed by the Genesis process.</li>
</ul>

<h5>Chronos</h5>

<ul>
	<li>Key Role: Job and task scheduler</li>
	<li>Description: Chronos is responsible for taking the jobs and tasks resulting from a Curator scan and scheduling/throttling tasks among nodes.&nbsp; Chronos runs on every node and is controlled by an elected Chronos Master that is responsible for the task and job delegation and runs on the same node as the Curator Master.</li>
</ul>

<h5>Cerebro</h5>

<ul>
	<li>Key Role: Replication/DR manager</li>
	<li>Description: Cerebro is responsible for the replication and DR capabilities of DSF.&nbsp; This includes the scheduling of snapshots, the replication to remote sites, and the site migration/failover.&nbsp; Cerebro runs on every node in the Nutanix cluster and all nodes participate in replication to remote clusters/sites.</li>
</ul>

<h5>Pithos</h5>

<ul>
	<li>Key Role: vDisk configuration manager</li>
	<li>Description: Pithos is responsible for vDisk (DSF file) configuration data.&nbsp; Pithos runs on every node and is built on top of Cassandra.</li>
</ul>
</section>

<section data-type="sect1" id="acropolis-services-BNIDiAuZ">
<h3>Acropolis Services</h3>

<p>An Acropolis Slave runs on every CVM with an elected Acropolis Master which is responsible for task scheduling, execution, IPAM, etc.&nbsp; Similar to other components which have a Master, if the Acropolis Master fails, a new one will be elected.</p>

<p>The role breakdown for each can be seen below:</p>

<ul>
	<li>Acropolis Master
	<ul>
		<li>Task scheduling &amp; execution</li>
		<li>Stat collection / publishing</li>
		<li>Network Controller (for hypervisor)</li>
		<li>VNC proxy (for hypervisor)</li>
		<li>HA (for hypervisor)</li>
	</ul>
	</li>
	<li>&nbsp;Acropolis Slave
	<ul>
		<li>Stat collection / publishing</li>
		<li>VNC proxy (for hypervisor)</li>
	</ul>
	</li>
</ul>

<p>Here we show a conceptual view of the Acropolis Master / Slave relationship:</p>

<figure id="id-DktkHGijuw"><img alt="Acropolis Services" class="iimagesv2acrop_componentspng image" src="imagesv2/acrop_components.png">
<figcaption><span class="label">Figure 10-2. </span>Acropolis Services</figcaption>
</figure>
</section>

<section data-type="sect1" id="drive-breakdown-rkIZh9uO">
<h3>Drive Breakdown</h3>

<p>In this section, I’ll cover how the various storage devices (SSD / HDD) are broken down, partitioned, and utilized by the Nutanix platform. NOTE: All of the capacities used are in Base2 Gibibyte (GiB) instead of the Base10 Gigabyte (GB).&nbsp; Formatting of the drives with a filesystem and associated overheads has also been taken into account.</p>

<h5>SSD Devices</h5>

<p>SSD devices store a few key items which are explained in greater detail above:</p>

<ul>
	<li>Nutanix Home (CVM core)</li>
	<li>Cassandra (metadata storage)</li>
	<li>OpLog (persistent write buffer)</li>
	<li>Unified Cache (SSD cache portion)</li>
	<li>Extent Store (persistent storage)</li>
</ul>

<p>The following figure shows an example of the storage breakdown for a Nutanix node’s SSD(s):</p>

<figure id="id-1ntnFGhwuQ"><img alt="SSD Drive Breakdown" class="iimagesv2drive_ssdpng" src="imagesv2/drive_ssd.png">
<figcaption><span class="label">Figure 10-6. </span>SSD Drive Breakdown</figcaption>
</figure>

<p>NOTE: The sizing for OpLog is done dynamically as of release 4.0.1 which will allow the extent store portion to grow dynamically.&nbsp; The values used are assuming a completely utilized OpLog.&nbsp; Graphics and proportions aren’t drawn to scale.&nbsp; When evaluating the Remaining GiB capacities, do so from the top down.&nbsp; For example, the Remaining GiB to be used for the OpLog calculation would be after Nutanix Home and Cassandra have been subtracted from the formatted SSD capacity.</p>

<p>
  Nutanix Home is mirrored across the first two SSDs to ensure availability.  Cassandra is on the first SSD by default, and if that SSD fails the CVM will be restarted and Cassandra storage will then be on the 2nd.
</p>

<p>Most models ship with 1 or 2 SSDs, however the same construct applies for models shipping with more SSD devices. For example, if we apply this to an example 3060 or 6060 node which has 2 x 400GB SSDs, this would give us 100GiB of OpLog, 40GiB of Unified Cache, and ~440GiB of Extent Store SSD capacity per node.</p>

<h5>HDD Devices</h5>

<p>Since HDD devices are primarily used for bulk storage, their breakdown is much simpler:</p>

<ul>
	<li>Curator Reservation (Curator storage)</li>
	<li>Extent Store (persistent storage)</li>
</ul>

<figure id="id-MltqUBhxu1"><img alt="HDD Drive Breakdown" class="iimagesv2drive_hddpng" src="imagesv2/drive_hdd.png">
<figcaption><span class="label">Figure 10-7. </span>HDD Drive Breakdown</figcaption>
</figure>

<p>For example, if we apply this to an example 3060 node which has 4 x 1TB HDDs, this would give us 80GiB reserved for Curator and ~3.4TiB of Extent Store HDD capacity per node.</p>

<p>NOTE: the above values are accurate as of 4.0.1 and may vary by release.</p>
</section>
</section>

<section data-type="chapter" id="distributed-storage-fabric-22IlTD">
<h2>Distributed Storage Fabric</h2>

<p>Together, a group of Nutanix nodes forms a distributed platform called the Acropolis Distributed Storage Fabric (DSF).&nbsp; DSF appears to the hypervisor like any centralized storage array, however all of the I/Os are handled locally to provide the highest performance.&nbsp; More detail on how these nodes form a distributed system can be found in the next section.</p>

<p>The following figure shows an example of how these Nutanix nodes form DSF:</p>

<figure class="large" id="id-oPtkTBTr"><img alt="Distributed Storage Fabric Overview" class="iimagesv2dsf_overviewpng" src="imagesv2/dsf_overview.png">
<figcaption><span class="label">Figure 11-1. </span>Distributed Storage Fabric Overview</figcaption>
</figure>

<section data-type="sect1" id="data-structure-M2IySRTw">
<h3>Data Structure</h3>

<p>The Acropolis Distributed Storage Fabric is composed of the following high-level struct:</p>

<h5>Storage Pool</h5>

<ul>
	<li>Key Role: Group of physical devices</li>
	<li>Description: A storage pool is a group of physical storage devices including PCIe SSD, SSD, and HDD devices for the cluster.&nbsp; The storage pool can span multiple Nutanix nodes and is expanded as the cluster scales.&nbsp; In most configurations, only a single storage pool is leveraged.</li>
</ul>

<h5>Container</h5>

<ul>
	<li>Key Role: Group of VMs/files</li>
	<li>Description: A container is a logical segmentation of the Storage Pool and contains a group of VM or files (vDisks).&nbsp; Some configuration options (e.g., RF) are configured at the container level, however are applied at the individual VM/file level.&nbsp; Containers typically have a 1 to 1 mapping with a datastore (in the case of NFS/SMB).</li>
</ul>

<h5>vDisk</h5>

<ul>
	<li>Key Role: vDisk</li>
	<li>Description: A vDisk is any file over 512KB on DSF including .vmdks and VM hard disks.&nbsp; vDisks are composed of extents which are grouped and stored on disk as an extent group.</li>
</ul>

<p>The following figure shows how these map between DSF and the hypervisor:</p>

<figure id="id-DktyigS1Tw"><img alt="High-level Filesystem Breakdown" class="iimagesv2data_structure_1png" src="imagesv2/data_structure_1.png">
<figcaption><span class="label">Figure 11-2. </span>High-level Filesystem Breakdown</figcaption>
</figure>

<h5>Extent</h5>

<ul>
	<li>Key Role: Logically contiguous data</li>
	<li>Description: An extent is a 1MB piece of logically contiguous data which consists of n number of contiguous blocks (varies depending on guest OS block size).&nbsp; Extents are written/read/modified on a sub-extent basis (aka slice) for granularity and efficiency.&nbsp; An extent’s slice may be trimmed when moving into the cache depending on the amount of data being read/cached.</li>
</ul>

<h5>Extent Group</h5>

<ul>
	<li>Key Role: Physically contiguous stored data</li>
	<li>Description: An extent group is a 1MB or 4MB piece of physically contiguous stored data.&nbsp; This data is stored as a file on the storage device owned by the CVM.&nbsp; Extents are dynamically distributed among extent groups to provide data striping across nodes/disks to improve performance.&nbsp; NOTE: as of 4.0, extent groups can now be either 1MB or 4MB depending on dedupe.</li>
</ul>

<p>The following figure shows how these structs relate between the various file systems:&nbsp;</p>

<figure id="id-NMtEsnSYT8"><img alt="Low-level Filesystem Breakdown" class="iimagesv2data_structure_2png" src="imagesv2/data_structure_2.png">
<figcaption><span class="label">Figure 11-3. </span>Low-level Filesystem Breakdown</figcaption>
</figure>

<p>Here is another graphical representation of how these units are related:</p>

<figure id="id-3rtoTXSlTb"><img alt="Graphical Filesystem Breakdown" class="iimagesv2data_structure_3png" src="imagesv2/data_structure_3.png">
<figcaption><span class="label">Figure 11-4. </span>Graphical Filesystem Breakdown</figcaption>
</figure>
</section>

<section data-type="sect1" id="io-path-overview-QMIBHzTQ">
<h3>I/O Path and Cache</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/SULqVPVXefY">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//SULqVPVXefY"></iframe></div>

<p>The Nutanix I/O path is composed of the following high-level components:</p>

<figure id="id-89t0S2H9TZ"><img alt="DSF I/O Path" class="iimagesv2io_path_basepng" src="imagesv2/io_path_base.png">
<figcaption><span class="label">Figure 11-5. </span>DSF I/O Path</figcaption>
</figure>

<h5>OpLog</h5>

<ul>
	<li>Key Role: Persistent write buffer</li>
	<li>Description: The OpLog is similar to a filesystem journal and is built as a staging area to handle bursts of random writes, coalesce them, and then sequentially drain the data to the extent store.&nbsp; Upon a write, the OpLog is synchronously replicated to another n number of CVM’s OpLog before the write is acknowledged for data availability purposes.&nbsp; All CVM OpLogs partake in the replication and are dynamically chosen based upon load.&nbsp; The OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O performance, especially for random I/O workloads.&nbsp; For sequential workloads, the OpLog is bypassed and the writes go directly to the extent store.&nbsp; If data is currently sitting in the OpLog and has not been drained, all read requests will be directly fulfilled from the OpLog until they have been drained, where they would then be served by the extent store/unified cache.&nbsp; For containers where fingerprinting (aka Dedupe) has been enabled, all write I/Os will be fingerprinted using a hashing scheme allowing them to be deduplicated based upon fingerprint in the unified cache.</li>
</ul>

<div data-type="note" class="note"><h6>Note</h6>
<h5>Per-vDisk OpLog Sizing</h5>

<p>
  The OpLog is a shared resource, however allocation is done on a per-vDisk basis to ensure each vDisk has an equal opportunity to leverage.  This is implemented through a per-vDisk OpLog limit (max amount of data per-vDisk in the OpLog).  VMs with multiple vDisk(s) will be able to leverage the per-vDisk limit times the number of disk(s).
</p>

<p>The per-vDisk OpLog limit is currently 6GB (as of 4.6), up from 2GB in prior versions.</p>

<p>
  This is controlled by the following Gflag: vdisk_distributed_oplog_max_dirty_MB.
</p>

</div>

<h5>Extent Store</h5>

<ul>
	<li>Key Role: Persistent data storage</li>
	<li>Description: The Extent Store is the persistent bulk storage of DSF and spans SSD and HDD and is extensible to facilitate additional devices/tiers.&nbsp; Data entering the extent store is either being A) drained from the OpLog or B) is sequential in nature and has bypassed the OpLog directly.&nbsp; Nutanix ILM will determine tier placement dynamically based upon I/O patterns and will move data between tiers.</li>
</ul>

<div data-type="note" class="note"><h6>Note</h6>
<h5>Sequential Write Characterization</h5>

<p>Write IO is deemed as sequential when there is more than 1.5MB of outstanding write IO to a vDisk (as of 4.6).  IOs meeting this will bypass the OpLog and go directly to the Extent Store since they are already large chunks of aligned data and won't benefit from coalescing.</p>

<p>
  This is controlled by the following Gflag: vdisk_distributed_oplog_skip_min_outstanding_write_bytes.
</p>

<p>
  All other IOs, including those which can be large (e.g. &gt;64K) will still be handled by the OpLog.
</p>
</div>

<h5>Unified Cache</h5>

<ul>
	<li>Key Role: Dynamic read cache</li>
	<li>Description: The Unified Cache is a deduplicated read cache which spans both the CVM’s memory and SSD.&nbsp; Upon a read request of data not in the cache (or based upon a particular fingerprint), the data will be placed into the single-touch pool of the Unified Cache which completely sits in memory, where it will use LRU until it is evicted from the cache.&nbsp; Any subsequent read request will “move” (no data is actually moved, just cache metadata) the data into the memory portion of the multi-touch pool, which consists of both memory and SSD.&nbsp; From here there are two LRU cycles, one for the in-memory piece upon which eviction will move the data to the SSD section of the multi-touch pool where a new LRU counter is assigned.&nbsp; Any read request for data in the multi-touch pool will cause the data to go to the peak of the multi-touch pool where it will be given a new LRU counter.</li>
</ul>

<p>The following figure shows a high-level overview of the Unified Cache:</p>

<figure id="id-RBtnUjHQT1"><img alt="DSF Unified Cache" class="iimagesv2content_cachepng" src="imagesv2/content_cache.png">
<figcaption><span class="label">Figure 11-6. </span>DSF Unified Cache</figcaption>
</figure>

<div data-type="note" class="note" id="cache-granularity-and-logic-21igh5H0T3"><h6>Note</h6>
<h5>Cache Granularity and Logic</h5>

<p>Data is brought into the cache at a 4K granularity and all caching is done real-time (e.g. no delay or batch process data to pull data into the cache).</p>

<p>
  Each CVM has its own local cache that it manages for the vDisk(s) it is hosting (e.g. VM(s) running on the same node).  When a vDisk is cloned (e.g. new clones, snapshots, etc.) each new vDisk has its own block map and the original vDisk is marked as immutable.  This allows us to ensure that each CVM can have it's own cached copy of the base vDisk with cache coherency.
</p>

<p>
  In the event of an overwrite, that will be re-directed to a new extent in the VM's own block map.  This ensures that there will not be any cache corruption.
</p>
</div>

<h5>Extent Cache</h5>

<ul>
	<li>Key Role: In-memory read cache</li>
	<li>Description: The Extent Cache is an in-memory read cache that is completely in the CVM’s memory.&nbsp; This will store non-fingerprinted extents for containers where fingerprinting and deduplication are disabled.&nbsp; As of version 3.5, this is separate from the Unified Cache, however these are merged in 4.5 with the unified cache.</li>
</ul>
</section>

<section data-type="sect1" id="scalable-metadata-4aI3tRTg">
<h3>Scalable Metadata</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/MlQczJhQI3U">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//MlQczJhQI3U"></iframe></div>

<p>Metadata is at the core of any intelligent system and is even more critical for any filesystem or storage array.&nbsp; In terms of DSF, there are a few key structs that are critical for its success: it has to be right 100% of the time (known as “strictly consistent”), it has to be scalable, and it has to perform at massive scale.&nbsp; As mentioned in the architecture section above, DSF utilizes a “ring-like” structure as a key-value store which stores essential metadata as well as other platform data (e.g., stats, etc.). In order to ensure metadata availability and redundancy a RF is utilized among an odd amount of nodes (e.g., 3, 5, etc.). Upon a metadata write or update, the row is written to a node in the ring and then replicated to n number of peers (where n is dependent on cluster size).&nbsp; A majority of nodes must agree before anything is committed, which is enforced using the Paxos algorithm.&nbsp; This ensures strict consistency for all data and metadata stored as part of the platform.</p>

<p>The following figure shows an example of a metadata insert/update for a 4 node cluster:</p>

<figure id="id-A0tOHyt8Tx"><img alt="Cassandra Ring Structure" class="iimagesv2metadata_1png" src="imagesv2/metadata_1.png">
<figcaption><span class="label">Figure 11-7. </span>Cassandra Ring Structure</figcaption>
</figure>

<p>Performance at scale is also another important struct for DSF metadata.&nbsp; Contrary to traditional dual-controller or “master” models, each Nutanix node is responsible for a subset of the overall platform’s metadata.&nbsp; This eliminates the traditional bottlenecks by allowing metadata to be served and manipulated by all nodes in the cluster.&nbsp; A consistent hashing scheme is utilized to minimize the redistribution of keys during cluster size modifications (also known as&nbsp;“add/remove node”) When the cluster scales (e.g., from 4 to 8 nodes), the nodes are inserted throughout the ring between nodes for “block awareness” and reliability.</p>

<p>The following figure shows an example of the metadata “ring” and how it scales:</p>

<figure class="large" id="id-GRtofpt0T1"><img alt="Cassandra Scale Out" class="iimagesv2metadata_2png" src="imagesv2/metadata_2.png">
<figcaption><span class="label">Figure 11-8. </span>Cassandra Scale Out</figcaption>
</figure>

<p>&nbsp;</p>
</section>

<section data-type="sect1" id="data-protection-J5I3F8Tw">
<h3>Data Protection</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/OWhdo81yTpk">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//OWhdo81yTpk"></iframe></div>

<p>The Nutanix platform currently uses a resiliency factor, also known as a replication factor (RF), and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption.&nbsp; As explained above, the OpLog acts as a staging area to absorb incoming writes onto a low-latency SSD tier.&nbsp; Upon being written to the local OpLog, the data is synchronously replicated to another one or two Nutanix CVM’s OpLog (dependent on RF) before being acknowledged (Ack) as a successful write to the host.&nbsp; This ensures that the data exists in at least two or three independent locations and is fault tolerant. NOTE: For RF3, a minimum of 5 nodes is required since metadata will be RF5.&nbsp;</p>

<p>Data RF is configured via Prism and is done at the container level. All nodes participate in OpLog replication to eliminate any “hot nodes”, ensuring linear performance at scale.&nbsp; While the data is being written, a checksum is computed and stored as part of its metadata. Data is then asynchronously drained to the extent store where the RF is implicitly maintained.&nbsp; In the case of a node or disk failure, the data is then re-replicated among all nodes in the cluster to maintain the RF.&nbsp; Any time the data is read, the checksum is computed to ensure the data is valid.&nbsp; In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.</p>

<p>
  Data is also consistently monitored to ensure integrity even when active I/O isn't occurring.  Stargate's scrubber operation will consistently scan through extent groups and perform checksum validation when disks aren't heavily utilized.  This protects against things like bit rot or corrupted sectors.
</p>

<p>The following figure shows an example of what this logically looks like:&nbsp;</p>

<figure id="id-A0tbFVF8Tx"><img alt="DSF Data Protection" class="iimagesv2data_protectionpng fse fs" src="imagesv2/data_protection.png">
<figcaption><span class="label">Figure 11-8. </span>DSF Data Protection</figcaption>
</figure>
</section>

<section data-type="sect1" id="availability-domains-WnIbC8TM">
<h3>Availability Domains</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/LDaNY9AJDn8">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//LDaNY9AJDn8"></iframe></div>

<p>Availability Domains (aka node/block/rack awareness) is a key struct for distributed systems to abide by for determining component and data placement.&nbsp; DSF is currently node and block aware, however this will increase to rack aware as supported cluster sizes grow in the future.&nbsp; Nutanix refers to a “block” as the chassis which contains either one, two, or four server “nodes”. NOTE: A minimum of 3 blocks must be utilized for block awareness to be activated, otherwise node awareness will be defaulted to.&nbsp;</p>

<p>It is recommended to utilized uniformly populated blocks to ensure block awareness is enabled.&nbsp; Common scenarios and the awareness level utilized can be found at the bottom of this section.&nbsp; The 3-block requirement is due to ensure quorum. For example, a 3450 would be a block which holds 4 nodes.&nbsp; The reason for distributing roles or data across blocks is to ensure if a block fails or needs maintenance the system can continue to run without interruption.&nbsp; NOTE: Within a block, the redundant PSU and fans are the only shared components Awareness can be broken into a few key focus areas:</p>

<ul>
	<li>Data (The VM data)</li>
	<li>Metadata (Cassandra)</li>
	<li>Configuration Data (Zookeeper)</li>
</ul>

<h5>Data</h5>

<p>With DSF, data replicas will be written to other blocks in the cluster to ensure that in the case of a block failure or planned downtime, the data remains available.&nbsp; This is true for both RF2 and RF3 scenarios, as well as in the case of a block failure. An easy comparison would be “node awareness”, where a replica would need to be replicated to another node which will provide protection in the case of a node failure.&nbsp; Block awareness further enhances this by providing data availability assurances in the case of block outages.</p>

<p>The following figure shows how the replica placement would work in a 3-block deployment:</p>

<figure id="id-lNtMiOCATe"><img alt="Block Aware Replica Placement" class="iimagesv2avail_dom_1png" src="imagesv2/avail_dom_1.png">
<figcaption><span class="label">Figure 11-25. </span>Block Aware Replica Placement</figcaption>
</figure>

<p>In the case of a block failure, block awareness will be maintained and the re-replicated blocks will be replicated to other blocks within the cluster:</p>

<figure id="id-Q4t4IaC5TX"><img alt="Block Failure Replica Placement" class="iimagesv2avail_dom_2png" src="imagesv2/avail_dom_2.png">
<figcaption><span class="label">Figure 11-26. </span>Block Failure Replica Placement</figcaption>
</figure>

<h5>Awareness Conditions and Tolerance</h5>
<p>Below we breakdown some common scenarios and the level of tolerance:</p>

<table>
  <tr>
    <th></th>
    <th></th>
    <th colspan="2">Simultaneous Failure Tolerance</th>
  </tr>
  <tr>
    <th>Number of Blocks</th>
    <th>Awareness Type</th>
    <th>Cluster FT1</th>
    <th>Cluster FT2</th>
  </tr>
  <tr>
    <td>&lt;3</td>
    <td>NODE</td>
    <td>SINGLE NODE</td>
    <td>DUAL NODE</td>
  </tr>
  <tr>
    <td>3-5</td>
    <td>NODE+BLOCK</td>
    <td>SINGLE BLOCK <br/>(up to 4 nodes)</td>
    <td>SINGLE BLOCK <br/>(up to 4 nodes)</td>
  </tr>
  <tr>
    <td>5+</td>
    <td>NODE+BLOCK</td>
    <td>SINGLE BLOCK <br/>(up to 4 nodes)</td>
    <td>DUAL BLOCK <br/>(up to 8 nodes)</td>
  </tr>
</table>

<p>
  As of Acropolis base software version 4.5 and later block awareness is best effort and doesn't have strict requirements for enabling.  This was done to ensure clusters with skewed storage resources (e.g. storage heavy nodes) don't disable the feature.  With that stated, it is however still a best practice to have uniform blocks to minimize any storage skew.
</p>

<p>
  Prior to 4.5 the following conditions must be met:
</p>
<ul>
  <li>If SSD <strong>or</strong> HDD tier variance between blocks is &gt; max variance: <strong>NODE</strong> awareness</li>
  <li>If SSD and HDD tier variance between blocks is &lt; max variance: <strong>BLOCK + NODE</strong> awareness</li>
</ul>

<p>Max tier variance is calculated as: 100 / (RF+1)</p>
<ul>
	<li>E.g., 33% for RF2 or 25% for RF3</li>
</ul>

<h5>Metadata</h5>

<p>As mentioned in the Scalable Metadata section above, Nutanix leverages a heavily modified Cassandra platform to store metadata and other essential information.&nbsp; Cassandra leverages a ring-like structure and replicates to n number of peers within the ring to ensure data consistency and availability.</p>

<p>The following figure shows an example of the Cassandra's ring for a 12-node cluster:</p>

<figure id="id-n3tDuwC1TN"><img alt="12 Node Cassandra Ring" class="iimagesv2avail_dom_3png fse fs image" src="imagesv2/avail_dom_3.png" style="width: 50%; height: 50%">
<figcaption><span class="label">Figure 11-27. </span>12 Node Cassandra Ring</figcaption>
</figure>

<p>Cassandra peer replication iterates through nodes in a clockwise manner throughout the ring.&nbsp; With block awareness, the peers are distributed among the blocks to ensure no two peers are on the same block.</p>

<p>The following figure shows an example node layout translating the ring above into the block based layout:</p>

<figure id="id-m2tmHdCoTN"><img alt="Cassandra Node Block Aware Placement" class="iimagesv2avail_dom_4png" src="imagesv2/avail_dom_4.png">
<figcaption><span class="label">Figure 11-28. </span>Cassandra Node Block Aware Placement</figcaption>
</figure>

<p>With this block-aware nature, in the event of a block failure there will still be at least two copies of the data (with Metadata RF3 – In larger clusters RF5 can be leveraged).</p>

<p>The following figure shows an example of all of the nodes replication topology to form the ring (yes – it’s a little busy):</p>

<figure id="id-jRtRfmCeTa"><img alt="Full Cassandra Node Block Aware Placement" class="iimagesv2avail_dom_5png" src="imagesv2/avail_dom_5.png">
<figcaption><span class="label">Figure 11-29. </span>Full Cassandra Node Block Aware Placement</figcaption>
</figure>

<div data-type="note" class="note" id="metadata-awareness-conditions-wDioinCwTQ"><h6>Note</h6>
<h5>Metadata Awareness Conditions</h5>
<p>Below we breakdown some common scenarios and what level of awareness will be utilized:</p>

<ul>
	<li>FT1 (Data RF2 / Metadata RF3) will be block aware if:
		<ul>
			<li>
				&gt; 3 blocks
			</li>
			<li>Let X be the number of nodes in the block with max nodes. Then, the remaining blocks should have at least 2X nodes.
				<ul>
					<li>
						Example: 4 blocks with 2,3,4,2 nodes per block respectively.
						<ul>
							<li>
								The max node block has 4 nodes which means the other 3 blocks should have 2x4 (8) nodes.  In this case it <strong>WOULD NOT</strong> be block aware as the remaining blocks only have 7 nodes.
							</li>
						</ul>
					</li>
					<br>
					<li>
						Example: 4 blocks with 3,3,4,3 nodes per block respectively.
						<ul>
							<li>
								The max node block has 4 nodes which means the other 3 blocks should have 2x4==8 nodes.  In this case it <strong>WOULD</strong> be block aware as the remaining blocks have 9 nodes which is above our minimum.
							</li>
						</ul>
					</li>
				</ul>
			</li>
		</ul>
	</li>
	<li>FT2 (Data RF3 / Metadata RF5) will be block aware if:
		<ul>
			<li>
				&gt; 5 blocks
			</li>
			<li>Let X be the number of nodes in the block with max nodes. Then, the remaining blocks should have at least 4X nodes.
				<ul>
					<li>
						Example: 6 blocks with 2,3,4,2,3,3 nodes per block respectively.
						<ul>
							<li>
							The max node block has 4 nodes which means the other 3 blocks should have 4x4==16 nodes.  In this case it <strong>WOULD NOT</strong> be block aware as the remaining blocks only have 13 nodes.
							</li>
						</ul>
						</li>
					<br>
					<li>
						Example: 6 blocks with 2,4,4,4,4,4 nodes per block respectively.
						<ul>
							<li>
								The max node block has 4 nodes which means the other 3 blocks should have 4x4==16 nodes.  In this case it <strong>WOULD</strong> be block aware as the remaining blocks have  18 nodes which is above our minimum.
							</li>
						</ul>
					</li>
				</ul>
			</li>
		</ul>
	</li>
</ul>
</div>

<h5>Configuration Data</h5>

<p>Nutanix leverages Zookeeper to store essential configuration data for the cluster.&nbsp; This role is also distributed in a block-aware manner to ensure availability in the case of a block failure.</p>

<p>The following figure shows an example layout showing 3 Zookeeper nodes distributed in a block-aware manner:</p>

<figure id="id-qMtEh2CyTQ"><img alt="Zookeeper Block Aware Placement" class="iimagesv2avail_dom_6png" src="imagesv2/avail_dom_6.png">
<figcaption><span class="label">Figure 11-30. </span>Zookeeper Block Aware Placement</figcaption>
</figure>

<p>In the event of a block outage, meaning one of the Zookeeper nodes will be gone, the Zookeeper role would be transferred to another node in the cluster as shown below:</p>

<figure id="id-EGt4soCoTo"><img alt="Zookeeper Placement Block Failure" class="iimagesv2avail_dom_7png" src="imagesv2/avail_dom_7.png">
<figcaption><span class="label">Figure 11-31. </span>Zookeeper Placement Block Failure</figcaption>
</figure>

<p>When the block comes back online, the Zookeeper role would be transferred back to maintain block awareness.</p>

<p>NOTE: Prior to 4.5, this migration was not automatic and must be done manually.</p>

</section>

<section data-type="sect1" id="data-path-resiliency-BNIWfkTZ">
<h3>Data Path Resiliency</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/SJIb_mTdMPg">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//SJIb_mTdMPg"></iframe></div>

<p>Reliability and resiliency are key, if not the most important concepts within DSF or any primary storage platform.&nbsp;</p>

<p>Contrary to traditional architectures which are built around the idea that hardware will be reliable, Nutanix takes a different approach: it expects hardware will eventually fail.&nbsp; By doing so, the system is designed to handle these failures in an elegant and non-disruptive manner.</p>

<p>NOTE: That doesn’t mean the hardware quality isn’t there, just a concept shift.&nbsp; The Nutanix hardware and QA teams undergo an exhaustive qualification and vetting process.</p>

<p><em>Potential levels of failure</em></p>

<p>Being a distributed system, DSF is built to handle component, service, and CVM failures, which can be characterized on a few levels:</p>

<ul>
	<li>Disk Failure</li>
	<li>CVM “Failure”</li>
	<li>Node Failure</li>
</ul>

<h5>Disk Failure</h5>

<p>A disk failure can be characterized as just that, a disk which has either been removed, had a dye failure, or is experiencing I/O errors and has been proactively removed.</p>

<p>VM impact:</p>

<ul>
	<li>HA event: <strong>No</strong></li>
	<li>Failed I/Os: <strong>No</strong></li>
	<li>Latency: <strong>No impact</strong></li>
</ul>

<p>In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately. &nbsp;It will scan the metadata (Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.</p>

<p>Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes throughout the cluster.&nbsp;</p>

<p>An important thing to highlight here is given how Nutanix distributes data and replicas across all nodes / CVMs / disks; all nodes / CVMs / disks will participate in the re-replication.&nbsp;</p>

<p>This substantially reduces the time required for re-protection, as the power of the full cluster can be utilized; the larger the cluster, the faster the re-protection.</p>

<h5>CVM “Failure”</h5>

<p>A CVM "failure” can be characterized as a CVM power action causing the CVM to be temporarily unavailable.&nbsp; The system is designed to transparently handle these gracefully.&nbsp; In the event of a failure, I/Os will be re-directed to other CVMs within the cluster.&nbsp; The mechanism for this will vary by hypervisor.&nbsp;</p>

<p>The rolling upgrade process actually leverages this capability as it will upgrade one CVM at a time, iterating through the cluster.</p>

<p>VM impact:</p>

<ul>
	<li>HA event: <strong>No</strong></li>
	<li>Failed I/Os: <strong>No</strong></li>
	<li>Latency: <strong>Potentially higher given I/Os over the network</strong></li>
</ul>

<p>In the event of a CVM "failure” the I/O which was previously being served from the down CVM, will be forwarded to other CVMs throughout the cluster.&nbsp; ESXi and Hyper-V handle this via a process called CVM Autopathing, which leverages HA.py (like “happy”), where it will modify the routes to forward traffic going to the internal address (192.168.5.2) to the external IP of other CVMs throughout the cluster.&nbsp; This enables the datastore to remain intact, just the CVM responsible for serving the I/Os is remote.</p>

<p>Once the local CVM comes back up and is stable, the route would be removed and the local CVM would take over all new I/Os.</p>

<p>In the case of KVM, iSCSI multi-pathing is leveraged where the primary path is the local CVM and the two other paths would be remote.&nbsp; In the event where the primary path fails, one of the other paths will become active.</p>

<p>Similar to Autopathing with ESXi and Hyper-V, when the local CVM comes back online, it’ll take over as the primary path.</p>

<h5>Node Failure</h5>

<p>VM Impact:</p>

<ul>
	<li>HA event: <strong>Yes</strong></li>
	<li>Failed I/Os: <strong>No</strong></li>
	<li>Latency: <strong>No impact</strong></li>
</ul>

<p>In the event of a node failure, a VM HA event will occur restarting the VMs on other nodes throughout the virtualization cluster.&nbsp; Once restarted, the VMs will continue to perform I/Os as usual which will be handled by their local CVMs.</p>

<p>Similar to the case of a disk failure above, a Curator scan will find the data previously hosted on the node and its respective replicas.</p>

<p>Similar to the disk failure scenario above, the same process will take place to re-protect the data, just for the full node (all associated disks).</p>

<p>In the event where the node remains down for a prolonged period of time (30 minutes as of 4.6), the down CVM will be removed from the metadata ring.&nbsp; It will be joined back into the ring after it has been up and stable for a duration of time.</p>
</section>

<section data-type="sect1">
<h3>Capacity Optimization</h3>

<p>
  The Nutanix platform incorporates a wide range of storage optimization technologies that work in concert to make efficient use of available capacity for any workload. These technologies are intelligent and adaptive to workload characteristics, eliminating the need for manual configuration and fine-tuning.
</p>

<p>
  The following optimizations are leveraged:
</p>

<ul>
  <li>
    Erasure Coding
  </li>
  <li>
    Compression
  </li>
  <li>
    Deduplication
  </li>
</ul>

<p>
  More detail on how each of these features can be found in the following sections.
</p>

<p>
  The table describes which optimizations are applicable to workloads a high-level:
</p>

<table>
  <tr>
    <th>Data Transform</th>
    <th>Application(s)</th>
    <th>Comments</th>
  </tr>
  <tr>
    <td>Erasure Coding</td>
    <td>Most</td>
    <td>Provides higher availability with reduced overheads than traditional RF.  No impact to normal write or read I/O performance.  Does have some read overhead in the case of a disk / node / block failure where data must be decoded.</td>
  </tr>
  <tr>
    <td>Inline<br>Compression</td>
    <td>All</td>
    <td>No impact to random I/O, helps increase storage tier utilization.  Benefits large or sequential I/O performance by reducing data to replicate and read from disk.</td>
  </tr>
  <tr>
    <td>Offline<br>Compression</td>
    <td>None</td>
    <td>Given inline compression will compress only large or sequential writes inline and do random or small I/Os post-process, that should be used instead.</td>
  </tr>
  <tr>
    <td>Perf Tier<br>Dedup</td>
    <td>P2V/V2V,Hyper-V (ODX),Cross-container clones</td>
    <td>Greater cache efficiency for data which wasn't cloned or created using efficient Acropolis clones</td>
  </tr>
  <tr>
    <td>Capacity Tier<br>Dedup</td>
    <td>Same as perf tier dedup</td>
    <td>Benefits of above with reduced overhead on disk</td>
  </tr>
</table>

<section data-type="sect2" id="erasure-coding-nrIjcoTx">
<h4>Erasure Coding</h4>

<p>The Nutanix platform relies leverages factor (RF) for data protection and availability.&nbsp; This method provides the highest degree of availability because it does not require reading from more than one storage location or data re-computation on failure.&nbsp; However, this does come at the cost of storage resources as full copies are required.&nbsp;</p>

<p>To provide a balance between availability while reducing the amount of storage required, DSF provides the ability to encode data using erasure codes (EC).</p>

<p>Similar to the concept of RAID (levels 4, 5, 6, etc.) where parity is calculated, EC encodes a strip of data blocks on different nodes and calculates parity.&nbsp; In the event of a host and/or disk failure, the parity can be leveraged to calculate any missing data blocks (decoding).&nbsp; In the case of DSF, the data block is an extent group and each data block must be on a different node and belong to a different vDisk.</p>

<p>The number of data and parity blocks in a strip is configurable based upon the desired failures to tolerate.&nbsp; The configuration is commonly referred to as the number of &lt;data blocks&gt;/&lt;number of parity blocks&gt;.</p>

<p>For example, “RF2 like” availability (e.g., N+1) could consist of 3 or 4 data blocks and 1 parity block in a strip (e.g., 3/1 or 4/1).&nbsp; “RF3 like” availability (e.g. N+2) could consist of 3 or 4 data blocks and 2 parity blocks in a strip (e.g. 3/2 or 4/2).</p>

<div data-type="note" class="note" id="pro-tip-1RinF8cdTQ"><h6>Note</h6>
<h5>Pro tip</h5>

<p>You can override the default strip size (4/1 for “RF2 like” or 4/2 for “RF3 like”) via NCLI ‘ctr [create / edit] … erasure-code=&lt;N&gt;/&lt;K&gt;’ where N is the number of data blocks and K is the number of parity blocks.</p>
</div>

<p>The expected overhead can be calculated as &lt;# parity blocks&gt; / &lt;# data blocks&gt;.&nbsp; For example, a 4/1 strip has a 25% overhead or 1.25X compared to the 2X of RF2.&nbsp; A 4/2 strip has a 50% overhead or 1.5X compared to the 3X of RF3.</p>

<p>The following table characterizes the encoded strip sizes and example overheads:</p>

<table>
  <tr>
    <th></th>
    <th colspan="2"><center>FT1 (RF2 equiv.)</center></th>
    <th colspan="2"><center>FT2 (RF3 equiv.)</center></th>
  </tr>
  <tr>
    <td>Cluster Size<br>(nodes)</td>
    <td>EC Strip Size<br>(data/parity blocks)</td>
    <td>EC Overhead<br>(vs. 2X of RF2)</td>
    <td>EC Strip Size<br>(data/parity)</td>
    <td>EC Overhead<br>(vs. 3X of RF3)</td>
  </tr>
  <tr>
    <td>4</td>
    <td>2/1</td>
    <td>1.5X</td>
    <td>N/A</td>
    <td>N/A</td>
  </tr>
  <tr>
    <td>5</td>
    <td>3/1</td>
    <td>1.33X</td>
    <td>N/A</td>
    <td>N/A</td>
  </tr>
  <tr>
    <td>6</td>
    <td>4/1</td>
    <td>1.25X</td>
    <td>N/A</td>
    <td>N/A</td>
  </tr>
  <tr>
    <td>7+</td>
    <td>4/1</td>
    <td>1.25X</td>
    <td>4/2</td>
    <td>1.5X</td>
  </tr>
</table>

<div data-type="note" class="note" id="pro-tip-P0iqcdcwT1"><h6>Note</h6>
<h5>Pro tip</h5>

<p>It is always recommended to have a cluster size which has at least 1 more node than the combined strip size (data + parity) to allow for rebuilding of the strips in the event of a node failure.  This eliminates any computation overhead on reads once the strips have been rebuilt (automated via Curator).  For example, a 4/1 strip should have at least 6 nodes in the cluster.  The previous table follows this best practice.</p>
</div>

<p>The encoding is done post-process and leverages the Curator MapReduce framework for task distribution.&nbsp; Since this is a post-process framework, the traditional write I/O path is unaffected.</p>

<p>A normal environment using RF would look like the following:</p>

<figure id="id-lNtGhwcATe"><img alt="Typical DSF RF Data Layout" class="iimagesv2ec_1png" src="imagesv2/ec_1.png">
<figcaption><span class="label">Figure 11-10. </span>Typical DSF RF Data Layout</figcaption>
</figure>

<p>In this scenario, we have a mix of both RF2 and RF3 data whose primary copies are local and replicas are distributed to other nodes throughout the cluster.</p>

<p>When a Curator full scan runs, it will find eligible extent groups which are available to become encoded.  Eligible extent groups must be "write-cold" meaning they haven't been written to for &gt; 1 hour. After the eligible candidates are found, the encoding tasks will be distributed and throttled via Chronos.</p>

<p>The following figure shows an example 4/1 and 3/2 strip:</p>

<figure id="id-4mtATYc8Tk"><img alt="DSF Encoded Strip - Pre-savings" class="iimagesv2ec_2png" src="imagesv2/ec_2.png">
<figcaption><span class="label">Figure 11-11. </span>DSF Encoded Strip - Pre-savings</figcaption>
</figure>

<p>Once the data has been successfully encoded (strips and parity calculation), the replica extent groups are then removed.</p>

<p>The following figure shows the environment after EC has run with the storage savings:</p>

<figure id="id-n3tMFNc1TN"><img alt="DSF Encoded Strip - Post-savings" class="iimagesv2ec_3png" src="imagesv2/ec_3.png">
<figcaption><span class="label">Figure 11-12. </span>DSF Encoded Strip - Post-savings</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-mEiEtYcoTN"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Erasure Coding pairs perfectly with inline compression which will add to the storage savings.  I leverage inline compression + EC in my environments.</p>
</div>
</section>

<section data-type="sect2" id="compression-ONIqiMTA">
<h4>Compression</h4>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/ERDqOCzDcQY">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//ERDqOCzDcQY"></iframe></div>

<p>The Nutanix Capacity Optimization Engine (COE) is responsible for performing data transformations to increase data efficiency on disk.&nbsp; Currently compression is one of the key features of the COE to perform data optimization. DSF provides both inline and offline flavors of compression to best suit the customer’s needs and type of data.&nbsp;</p>

<p>Inline compression will compress sequential streams of data or large I/O sizes in memory before it is written to disk, while offline compression will initially write the data as normal (in an un-compressed state) and then leverage the Curator framework to compress the data cluster wide. When inline compression is enabled but the I/Os are random in nature, the data will be written un-compressed in the OpLog, coalesced, and then compressed in memory before being written to the Extent Store. The Google Snappy compression library is leveraged which provides good compression ratios with minimal computational overhead and extremely fast compression / decompression rates.</p>

<p>The following figure shows an example of how inline compression interacts with the DSF write I/O path:</p>

<figure id="id-GRtVFZi0T1"><img alt="Inline Compression I/O Path" class="iimagesv2compression_1png" src="imagesv2/compression_1.png">
<figcaption><span class="label">Figure 11-13. </span>Inline Compression I/O Path</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-1RiZtvidTQ"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Almost always use inline compression (compression delay = 0) as it will only compress larger / sequential writes and not impact random write performance.</p>

<p>
	Inline compression also pairs perfectly with erasure coding.
</p>

</div>

<p>For offline compression, all new write I/O is written in an un-compressed state and follows the normal DSF I/O path.&nbsp; After the compression delay (configurable) is met and the data has become cold (down-migrated to the HDD tier via ILM), the data is eligible to become compressed. Offline compression uses the Curator MapReduce framework and all nodes will perform compression tasks.&nbsp; Compression tasks will be throttled by Chronos.</p>

<p>The following figure shows an example of how offline compression interacts with the DSF write I/O path:</p>

<figure id="id-NMt1cmiYT8"><img alt="Offline Compression I/O Path" class="iimagesv2compression_2png" src="imagesv2/compression_2.png">
<figcaption><span class="label">Figure 11-14. </span>Offline Compression I/O Path</figcaption>
</figure>

<p>For read I/O, the data is first decompressed in memory and then the I/O is served.&nbsp; For data that is heavily accessed, the data will become decompressed in the HDD tier and can then leverage ILM to move up to the SSD tier as well as be stored in the cache.</p>

<p>The following figure shows an example of how decompression interacts with the DSF I/O path during read:</p>

<figure id="id-PJtrhjiwT1"><img alt="Decompression I/O Path" class="iimagesv2compression_3png" src="imagesv2/compression_3.png">
<figcaption><span class="label">Figure 11-15. </span>Decompression I/O Path</figcaption>
</figure>

<p>You can view the current compression rates via Prism on the Storage &gt; Dashboard page.</p>
</section>

<section data-type="sect2" id="elastic-dedupe-engine-b1I4I5T5">
<h4>Elastic Dedupe Engine</h4>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/C-rp13cDpNw">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//C-rp13cDpNw"></iframe></div>

<p>The Elastic Dedupe Engine is a software-based feature of DSF which allows for data deduplication in the capacity (HDD) and performance (SSD/Memory) tiers.&nbsp; Streams of data are fingerprinted during ingest using a SHA-1 hash at a 16K granularity.&nbsp; This fingerprint is only done on data ingest and is then stored persistently as part of the written block’s metadata.&nbsp; NOTE: Initially a 4K granularity was used for fingerprinting, however after testing 16K offered the best blend of deduplication with reduced metadata overhead.&nbsp; Deduplicated data is pulled into the unified cache at a 4K granularity.</p>

<p>Contrary to traditional approaches which utilize background scans requiring the data to be re-read, Nutanix performs the fingerprint inline on ingest.&nbsp; For duplicate data that can be deduplicated in the capacity tier, the data does not need to be scanned or re-read, essentially duplicate copies can be removed.</p>

<p>
	To make the metadata overhead more efficient, fingerprint refcounts are monitored to track dedupability.  Fingerprints with low refcounts will be discarded to minimize the metadata overhead. To minimize fragmentation full extents will be preferred for capacity tier deduplication.
</p>

<div data-type="note" class="note" id="pro-tip-N8iNFMIYT8"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Use performance tier deduplication on your base images (you can manually fingerprint them using vdisk_manipulator) to take advantage of the unified cache.</p>

<p>Use capacity tier deduplication for P2V / V2V,&nbsp;when using Hyper-V since ODX does a full data copy, or when doing cross-container clones (not usually recommended as a single container is preferred).</p>

<p>In most other cases compression will yield the highest capacity savings and should be used instead.</p>
</div>

<p>The following figure shows an example of how the Elastic Dedupe Engine scales and handles local VM I/O requests:</p>

<figure class="large" id="id-NMt2fMIYT8"><img alt="Elastic Dedupe Engine - Scale" class="iimagesv2dedup_1png" src="imagesv2/dedup_1.png">
<figcaption><span class="label">Figure 11-16. </span>Elastic Dedupe Engine - Scale</figcaption>
</figure>

<p>Fingerprinting is done during data ingest of data with an I/O size of 64K or greater.&nbsp; Intel acceleration is leveraged for the SHA-1 computation which accounts for very minimal CPU overhead.&nbsp; In cases where fingerprinting is not done during ingest (e.g., smaller I/O sizes), fingerprinting can be done as a background process. The Elastic Deduplication Engine spans both the capacity disk tier (HDD), but also the performance tier (SSD/Memory).&nbsp; As duplicate data is determined, based upon multiple copies of the same fingerprints, a background process will remove the duplicate data using the DSF MapReduce framework (Curator). For data that is being read, the data will be pulled into the DSF Unified Cache which is a multi-tier/pool cache.&nbsp; Any subsequent requests for data having the same fingerprint will be pulled directly from the cache.&nbsp; To learn more about the Unified Cache and pool structure, please refer to the 'Unified Cache' sub-section in the I/O path overview.</p>

<div data-type="note" class="note" id="fingerprinted-vdisk-offsets-l9ijcWIATe"><h6>Note</h6>
<h5>Fingerprinted vDisk Offsets</h5>
<p>
	Prior to 4.5 only the first 12GB of a vDisk was eligible to be fingerprinted.  This was done to maintain a smaller metadata footprint and since the OS is normally the most common data.  As of 4.5 this has increased to 24GB due to higher metadata efficiencies.
</p>
</div>

<p>The following figure shows an example of how the Elastic Dedupe Engine interacts with the DSF I/O path:</p>

<figure id="id-lNtAUWIATe"><img alt="EDE I/O Path" class="iimagesv2dedup_2png" src="imagesv2/dedup_2.png">
<figcaption><span class="label">Figure 11-17. </span>EDE I/O Path</figcaption>
</figure>

<p>You can view the current deduplication rates via Prism on the Storage &gt; Dashboard page.</p>

<div data-type="note" class="note" id="dedup-compression-42ikCZI8Tk"><h6>Note</h6>
<h5>Dedup + Compression</h5>

<p>
	As of 4.5 both deduplication and compression can be enabled on the same container.  However, unless the data is dedupable (conditions explained earlier in section), stick with compression.
</p>

</div>
</section>

</section>

<section data-type="sect1" id="storage-tiering-and-prioritization-rkIqUgTO">
<h3>Storage Tiering and Prioritization</h3>

<p>The Disk Balancing section above talked about how storage capacity was pooled among all nodes in a Nutanix cluster and that ILM would be used to keep hot data local.&nbsp; A similar concept applies to disk tiering, in which the cluster’s SSD and HDD tiers are cluster-wide and DSF ILM is responsible for triggering data movement events. A local node’s SSD tier is always the highest priority tier for all I/O generated by VMs running on that node, however all of the cluster’s SSD resources are made available to all nodes within the cluster.&nbsp; The SSD tier will always offer the highest performance and is a very important thing to manage for hybrid arrays.</p>

<p>The tier prioritization can be classified at a high-level by the following:</p>

<figure id="id-GRtgTgU0T1"><img alt="DSF Tier Prioritization" class="iimagesv2tiering_1png" src="imagesv2/tiering_1.png">
<figcaption><span class="label">Figure 11-18. </span>DSF Tier Prioritization</figcaption>
</figure>

<p>Specific types of resources (e.g. SSD, HDD, etc.) are pooled together and form a cluster wide storage tier.&nbsp; This means that any node within the cluster can leverage the full tier capacity, regardless if it is local or not.</p>

<p>The following figure shows a high level example of what this pooled tiering looks like:</p>

<figure class="large" id="id-1ntnF4UdTQ"><img alt="DSF Cluster-wide Tiering" class="iimagesv2tiering_2png" src="imagesv2/tiering_2.png">
<figcaption><span class="label">Figure 11-19. </span>DSF Cluster-wide Tiering</figcaption>
</figure>

<p>A common question is what happens when a local node’s SSD becomes full?&nbsp; As mentioned in the Disk Balancing section, a key concept is trying to keep uniform utilization of devices within disk tiers.&nbsp; In the case where a local node’s SSD utilization is high, disk balancing will kick in to move the coldest data on the local SSDs to the other SSDs throughout the cluster.&nbsp; This will free up space on the local SSD to allow the local node to write to SSD locally instead of going over the network.&nbsp; A key point to mention is that all CVMs and SSDs are used for this remote I/O to eliminate any potential bottlenecks and remediate some of the hit by performing I/O over the network.</p>

<figure class="large" id="id-2btrfvU0T3"><img alt="DSF Cluster-wide Tier Balancing" class="iimagesv2tiering_3png" src="imagesv2/tiering_3.png">
<figcaption><span class="label">Figure 11-20. </span>DSF Cluster-wide Tier Balancing</figcaption>
</figure>

<p>The other case is when the overall tier utilization breaches a specific threshold [curator_tier_usage_ilm_threshold_percent (Default=75)] where DSF ILM will kick in and as part of a Curator job will down-migrate data from the SSD tier to the HDD tier.&nbsp; This will bring utilization within the threshold mentioned above or free up space by the following amount [curator_tier_free_up_percent_by_ilm (Default=15)], whichever is greater. The data for down-migration is chosen using last access time. In the case where the SSD tier utilization is 95%, 20% of the data in the SSD tier will be moved to the HDD tier (95% –&gt; 75%).&nbsp;</p>

<p>However, if the utilization was 80%, only 15% of the data would be moved to the HDD tier using the minimum tier free up amount.</p>

<figure class="large" id="id-lNtzIJUATe"><img alt="DSF Tier ILM" class="iimagesv2tiering_4png" src="imagesv2/tiering_4.png">
<figcaption><span class="label">Figure 11-21. </span>DSF Tier ILM</figcaption>
</figure>

<p>DSF ILM will constantly monitor the I/O patterns and (down/up) migrate data as necessary as well as bring the hottest data local regardless of tier.</p>
</section>

<section data-type="sect1" id="disk-balancing-mkIOhPTq">
<h3>Disk Balancing</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/atbkgrmpVNo">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//atbkgrmpVNo"></iframe></div>

<p>DSF is designed to be a very dynamic platform which can react to various workloads as well as allow heterogeneous node types: compute heavy (3050, etc.) and storage heavy (60X0, etc.) to be mixed in a single cluster.&nbsp; Ensuring uniform distribution of data is an important item when mixing nodes with larger storage capacities. DSF has a native feature, called disk balancing, which is used to ensure uniform distribution of data throughout the cluster.&nbsp; Disk balancing works on a node’s utilization of its local storage capacity and is integrated with DSF ILM.&nbsp; Its goal is to keep utilization uniform among nodes once the utilization has breached a certain threshold.</p>

<p>The following figure shows an example of a mixed cluster (3050 + 6050) in an “unbalanced” state:</p>

<figure class="large" id="id-1ntzHGhdTQ"><img alt="Disk Balancing - Unbalanced State" class="iimagesv2disk_balancing_1png" src="imagesv2/disk_balancing_1.png">
<figcaption><span class="label">Figure 11-22. </span>Disk Balancing - Unbalanced State</figcaption>
</figure>

<p>Disk balancing leverages the DSF Curator framework and is run as a scheduled process as well as when a threshold has been breached (e.g., local node capacity utilization &gt; n %).&nbsp; In the case where the data is not balanced, Curator will determine which data needs to be moved and will distribute the tasks to nodes in the cluster. In the case where the node types are homogeneous (e.g., 3050), utilization should be fairly uniform. However, if there are certain VMs running on a node which are writing much more data than others, there can become a skew in the per node capacity utilization.&nbsp; In this case, disk balancing would run and move the coldest data on that node to other nodes in the cluster. In the case where the node types are heterogeneous (e.g., 3050 + 6020/50/70), or where a node may be used in a “storage only” mode (not running any VMs), there will likely be a requirement to move data.</p>

<p>The following figure shows an example the mixed cluster after disk balancing has been run in a “balanced” state:</p>

<figure class="large" id="id-3rt1fehlTb"><img alt="Disk Balancing - Balanced State" class="iimagesv2disk_balancing_2png" src="imagesv2/disk_balancing_2.png">
<figcaption><span class="label">Figure 11-23. </span>Disk Balancing - Balanced State</figcaption>
</figure>

<p>In some scenarios, customers might run some nodes in a “storage-only” state where only the CVM will run on the node whose primary purpose is bulk storage capacity.&nbsp; In this case, the full node's memory can be added to the CVM to provide a much larger read cache.</p>

<p>The following figure shows an example of how a storage only node would look in a mixed cluster with disk balancing moving data to it from the active VM nodes:</p>

<figure class="large" id="id-MltQIBhpT1"><img alt="Disk Balancing - Storage Only Node" class="iimagesv2disk_balancing_3png" src="imagesv2/disk_balancing_3.png">
<figcaption><span class="label">Figure 11-24. </span>Disk Balancing - Storage Only Node</figcaption>
</figure>
</section>

<section data-type="sect1" id="snapshots-and-clones-vlInsNTB">
<h3>Snapshots and Clones</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/uK5wWR44UYE">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//uK5wWR44UYE"></iframe></div>

<p>DSF provides native support for offloaded snapshots and clones which can be leveraged via VAAI, ODX, ncli, REST, Prism, etc.&nbsp; Both the snapshots and clones leverage the redirect-on-write algorithm which is the most effective and efficient. As explained in the Data Structure section above, a virtual machine consists of files (vmdk/vhdx) which are vDisks on the Nutanix platform.&nbsp;</p>

<p>A vDisk is composed of extents which are logically contiguous chunks of data, which are stored within extent groups which are physically contiguous data stored as files on the storage devices. When a snapshot or clone is taken, the base vDisk is marked immutable and another vDisk is created as read/write.&nbsp; At this point, both vDisks have the same block map, which is a metadata mapping of the vDisk to its corresponding extents. Contrary to traditional approaches which require traversal of the snapshot chain (which can add read latency), each vDisk has its own block map.&nbsp; This eliminates any of the overhead normally seen by large snapshot chain depths and allows you to take continuous snapshots without any performance impact.</p>

<p>The following figure shows an example of how this works when a snapshot is taken (NOTE: I need to give some credit to NTAP as a base for these diagrams, as I thought their representation was the clearest):</p>

<figure class="large" id="id-3rteF9slTb"><img alt="Example Snapshot Block Map" class="iimagesv2snap_1png" src="imagesv2/snap_1.png">
<figcaption><span class="label">Figure 11-32. </span>Example Snapshot Block Map</figcaption>
</figure>

<p>The same method applies when a snapshot or clone of a previously snapped or cloned vDisk is performed:</p>

<figure class="large" id="id-lNt3fNsATe"><img alt="Multi-snap Block Map and New Write" class="iimagesv2snap_2png" src="imagesv2/snap_2.png">
<figcaption><span class="label">Figure 11-33. </span>Multi-snap Block Map and New Write</figcaption>
</figure>

<p>The same methods are used for both snapshots and/or clones of a VM or vDisk(s).&nbsp; When a VM or vDisk is cloned, the current block map is locked and the clones are created.&nbsp; These updates are metadata only, so no I/O actually takes place.&nbsp; The same method applies for clones of clones; essentially the previously cloned VM acts as the “Base vDisk” and upon cloning, that block map is locked and two “clones” are created: one for the VM being cloned and another for the new clone.&nbsp;</p>

<p>They both inherit the prior block map and any new writes/updates would take place on their individual block maps.</p>

<figure id="id-J2tNIms9T9"><img alt="Multi-Clone Block Maps" class="iimagesv2snap_3png" src="imagesv2/snap_3.png">
<figcaption><span class="label">Figure 11-34. </span>Multi-Clone Block Maps</figcaption>
</figure>

<p>As mentioned previously, each VM/vDisk has its own individual block map.&nbsp; So in the above example, all of the clones from the base VM would now own their block map and any write/update would occur there.&nbsp;</p>

<p>The following figure shows an example of what this looks like:</p>

<figure id="id-OQtmCNsjTJ"><img alt="Clone Block Maps - New Write" class="iimagesv2snap_4png" src="imagesv2/snap_4.png">
<figcaption><span class="label">Figure 11-35. </span>Clone Block Maps - New Write</figcaption>
</figure>

<p>Any subsequent clones or snapshots of a VM/vDisk would cause the original block map to be locked and would create a new one for R/W access.</p>
</section>

<section data-type="sect1" id="replication-and-multi-site-disaster-recovery-jrIOuxT2">
<h3>Replication and DR</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/AoKwKI7CXIM">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//AoKwKI7CXIM"></iframe></div>

<p>Nutanix provides native DR and replication capabilities, which build upon the same features explained in the Snapshots &amp; Clones section.&nbsp; Cerebro is the component responsible for managing the DR and replication in DSF.&nbsp; Cerebro runs on every node and a Cerebro master is elected (similar to NFS master) and is responsible for managing replication tasks.&nbsp; In the event the CVM acting as Cerebro master fails, another is elected and assumes the role.&nbsp; The Cerebro page can be found on &lt;CVM IP&gt;:2020. The DR function can be broken down into a few key focus areas:</p>

<ul>
	<li>Replication Topologies</li>
	<li>Implementation Constructs</li>
	<li>Replication Lifecycle</li>
	<li>Global Deduplication</li>
</ul>

<h5>Replication Topologies</h5>

<p>Traditionally, there are a few key replication topologies: Site to site, hub and spoke, and full and/or partial mesh.&nbsp; Contrary to traditional solutions which only allow for site to site or hub and spoke, Nutanix provides a fully mesh or flexible many-to-many model.</p>

<figure id="id-lNtJtGuATe"><img alt="Example Replication Topologies" class="iimagesv2dr_1png" src="imagesv2/dr_1.png">
<figcaption><span class="label">Figure 11-36. </span>Example Replication Topologies</figcaption>
</figure>

<p>Essentially, this allows the admin to determine a replication capability that meets their company's needs.</p>

<h5>Implementation Constructs</h5>

<p>Within Nutanix DR, there are a few key constructs which are explained below:</p>

<h6>Remote Site</h6>

<ul>
	<li>Key Role: A remote Nutanix cluster</li>
	<li>Description: A remote Nutanix cluster which can be leveraged as a target for backup or DR purposes.</li>
</ul>

<div data-type="note" class="note" id="pro-tip-bOi5hDuyTp"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Ensure the target site has ample capacity (compute/storage) to handle a full site failure.&nbsp; In certain cases replication/DR between racks within a single site can also make sense.</p>
</div>

<h6>Protection Domain (PD)</h6>

<ul>
	<li>Key Role: Macro group of VMs and/or files to protect</li>
	<li>Description: A group of VMs and/or files to be replicated together on a desired schedule.&nbsp; A PD can protect a full container or you can select individual VMs and/or files</li>
</ul>

<div data-type="note" class="note" id="pro-tip-W1i5uRuRT2"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Create multiple PDs for various services tiers driven by a desired RPO/RTO.&nbsp; For file distribution (e.g. golden images, ISOs, etc.) you can create a PD with the files to replication.</p>
</div>

<h6>Consistency Group (CG)</h6>

<ul>
	<li>Key Role: Subset of VMs/files in PD to be crash-consistent</li>
	<li>Description: VMs and/or files which are part of a Protection Domain which need to be snapshotted in a crash-consistent manner.&nbsp; This ensures that when VMs/files are recovered, they come up in a consistent state.&nbsp; A protection domain can have multiple consistency groups.</li>
</ul>

<div data-type="note" class="note" id="pro-tip-yGioHpuoTz"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Group dependent application or service VMs in a consistency group to ensure they are recovered in a consistent state (e.g. App and DB)</p>
</div>

<h6>Replication Schedule</h6>

<ul>
	<li>Key Role: Snapshot and replication schedule</li>
	<li>Description: Snapshot and replication schedule for VMs in a particular PD and CG</li>
</ul>

<div data-type="note" class="note" id="pro-tip-Y2iwfquqTR"><h6>Note</h6>
<h5>Pro tip</h5>

<p>The snapshot schedule should be equal to your desired RPO</p>
</div>

<h6>Retention Policy</h6>

<ul>
	<li>Key Role: Number of local and remote snapshots to keep</li>
	<li>Description: The retention policy defines the number of local and remote snapshots to retain.&nbsp; NOTE: A remote site must be configured for a remote retention/replication policy to be configured.</li>
</ul>

<div data-type="note" class="note" id="pro-tip-EaiQIYuoTo"><h6>Note</h6>
<h5>Pro tip</h5>

<p>The retention policy should equal the number of restore points required per VM/file</p>
</div>

<p>The following figure shows a logical representation of the relationship between a PD, CG, and VM/Files for a single site:</p>

<figure id="id-EGtOhYuoTo"><img alt="DR Construct Mapping" class="iimagesv2dr_2png" src="imagesv2/dr_2.png">
<figcaption><span class="label">Figure 11-37. </span>DR Construct Mapping</figcaption>
</figure>

<p>It’s important to mention that a full container can be protected for simplicity.  However, the platform provides the ability to protect down to the granularity of a single VM and/or file level.</p>

<h5>Replication Lifecycle</h5>

<p>Nutanix replication leverages the Cerebro service mentioned above.&nbsp; The Cerebro service is broken into a “Cerebro Master”, which is a dynamically elected CVM, and Cerebro Slaves, which run on every CVM.&nbsp; In the event where the CVM acting as the “Cerebro Master” fails, a new “Master” is elected.</p>

<p>The Cerebro Master is responsible for managing task delegation to the local Cerebro Slaves as well as coordinating with remote Cerebro Master(s) when remote replication is occurring.</p>

<p>During a replication, the Cerebro Master will figure out which data needs to be replicated, and delegate the replication tasks to the Cerebro Slaves which will then tell Stargate which data to replicate and to where.</p>

<p>
  Replicated data is protected at multiple layers throughout the process.  Extent reads on the source are checksummed to ensure consistency for source data (similar to how any DSF read occurs) and the new extent(s) are checksummed at the target (similar to any DSF write).  TCP provides consistency on the network layer.
</p>

<p>The following figure shows a representation of this architecture:</p>

<figure id="id-d0tPFVu8TW"><img alt="Replication Architecture" class="iimagesv2dr_3png" src="imagesv2/dr_3.png">
<figcaption><span class="label">Figure 11-38. </span>Replication Architecture</figcaption>
</figure>

<p>It is also possible to configure a remote site with a proxy which will be used as a bridgehead for all coordination and replication traffic coming from a cluster.</p>

<div data-type="note" class="note" id="pro-tip-xkiRf1uDTz"><h6>Note</h6>
<h5>Pro tip</h5>

<p>When using a remote site configured with a proxy, always utilize the cluster IP as that will always be hosted by the Prism Leader and available, even if CVM(s) go down.</p>
</div>

<p>The following figure shows a representation of the replication architecture using a proxy:</p>

<figure id="id-xbtMc1uDTz"><img alt="Replication Architecture - Proxy" class="iimagesv2dr_4png" src="imagesv2/dr_4.png">
<figcaption><span class="label">Figure 11-39. </span>Replication Architecture - Proxy</figcaption>
</figure>

<p>In certain scenarios, it is also possible to configure a remote site using a SSH tunnel where all traffic will flow between two CVMs.</p>

<div data-type="note" class="note" id="note-kpirUduBT3"><h6>Note</h6>
<h5>Note</h5>
This should only be used for non-production scenarios and the cluster IPs should be used to ensure availability.</div>

<p>The following figure shows a representation of the replication architecture using a SSH tunnel:</p>

<figure id="id-k1tvCduBT3"><img alt="Replication Architecture - SSH Tunnel" class="iimagesv2dr_5png" src="imagesv2/dr_5.png">
<figcaption><span class="label">Figure 11-40. </span>Replication Architecture - SSH Tunnel</figcaption>
</figure>

<h5>Global Deduplication</h5>

<p>As explained in the Elastic Deduplication Engine section above, DSF has the ability to deduplicate data by just updating metadata pointers. The same concept is applied to the DR and replication feature.&nbsp; Before sending data over the wire, DSF will query the remote site and check whether or not the fingerprint(s) already exist on the target (meaning the data already exists).&nbsp; If so, no data will be shipped over the wire and only a metadata update will occur. For data which doesn’t exist on the target, the data will be compressed and sent to the target site.&nbsp; At this point, the data exists on both sites is usable for deduplication.</p>

<p>The following figure shows an example three site deployment where each site contains one of more protection domains (PD):</p>

<figure id="id-1nteRSkudTQ"><img alt="Replication Deduplication" class="iimagesv2dr_6png" src="imagesv2/dr_6.png">
<figcaption><span class="label">Figure 11-41. </span>Replication Deduplication</figcaption>
</figure>

<div data-type="note" class="note" id="note-3Nim3HyulTb"><h6>Note</h6>
<h5>Note</h5>

<p>Fingerprinting must be enabled on the source and target container / vstore for replication deduplication to occur.</p>
</div>
</section>

<section data-type="sect1" id="cloud-connect-yAI5T1Td">
<h3>Cloud Connect</h3>

<p>Building upon the native DR / replication capabilities of DSF, Cloud Connect extends this capability into cloud providers (currently Amazon Web Services, Microsoft Azure).&nbsp; NOTE: This feature is currently limited to just backup / replication.</p>

<p>Very similar to creating a remote site to be used for native DR / replication, a “cloud remote site” is just created.&nbsp; When a new cloud remote site is created, Nutanix will automatically spin up a single node Nutanix cluster in EC2 (currently m1.xlarge) or Azure Virtual Machines (currently D3) to be used as the endpoint.</p>

<p>The cloud instance is based upon the same Acropolis code-base leveraged for locally running clusters.&nbsp; This means that all of the native replication capabilities (e.g., global deduplication, delta based replications, etc.) can be leveraged.</p>

<p>In the case where multiple Nutanix clusters are leveraging Cloud Connect, they can either A) share the same instance running in the region, or B) spin up a new instance.</p>

<p>
  Storage for cloud instances is done using a "cloud disk" which is a logical disk backed by S3 (AWS) or BlobStore (Azure).  Data is stored as the usual egroups which are files on the object stores.
</p>

<p>The following figure shows a logical representation of a “remote site” used for Cloud Connect:</p>

<figure id="id-lNtnF1TATe"><img alt="Cloud Connect - Region" class="iimagesv2cloudconn_1png" src="imagesv2/cloudconn_1.png">
<figcaption><span class="label">Figure 11-42. </span>Cloud Connect Region</figcaption>
</figure>

<p>Since a cloud based remote site is similar to any other Nutanix remote site, a cluster can replicate to multiple regions if higher availability is required (e.g., data availability in the case of a full region outage):</p>

<figure id="id-Q4tgfVT5TX"><img alt="Cloud Connect - Multi-region" class="iimagesv2cloudconn_2png" src="imagesv2/cloudconn_2.png">
<figcaption><span class="label">Figure 11-43. </span>Cloud Connect Multi-region</figcaption>
</figure>

<p>The same replication / retention policies are leveraged for data replicated using Cloud Connect.&nbsp; As data / snapshots become stale, or expire, the cloud cluster will clean up data as necessary.</p>

<p>If replication isn’t frequently occurring (e.g., daily or weekly), the platform can be configured to power up the cloud instance(s) prior to a scheduled replication and down after a replication has completed.</p>

<p>Data that is replicated to any cloud region can also be pulled down and restored to any existing, or newly created Nutanix cluster which has the cloud remote site(s) configured:</p>

<figure id="id-OQtXU1TjTJ"><img alt="Cloud Connect - Restore" class="iimagesv2cloudconn_3png" src="imagesv2/cloudconn_3.png">
<figcaption><span class="label">Figure 11-44. </span>Cloud Connect - Restore</figcaption>
</figure>
</section>

<section data-type="sect1" id="metro-availability-eqInSNTE">
<h3>Metro Availability</h3>

<p>Nutanix provides native “stretch clustering” capabilities which allow for a compute and storage cluster to span multiple physical sites.&nbsp; In these deployments, the compute cluster spans two locations and has access to a shared pool of storage.</p>

<p>This expands the VM HA domain from a single site to between two sites providing a near 0 RTO and a RPO of 0.</p>

<p>In this deployment, each site has its own Nutanix cluster, however the containers are “stretched” by synchronously replicating to the remote site before acknowledging writes.</p>

<p>The following figure shows a high-level design of what this architecture looks like:</p>

<figure id="id-lNt4HySATe"><img alt="Metro Availability - Normal State" class="iimagesv2metro_1png" src="imagesv2/metro_1.png">
<figcaption><span class="label">Figure 11-45. </span>Metro Availability - Normal State</figcaption>
</figure>

<p>In the event of a site failure, an HA event will occur where the VMs can be restarted on the other site.</p>

<p>The following figure shows an example site failure:</p>

<figure id="id-J2tMfaS9T9"><img alt="Metro Availability - Site Failure" class="iimagesv2metro_2png" src="imagesv2/metro_2.png">
<figcaption><span class="label">Figure 11-46. </span>Metro Availability - Site Failure</figcaption>
</figure>

<p>In the event where there is a link failure between the two sites, each cluster will operate independently.&nbsp; Once the link comes back up, the sites will be re-synchronized (deltas-only) and synchronous replication will start occurring.</p>

<p>The following figure shows an example link failure:</p>

<figure id="id-OQt3IxSjTJ"><img alt="Metro Availability - Link Failure" class="iimagesv2metro_3png" src="imagesv2/metro_3.png">
<figcaption><span class="label">Figure 11-47. </span>Metro Availability - Link Failure</figcaption>
</figure>
</section>

<section data-type="sect1" id="volumes-api-wrIRHzTv">
<h3>Volumes API</h3>

<p>The Acropolis Volumes API exposes back-end DSF storage to external consumers (guest OS, physical hosts, containers, etc.) via iSCSI (today).</p>

<p>This allows any operating system to access DSF and leverage its storage capabilities.&nbsp; In this deployment scenario, the OS is talking directly to Nutanix bypassing any hypervisor.&nbsp;</p>

<p>Core use-cases for the Volumes API:</p>

<ul>
	<li>Shared Disks
	<ul>
		<li>Oracle RAC, Microsoft Failover Clustering, etc.</li>
	</ul>
	</li>
	<li>Disks as first-class entities
	<ul>
		<li>Where execution contexts are ephemeral and data is critical</li>
		<li>Containers, OpenStack, etc.</li>
	</ul>
	</li>
	<li>Guest-initiated iSCSI
	<ul>
		<li>Bare-metal consumers</li>
		<li>Exchange on vSphere (for Microsoft Support)</li>
	</ul>
	</li>
</ul>

<p>The following entities compose the volumes API:</p>

<ul>
	<li><strong>Volume Group:</strong> iSCSI target and group of disk devices allowing for centralized management, snapshotting, and policy application</li>
	<li><strong>Disks:</strong> Storage devices in the Volume Group (seen as LUNs for the iSCSI target)</li>
	<li><strong>Attachment:</strong> Allowing a specified initiator IQN access to the volume group</li>
</ul>

<p>NOTE: On the backend, a VG’s disk is just a vDisk on DSF.</p>

<p>To use the Volumes API, the following process is leveraged:</p>

<ol>
	<li>Create new Volume Group</li>
	<li>Add disk(s) to Volume Group</li>
	<li>Attach an initiator IQN to the Volume Group</li>
</ol>

<div data-type="example" id="create-volume-group-bOiXc9HyTp">
<h5><span class="label">Example 11-1. </span>Create Volume Group</h5>

<p># Create VG</p>

<p class="codetext">vg.create &lt;VG Name&gt;</p>

<p># Add disk(s) to VG</p>

<p class="codetext">Vg.disk_create &lt;VG Name&gt; container=&lt;CTR Name&gt; create_size=&lt;Disk size, e.g. 500G&gt;</p>

<p># Attach initiator IQN to VG</p>

<p class="codetext">Vg.attach_external &lt;VG Name&gt; &lt;Initiator IQN&gt;</p>
</div>

<p>The following figure shows an example with a VM running on Nutanix, with its OS hosted on the normal Nutanix storage, mounting the volumes directly:</p>

<figure id="id-b8teU9HyTp"><img alt="Volume API - Example" class="iimagesv2volapi_1png" src="imagesv2/volapi_1.png">
<figcaption><span class="label">Figure 11-48. </span>Volume API - Example</figcaption>
</figure>

<p>In Windows deployments, iSCSI multi-pathing can be configured leveraging the Windows MPIO feature.&nbsp; It is recommended to leverage the ‘Failover only’ policy (default) to ensure vDisk ownership doesn’t change.</p>

<figure id="id-m2tkC1HoTN"><img alt="MPIO Example - Normal State" class="iimagesv2volapi_2png" src="imagesv2/volapi_2.png">
<figcaption><span class="label">Figure 11-49. </span>MPIO Example - Normal State</figcaption>
</figure>

<p>In the event there are multiple disk devices, each disk will have an active path to the local CVM:</p>

<figure id="id-vOtYuDH5Tx"><img alt="MPIO Example - Multi-disk" class="iimagesv2volapi_3png" src="imagesv2/volapi_3.png">
<figcaption><span class="label">Figure 11-50. </span>MPIO Example - Multi-disk</figcaption>
</figure>

<p>In the event where the active CVM goes down, another path would become active and I/Os would resume:</p>

<figure id="id-yrt8SWHoTz"><img alt="MPIO Example - Path Failure" class="iimagesv2volapi_4png" src="imagesv2/volapi_4.png">
<figcaption><span class="label">Figure 11-51. </span>MPIO Example - Path Failure</figcaption>
</figure>

<p>In our testing, we’ve seen MPIO to take ~15-16 seconds to complete, which is within the Windows disk I/O timeout (default is 60 seconds).</p>

<p>If RAID or LVM is desired, the attached disk devices can be put into a dynamic or logical disk:</p>

<figure id="id-Yat4tjHqTR"><img alt="RAID / LVM Example - Single-path" class="iimagesv2volapi_5png" src="imagesv2/volapi_5.png">
<figcaption><span class="label">Figure 11-52. </span>RAID / LVM Example - Single-path</figcaption>
</figure>

<p>In the event where the local CVM is under heavy utilization, it is possible to have active paths to other CVMs.&nbsp; This will balance the I/O load across multiple CVMs, however will take a hit by having to traverse the network for the primary I/O:</p>

<figure id="id-gktniNHoTP"><img alt="RAID / LVM Example - Multi-path" class="iimagesv2volapi_6png" src="imagesv2/volapi_6.png">
<figcaption><span class="label">Figure 11-53. </span>RAID / LVM Example - Multi-path</figcaption>
</figure>
</section>

<section data-type="sect1" id="networking-and-io-YWI5FoTZ">
<h3>Networking and I/O</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/Bz37Eu_TgxY">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//Bz37Eu_TgxY"></iframe></div>

<p>The Nutanix platform does not leverage any backplane for inter-node communication and only relies on a standard 10GbE network.&nbsp; All storage I/O for VMs running on a Nutanix node is handled by the hypervisor on a dedicated private network.&nbsp; The I/O request will be handled by the hypervisor, which will then forward the request to the private IP on the local CVM.&nbsp; The CVM will then perform the remote replication with other Nutanix nodes using its external IP over the public 10GbE network. For all read requests, these will be served completely locally in most cases and never touch the 10GbE network. This means that the only traffic touching the public 10GbE network will be DSF remote replication traffic and VM network I/O.&nbsp; There will, however, be cases where the CVM will forward requests to other CVMs in the cluster in the case of a CVM being down or data being remote.&nbsp; Also, cluster-wide tasks, such as disk balancing, will temporarily generate I/O on the 10GbE network.</p>

<p>The following figure shows an example of how the VM’s I/O path interacts with the private and public 10GbE network:</p>

<figure class="large" id="id-Q4tBHoF5TX"><img alt="DSF Networking" class="iimagesv2net_iopng" src="imagesv2/net_io.png">
<figcaption><span class="label">Figure 11-54. </span>DSF Networking</figcaption>
</figure>

<p>&nbsp;</p>
</section>

<section data-type="sect1" id="data-locality-qeI8t8Td">
<h3>Data Locality</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/ocLD5nBbUTU">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//ocLD5nBbUTU"></iframe></div>

<p>Being a converged (compute+storage) platform, I/O and data locality are critical to cluster and VM performance with Nutanix.&nbsp; As explained above in the I/O path, all read/write IOs are served by the local Controller VM (CVM) which is on each hypervisor adjacent to normal VMs.&nbsp; A VM’s data is served locally from the CVM and sits on local disks under the CVM’s control.&nbsp; When a VM is moved from one hypervisor node to another (or during a HA event), the newly migrated VM’s data will be served by the now local CVM. When reading old data (stored on the now remote node/CVM), the I/O will be forwarded by the local CVM to the remote CVM.&nbsp; All write I/Os will occur locally right away.&nbsp; DSF will detect the I/Os are occurring from a different node and will migrate the data locally in the background, allowing for all read I/Os to now be served locally.&nbsp; The data will only be migrated on a read as to not flood the network.</p>

<p>
  Data locality occurs in two main flavors:
</p>

<ul>
  <li>
    Cache Locality
    <ul>
      <li>
        vDisk data stored locally in the Unified Cache.  vDisk extent(s) may be remote to the node.
      </li>
    </ul>
  </li>
  <li>
    Extent Locality
    <ul>
      <li>
        vDisk extents local on the same node as the VM
      </li>
    </ul>
  </li>
</ul>

<p>The following figure shows an example of how data will “follow” the VM as it moves between hypervisor nodes:</p>

<figure class="large" id="id-J2trh3t9T9"><img alt="Data Locality" class="iimagesv2data_locality2png" src="imagesv2/data_locality2.png">
<figcaption><span class="label">Figure 11-55. </span>Data Locality</figcaption>
</figure>

<div data-type="note" class="note" id="thresholds-for-data-migration-Oxi4FOtjTJ"><h6>Note</h6>
<h5>Thresholds for Data Migration</h5>

<p>
  Cache locality occurs in real time and will be determined based upon vDisk ownership.  When a vDisk / VM moves from one node to another the "ownership" of those vDisk(s) will transfer to the now local CVM.  Once the ownership has transferred the data can be cached locally in the Unified Cache.  In the interim the cache will be wherever the ownership is held (the now remote host).  The previously hosting Stargate will relinquish the vDisk token when it sees remote I/Os for 300+ seconds at which it will then be taken by the local Stargate.  Cache coherence is enforced as ownership is required to cache the vDisk data.
</p>

<p>Extent locality is a sampled operation and an extent group will be migrated when the following occurs:
"3 touches for random or 10 touches for sequential within a 10 minute window where multiple reads every 10 second sampling count as a single touch".
</p>
</div>

</section>

<section data-type="sect1" id="shadow-clones-gnIEf8T1">
<h3>Shadow Clones</h3>

<p>For a visual explanation, you can watch the following video: <a href="https://youtu.be/oqfFDMYQFJg">LINK</a></p>

<div class="video-container"><iframe allowfullscreen frameborder="0" src="https://www.youtube.com/embed//oqfFDMYQFJg"></iframe></div>

<p>The Acropolis Distributed Storage Fabric has a feature called ‘Shadow Clones’, which allows for distributed caching of particular vDisks or VM data which is in a ‘multi-reader’ scenario.&nbsp; A great example of this is during a VDI deployment many ‘linked clones’ will be forwarding read requests to a central master or ‘Base VM’.&nbsp; In the case of VMware View, this is called the replica disk and is read by all linked clones, and in XenDesktop, this is called the MCS Master VM.&nbsp; This will also work in any scenario which may be a multi-reader scenario (e.g., deployment servers, repositories, etc.). Data or I/O locality is critical for the highest possible VM performance and a key struct of DSF.&nbsp;</p>

<p>With Shadow Clones, DSF will monitor vDisk access trends similar to what it does for data locality.&nbsp; However, in the case there are requests occurring from more than two remote CVMs (as well as the local CVM), and all of the requests are read I/O, the vDisk will be marked as immutable.&nbsp; Once the disk has been marked as immutable, the vDisk can then be cached locally by each CVM making read requests to it (aka Shadow Clones of the base vDisk). This will allow VMs on each node to read the Base VM’s vDisk locally. In the case of VDI, this means the replica disk can be cached by each node and all read requests for the base will be served locally.&nbsp; NOTE:&nbsp; The data will only be migrated on a read as to not flood the network and allow for efficient cache utilization.&nbsp; In the case where the Base VM is modified, the Shadow Clones will be dropped and the process will start over.&nbsp; Shadow clones are enabled by default (as of 4.0.2) and can be enabled/disabled using the following NCLI command: ncli cluster edit-params enable-shadow-clones=&lt;true/false&gt;.</p>

<p>The following figure shows an example of how Shadow Clones work and allow for distributed caching:</p>

<figure class="large" id="id-BAtyFEfJTv"><img alt="Shadow Clones" class="iimagesv2shadow_clonepng" src="imagesv2/shadow_clone.png">
<figcaption><span class="label">Figure 11-56. </span>Shadow Clones</figcaption>
</figure>
</section>

<section data-type="sect1" id="storage-layers-and-monitoring-EqIjiATz">
<h3>Storage Layers and Monitoring</h3>

<p>The Nutanix platform monitors storage at multiple layers throughout the stack, ranging from the VM/Guest OS all the way down to the physical disk devices.&nbsp; Knowing the various tiers and how these relate is important whenever monitoring the solution and allows you to get full visibility of how the ops relate. The following figure shows the various layers of where operations are monitored and the relative granularity which are explained below:</p>

<figure id="id-Q4tMubi5TX"><img alt="Storage Layers" class="iimagesv2storage_layerspng" src="imagesv2/storage_layers.png">
<figcaption><span class="label">Figure 11-57. </span>Storage Layers</figcaption>
</figure>

<p>&nbsp;</p>

<h5>Virtual Machine Layer</h5>

<ul>
	<li>Key Role: Metrics reported by the hypervisor for the VM</li>
	<li>Description: Virtual Machine or guest level metrics are pulled directly from the hypervisor and represent the performance the VM is seeing and is indicative of the I/O performance the application is seeing.</li>
	<li>When to use: When troubleshooting or looking for VM level detail</li>
</ul>

<h5>Hypervisor Layer</h5>

<ul>
	<li>Key Role: Metrics reported by the Hypervisor(s)</li>
	<li>Description: Hypervisor level metrics are pulled directly from the hypervisor and represent the most accurate metrics the hypervisor(s) are seeing.&nbsp; This data can be viewed for one of more hypervisor node(s) or the aggregate cluster.&nbsp; This layer will provide the most accurate data in terms of what performance the platform is seeing and should be leveraged in most cases.&nbsp; In certain scenarios the hypervisor may combine or split operations coming from VMs which can show the difference in metrics reported by the VM and hypervisor.&nbsp; These numbers will also include cache hits served by the Nutanix CVMs.</li>
	<li>When to use: Most common cases as this will provide the most detailed and valuable metrics.</li>
</ul>

<h5>Controller Layer</h5>

<ul>
	<li>Key Role: Metrics reported by the Nutanix Controller(s)</li>
	<li>Description: Controller level metrics are pulled directly from the Nutanix Controller VMs (e.g.,&nbsp;Stargate 2009 page) and represent what the Nutanix front-end is seeing from NFS/SMB/iSCSI or any back-end operations (e.g., ILM, disk balancing, etc.).&nbsp; This data can be viewed for one of more Controller VM(s) or the aggregate cluster.&nbsp; The metrics seen by the Controller Layer should normally match those seen by the hypervisor layer, however will include any backend operations (e.g., ILM, disk balancing). These numbers will also include cache hits served by memory.&nbsp; In certain cases, metrics like (IOPS), might not match as the NFS / SMB / iSCSI client might split a large IO into multiple smaller IOPS.&nbsp; However, metrics like bandwidth should match.</li>
	<li>When to use: Similar to the hypervisor layer, can be used to show how much backend operation is taking place.</li>
</ul>

<h5>Disk Layer</h5>

<ul>
	<li>Key Role: Metrics reported by the Disk Device(s)</li>
	<li>Description: Disk level metrics are pulled directly from the physical disk devices (via the CVM) and represent what the back-end is seeing.&nbsp; This includes data hitting the OpLog or Extent Store where an I/O is performed on the disk.&nbsp; This data can be viewed for one of more disk(s), the disk(s) for a particular node, or the aggregate disks in the cluster.&nbsp; In common cases, it is expected that the disk ops should match the number of incoming writes as well as reads not served from the memory portion of the cache.&nbsp; Any reads being served by the memory portion of the cache will not be counted here as the op is not hitting the disk device.</li>
	<li>When to use: When looking to see how many ops are served from cache or hitting the disks.</li>
</ul>

<div data-type="note" class="note" id="metric-and-stat-retention-yGiRUBioTz"><h6>Note</h6>
<h5>Metric and Stat Retention</h5>

<p>Metrics and time series data is stored locally for 90 days in Prism Element.  For Prism Central and Insights, data can be stored indefinitely (assuming capacity is available).</p>
</div>
</section>
<section data-type="sect1">
<h3>Nutanix Guest Tools (NGT)</h3>
<p>
  Nutanix Guest Tools (NGT) is a software based in-guest agent framework which enables advanced VM management functionality through the Nutanix Platform.
</p>

<p>
  The solution is composed of the NGT installer which is installed on the VMs and the Guest Tools Framework which is used for coordination between the agent and Nutanix platform.
</p>

<p>
  The NGT installer contains the following components:
</p>

<ul>
  <li>
    Guest Agent Service
  </li>
  <li>
    Self-service Restore (SSR) aka File-level Restore (FLR) CLI
  </li>
  <li>
    VM Mobility Drivers (VirtIO drivers for AHV)
  </li>
  <li>
    VSS Agent and Hardware Provider for Windows VMs
  </li>
  <li>
    App Consistent snapshot support for Linux VMs (via scripts to quiesce)
  </li>
</ul>

<p>
  This framework is composed of a few high-level components:
</p>

<ul>
  <li>
    Guest Tools Service
    <ul>
      <li>
        Gateway between the Acropolis and Nutanix services and the Guest Agent.  Distributed across CVMs within the cluster with an elected NGT Master which runs on the current Prism Leader (hosting cluster vIP)
      </li>
    </ul>
  </li>
  <li>
    Guest Agent
    <ul>
      <li>
        Agent and associated services deployed in the VM's OS as part of the NGT installation process.  Handles any local functions (e.g. VSS, Self-service Restore (SSR), etc.) and interacts with the Guest Tools Service.
      </li>
    </ul>
  </li>
</ul>

<p>
  The figure shows the high-level mapping of the components:
</p>
<figure><img alt="Guest Tools Mapping" src="imagesv2/ngt_1.png">
<figcaption><span class="label">Figure. </span>Guest Tools Mapping</figcaption>
</figure>

<h5>Guest Tools Service</h5>
<p>
  The Guest Tools Service is composed of two main roles:
</p>

<ul>
  <li>
    NGT Master
  <ul>
    <li>
      Handles requests coming from NGT Proxy and interfaces with Acropolis components.  A single NGT Master is dynamically elected per cluster; in the event the current master fails a new one will be elected.  The service listens internally on port 2073.
    </li>
  </ul>
  </li>
  <li>
    NGT Proxy
  <ul>
    <li>
      Runs on every CVM and will forward requests to the NGT Master to perform the desired activity.  The current VM activing as the Prism Leader (hosting the VIP) will be the active CVM handling communication from the Guest Agent. Listens externally on port 2074.
    </li>
  </ul>
  </li>
</ul>

<p>
  The figure shows the high-level mapping of the roles:
</p>
<figure><img alt="Guest Tools Service" src="imagesv2/ngt_2.png">
<figcaption><span class="label">Figure. </span>Guest Tools Service</figcaption>
</figure>

<h5>Guest Agent</h5>
<p>
  The Guest Agent is composed of the following high-level components as mentioned prior:
</p>

<figure><img alt="Guest Agent" src="imagesv2/ngt_3.png">
<figcaption><span class="label">Figure. </span>Guest Agent</figcaption>
</figure>


<h5>Communication and Security</h5>
<p>The Guest Agent Service communicates with Guest Tools Service via the Nutanix Cluster IP using SSL.  For deployments where the Nutanix cluster components and UVMs are on a different network (hopefully all), ensure that the following are possible:</p>
<ul>
  <li>
    Ensure routed communication from UVM network(s) to Cluster IP OR
  </li>
  OR
  <li>
    Create a firewall rule (and associated NAT) from UVM network(s) allowing communication with the Cluster IP on port 2074 (preferred)
  </li>
</ul>

<p>
  The Guest Tools Service acts as a Certificate Authority (CA) and is responsible for generating certificate pairs for each NGT enabled UVM.  This certificate is embedded in to the ISO which is configured for the UVM and used as part of the NGT deployment process.  These certificates are installed inside the UVM as part of the installation process.
</p>

<h5>NGT Agent Installation</h5>
<p>
  NGT Agent installation can be performed via Prism or CLI/Scripts (ncli/REST/PowerShell).
</p>

<p>
  To install NGT via Prism, navigate to the 'VM' page, select a VM to install NGT on and click 'Enable NGT':
</p>
<figure><img alt="Enable NGT for VM" src="imagesv2/Prism/ngt/ngt_deploy_1.png">
<figcaption><span class="label">Figure. </span>Enable NGT for VM</figcaption>
</figure>

<p>
  Click 'Yes' at the prompt to continue with NGT installation:
</p>
<figure><img alt="Enable NGT Prompt" src="imagesv2/Prism/ngt/ngt_deploy_2.png">
<figcaption><span class="label">Figure. </span>Enable NGT Prompt</figcaption>
</figure>

<p>
  The VM must have a CD-ROM as the generated installer containing the software and unique certificate will be mounted there as shown:
</p>
<figure><img alt="Enable NGT - CD-ROM" src="imagesv2/Prism/ngt/ngt_deploy_3a.png">
<figcaption><span class="label">Figure. </span>Enable NGT - CD-ROM</figcaption>
</figure>

<p>
  The NGT installer CD-ROM will be visible in the OS:
</p>
<figure><img alt="Enable NGT - CD-ROM in OS" src="imagesv2/Prism/ngt/ngt_deploy_3b.png">
<figcaption><span class="label">Figure. </span>Enable NGT - CD-ROM in OS</figcaption>
</figure>

<p>
  Double click on the CD to begin the installation process.
</p>

<p>
  Follow the prompts and accept the licenses to complete the installation:
</p>
<figure><img alt="Enable NGT - Installer" src="imagesv2/Prism/ngt/ngt_deploy_4.png">
<figcaption><span class="label">Figure. </span>Enable NGT - Installer</figcaption>
</figure>

<p>
  As part of the installation process Python, PyWin and the Nutanix Mobility (cross-hypervisor compatibility) drivers will also be installed.
</p>

<p>
  After the installation has been completed, a reboot will be required.
</p>

<p>
  After successful installation and reboot, you will see the following items visible in 'Programs and Features':
</p>
<figure><img alt="Enable NGT - Installed Programs" src="imagesv2/Prism/ngt/ngt_deploy_5a.png">
<figcaption><span class="label">Figure. </span>Enable NGT - Installed Programs</figcaption>
</figure>

<p>
  Services for the NGT Agent and VSS Hardware Provider will also be running:
</p>
<figure><img alt="Enable NGT - Services" src="imagesv2/Prism/ngt/ngt_deploy_5b.png">
<figcaption><span class="label">Figure. </span>Enable NGT - Services</figcaption>
</figure>

<p>
  NGT is now installed and can be leveraged.
</p>

<div data-type="note" class="note"><h6>Note</h6>
<h5>Bulk NGT Deployment</h5>

<p>Rather than installing NGT on each individual VM, it is possible to embed and deploy NGT in your base image.</p>

<p>
  Follow the following process to leverage NGT inside a base image:
</p>
<ol>
  <li>
    Install NGT on master VM and ensure communication
  </li>
  <li>
    Clone VMs from master VM
  </li>
  <li>
    Mount NGT ISO on each clone (required to get new certificate pair)
    <ul>
      <li>
        Example: ncli ngt mount vm-id=&lt;CLONE_ID&gt; OR via Prism
      </li>
    </ul>
    <ul>
      <li>
        Automated way coming soon :)
      </li>
    </ul>
  </li>
  <li>
    Power on clones
  </li>
</ol>

<p>
  When the cloned VM is booted it will detect the new NGT ISO and copy relevant configuration files and new certificates and will start communicating with the Guest Tools Service.
</p>
</div>

</section>

<section data-type="sect1">
<h3>File Services</h3>
<p>The File Services feature allows users to leverage the Nutanix platform as a highly available file server.  This allows for a single namespace where users can store home directories and files.</p>

<p>This feature is composed of a few high-level constructs:</p>

<ul>
  <li>
    File Server
    <ul>
      <li>
        High-level namespace.  Each file server will have its own set of File Services VMs (FSVM) which are deployed
      </li>
    </ul>
  </li>
  <li>
    Share
    <ul>
      <li>
        Share exposed to users.  A file server can have multiple shares (e.g. departmental shares, etc.)
      </li>
    </ul>
  </li>
  <li>
    Folder
    <ul>
      <li>
        Folders for file storage.  Folders are sharded across FSVMs
      </li>
    </ul>
  </li>
</ul>

<p>
  The figure shows the high-level mapping of the constructs:
</p>
<figure><img alt="File Services Mapping" src="imagesv2/fs_1.png">
<figcaption><span class="label">Figure. </span>File Services Mapping</figcaption>
</figure>

<p>
  The file services feature follows the same methodology for distribution as the Nutanix platform to ensure availability and scale.  A minimum of 3 FSVMs will be deployed as part of the File Server deployment.
</p>

<p>
  The figure shows a detailed view of the components:
</p>
<figure><img alt="File Services Detail" src="imagesv2/fs_2.png">
<figcaption><span class="label">Figure. </span>File Services Detail</figcaption>
</figure>

<div data-type="note" class="note"><h6>Note</h6>
<h5>Supported Protocols</h5>

<p>As of 4.6, SMB (up to version 2.1) is the only supported protocol for client communication with file services.</p>
</div>

<p>
  The File Services VMs run as agent VMs on the platform and are transparently deployed as part of the configuration process.
</p>

<p>
  The figure shows a detailed view of FSVMs on the Acropolis platform:
</p>
<figure><img alt="File Services Detail" src="imagesv2/fs_2b.png">
<figcaption><span class="label">Figure. </span>FSVM Deployment Arch</figcaption>
</figure>

<h5>Authentication and Authorization</h5>
<p>
  The File Services feature is fully integrated into Microsoft Active Directory (AD) and DNS.  This allows all of the secure and established authentication and authorization capabilities of AD to be leveraged.  All share permissions, user and group management is done using the traditional Windows MMC for file management.
</p>

<p>
  As part of the installation process the following AD / DNS objects will be created:
</p>

<ul>
  <li>
    AD Computer Account for File Server
  </li>
  <li>
    AD Service Principal Name (SPN) for File Server and each FSVM
  </li>
  <li>
    DNS entry for File Server pointing to all FSVM(s)
  </li>
  <li>
    DNS entry for each FSVM
  </li>
</ul>

<div data-type="note" class="note"><h6>Note</h6>
<h5>AD Privileges for File Server Creation</h5>

<p>A user account with the domain admin or equivalent privileges must be used to deploy the File Service feature as AD and DNS objects are created.</p>
</div>

<h5>High-Availability (HA)</h5>

<p>
  Each FSVM leverages the Acropolis Volumes API for its data storage which is accessed via in-guest iSCSI.  This allows any FSVM to connect to any iSCSI target in the event of a FSVM outage.
</p>

<p>
  The figure shows a high-level overview of the FSVM storage:
</p>
<figure><img alt="FSVM Storage" src="imagesv2/fs_3.png">
<figcaption><span class="label">Figure. </span>FSVM Storage</figcaption>
</figure>

<p>
  To provide for path availability DM-MPIO is leveraged within the FSVM which will have the active path set to the local CVM by default:
</p>
<figure><img alt="FSVM MPIO" src="imagesv2/fs_4a.png">
<figcaption><span class="label">Figure. </span>FSVM MPIO</figcaption>
</figure>

<p>
  In the event where the local CVM becomes unavailable (e.g. active path down), DM-MPIO will activate one of the failover paths to a remote CVM which will then takeover IO.
</p>
<figure><img alt="FSVM MPIO Failover" src="imagesv2/fs_4b.png">
<figcaption><span class="label">Figure. </span>FSVM MPIO Failover</figcaption>
</figure>

<p>
  When the local CVM comes back and is healthy it will be marked as the active path to provide for local IO.
</p>

<p>
  In a normal operating environment each FSVM will be communicating with its own VG for data storage with passive connections to the others.  Each FSVM will have an IP which clients use to communicate with the FSVM as part of the DFS referral process.  Clients do not need to know each individual FSVM's IP as the DFS referral process will connect them to the correct IP hosting their folder(s).
</p>

<figure><img alt="FSVM Normal Operation" src="imagesv2/fs_5.png">
<figcaption><span class="label">Figure. </span>FSVM Normal Operation</figcaption>
</figure>

<p>
  In the event of a FSVM "failure" (e.g. maintenance, power off, etc.) the VG and IP of the failed FSVM will be taken over by another FSVM to ensure client availability.
</p>

<p>
  The figure shows the transfer of the failed FSVM's IP and VG:
</p>

<figure><img alt="FSVM Failure Scenario" src="imagesv2/fs_6.png">
<figcaption><span class="label">Figure. </span>FSVM Failure Scenario</figcaption>
</figure>

<p>
  When the failed FSVM comes back and is stable, it will re-take its IP and VG and continue to serve client IO.
</p>

</section>
<!--END of Acroplis DSF Section -->
</section>

<section data-type="chapter" id="application-mobility-fabric-coming-soon-3jIvS1">
<h2>Application Mobility Fabric - coming soon!</h2>

<p>More coming soon!</p>
</section>

<section data-type="chapter" id="acropolis-hypervisor-PgIzH9">
<h2>Acropolis Hypervisor (AHV)</h2>

<section data-type="sect1" id="node-architecture-lkIms8h3">
<h3>Node Architecture</h3>

<p>In Acropolis Hypervisor deployments, the Controller VM (CVM) runs as a VM and disks are presented using PCI passthrough.&nbsp; This allows the full PCI controller (and attached devices) to be passed through directly to the CVM and bypass the hypervisor.&nbsp; Acropolis Hypervisor is based upon CentOS KVM.</p>

<figure id="id-ZptDuNsvHk"><img alt="Acropolis Hypervisor Node" class="iimagesv2acrop_nodepng" src="imagesv2/acrop_node.png">
<figcaption><span class="label">Figure 13-1. </span>Acropolis Hypervisor Node</figcaption>
</figure>

<p>The Acropolis Hypervisor is built upon the CentOS KVM foundation and extends its base functionality to include features like HA, live migration, etc.&nbsp;</p>

<p>Acropolis Hypervisor is validated as part of the Microsoft Server Virtualization Validation Program and is validated to run Microsoft OS and applications.</p>
</section>

<section data-type="sect1" id="kvm-architecture-M2I0uWHw">
<h3>KVM Architecture</h3>

<p>Within KVM there are a few main components:</p>

<ul>
	<li>KVM-kmod
	<ul>
		<li>KVM kernel module</li>
	</ul>
	</li>
	<li>Libvirtd
	<ul>
		<li>An API, daemon and management tool for managing KVM and QEMU.&nbsp; Communication between Acropolis and KVM / QEMU occurs through libvirtd.</li>
	</ul>
	</li>
	<li>Qemu-kvm
	<ul>
		<li>A machine emulator and virtualizer that runs in userspace for every Virtual Machine (domain).&nbsp; In the Acropolis Hypervisor it is used for hardware-assisted virtualization and VMs run as HVMs.</li>
	</ul>
	</li>
</ul>

<p>The following figure shows the relationship between the various components:</p>

<figure id="id-zmtoSEuxHl"><img alt="KVM Component Relationship" class="iimagesv2kvm_overviewpng" src="imagesv2/kvm_overview.png">
<figcaption><span class="label">Figure 13-2. </span>KVM Component Relationship</figcaption>
</figure>

<p>Communication between Acropolis and KVM occurs via Libvirt.&nbsp;</p>

<div data-type="note" class="note" id="processor-generation-compatibility-AmibF4u9Hx"><h6>Note</h6>
<h5>Processor generation compatibility</h5>

<p>Similar to VMware's Enhanced vMotion Capability (EVC) which allows VMs to move between different processor generations; Acropolis Hypervisor will determine the lowest processor generation in the cluster and constrain all QEMU domains to that level.  This allows mixing of processor generations within an AHV cluster and ensures the ability to live migrate between hosts.</p>
</div>

</section>

<section data-type="sect1" id="configuration-maximums-and-scalability-QMImTwHQ">
<h3>Configuration Maximums and Scalability</h3>

<p>The following configuration maximums and scalability limits are applicable:</p>

<ul>
	<li>Maximum cluster size: <strong>N/A – same as Nutanix cluster size</strong></li>
	<li>Maximum vCPUs per VM: <strong>Number of physical cores per host</strong></li>
	<li>Maximum memory per VM: <strong>2TB</strong></li>
	<li>Maximum VMs per host: <strong>N/A – Limited by memory</strong></li>
	<li>Maximum VMs per cluster: <strong>N/A – Limited by memory</strong></li>
</ul>
</section>

<section data-type="sect1" id="networking-J5IbSVHw">
<h3>Networking</h3>

<p>Acropolis Hypervisor leverages Open vSwitch (OVS) for all VM networking.&nbsp; VM networking is configured through Prism / ACLI and each VM nic is connected into a tap interface.</p>

<p>The following figure shows a conceptual diagram of the OVS architecture:</p>

<figure id="id-89t5TmSAHZ"><img alt="Open vSwitch Network Overview" class="iimagesv2acrop_netpng" src="imagesv2/acrop_net.png">
<figcaption><span class="label">Figure 13-3. </span>Open vSwitch Network Overview</figcaption>
</figure>

<h4>VM NIC Types</h4>
<p>
  AHV supports the following VM network interface types:
</p>
<ul>
  <li>
    Access (default)
  </li>
  <li>
    Trunk (4.6 and above)
  </li>
</ul>
<p>
  By default VM nics will be created as Access interfaces (similar to what you'd see with a VM nic on a port group), however it is possible to expose a trunked interface up to the VM's OS.
</p>

<p>
  A trunked interface can be added with the following command:
</p>

<p class="codetext">vm.nic_create &lt;VM_NAME&gt; vlan_mode=kTrunked trunked_networks=&lt;ALLOWED_VLANS&gt; network=&lt;NATIVE_VLAN&gt;</p>

<p>
  Example:
</p>

<p class="codetext">vm.nic_create fooVM vlan_mode=kTrunked trunked_networks=10,20,30 network=vlan.10</p>

</section>

<section data-type="sect1" id="how-it-works-4aIRHwHg">
<h3>How It Works</h3>

<section data-type="sect2" id="iscsi-multi-pathing-BNIZsRHzHv">
<h4>iSCSI Multi-pathing</h4>

<p>On each KVM host there is a iSCSI redirector daemon running which checks Stargate health throughout the cluster using NOP OUT commands.</p>

<p>QEMU is configured with the iSCSI redirector as the iSCSI target portal.&nbsp; Upon a login request, the redirector will perform and iSCSI login redirect to a healthy Stargate (preferably the local one).</p>

<figure id="id-0OtzTbs9HWH3"><img alt="iSCSI Multi-pathing - Normal State" class="iimagesv2iscsi_mp_1png" src="imagesv2/iscsi_mp_1.png">
<figcaption><span class="label">Figure 13-4. </span>iSCSI Multi-pathing - Normal State</figcaption>
</figure>

<p>In the event where the active Stargate goes down (thus failing to respond to the NOP OUT command), the iSCSI redirector will mark the local Stargate as unhealthy.&nbsp; When QEMU retries the iSCSI login, the redirector will redirect the login to another healthy Stargate.</p>

<figure id="id-DktkHYsnHrHZ"><img alt="iSCSI Multi-pathing - Local CVM Down" class="iimagesv2iscsi_mp_2png" src="imagesv2/iscsi_mp_2.png">
<figcaption><span class="label">Figure 13-5. </span>iSCSI Multi-pathing - Local CVM Down</figcaption>
</figure>

<p>Once the local Stargate comes back up (and begins responding to the NOP OUT commands), the iSCSI redirector will perform a TCP kill to kill all connections to remote Stargates.&nbsp; QEMU will then attempt an iSCSI login again and will be redirected to the local Stargate.</p>

<figure id="id-GRtbtxsbH9H8"><img alt="iSCSI Multi-pathing - Local CVM Back Up" class="iimagesv2iscsi_mp_3png" src="imagesv2/iscsi_mp_3.png">
<figcaption><span class="label">Figure 13-6. </span>iSCSI Multi-pathing - Local CVM Back Up</figcaption>
</figure>
</section>

<section data-type="sect2" id="ip-address-management-ONIyujHgHJ">
<h4>IP Address Management</h4>

<p>The Acropolis IP address management (IPAM) solution provides the ability to establish a DHCP scope and assign addresses to VMs.&nbsp; This leverages VXLAN and OpenFlow rules to intercept the DHCP request and respond with a DHCP response.</p>

<p>Here we show an example DHCP request using the Nutanix IPAM solution where the Acropolis Master is running locally:</p>

<figure id="id-A0tmT4u9HgHo"><img alt="IPAM - Local Acropolis Master" class="iimagesv2acrop_ipam_1png" src="imagesv2/acrop_ipam_1.png">
<figcaption><span class="label">Figure 13-7. </span>IPAM - Local Acropolis Master</figcaption>
</figure>

<p>If the Acropolis Master is running remotely, the same VXLAN tunnel will be leveraged to handle the request over the network.&nbsp;</p>

<figure id="id-k1tjH0u5HXHG"><img alt="IPAM - Remote Acropolis Master" class="iimagesv2acrop_ipam_2png" src="imagesv2/acrop_ipam_2.png">
<figcaption><span class="label">Figure 13-8. </span>IPAM - Remote Acropolis Master</figcaption>
</figure>

<p>Traditional DHCP / IPAM solutions can also be leveraged in an ‘unmanaged’ network scenario.</p>
</section>

<section data-type="sect2" id="vm-high-availability-ha-nrIkT2HrHN">
<h4>VM High Availability (HA)</h4>
<p>
	Acropolis hypervisor VM HA is a feature built to ensure VM availability in the event of a host or block outage.  In the event of a host failure the VMs previously running on that host will be restarted on other healthy nodes throughout the cluster.  The Acropolis Master is responsible for restarting the VM(s) on the healthy host(s).
</p>

<p>
	The Acropolis Master tracks host health by monitoring its connections to the libvirt on all cluster hosts:
</p>

<figure id="id-DktlTvTnHrHZ"><img alt="HA - Host Monitoring" src="imagesv2/ha_hostmonitoring.png">
<figcaption><span class="label">Figure 13-9. </span>HA - Host Monitoring</figcaption>
</figure>

<p>
	In the event the Acropolis Master becomes partitioned, isolated or fails a new Acropolis Master will be elected on the healthy portion of the cluster.  If a cluster becomes partitioned (e.g X nodes can't talk to the other Y nodes) the side with quorum will remain up and VM(s) will be restarted on those hosts.
</p>

<div data-type="note" class="note" id="default-vm-restart-policy-RDiqH9TWHRHG"><h6>Note</h6>
<h5>Default VM restart policy</h5>
<p>
	By default any AHV cluster will do its best to restart VM(s) in the event of a host failure.  In this mode, when a host becomes unavailable, the previously running VMs will be restarted on the remaining healthy hosts if possible.  Since this is best effort (meaning resources aren't reserved) the ability to restart all VMs will be dependent on available AHV resources.
</p>
</div>

<p>
	There are two main types of resource reservations for HA:
</p>

<ul>
	<li>
		Reserve Hosts
		<ul>
			<li>
				Reserve X number of hosts where X is the number of host failures to tolerate (e.g. 1, 2, etc.)
			</li>
			<li>
				This is the default when all hosts with the same amount of RAM
			</li>
		</ul>
	</li>
	<li>
		Reserve Segments
		<ul>
			<li>
				Reserve Y resources across N hosts in the cluster.  This will be a function of the cluster FT level, the size of the running VMs and the number of nodes in the cluster.
			</li>
			<li>
				This is the default when some hosts have different amounts of RAM
			</li>
		</ul>
	</li>
</ul>

<div data-type="note" class="note" id="pro-tip-21irfmTqHRHb"><h6>Note</h6>
<h5>Pro tip</h5>
<p>
	Use reserve hosts when:
</p>
<ul>
	<li>
		You have homogenous clusters (all hosts <strong>DO</strong> have the same amount of RAM)
	</li>
	<li>
		Consolidation ratio is higher priority than performance
	</li>
</ul>
<p>
	Use reserve segments when:
</p>
<ul>
	<li>
		You have heterogeneous clusters (all hosts <strong>DO NOT</strong> have the same amount of RAM)
	</li>
	<li>
		Performance is higher priority than consolidation ratio
	</li>
</ul>
</div>

<p>
	I'll cover both reservation options in the following sections.
</p>

<h5>Reserve Hosts</h5>

<p>
	  By default the number of failures to tolerate will be the same as the cluster FT level (i.e. 1 for FT1 aka RF2, 2 for FT2 aka RF3, etc.).  It is possible to override this via acli.
</p>

<div data-type="note" class="note" id="pro-tip-MziqURTqHbHX"><h6>Note</h6>
<h5>Pro tip</h5>
<p>
	You can override or manually set the number of reserved failover hosts with the following ACLI command:
</p>
<p class="codetext">acli ha.update num_reserved_hosts=&lt;NUM_RESERVED&gt;</p>
</div>

<p>
	The figure shows an example scenario with a reserved host:
</p>

<figure id="id-MltWCRTqHbHX"><img alt="HA - Reserved Host" src="imagesv2/ha_reservedhost1.png">
<figcaption><span class="label">Figure 13-10. </span>HA - Reserved Host</figcaption>
</figure>

<p>
	In the event of a host failure VM(s) will be restarted on the reserved host(s):
</p>

<figure id="id-J2tDu8TlHMH5"><img alt="HA - Reserved Host - Fail Over" src="imagesv2/ha_reservedhost2.png">
<figcaption><span class="label">Figure 13-11. </span>HA - Reserved Host - Fail Over</figcaption>
</figure>

<p>
	If the failed host comes back the VM(s) will be live migrated back to the original host to minimize any data movement for data locality:
</p>

<figure id="id-BAtzSkTzHxH0"><img alt="HA - Reserved Host - Fail Back" src="imagesv2/ha_reservedhost3.png">
<figcaption><span class="label">Figure 13-12. </span>HA - Reserved Host - Fail Back</figcaption>
</figure>

<h5>Reserve Segments</h5>
<p>
	Reserve segments distributes the resource reservation across all hosts in a cluster.  In this scenario, each host will share a portion of the reservation for HA.  This ensures the overall cluster has enough failover capacity to restart VM(s) in the event of a host failure.
</p>

<div data-type="note" class="note" id="pro-tip-mEiEtPTrHdHM"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Keep your hosts balanced when using segment based reservation.  This will give the highest utilization and ensure not too many segments are reserved.</p>
</div>

<p>
	The figure shows an example scenario with reserved segments:
</p>

<figure id="id-m2tqiPTrHdHM"><img alt="HA - Reserved Segment" src="imagesv2/ha_reservedsegment1.png">
<figcaption><span class="label">Figure 13-13. </span>HA - Reserved Segment</figcaption>
</figure>

<p>
	In the event of a host failure VM(s) will be restarted throughout the cluster on the remaining healthy hosts:
</p>

<figure id="id-vOt4INT9HEH5"><img alt="HA - Reserved Segment - Fail Over" src="imagesv2/ha_reservedsegment2.png">
<figcaption><span class="label">Figure 13-14. </span>HA - Reserved Segment - Fail Over</figcaption>
</figure>

<div data-type="note" class="note" id="reserved-segments-calculation-eriXUNTQh3H8"><h6>Note</h6>
<h5>Reserved segment(s) calculation</h5>
<p>
	The system will automatically calculate the total number of reserved segments and per host reservation.  To gain some insight on how this is calculated some details on the calculation can be found in the following text.
</p>

<p>
	Acropolis HA uses fixed size segments to reserve enough space for successful VM restart in case of host failure.  The segment size corresponds to largest VM in the system.  The distinctive feature of Acropolis HA is the ability to pack multiple smaller VMs into a single fixed size segment.  In a cluster with VMs of varying size, a single segment can accommodate multiple VMs, thus reducing fragmentation inherent to any fixed size segment scheme.
</p>

<p>
	The most efficient placement of VMs (least number of segments reserved) is defined as bin-packing problem, a well known problem in computer science.  The optimal solution is NP-hard (exponential), but heuristic solutions can come close to optimal for the common case.  Nutanix will continue improving its placement algorithms.  We expect to have 0.25 extra overhead for the common case in future versions. Today, the fragmentation overhead varies between 0.5 and 1 giving a total overhead of 1.5-2 per configured host failure.
</p>

<p>
	When using a segment based reservation there are a few key constructs that come in to play:
</p>

<ul>
	<li>
		Segment size = Largest running VM's memory footprint (GB)
	</li>
	<li>
		Most loaded host = Host running VMs with most memory (GB)
	</li>
	<li>
		Fragmentation overhead = 0.5 - 1
	</li>
</ul>

<p>
	Based upon these inputs you can calculate the expected number of reserved segments:
</p>

<ul>
	<li>
		Reserved segments = (Most loaded host / Segment size) x (1 + Fragmentation overhead)
	</li>
</ul>
</div>

</section>
</section>

<section data-type="sect1" id="administration-BNIyFRHZ">
<h3>Administration</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="important-pages-ONIPtjHA">
<h3>Important Pages</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="command-reference-nrIef2Hx">
<h3>Command Reference</h3>

<h5>Enable 10GbE links only on OVS</h5>

<p class="codedescription">Description: Enable 10g only on bond0 for local host</p>

<p class="codetext">manage_ovs --interfaces 10g update_uplinks</p>
<p class="codedescription">Description: Show ovs uplinks for full cluster</p>
<p class="codetext">
allssh "manage_ovs --interfaces 10g update_uplinks"
</p>

<h5>Show OVS uplinks</h5>

 <p class="codedescription">Description: Show ovs uplinks for local host</p>

<p class="codetext">manage_ovs show_uplinks</p>

 <p class="codedescription">Description: Show ovs uplinks for full cluster</p>

<p class="codetext">allssh "manage_ovs show_uplinks"</p>

<h5>Show OVS interfaces</h5>

 <p class="codedescription">Description: Show ovs interfaces for local host</p>

<p class="codetext">manage_ovs show_interfaces</p>

<p>Show interfaces for full cluster</p>

<p class="codetext">allssh "manage_ovs show_interfaces"</p>

<h5>Show OVS switch information</h5>

 <p class="codedescription">Description: Show switch information</p>

<p class="codetext">ovs-vsctl show</p>

<h5>List OVS bridges</h5>

 <p class="codedescription">Description: List bridges</p>

<p class="codetext">ovs-vsctl list br</p>

<h5>Show OVS bridge information</h5>

 <p class="codedescription">Description: Show OVS port information</p>

<p class="codetext">ovs-vsctl list port br0
<br>
ovs-vsctl list port &lt;bond&gt;</p>

<h5>Show OVS interface information</h5>

 <p class="codedescription">Description: Show interface information</p>

<p class="codetext">ovs-vsctl list interface br0</p>

<h5>Show ports / interfaces on bridge</h5>

<p class="codedescription">Description: Show ports on a bridge</p>

<p class="codetext">ovs-vsctl list-ports br0</p>

<p class="codedescription">Description: Show ifaces on a bridge</p>

<p class="codetext">ovs-vsctl list-ifaces br0</p>

<h5>Create OVS bridge</h5>

<p class="codedescription">Description: Create bridge</p>

<p class="codetext">ovs-vsctl add-br &lt;bridge&gt;</p>

<h5>Add ports to bridge</h5>

 <p class="codedescription">Description: Add port to bridge</p>

<p class="codetext">ovs-vsctl add-port &lt;bridge&gt; &lt;port&gt;</p>

<p class="codedescription">Description: Add bond port to bridge</p>

<p class="codetext">ovs-vsctl add-bond &lt;bridge&gt; &lt;port&gt; &lt;iface&gt;</p>

<h5>Show OVS bond details</h5>

<p class="codedescription">Description: Show bond details</p>

<p class="codetext">ovs-appctl bond/show &lt;bond&gt;</p>

<p>Example:</p>

<p class="codetext">ovs-appctl bond/show bond0</p>

<h5>Set bond mode and configure LACP on bond</h5>

<p class="codedescription">Description: Enable LACP on ports</p>

<p class="codetext">ovs-vsctl set port &lt;bond&gt; lacp=&lt;active/passive&gt;</p>

<p class="codedescription">Description: Enable on all hosts for bond0</p>

<p class="codetext">for i in `hostips`;do echo $i; ssh $i source /etc/profile &gt; /dev/null 2&gt;&amp;1; ovs-vsctl set port bond0 lacp=active;done</p>

<h5>Show LACP details on bond</h5>

<p class="codedescription">Description: Show LACP details</p>

<p class="codetext">ovs-appctl lacp/show &lt;bond&gt;</p>

<h5>Set bond mode</h5>

<p class="codedescription">Description: Set bond mode on ports</p>

<p class="codetext">ovs-vsctl set port &lt;bond&gt; bond_mode=&lt;active-backup, balance-slb, balance-tcp&gt;</p>

<h5>Show OpenFlow information</h5>

<p class="codedescription">Description: Show OVS openflow details</p>

<p class="codetext">ovs-ofctl show br0</p>

<p class="codedescription">Description: Show OpenFlow rules</p>

<p class="codetext">ovs-ofctl dump-flows br0</p>

<h5>Get QEMU PIDs and top information</h5>

<p class="codedescription">Description: Get QEMU PIDs</p>

<p class="codetext">ps aux | grep qemu | awk '{print $2}'</p>

<p class="codedescription">Description: Get top metrics for specific PID</p>

<p class="codetext">top -p &lt;PID&gt;</p>

<h5>Get active Stargate for QEMU processes</h5>

<p class="codedescription">Description: Get active Stargates for storage I/O for each QEMU processes</p>

<p class="codetext">netstat –np | egrep tcp.*qemu</p>
</section>

<section data-type="sect1" id="metrics-and-thresholds-b1IPiXH5">
<h3>Metrics and Thresholds</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="troubleshooting-andamp-advanced-administration-rkIAcyHO">
<h3>Troubleshooting &amp; Advanced Administration</h3>

<h5>Check iSCSI Redirector Logs</h5>

<p class="codedescription">Description: Check iSCSI Redirector Logs for all hosts</p>

<p class="codetext">for i in `hostips`; do echo $i; ssh root@$i cat /var/log/iscsi_redirector;done</p>

<p>Example for single host</p>

<p class="codetext">Ssh root@&lt;HOST IP&gt;
<br>
Cat /var/log/iscsi_redirector</p>

<h5>Monitor CPU steal (stolen CPU)</h5>

<p class="codedescription">Description: Monitor CPU steal time (stolen CPU)</p>

<p>Launch top and look for %st (bold below)</p>

<p class="codetext">Cpu(s):&nbsp; 0.0%us, 0.0%sy,&nbsp; 0.0%ni, 96.4%id,&nbsp; 0.0%wa,&nbsp; 0.0%hi,&nbsp; 0.1%si,&nbsp; <strong>0.0%st</strong></p>

<h5>Monitor VM network resource stats</h5>

<p class="codedescription">Description: Monitor VM resource stats</p>

<p>Launch virt-top</p>

<p class="codetext">Virt-top</p>

<p>Go to networking page</p>

<p>2 – Networking</p>
</section>
</section>

<section data-type="chapter" id="administration-lkInFl">
<h2>Administration</h2>

<section data-type="sect1" id="important-pages-M2IosXFw">
<h3>Important Pages</h3>

<p>These are advanced Nutanix pages besides the standard user interface that allow you to monitor detailed stats and metrics.&nbsp; The URLs are formatted in the following way: http://&lt;Nutanix CVM IP/DNS&gt;:&lt;Port/path (mentioned below)&gt;&nbsp; Example: http://MyCVM-A:2009&nbsp; NOTE: if you’re on a different subnet IPtables will need to be disabled on the CVM to access the pages.</p>

<h5>2009 Page</h5>

<p>This is a Stargate page used to monitor the back end storage system and should only be used by advanced users.&nbsp; I’ll have a post that explains the 2009 pages and things to look for.</p>

<h5>2009/latency Page</h5>

<p>This is a Stargate page used to monitor the back end latency.</p>

<h5>2009/vdisk_stats Page</h5>

<p>This is a Stargate page used to show various vDisk stats including histograms of I/O sizes, latency, write hits (e.g., OpLog, eStore), read hits (cache, SSD, HDD, etc.) and more.</p>

<h5>2009/h/traces Page</h5>

<p>This is the Stargate page used to monitor activity traces for operations.</p>

<h5>2009/h/vars Page</h5>

<p>This is the Stargate page used to monitor various counters.</p>

<h5>2010 Page</h5>

<p>This is the Curator page which is used for monitoring Curator runs.</p>

<h5>2010/master/control Page</h5>

<p>This is the Curator control page which is used to manually start Curator jobs</p>

<h5>2011 Page</h5>

<p>This is the Chronos page which monitors jobs and tasks scheduled by Curator.</p>

<h5>2020 Page</h5>

<p>&nbsp;This is the Cerebro page which monitors the protection domains, replication status and DR.</p>

<h5>2020/h/traces Page</h5>

<p>This is the Cerebro page used to monitor activity traces for PD operations and replication.</p>

<h5>2030 Page</h5>

<p>This is the main Acropolis page and shows details about the environment hosts, any currently running tasks and networking details.</p>

<h5>2030/sched Page</h5>

<p>This is an Acropolis page used to show information about VM and resource scheduling used for placement decisions.&nbsp; This page shows the available host resources and VMs running on each host.</p>

<h5>2030/tasks Page</h5>

<p>This is an Acropolis page used to show information about Acropolis tasks and their state.&nbsp; You can click on the task UUID to get detailed JSON about the task.</p>

<h5>2030/vms Page</h5>

<p>This is an Acropolis page used to show information about Acropolis VMs and details about them.&nbsp; You can click on the VM Name to connect to the console.</p>
</section>

<section data-type="sect1" id="cluster-commands-QMIMupFQ">
<h3>Cluster Commands</h3>

<h5>Check cluster status</h5>

<p class="codedescription">Description: Check cluster status from the CLI</p>

<p class="codetext">cluster status</p>

<h5>Check local CVM service status</h5>

<p class="codedescription">Description: Check a single CVM's service status from the CLI</p>

<p class="codetext">genesis status</p>

<h5>Nutanix cluster upgrade</h5>

<p class="codedescription">Description: Perform rolling (aka "live") cluster upgrade from the CLI</p>

<p>Upload upgrade package to ~/tmp/ on one CVM</p>

<p>Untar package</p>

<p class="codetext">tar xzvf ~/tmp/nutanix*</p>

<p>Perform upgrade</p>

<p class="codetext">~/tmp/install/bin/cluster -i ~/tmp/install upgrade</p>

<p>Check status</p>

<p class="codetext">upgrade_status</p>

<h5>Node(s) upgrade</h5>

<p class="codedescription">Description: Perform upgrade of specified node(s) to current clusters version</p>

<p>From any CVM running the desired version run the following command:</p>

<p class="codetext">cluster -u &lt;NODE_IP(s)&gt; upgrade_node</p>

<h5>Hypervisor upgrade status</h5>

<p class="codedescription">Description: Check hypervisor upgrade status from the CLI on any CVM</p>

<p class="codetext">host_upgrade --status</p>

<p>Detailed logs (on every CVM)</p>

<p class="codetext">~/data/logs/host_upgrade.out</p>

<h5>Restart cluster service from CLI</h5>

<p class="codedescription">Description: Restart a single cluster service from the CLI</p>

<p>Stop service</p>

<p class="codetext">cluster stop &lt;Service Name&gt;</p>

<p>Start stopped services</p>

<p class="codetext">cluster start&nbsp; #NOTE: This will start all stopped services</p>

<h5>Start cluster service from CLI</h5>

<p class="codedescription">Description: Start stopped cluster services from the CLI</p>

<p>Start stopped services</p>

<p class="codetext">cluster start&nbsp; #NOTE: This will start all stopped services</p>

<p>OR</p>

<p>Start single service</p>

<p class="codetext">Start single service: cluster start&nbsp; &lt;Service Name&gt;</p>

<h5>Restart local service from CLI</h5>

<p class="codedescription">Description: Restart a single cluster service from the CLI</p>

<p>Stop Service</p>

<p class="codetext">genesis stop &lt;Service Name&gt;</p>

<p>Start Service</p>

<p class="codetext">cluster start</p>

<h5>Start local service from CLI</h5>

<p class="codedescription">Description: Start stopped cluster services from the CLI</p>

<p class="codetext">cluster start #NOTE: This will start all stopped services</p>

<h5>Cluster add node from cmdline</h5>

<p class="codedescription">Description: Perform cluster add-node from CLI</p>

<p class="codetext">ncli cluster discover-nodes | egrep "Uuid" | awk '{print $4}' | xargs -I UUID ncli cluster add-node node-uuid=UUID</p>

<h5>Find number of vDisks</h5>

<p class="codedescription">Description: Displays the number of vDisks</p>

<p class="codetext">vdisk_config_printer | grep vdisk_id | wc -l</p>

<h5>Find cluster id</h5>

<p class="codedescription">Description: Find the cluster ID for the current cluster</p>

<p class="codetext">zeus_config_printer | grep cluster_id</p>

<h5>Open port</h5>

<p class="codedescription">Description: Enable port through IPtables</p>

<p class="codetext">sudo vi /etc/sysconfig/iptables
<br>
-A INPUT -m state --state NEW -m tcp -p tcp --dport &lt;PORT&gt; -j ACCEPT
<br>
sudo service iptables restart</p>

<h5>Check for Shadow Clones</h5>

<p class="codedescription">Description: Displays the shadow clones in the following format:&nbsp; name#id@svm_id</p>

<p class="codetext">vdisk_config_printer | grep '#'</p>

<h5>Reset Latency Page Stats</h5>

<p class="codedescription">Description: Reset the Latency Page (&lt;CVM IP&gt;:2009/latency) counters</p>

<p class="codetext">allssh "wget $i:2009/latency/reset"</p>

<h5>Find Number of vDisks</h5>

<p class="codedescription">Description: Find the current number of vDisks (files) on DSF</p>

<p class="codetext">vdisk_config_printer | grep vdisk_id | wc -l</p>

<h5>Start Curator scan from CLI</h5>

<p class="codedescription">Description: Starts a Curator full scan from the CLI</p>

<p class="codetext">allssh "wget -O - "http://$i:2010/master/api/client/StartCuratorTasks?task_type=2";"</p>

<h5>Compact ring</h5>

<p class="codedescription">Description: Compact the metadata ring</p>

<p class="codetext">allssh "nodetool -h localhost compact"</p>

<h5>Find NOS version</h5>

<p class="codedescription">Description: Find the NOS&nbsp; version (NOTE: can also be done using NCLI)</p>

<p class="codetext">allssh "cat /etc/nutanix/release_version"</p>

<h5>Find CVM version</h5>

<p class="codedescription">Description: Find the CVM image version</p>

<p class="codetext">allssh "cat /etc/nutanix/svm-version"</p>

<h5>Manually fingerprint vDisk(s)</h5>

<p class="codedescription">Description: Create fingerprints for a particular vDisk (For dedupe)&nbsp; NOTE: dedupe must be enabled on the container</p>

<p class="codetext">vdisk_manipulator –vdisk_id=&lt;vDisk ID&gt; --operation=add_fingerprints</p>

<h5>Echo Factory_Config.json for all cluster nodes</h5>

<p class="codedescription">Description: Echos the factory_config.jscon for all nodes in the cluster</p>

<p class="codetext">allssh "cat /etc/nutanix/factory_config.json"</p>

<h5>Upgrade a single Nutanix node’s NOS version</h5>

<p class="codedescription">Description: Upgrade a single node's NOS version to match that of the cluster</p>

<p class="codetext">~/cluster/bin/cluster -u &lt;NEW_NODE_IP&gt; upgrade_node</p>

<h5>&nbsp;List files (vDisk) on DSF</h5>

<p class="codedescription">Description: List files and associated information for vDisks stored on DSF</p>

<p class="codetext">Nfs_ls</p>

<p>Get help text</p>

<p class="codetext">Nfs_ls --help</p>

<h5>Install Nutanix Cluster Check (NCC)</h5>

<p class="codedescription">Description: Installs the Nutanix Cluster Check (NCC) health script to test for potential issues and cluster health</p>

<p>Download NCC from the Nutanix Support Portal (portal.nutanix.com)</p>

<p>SCP .tar.gz to the /home/nutanix directory</p>

<p>Untar NCC .tar.gz</p>

<p class="codetext">tar xzmf &lt;ncc .tar.gz file name&gt; --recursive-unlink</p>

<p>Run install script</p>

<p class="codetext">./ncc/bin/install.sh -f &lt;ncc .tar.gz file name&gt;</p>

<p>Create links</p>

<p class="codetext">source ~/ncc/ncc_completion.bash
<br>
echo "source ~/ncc/ncc_completion.bash" &gt;&gt; ~/.bashrc</p>

<h5>Run Nutanix Cluster Check (NCC)</h5>

<p class="codedescription">Description: Runs the Nutanix Cluster Check (NCC) health script to test for potential issues and cluster health.&nbsp; This is a great first step when troubleshooting any cluster issues.</p>

<p>Make sure NCC is installed (steps above)</p>

<p>Run NCC health checks</p>

<p class="codetext">ncc health_checks run_all</p>

<h5>List tasks using progress monitor cli</h5>

<p class="codetext">progress_monitor_cli -fetchall</p>

<h5>Remove task using progress monitor cli</h5>

<p class="codetext">progress_monitor_cli --entity_id=&lt;ENTITY_ID&gt; --operation=&lt;OPERATION&gt; --entity_type=&lt;ENTITY_TYPE&gt; --delete<br />
# NOTE: operation and entity_type should be all lowercase with k removed from the begining</p>
</section>

<section data-type="sect1" id="metrics-and-thresholds-J5IvTbFw">
<h3>Metrics and Thresholds</h3>

<p>The following section will cover specific metrics and thresholds on the Nutanix back end.&nbsp; More updates to these coming shortly!</p>
</section>

<section data-type="sect1" id="gflags-4aI2S1Fg">
<h3>Gflags</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="troubleshooting-andamp-advanced-administration-BNI4HzFZ">
<h3>Troubleshooting &amp; Advanced Administration</h3>

<h5>Find Acropolis logs</h5>

<p class="codedescription">Description: Find Acropolis logs for the cluster</p>

<p class="codetext">allssh "cat ~/data/logs/Acropolis.log"</p>

<h5>Find cluster error logs</h5>

<p class="codedescription">Description: Find ERROR logs for the cluster</p>

<p class="codetext">allssh "cat ~/data/logs/&lt;COMPONENT NAME or *&gt;.ERROR"</p>

<p>Example for Stargate</p>

<p class="codetext">allssh "cat ~/data/logs/Stargate.ERROR"</p>

<h5>Find cluster fatal logs</h5>

<p class="codedescription">Description: Find FATAL logs for the cluster</p>

<p class="codetext">allssh "cat ~/data/logs/&lt;COMPONENT NAME or *&gt;.FATAL"</p>

<p>Example for Stargate</p>

<p class="codetext">allssh "cat ~/data/logs/Stargate.FATAL"</p>

<section data-type="sect2" id="using-the-2009-page-stargate-gnIoCmh3FP">
<h4>Using the 2009 Page (Stargate)</h4>

<p>In most cases Prism should be able to give you all of the information and data points you require. &nbsp;However,&nbsp;in certain scenarios, or if you want some more detailed data you can leverage the Stargate aka 2009 page. &nbsp;The 2009 page can be viewed by navigating to &lt;CVM IP&gt;:2009.</p>

<div data-type="note" class="note" id="accessing-back-end-pages-JZiaukClH9F5"><h6>Note</h6>
<h5>Accessing back-end pages</h5>

<p>If you're on a different network segment (L2 subnet)&nbsp;you'll need to add a rule in IP tables to access any of the back-end pages.</p>
</div>

<p>At the top of the page is the overview details which show various details about the cluster:</p>

<figure id="id-J2tbSkClH9F5"><img alt="2009 Page - Stargate Overview" class="iimagesv22009pagesstargate_overview2png" src="imagesv2/2009Pages/stargate_overview2.png">
<figcaption><span class="label">Figure 14-1. </span>2009 Page - Stargate Overview</figcaption>
</figure>

<p>In this section there are two key areas I look out for, the first being the I/O queues which shows the number of admitted / outstanding operations.</p>

<p>The figure shows the queues portion of the overview section:</p>

<figure id="id-OQtPtBCgHJF9"><img alt="2009 Page - Stargate Overview - Queues" class="iimagesv22009pagesstargate_io_queuespng" src="imagesv2/2009Pages/stargate_io_queues.png">
<figcaption><span class="label">Figure 14-2. </span>2009 Page - Stargate Overview - Queues</figcaption>
</figure>

<p>The second portion is the unified cache details which shows information on cache sizes and hit rates.</p>

<p>The figure shows the unified cache portion of the overview section:</p>

<figure id="id-rQtAcNCJHDF9"><img alt="2009 Page - Stargate Overview - Unified Cache" class="iimagesv22009pagesstargate_contentcache2png" src="imagesv2/2009Pages/stargate_contentCache2.png">
<figcaption><span class="label">Figure 14-3. </span>2009 Page - Stargate Overview - Unified Cache</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-vAiPIXC9HxF5"><h6>Note</h6>
<h5>Pro tip</h5>

<p>In ideal cases the hit rates should be above 80-90%+ if the workload is read heavy for the best possible read performance.</p>
</div>

<p>NOTE: these values are per Stargate / CVM</p>

<p>The next section is the 'Cluster State' which shows details on the various Stargates in the cluster and their disk usages.</p>

<p>The figure shows the Stargates and disk utilization&nbsp;(available/total):</p>

<figure class="large" id="id-yrtysdCvHzFk"><img alt="2009 Page - Cluster State - Disk Usage" class="iimagesv22009pagesstargate_diskutilpng" src="imagesv2/2009Pages/stargate_diskUtil.png">
<figcaption><span class="label">Figure 14-4. </span>2009 Page - Cluster State - Disk Usage</figcaption>
</figure>

<p>The next section is the 'NFS Slave' section which will show various details and stats per vDisk.</p>

<p>The figure shows the vDisks and various I/O details:</p>

<figure class="large" id="id-YatzSeCwHRFm"><img alt="2009 Page - NFS Slave - vDisk Stats" class="iimagesv22009pagesstargate_vdiskstatpng" src="imagesv2/2009Pages/stargate_vdiskStat.png">
<figcaption><span class="label">Figure 14-5. </span>2009 Page - NFS Slave - vDisk Stats</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-EaigHoCDHoFE"><h6>Note</h6>
<h5>Pro tip</h5>

<p>When looking at any potential performance issues I always look at the following:</p>

<ol>
	<li>Avg. latency</li>
	<li>Avg. op size</li>
	<li>Avg. outstanding</li>
</ol>

<p>For more specific details the vdisk_stats page holds a plethora of information.</p>
</div>
</section>
<section data-type="sect2" id="using-the-2009vdisk_stats-page-EqIdsRHzFo">
<h4>Using the 2009/vdisk_stats Page</h4>

<p>The 2009 vdisk_stats page is a detailed page which provides even further data points per vDisk. &nbsp;This page includes details and a histogram of items like randomness, latency histograms, I/O sizes and working set details.</p>

<p>You can navigate to the vdisk_stats page by clicking on the 'vDisk Id' in the left hand column.</p>

<p>The figure shows the section and hyperlinked vDisk Id:</p>

<figure class="large" id="id-4mt2S0seHkFY"><img alt="2009 Page - Hosted vDisks" class="iimagesv22009pagesstargate_hostedvdiskbriefpng" src="imagesv2/2009Pages/stargate_hostedVdiskBrief.png">
<figcaption><span class="label">Figure 14-6. </span>2009 Page - Hosted vDisks</figcaption>
</figure>

<p>This will bring you to the vdisk_stats page which will give you the detailed vDisk stats. &nbsp;NOTE: These values are real-time and can be updated by refreshing the page.</p>

<p>The first key area is the 'Ops and Randomness' section which will show a breakdown of whether the I/O patterns are random or sequential in nature.</p>

<p>The figure shows the 'Ops and Randomness' section:</p>

<figure id="id-b8tRf2sBHpFJ"><img alt="2009 Page - vDisk Stats - Ops and Randomness" class="iimagesv22009pagesstargate_opsrandomnesspng" src="imagesv2/2009Pages/stargate_opsRandomness.png">
<figcaption><span class="label">Figure 14-7. </span>2009 Page - vDisk Stats - Ops and Randomness</figcaption>
</figure>

<p>The next area shows a histogram of the frontend read and write I/O latency&nbsp;(aka the latency the VM / OS sees).</p>

<p>The figure shows the 'Frontend Read Latency' histogram:</p>

<figure class="large" id="id-WmtNI9sGH2Fm"><img alt="2009 Page - vDisk Stats - Frontend Read Latency" class="iimagesv22009pagesstargate_readlat_fepng" src="imagesv2/2009Pages/stargate_readLat_FE.png">
<figcaption><span class="label">Figure 14-8. </span>2009 Page - vDisk Stats - Frontend Read Latency</figcaption>
</figure>

<p>The figure shows the 'Frontend Write Latency' histogram:</p>

<figure class="large" id="id-jRtph0smHaFr"><img alt="2009 Page - vDisk Stats - Frontend Write Latency" class="iimagesv22009pagesstargate_writelat_fepng" src="imagesv2/2009Pages/stargate_writeLat_FE.png">
<figcaption><span class="label">Figure 14-9. </span>2009 Page - vDisk Stats - Frontend Write Latency</figcaption>
</figure>

<p>The next key area is the I/O size distribution which shows a histogram of the read and write I/O sizes.</p>

<p>The figure shows the 'Read Size Distribution' histogram:</p>

<figure class="large" id="id-wntMupsQHQF1"><img alt="2009 Page - vDisk Stats - Read I/O Size" class="iimagesv22009pagesstargate_readsizepng" src="imagesv2/2009Pages/stargate_readSize.png">
<figcaption><span class="label">Figure 14-10. </span>2009 Page - vDisk Stats - Read I/O Size</figcaption>
</figure>

<p>The figure shows the 'Write Size Distribution'&nbsp;histogram:</p>

<figure class="large" id="id-qMt9Sys5HQFB"><img alt="2009 Page - vDisk Stats - Write I/O Size" class="iimagesv22009pagesstargate_writesizepng" src="imagesv2/2009Pages/stargate_writeSize.png">
<figcaption><span class="label">Figure 14-11. </span>2009 Page - vDisk Stats - Write I/O Size</figcaption>
</figure>

<p>The next key area is the 'Working Set Size' section which provides insight on working set sizes for the last 2 minutes and 1 hour. &nbsp;This is broken down for both read and write I/O.</p>

<p>The figure shows the 'Working Set Sizes' table:</p>

<figure id="id-aztmtms9HWFZ"><img alt="2009 Page - vDisk Stats - Working Set" class="iimagesv22009pagesstargate_workingsetpng" src="imagesv2/2009Pages/stargate_workingSet.png">
<figcaption><span class="label">Figure 14-12. </span>2009 Page - vDisk Stats - Working Set</figcaption>
</figure>

<p>The 'Read Source' provides details on which tier or location the read I/O are being served from.</p>

<p>The figure shows the 'Read Source' details:</p>

<figure id="id-oPtmcRsRH3Fm"><img alt="2009 Page - vDisk Stats - Read Source" class="iimagesv22009pagesstargate_readsourcepng" src="imagesv2/2009Pages/stargate_readSource.png">
<figcaption><span class="label">Figure 14-13. </span>2009 Page - vDisk Stats - Read Source</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-dbi8IEsnHWF4"><h6>Note</h6>
<h5>Pro tip</h5>

<p>If you're seeing high read latency take a look at the read source for the vDisk and take a look where the I/Os are being served from. &nbsp;In most cases high latency could be caused by reads coming from HDD (Estore HDD).</p>
</div>

<p>The 'Write Destination' section will show where the new write I/O are coming in to.</p>

<p>The figure shows the 'Write Destination' table:</p>

<figure id="id-Vrt9CNs5HzFk"><img alt="2009 Page - vDisk Stats - Write Destination" class="iimagesv22009pagesstargate_writedestpng" src="imagesv2/2009Pages/stargate_writeDest.png">
<figcaption><span class="label">Figure 14-14. </span>2009 Page - vDisk Stats - Write Destination</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-xkiVsosqHzF5"><h6>Note</h6>
<h5>Pro tip</h5>

<p>Random I/Os will be written to the Oplog, sequential I/Os will bypass the Oplog and be directly written to the Extent Store (Estore).</p>
</div>

<p>Another interesting data point is what data is being up-migrated from HDD to SSD via ILM. &nbsp;The 'Extent Group Up-Migration' table shows data that has been up-migrated in the last 300, 3,600 and 86,400 seconds.</p>

<p>The figure shows the 'Extent Group Up-Migration' table:</p>

<figure id="id-0Ot9SZs9HJF3"><img alt="2009 Page - vDisk Stats - Extent Group Up-Migration" class="iimagesv22009pagesstargate_egroupilmpng" src="imagesv2/2009Pages/stargate_eGroupILM.png">
<figcaption><span class="label">Figure 14-15. </span>2009 Page - vDisk Stats - Extent Group Up-Migration</figcaption>
</figure>
</section>
<section data-type="sect2" id="using-the-2010-page-curator-aOIJu5HnFW">
<h4>Using the 2010 Page (Curator)</h4>

<p>The 2010 page is a detailed page for monitoring the Curator MapReduce framework. &nbsp;This page provides details on jobs, scans, and associated tasks.&nbsp;</p>

<p>You can navigate to the Curator page by navigating to http://&lt;CVM IP&gt;:2010. &nbsp;NOTE: if you're not on the Curator Master click on the IP hyperlink after 'Curator Master: '. &nbsp;</p>

<p>The top of the page will show various details about the Curator Master including uptime, build version, etc.</p>

<p>The next section is the 'Curator Nodes' table which shows various details about the nodes in the cluster, the roles, and health status. &nbsp;These will be the nodes Curator leverages for the distributed processing and delegation of tasks.</p>

<p>The figure shows the 'Curator Nodes' table:</p>

<figure id="id-n3t5FrurHNFB"><img alt="2010 Page - Curator Nodes" class="iimagesv22010pagescurator_nodes2png" src="imagesv2/2010Pages/curator_nodes2.png">
<figcaption><span class="label">Figure 14-16. </span>2010 Page - Curator Nodes</figcaption>
</figure>

<p>The next section is the 'Curator Jobs' table which shows the completed or currently running jobs. &nbsp;</p>

<p>There are two main types of jobs which include a partial scan which is eligible to run every 60 minutes and a full scan which is eligible to run every 6 hours. &nbsp;NOTE: the timing will be variable based upon utilization and other activities.</p>

<p><span style="letter-spacing: 0.01em; line-height: 1.3em;">These scans will run on their periodic schedules however can also be triggered by certain cluster events.</span></p>

<p>Here are some of the reasons for a jobs execution:</p>

<ul>
	<li>Periodic (normal state)</li>
	<li>Disk / Node / Block failure</li>
	<li>ILM Imbalance</li>
	<li>Disk / Tier Imbalance</li>
</ul>

<p>The figure shows the 'Curator Jobs' table:</p>

<figure id="id-yrtkhpuvHzFk"><img alt="2010 Page - Curator Jobs" class="iimagesv22010pagescurator_jobs2png" src="imagesv2/2010Pages/curator_jobs2.png">
<figcaption><span class="label">Figure 14-17. </span>2010 Page - Curator Jobs</figcaption>
</figure>

<p>The table shows some of the high-level activities performed by each job:</p>

<table border="1" cellpadding="1" cellspacing="1" style="width: 100%;">
	<thead>
		<tr>
			<th scope="col"><strong>Activity</strong></th>
			<th scope="col"><strong>Full Scan</strong></th>
			<th scope="col"><strong>Partial Scan</strong></th>
		</tr>
	</thead>
	<tbody>
		<tr>
			<th scope="row">ILM</th>
			<td>X</td>
			<td>X</td>
		</tr>
		<tr>
			<th scope="row">Disk Balancing</th>
			<td>X</td>
			<td>X</td>
		</tr>
		<tr>
			<th scope="row">Compression</th>
			<td>X</td>
			<td>X</td>
		</tr>
		<tr>
			<th scope="row">Deduplication</th>
			<td>X</td>
			<td>&nbsp;</td>
		</tr>
		<tr>
			<th scope="row">Erasure Coding</th>
			<td>X</td>
			<td>&nbsp;</td>
		</tr>
		<tr>
			<th scope="row">Garbage Cleanup</th>
			<td>X</td>
			<td>&nbsp;</td>
		</tr>
	</tbody>
</table>

<p>Clicking on the 'Execution id' will bring you to the job details page which displays various job stats as well as generated tasks.</p>

<p>The table at the top of the page will show various details on the job including the type, reason, tasks and duration.</p>

<p>The next section is the 'Background Task Stats' table which displays various details on the type of tasks, quantity generated and priority.</p>

<p>The figure shows the job details table:</p>

<figure id="id-aztPFgu9HWFZ"><img alt="2010 Page - Curator Job - Details" class="iimagesv22010pagesjob_details2png" src="imagesv2/2010Pages/job_details2.png">
<figcaption><span class="label">Figure 14-18. </span>2010 Page - Curator Job - Details</figcaption>
</figure>

<p>The figure shows the 'Background Task Stats' table:</p>

<figure id="id-9otmfbuNH3Fx"><img alt="2010 Page - Curator Job - Tasks" class="iimagesv22010pagesjob_tasks2png" src="imagesv2/2010Pages/job_tasks2.png">
<figcaption><span class="label">Figure 14-19. </span>2010 Page - Curator Job - Tasks</figcaption>
</figure>

<p>The next section is the 'MapReduce Jobs' table which shows the actual MapReduce jobs started by each Curator job. &nbsp;Partial scans will have a single MapReduce Job, full scans will have four MapReduce Jobs.</p>

<p>The figure shows the 'MapReduce Jobs' table:</p>

<figure class="large" id="id-ZptlI1uvHkFQ"><img alt="2010 Page - MapReduce Jobs" class="iimagesv22010pagescurator_mrjobs2png" src="imagesv2/2010Pages/curator_mrjobs2.png">
<figcaption><span class="label">Figure 14-20. </span>2010 Page - MapReduce Jobs</figcaption>
</figure>

<p>Clicking on the 'Job id' will bring you to the MapReduce job details page which displays the tasks status,&nbsp;various counters and details about the MapReduce job.</p>

<p>The figure shows a sample of some of the job counters:</p>

<figure id="id-zmt5CAuxHlFm"><img alt="2010 Page - MapReduce Job - Counters" class="iimagesv22010pagesjob_counters2png" src="imagesv2/2010Pages/job_counters2.png">
<figcaption><span class="label">Figure 14-21. </span>2010 Page - MapReduce Job - Counters</figcaption>
</figure>

<p>The next section on the main page is the 'Queued Curator Jobs' and 'Last Successful Curator Scans' section. These tables show when the periodic scans are eligible to run and the last successful scan's details.</p>

<p>The figure shows the&nbsp;'Queued Curator Jobs' and 'Last Successful Curator Scans' section:</p>

<figure class="large" id="id-0OtyTyu9HJF3"><img alt="2010 Page - Queued and Successful Scans" class="iimagesv22010pagescurator_queue_lastsuccessful2png" src="imagesv2/2010Pages/curator_queue_lastsuccessful2.png">
<figcaption><span class="label">Figure 14-22. </span>2010 Page - Queued and Successful Scans</figcaption>
</figure>
</section>
</section>
</section>
</div>

<div data-type="part" id="book-of-vsphere-7aBig">
<h1><span class="label">Part IV. </span>Book of vSphere</h1>

<section data-type="chapter" id="architecture-NYIMsy">
<h2>Architecture</h2>

<section data-type="sect1" id="node-architecture-22IRsYsQ">
<h3>Node Architecture</h3>

<p>In ESXi deployments, the Controller VM (CVM) runs as a VM and disks are presented using VMDirectPath I/O.&nbsp; This allows the full PCI controller (and attached devices) to be passed through directly to the CVM and bypass the hypervisor.</p>

<figure id="id-9otduEsDsX"><img alt="ESXi Node Architecture" class="iimagesv2esx_nodepng" src="imagesv2/esx_node.png">
<figcaption><span class="label">Figure 15-1. </span>ESXi Node Architecture</figcaption>
</figure>
</section>

<section data-type="sect1" id="configuration-maximums-and-scalability-3jIAuksQ">
<h3>Configuration Maximums and Scalability</h3>

<p>The following configuration maximums and scalability limits are applicable:</p>

<ul>
	<li>Maximum cluster size: <strong>64</strong></li>
	<li>Maximum vCPUs per VM: <strong>128</strong></li>
	<li>Maximum memory per VM: <strong>4TB</strong></li>
	<li>Maximum VMs per host: <strong>1,024</strong></li>
	<li>Maximum VMs per cluster: <strong>8,000 (2,048 per datastore if HA is enabled)</strong></li>
</ul>

<p>NOTE: As of vSphere 6.0</p>
</section>

<section data-type="sect1">
<h3>Networking</h3>

<p>Each ESXi host has a local vSwitch which is used for intra-host communication between the Nutanix CVM and host.  For external communication and VMs a standard vSwitch (default) or dvSwitch is leveraged.</p>

<p>
  The local vSwitch (vSwitchNutanix) is for local communication between the Nutanix CVM and ESXi host.  The  host has a vmkernel interface  on this vSwitch (vmk1 - 192.168.5.1) and the CVM has an interface bound to a port group on this internal switch (svm-iscsi-pg - 192.168.5.2).  This is the primary storage communication path.
</p>

<p>The external vSwitch can be a standard vSwitch or a dvSwitch.  This will host the external interfaces for the ESXi host and CVM as well as the port groups leveraged by VMs on the host.  The external vmkernel interface is leveraged for host management, vMotion, etc.  The external CVM interface is used for communication to other Nutanix CVMs.  As many port groups can be created as required assuming the VLANs are enabled on the trunk.</p>

<p>The following figure shows a conceptual diagram of the vSwitch architecture:</p>

<figure><img alt="ESXi vSwitch Network Overview" src="imagesv2/esxi_net.png">
<figcaption><span class="label">Figure. </span>ESXi vSwitch Network Overview</figcaption>
</figure>

<div data-type="note" class="note" id="pro-tip-vAiPIXC9HxF5"><h6>Note</h6>
<h5>Uplink and Teaming policy</h5>

<p>It is recommended to have dual ToR switches and uplinks across both switches for switch HA.  By default the system will have uplink interfaces in active/passive mode.  For upstream switch architectures that are capable of having active/active uplink interfaces (e.g. vPC, MLAG, etc.) that can be leveraged for additional network throughput.</p>
</div>
</section>

</section>

<section data-type="chapter" id="how-it-works-22IBuk">
<h2>How It Works</h2>

<section data-type="sect1" id="array-offloads-vaai-3jImsMuQ">
<h3>Array Offloads – VAAI</h3>

<p>The Nutanix platform supports the VMware APIs for Array Integration (VAAI),&nbsp;which allows the hypervisor to offload certain tasks to the array.&nbsp; This is much more efficient as the hypervisor doesn’t need to be the 'man in the middle'. Nutanix currently supports the VAAI primitives for NAS, including the ‘full file clone’, ‘fast file clone’, and ‘reserve space’ primitives.&nbsp; Here’s a good article explaining the various primitives: http://cormachogan.com/2012/11/08/vaai-comparison-block-versus-nas/.&nbsp;</p>

<p>For both the full and fast file clones, a DSF 'fast clone' is done, meaning a writable snapshot (using re-direct on write) for each clone that is created.&nbsp; Each of these clones has its own block map, meaning that chain depth isn’t anything to worry about. The following will determine whether or not VAAI will be used for specific scenarios:</p>

<ul>
	<li>Clone VM with Snapshot –&gt; VAAI will NOT be used</li>
	<li>Clone VM without Snapshot which is Powered Off –&gt; VAAI WILL be used</li>
	<li>Clone VM to a different Datastore/Container –&gt; VAAI will NOT be used</li>
	<li>Clone VM which is Powered On&nbsp; –&gt; VAAI will NOT be used</li>
</ul>

<p>These scenarios apply to VMware View:</p>

<ul>
	<li>View Full Clone (Template with Snapshot) –&gt; VAAI will NOT be used</li>
	<li>View Full Clone (Template w/o Snapshot) –&gt; VAAI WILL be used</li>
	<li>View Linked Clone (VCAI) –&gt; VAAI WILL be used</li>
</ul>

<p>You can validate VAAI operations are taking place by using the ‘NFS Adapter’ Activity Traces page.</p>
</section>

<section data-type="sect1" id="cvm-autopathing-aka-hapy-PgIJuQub">
<h3>CVM Autopathing aka Ha.py</h3>

<p>In this section, I’ll cover how CVM 'failures' are handled (I’ll cover how we handle component failures in future update).&nbsp; A CVM 'failure' could include a user powering down the CVM, a CVM rolling upgrade, or any event which might bring down the CVM. DSF has a feature called autopathing where when a local CVM becomes unavailable, the I/Os are then transparently handled by other CVMs in the cluster. The hypervisor and CVM communicate using a private 192.168.5.0 network on a dedicated vSwitch (more on this above).&nbsp; This means that for all storage I/Os, these are happening to the internal IP addresses on the CVM (192.168.5.2).&nbsp; The external IP address of the CVM is used for remote replication and for CVM communication.</p>

<p>The following figure shows an example of what this looks like:</p>

<figure id="id-ZptjTxupuG"><img alt="ESXi Host Networking" class="iimagesv2esx_hapy_1png" src="imagesv2/esx_hapy_1.png">
<figcaption><span class="label">Figure 16-1. </span>ESXi Host Networking</figcaption>
</figure>

<p>In the event of a local CVM failure, the local 192.168.5.2 addresses previously hosted by the local CVM are unavailable.&nbsp; DSF will automatically detect this outage and will redirect these I/Os to another CVM in the cluster over 10GbE.&nbsp; The re-routing is done transparently to the hypervisor and VMs running on the host.&nbsp; This means that even if a CVM is powered down, the VMs will still continue to be able to perform I/Os to DSF. Once the local CVM is back up and available, traffic will then seamlessly be transferred back and served by the local CVM.</p>

<p>The following figure shows a graphical representation of how this looks for a failed CVM:</p>

<figure class="large" id="id-zmtMFEuPuR"><img alt="ESXi Host Networking - Local CVM Down" class="iimagesv2esx_hapy_2png" src="imagesv2/esx_hapy_2.png">
<figcaption><span class="label">Figure 16-2. </span>ESXi Host Networking - Local CVM Down</figcaption>
</figure>
</section>
</section>

<section data-type="chapter" id="administration-3jI3Tg">
<h2>Administration</h2>

<section data-type="sect1" id="important-pages-PgIys4Tb">
<h3>Important Pages</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="command-reference-lkIBuxTJ">
<h3>Command Reference</h3>

<h5>ESXi cluster upgrade</h5>

 <p class="codedescription">Description: Perform an automated upgrade of ESXi hosts using the CLI and custom offline bundle
<br>
# Upload upgrade offline bundle to a Nutanix CVM
<br>
# Log in to Nutanix CVM
<br>
# Perform upgrade</p>

<p class="codetext">cluster --md5sum=&lt;bundle_checksum&gt; --bundle=&lt;/path/to/offline_bundle&gt; host_upgrade</p>

<p># Example</p>

<p class="codetext">cluster --md5sum=bff0b5558ad226ad395f6a4dc2b28597 --bundle=/tmp/VMware-ESXi-5.5.0-1331820-depot.zip host_upgrade</p>

<h5>Restart ESXi host services</h5>

 <p class="codedescription">Description: Restart each ESXi hosts services in a incremental manner</p>

<p class="codetext">for i in `hostips`;do ssh root@$i "services.sh restart";done</p>

<h5>Display ESXi host nics in ‘Up’ state</h5>

 <p class="codedescription">Description: Display the ESXi host's nics which are in a 'Up' state</p>

<p class="codetext">for i in `hostips`;do echo $i &amp;&amp; ssh root@$i esxcfg-nics -l | grep Up;done</p>

<h5>Display ESXi host 10GbE nics and status</h5>

 <p class="codedescription">Description: Display the ESXi host's 10GbE nics and status</p>

<p class="codetext">for i in `hostips`;do echo $i &amp;&amp; ssh root@$i esxcfg-nics -l | grep ixgbe;done</p>

<h5>Display ESXi host active adapters</h5>

 <p class="codedescription">Description: Display the ESXi host's active, standby and unused adapters</p>

<p class="codetext">for i in `hostips`;do echo $i &amp;&amp;&nbsp; ssh root@$i "esxcli network vswitch standard policy failover get --vswitch-name vSwitch0";done</p>

<h5>Display ESXi host routing tables</h5>

 <p class="codedescription">Description: Display the ESXi host's routing tables</p>

<p class="codetext">for i in `hostips`;do ssh root@$i 'esxcfg-route -l';done</p>

<h5>Check if VAAI is enabled on datastore</h5>

 <p class="codedescription">Description: Check whether or not VAAI is enabled/supported for a datastore</p>

<p class="codetext">vmkfstools -Ph /vmfs/volumes/&lt;Datastore Name&gt;</p>

<h5>Set VIB acceptance level to community supported</h5>

 <p class="codedescription">Description: Set the vib acceptance level to CommunitySupported allowing for 3rd party vibs to be installed</p>

<p class="codetext">esxcli software acceptance set --level CommunitySupported</p>

<h5>Install VIB</h5>

 <p class="codedescription">Description: Install a vib without checking the signature</p>

<p class="codetext">esxcli software vib install --viburl=/&lt;VIB directory&gt;/&lt;VIB name&gt; --no-sig-check</p>

<p># OR</p>

<p class="codetext">esxcli software vib install --depoturl=/&lt;VIB directory&gt;/&lt;VIB name&gt; --no-sig-check</p>

<h5>Check ESXi ramdisk space</h5>

 <p class="codedescription">Description: Check free space of ESXi ramdisk</p>

<p class="codetext">for i in `hostips`;do echo $i; ssh root@$i 'vdf -h';done</p>

<h5>Clear pynfs logs</h5>

 <p class="codedescription">Description: Clears the pynfs logs on each ESXi host</p>

<p class="codetext">for i in `hostips`;do echo $i; ssh root@$i '&gt; /pynfs/pynfs.log';done</p>
</section>

<section data-type="sect1" id="metrics-and-thresholds-M2I5TRTv">
<h3>Metrics and Thresholds</h3>

<p>More coming soon!</p>
</section>
</section>

<section data-type="sect1" id="troubleshooting-andamp-advanced-administration-PgIPS5">
<h3>Troubleshooting &amp; Advanced Administration</h3>

<p>More coming soon!</p>
</section>
</div>

<div data-type="part" id="book-of-hyper-v-7pdi1">
<h1><span class="label">Part V. </span>Book of Hyper-V</h1>

<section data-type="chapter" id="architecture-22IRsV">
<h2>Architecture</h2>

<section data-type="sect1" id="node-architecture-3jImsksn">
<h3>Node Architecture</h3>

<p>In Hyper-V deployments, the Controller VM (CVM) runs as a VM and disks are presented using disk passthrough.</p>

<figure id="id-oPtruYsWsB"><img alt="Hyper-V Node Architecture" class="iimagesv2hyperv_nodepng" src="imagesv2/hyperv_node.png">
<figcaption><span class="label">Figure 18-1. </span>Hyper-V Node Architecture</figcaption>
</figure>
</section>

<section data-type="sect1" id="configuration-maximums-and-scalability-PgIJu8s2">
<h3>Configuration Maximums and Scalability</h3>

<p>The following configuration maximums and scalability limits are applicable:</p>

<ul>
	<li>Maximum cluster size: <strong>64</strong></li>
	<li>Maximum vCPUs per VM: <strong>64</strong></li>
	<li>Maximum memory per VM: <strong>1TB</strong></li>
	<li>Maximum VMs per host: <strong>1,024</strong></li>
	<li>Maximum VMs per cluster: <strong>8,000</strong></li>
</ul>

<p>NOTE: As of Hyper-V 2012 R2</p>
</section>

<section data-type="sect1">
<h3>Networking</h3>

<p>Each Hyper-V host has a internal only virtual switch which is used for intra-host communication between the Nutanix CVM and host.  For external communication and VMs a external virtual switch (default) or logical switch is leveraged.</p>

<p>
  The internal switch (InternalSwitch) is for local communication between the Nutanix CVM and Hyper-V host.  The host has a virtual ethernet interface (vEth) on this internal switch (192.168.5.1) and the CVM has a vEth on this internal switch (192.168.5.2).  This is the primary storage communication path.
</p>

<p>The external vSwitch can be a standard virtual switch or a logical switch.  This will host the external interfaces for the Hyper-V host and CVM as well as the logical and VM networks leveraged by VMs on the host.  The external vEth interface is leveraged for host management, live migration, etc.  The external CVM interface is used for communication to other Nutanix CVMs.  As many logical and VM networks can be created as required assuming the VLANs are enabled on the trunk.</p>

<p>The following figure shows a conceptual diagram of the virtual switch architecture:</p>

<figure><img alt="Hyper-V Virtual Switch Network Overview" src="imagesv2/hyperv_net.png">
<figcaption><span class="label">Figure. </span>Hyper-V Virtual Switch Network Overview</figcaption>
</figure>

<div data-type="note" class="note"><h6>Note</h6>
<h5>Uplink and Teaming policy</h5>

<p>It is recommended to have dual ToR switches and uplinks across both switches for switch HA.  By default the system will have the LBFO team in switch independent mode which doesn't require any special configuration.</p>
</div>
</section>
</section>

<section data-type="chapter" id="how-it-works-3jIAul">
<h2>How It Works</h2>

<section data-type="sect1" id="array-offloads-odx-PgIysQu2">
<h3>Array Offloads – ODX</h3>

<p>The Nutanix platform supports the Microsoft Offloaded Data Transfers (ODX), which allow the hypervisor to offload certain tasks to the array.&nbsp; This is much more efficient as the hypervisor doesn’t need to be the 'man in the middle'. Nutanix currently supports the ODX primitives for SMB, which include full copy and zeroing operations.&nbsp; However, contrary to VAAI which has a 'fast file' clone operation (using writable snapshots), the ODX primitives do not have an equivalent and perform a full copy.&nbsp; Given this, it is more efficient to rely on the native DSF clones which can currently be invoked via nCLI, REST, or PowerShell CMDlets. Currently ODX IS invoked for the following operations:</p>

<ul>
	<li>In VM or VM to VM file copy on DSF SMB share</li>
	<li>SMB share file copy</li>
</ul>

<p>Deploy the template from the SCVMM Library (DSF SMB share) – NOTE: Shares must be added to the SCVMM cluster using short names (e.g., not FQDN).&nbsp; An easy way to force this is to add an entry into the hosts file for the cluster (e.g. 10.10.10.10&nbsp;&nbsp;&nbsp;&nbsp; nutanix-130).</p>

<p>ODX is NOT invoked for the following operations:</p>

<ul>
	<li>Clone VM through SCVMM</li>
	<li>Deploy template from SCVMM Library (non-DSF SMB Share)</li>
	<li>XenDesktop Clone Deployment</li>
</ul>

<p>You can validate ODX operations are taking place by using the ‘NFS Adapter’ Activity Traces page (yes, I said NFS, even though this is being performed via SMB).&nbsp; The operations activity show will be ‘NfsSlaveVaaiCopyDataOp‘ when copying a vDisk and ‘NfsSlaveVaaiWriteZerosOp‘ when zeroing out a disk.</p>
</section>
</section>

<section data-type="chapter" id="administration-PgIYTe">
<h2>Administration</h2>

<section data-type="sect1" id="important-pages-lkImsxTa">
<h3>Important Pages</h3>

<p>More coming soon!</p>
</section>

<section data-type="sect1" id="command-reference-M2I0uRT0">
<h3>Command Reference</h3>

<h5>Execute command on multiple remote hosts</h5>

<p class="codedescription">Description: Execute a PowerShell on one or many remote hosts</p>

<p class="codetext">$targetServers = "Host1","Host2","Etc"
<br>
Invoke-Command -ComputerName&nbsp; $targetServers {
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;COMMAND or SCRIPT BLOCK&gt;
<br>
}</p>

<h5>Check available VMQ Offloads</h5>

<p class="codedescription">Description: Display the available number of VMQ offloads for a particular host</p>

<p class="codetext">gwmi –Namespace "root\virtualization\v2" –Class Msvm_VirtualEthernetSwitch | select elementname, MaxVMQOffloads</p>

<h5>Disable VMQ for VMs matching a specific prefix</h5>

<p class="codedescription">Description: Disable VMQ for specific VMs</p>

<p class="codetext">$vmPrefix = "myVMs"
<br>
Get-VM | Where {$_.Name -match $vmPrefix} | Get-VMNetworkAdapter | Set-VMNetworkAdapter -VmqWeight 0</p>

<h5>Enable VMQ for VMs matching a certain prefix</h5>

<p class="codedescription">Description: Enable VMQ for specific VMs</p>

<p class="codetext">$vmPrefix = "myVMs"
<br>
Get-VM | Where {$_.Name -match $vmPrefix} | Get-VMNetworkAdapter | Set-VMNetworkAdapter -VmqWeight 1</p>

<h5>Power-On VMs matching a certain prefix</h5>

<p class="codedescription">Description: Power-On VMs matching a certain prefix</p>

<p class="codetext">$vmPrefix = "myVMs"
<br>
Get-VM | Where {$_.Name -match $vmPrefix -and $_.StatusString -eq "Stopped"} | Start-VM</p>

<h5>Shutdown VMs matching a certain prefix</h5>

<p class="codedescription">Description: Shutdown VMs matching a certain prefix</p>

<p class="codetext">$vmPrefix = "myVMs"
<br>
Get-VM | Where {$_.Name -match $vmPrefix -and $_.StatusString -eq "Running"}} | Shutdown-VM -RunAsynchronously</p>

<h5>Stop VMs matching a certain prefix</h5>

<p class="codedescription">Description: Stop VMs matching a certain prefix</p>

<p class="codetext">$vmPrefix = "myVMs"
<br>
Get-VM | Where {$_.Name -match $vmPrefix} | Stop-VM</p>

<h5>Get Hyper-V host RSS settings</h5>

<p class="codedescription">Description: Get Hyper-V host RSS (recieve side scaling) settings</p>

<p class="codetext">Get-NetAdapterRss</p>

<h5>Check Winsh and WinRM connectivity</h5>

<p class="codedescription">Description: Check Winsh and WinRM connectivity / status by performing a sample query which should return the computer system object not an error</p>

<p class="codetext">allssh "winsh "get-wmiobject win32_computersystem"</p>
</section>

<section data-type="sect1" id="metrics-and-thresholds-QMImTzTx">
<h3>Metrics and Thresholds</h3>

<p>More coming soon!</p>
</section>
</section>

<section data-type="sect1" id="troubleshooting-andamp-advanced-administration-lkI8SA">
<h3>Troubleshooting &amp; Advanced Administration</h3>

<p>More coming soon!</p>
</section>
</div>

<section data-type="afterword" id="afterword-622I2">
<h1>Afterword</h1>

<p>Thank you for reading The Nutanix Bible!&nbsp; Stay tuned for many more upcoming updates and enjoy the Nutanix platform!</p>
</section>

    </div>
    <!-- START: Ken Chen 11-11-2015-->
    <script src="js/menu.js"></script>
    <!-- END: Ken Chen 11-11-2015-->
  </body>
</html>