Capacity management is an ongoing
process, because no implementation remains static with regard to content and
usage. You need to plan for growth and change, so that your SharePoint Server
2013–based environment can continue to deliver an effective business solution.
Capacity Planning is only one
part of the capacity management cycle. It is the initial set of activities that
brings the design architect to the point where there is an initial architecture
that the architect believes will best serve the SharePoint Server 2013
deployment.
The capacity
management model includes additional steps to help you validate and tune the
initial architecture, and provides a feedback loop for re-planning and
optimizing the production environment until it can support design goals with
optimal choices of hardware, topology, and configuration.
Capacity
management versus capacity planning
Capacity management extends the
concept of capacity planning to express a cyclical approach in which the
capacity of a SharePoint Server 2013 deployment is continually monitored and
optimized to accommodate changing conditions and requirements.
SharePoint Server 2013 offers
increased flexibility and can be configured to sustain usage scenarios in a
wide variety of different scale points. There is no single deployment
architecture. Therefore, system designers and administrators must understand
the requirements for their specific environments.
SharePoint Server 2013
capacity management model
·
Step 1: Model Modeling
is the processes by which you decide the key solutions that you want your
environment to support, and establish all important metrics and parameters. The
output of the modeling exercise should be a list of all the key data that you
must have to design your environment.
o Understand your expected workload and
dataset.
o Set farm performance and reliability
targets.
o Analyze the SharePoint Server 2013
IIS logs.
·
Step 2: Design once
you have collected the data from Step 1, you can design your farm. Outputs are
detailed data architecture and physical and logical topologies.
o Determine your starting point
architecture.
o Select your hardware.
·
Step 3: Pilot, Test, and Optimize if you have designed a new
deployment, you must deploy a pilot environment for testing against your
workload and expected usage characteristics. For an existing farm, testing is
advised when major changes are being made to the infrastructure, but regular
optimization based on monitoring results might be necessary to maintain
performance targets. The output from this phase is analysis of test results
against targets, and an optimized architecture able to sustain established
performance and capacity targets.
o Pilot Deploy a pilot environment.
o Test Test against latency and throughput targets.
o Optimize Gather test results and make any required changes to
the farm resources and topology.
·
Step 4: Deploy this
step describes implementing the farm, or deploying changes to an existing farm.
Output for a new design is a completed deployment to live production, including
all content and user migrations. Output for an existing farm is revised farm
maps and updates to maintenance plans.
·
Step 5: Monitor and maintain this step describes how to set up monitoring,
and how to predict and identify bottlenecks and perform regular maintenance and
bottleneck mitigation activities.
Glossary
Terms
The following specialized terms are used in
SharePoint Server 2013 capacity management documentation.
·
RPS -Requests per second. The number of requests
received by a farm or server in one second. This is a common measurement of
server and farm load. The number of requests processed by a farm is greater
than the number of page loads and end-user interactions. This is because each
page contains several components, each of which creates one or more requests
when the page is loaded. Some requests are lighter than other requests with
regard to transaction costs. In our lab tests and case study documents, we
remove 401 requests and responses (authentication handshakes) from the requests
that were used to calculate RPS because they have insignificant impact on farm
resources.
·
Peak hours - The time or times during the day when load
on the farm is at its maximum.
·
Peak load - The average maximum daily load on the farm,
measured in RPS.
·
Load spike - Transient load peaks that fall outside
usual peak hours. These can be caused by unplanned increases in user traffic,
decreased farm throughput because of administrative operations, or combinations
of such factors.
·
Scale up -To scale up means to add resources such as
processors or memory to a server.
·
Scale out - To scale out means to add more servers to a farm.
Four
fundamentals of performance
Capacity management focuses on the following four
major aspects of sizing your solution:-
·
Latency For the purposes of capacity management, latency
is defined as the duration between the time that a user initiates an action,
such as clicking a hyperlink, and the time until the last byte is transmitted
to the client application or web browser.
·
Throughput Throughput is defined as the number of concurrent requests
that a server or server farm can process.
·
Data scale Data scale is defined as the
content size and data corpus that the system can host. The structure and
distribution of the content databases has a significant effect on the time that
is required for the system to process requests (latency) and the number of
concurrent requests it can serve (throughput).
·
Reliability Reliability is a measurement of
the ability of the system to meet the targets set for the latency and
throughput over time.
Oversizing
versus under sizing
Oversizing describes an approach to farm design
in which targets are achieved without full utilization of hardware, and the
resources in the SharePoint Server 2013 farm are significantly and consistently
underutilized. In an oversized deployment, memory, CPU, and other indicators on
the farm's resources show that it can well serve the demand with fewer
resources. The downside of oversizing is increased hardware and maintenance
expenditures and can impose greater power and space demands.
Under
sizing describes an approach to farm design in which performance and
capacity targets are not achievable because hardware resources in the
SharePoint Server 2013 farm are over-utilized. Under sizing a farm is sometimes
done to reduce hardware costs, but generally results in high latency leading to
a poor user experience, low satisfaction, frequent escalations, high support
costs, and unnecessary spending for troubleshooting and tuning the environment.
Operational
states: Green Zone and Red Zone
When we describe the load of a
production system, we refer to two major operational states: the “Green Zone” state in which the system is
operating under the normal, expected load range, and the “Red Zone” state, which is a state in which
the farm experiences very high transient resource demand that can only be
sustained for limited periods until failures and other performance and
reliability issues occur.
Green
Zone - This
is the state at which the server or farm is operating under normal load
conditions, up to expected daily peak loads. A farm operating in this range
should be able to sustain response times and latency within acceptable
parameters.
·
Server Latency (Average
Response Time): < 0.5 Seconds
·
CPU
Utilization (% Processor Time): 50 Percent or less.
This will allow for spikes in the farm caused by services such as Search
Crawls and user profile sync and leave you enough headroom in CPU Utilization
before you enter into the Red Zone.
·
Average
Memory (Available Mbytes): > 4 GB
Red
Zone -
The operating range in which load is greater than normal peak load, but can
still service requests for a limited period. This state is characterized by
greater than normal latency and possible failures caused by saturation of
system bottlenecks.
·
Server
Latency (Average Response Time): >1.0 Second
·
CPU
Utilization (% Processor Time): 75 Percent or loss
·
Average
Memory (Available Mbytes): < 2 GB
·
Average
Page Load (Average Page Time): > 3.0 Seconds
The ultimate goal of farm design is
to deploy an environment that can consistently support Red Zone load without
service failure and within acceptable latency and throughput targets.
Software boundaries and limits
Boundaries: Static limits that cannot be exceeded by design
An example of a boundary is the 2 GB document size limit;
you cannot configure SharePoint Server 2013 to store documents that are larger
than 2 GB. This is a built-in absolute value, and cannot be exceeded by design.
Thresholds: Configurable limits that can be exceeded to accommodate specific
requirements
The default value of certain thresholds can only be
exceeded up to an absolute maximum value. A good example is the document size
limit. By default, the default document size threshold is set to 50MB, but can
be changed to support the maximum boundary of 2GB.
Supported limits: Configurable limits that have been set by default to a
tested value
Some supported limits are configurable parameters that
are set by default to the recommended value, while other supported limits
relate to parameters that are not represented by a configurable value.
Deployment key differentiators
Each SharePoint Server 2013 deployment has a key
set of characteristics that will make it unique and different from other farms.
These key differentiators can be described by these four major categories:-
·
Specification Describes the farm's hardware, and the farm topology
and configuration.
·
Workload Describes
the demand on the farm, including the number of users and the usage characteristics.
·
Dataset Describes content sizes and distribution.
·
Health and performance Describes the
farm's performance against latency and throughput targets.
Sizing
When you have a good understanding of capacity planning and
management, you can apply your knowledge to system sizing.
Sizing is the term used to describe the selection and configuration of
appropriate data architecture, logical and physical topology, and hardware for
a solution platform.
There is a
range of capacity management and usage considerations that affect how you
should determine the most appropriate hardware and configuration options.
Capacity
Planning
Capacity Planning is
only one part of the capacity management cycle. It is the initial set of
activities that brings the design architect to the point where there is an
initial architecture that the architect believes will best serve the SharePoint
Server 2013 deployment.
To undertake the effective capacity
management for your environment. Each step requires certain information for
successful execution, and has a set of deliverables that you will use in the
subsequent step.
Steps
Step
1: Model
Modeling your SharePoint Server
2013-based environment begins with analyzing your existing solutions and
estimating the expected demand and targets for the deployment you are planning
to set up. You start by gathering information about your user base, data
requirements, latency and throughput targets, and document the SharePoint
Server features you want to deploy.
Expected workload and dataset
Proper sizing of a SharePoint Server
2013 implementation requires that you study and understand the demand
characteristics that your solution is expected to handle.
Understanding the demand requires
that you be able to describe both the workload characteristics such as number
of users and the most frequently used operations, and dataset characteristics
such as content size and content distribution.
The following diagram describes the relationship of
the workload and load on the system:
Refer RPS
Worksheet to determine the Farm Capacity.
Workload
Workload describes the demand that
the system will need to sustain, the user base and usage characteristics. The
following table provides some key metrics that are helpful in determining your
workload. You can use this table to record these metrics as you collect them.
Workload Characteristics
|
Value
|
|
Average
daily RPS
|
15
|
|
Average
RPS at peak time
|
30
|
|
Total
number of unique users per day
|
4000
|
|
Average
daily concurrent users
|
1200
|
|
Peak
concurrent users at peak time50
|
2400
|
|
Total
number of requests per hour
|
120000
|
|
Expected
workload distribution
|
No.
of Requests per day
|
%
|
Web
Browser - Search Crawl
|
40000
|
30
|
Web
Browser - General Collaboration Interaction
|
20000
|
|
Web
Browser - Social Interaction
|
40000
|
|
Web
Browser - General Interaction
|
5000
|
|
Web
Browser - Office Web Apps
|
10000
|
|
Office
Clients
|
||
OneNote
Client
|
||
SharePoint
Workspace
|
||
Outlook
RSS Sync
|
5000
|
|
Outlook
Social Connector
|
||
Other
interactions(Custom Applications/Web services)
|
Terms
·
Concurrent users – It is most common to measure the
concurrency of operations executed on the server farm as the number of distinct
users generating requests in a given time frame. The key metrics are the daily
average and the concurrent users at peak load.
·
Requests per second (RPS) – RPS is a commonly used indicator
used to describe the demand on the server farm expressed in the number of
requests processed by the farm per second, but with no differentiation between
the type or size of requests. Every organization's user base generates system
load at a rate that is dependent on the organization's unique usage
characteristics.
·
Total daily requests – Total daily requests is a good
indicator of the overall load the system will need to handle. It is most common
to measure all requests except authentication handshake requests (HTTP status
401) over a 24 hour period.
·
Total daily users - Total users is another key
indicator of the overall load the system will need to handle. This measurement
is the actual number of unique users in a 24 hour period, not the total number
of employees in the organization.
·
Workload Distribution – Understanding the distribution of
the requests based on the clients applications that are interacting with the
farm can help predict the expected trend and load changes after migrating to
SharePoint Server 2013. As users transition to more recent client versions such
as Office 2013, and start using the new capabilities new load patterns, RPS and
total requests are expected to grow. For each client we can describe the number
of distinct users using it in a time frame of a day, and the amount of total
requests that the client or feature generates on the server.
Dataset
Dataset describes the volume of content stored in
the system and how it can be distributed in the data store. The following table
provides some key metrics that are helpful in determining your dataset. You can
use this table to record these metrics as you collect them.
Object
|
Value
|
DB
size (in GB)
|
|
Number
of Content DBs
|
|
Number
of site collections
|
|
Number
of web apps
|
|
Number
of sites
|
|
Search
index size (# of items)
|
|
Number
of docs
|
|
Number
of lists
|
|
Average
size of sites
|
|
Largest
site size
|
|
Number
of user profiles
|
Terms
·
Content size – Understanding the size of the
content that you expect to store in the SharePoint Server 2013 system is
important for planning and architecting the system storage, and also for
properly sizing the Search solution that will crawl and index this content. The
content size is described in total disk space. If you are migrating content
from an existing deployment you might find it simple to identify the total size
that you will move; while planning you should leave room for growth over time
based on the predicted trend.
·
Total
number of documents – Other than the data corpus size, it is important to track
the overall number of items. The system reacts differently if 100 GB of data is
composed of 50 files of 2 GB each versus 100,000 files of 1 KB each. In large
deployments, the less stress there is on a single item, document or area of
documents, the better performance will be. Widely distributed content like
multiple smaller files across many sites and site collection is easier to serve
then a single large document library with very large files.
·
Maximum site collection size – It is important to identify what is
the biggest unit of content that you will store in SharePoint Server 2013;
usually it is an organizational need that prevents you from splitting that unit
of content. Average size of all site collections and the estimated total number
of site collections are additional indicators that will help you identify your
preferred data architecture.
·
Service applications data characteristics – In addition to analyzing the
storage needs for the content store, you should analyze and estimate the sizes
of other SharePoint Server 2013 stores, including:
o Total size of the Search index
o The profile database total size based
on the number of user in the profile store
o The social database total size based
on the expected number of tags, colleagues and activities
o The metadata store size
o The size of the usage database
o The size of the Web Analytics data
base
Setting
Farm Performance and Reliability Targets
A properly designed SharePoint Server
solution should be able to achieve "four nines" (99.99%) of uptime
with sub-second server responsiveness.
The indicators used to describe the performance and
reliability of the farm can include:-
·
Server availability – Usually described by the percent of
overall uptime of the system. You should track any unexpected downtime and
compare the overall availability to the organizational target you set. The
targets are commonly described by a number of nines (i.e. 99%, 99.9%, and 99.99%)
·
Server responsiveness – The time it takes the farm to serve
requests is a good indicator to track the health of the farm. This indicator is
usually named server side latency, and it is common to use the average or
median (the 50th percentile) latency of the daily requests being served. The
targets are commonly described in sub seconds or seconds. Note that if your
organization has a target to serve pages from SharePoint Server 2013 in less
than two seconds, then the server side goal needs to be sub seconds to leave
time for the page to reach the client over the network and time to render in
the browser. Also in general longer server response times are an indication of
an unhealthy farm, as this usually as an impact on throughput and rarely can
RPS keep up if you spend more than a second on the server on most requests
·
Server spikiness – Another good server side latency
indicator worth tracking is the behavior of the slowest 5% of all requests.
Slower requests are usually the requests that hit the system when it is under
higher load or even more commonly, requests that are impacted by less frequent
activity that occur while users interact with the system; a healthy system is
one that has the slowest requests under control as well. The target here is
similar to Server Responsiveness, but to achieve sub-second response on server
spikiness, you will need to build the system with a lot of spare resources to
handle the spikes in load.
·
System resource utilization – Other common indicators used to
track the health of the system are a collection of system counters that
indicate the health of each server in the farm topology. The most frequently
used indicators to track are % CPU utilization and Available Memory; however,
there are several additional counters that can help identify a non-healthy
system;
Step
2: Design
Now that you have finished collecting
some facts or estimates on the solution you need to deliver, you are ready to
start the next step of designing a proposed architecture that you predict will
be able to sustain the expected demand.
By the end of this step you should
have a design for your physical topology and a layout for your logical
topology,.
The hardware specifications and the
number of machines you layout are tightly related, to handle a specific load
there are several solutions you can choose to deploy. It is common to either
use a small set of strong machines (scale up) or a larger set of smaller
machines (scale out); each solution has its advantages and disadvantages when
it comes to capacity, redundancy, power, cost, space, and other considerations.
Use the following table to record
your design parameters.
Role
|
Type (Standard or virtual)
|
# of machines
|
Procs
|
RAM
|
IOPS need
|
Disk size OS+Log
|
Data drive
|
|
Web servers
|
Virtual
|
4
|
4 cores
|
8
|
N/A
|
400 GB
|
N/A
|
|
Content database server
|
Standard
|
1 cluster
|
4 quad-core 2.33 (GHz)
|
48
|
2k
|
400 GB
|
20 disks of 300GB @ 15K RPM
|
|
Application servers
|
Virtual
|
4
|
4 cores
|
16
|
N/A
|
400 GB
|
N/A
|
|
Search Crawl Target Web server
|
Virtual
|
1
|
4 cores
|
8
|
N/A
|
400 GB
|
N/A
|
|
Search Query server
|
Standard
|
2
|
2 quad-core 2.33 (GHz)
|
32
|
N/A
|
400 GB
|
500 GB
|
|
Search Crawler server
|
Standard
|
2
|
2 quad-core 2.33 (GHz)
|
16
|
400
|
400 GB
|
N/A
|
|
Search Crawl database server
|
Standard
|
1 cluster
|
4 quad-core 2.33 (GHz)
|
48
|
4k (tuned for read)
|
100 GB
|
16 disks of 150GB @ 15K RPM
|
|
Search Property Store database + Administration database server
|
Standard
|
1 cluster
|
4 quad-core 2.33 (GHz)
|
48
|
2k (tuned for write)
|
100 GB
|
16 disks of 150GB @ 15K RPM
|
Step
3: Pilot, Test and Optimize
The testing and optimization stage is
an extremely important component of effective capacity management. You should
test new architectures before you deploy them to production and you should
conduct acceptance testing together with following monitoring best practices in
order to ensure the architectures you design achieve the performance and
capacity targets. This allows you to identify and optimize potential
bottlenecks before they affect users in a live deployment. If you are upgrading
from an Office SharePoint Server 2007 environment and plan to make
architectural changes, or are estimating user load of the new SharePoint Server
features, then testing particularly important to make sure that your new
SharePoint Server-based environment will meet performance and capacity targets.
Once you have tested your
environment, you can analyze the test results to determine what changes must be
made in order to achieve the performance and capacity targets you established
in Step 1: Model.
These are the recommended sub steps
that you should follow for pre-production:
·
Populate
the storage with the dataset or part of the dataset that you've identified in Step 1: Model.
·
Stress
the system with synthetic load that represents the workload you've identified
in Step 1: Model.
·
Run
tests, analyze results, and optimize your architecture.
·
Deploy
your optimized architecture in your data center, and roll out a pilot with a
smaller set of users.
·
Analyze
the pilot results, identify potential bottlenecks, and optimize the
architecture. Retest if it is required.
·
Deploy
to the production environment.
Test
Testing is a critical factor in
establishing the ability of your system design to support your workload and
usage characteristics.
·
Create
a test plan
·
Create
the test environment
·
Create
Tests and Tools
Deploy
the pilot environment
Before you deploy SharePoint Server
2013 to a production environment, it is important that you first deploy a pilot
environment and thoroughly test the farm to make sure that that it can meet
capacity and performance targets for your expected peak load. We recommend that
the pilot environment is first tested with synthetic load especially for large
deployments, and then stressed by a small set of live users and live content.
The benefit of analyzing a pilot environment by using a small set of live users
is the opportunity to validate some assumptions you made about the usage
characteristics and the content growth before you go fully into production.
Optimize
If you cannot meet your capacity and
performance targets by scaling your farm hardware or making changes to the
topology, you may have to consider revising your solution. For example, if your
initial requirements were for a single farm for collaboration, Search and
Social, you may have to federate some services such as search to a dedicated
services farm, or split the workload across more farms. One alternative is to
deploy a dedicated farm for social and another for team collaboration.
Step
4: Deploy
Once you have executed your final
round of tests and confirmed that the architecture you have selected can
achieve the performance and capacity targets you established in Step 1: Model, you can deploy your SharePoint
Server 2013-based environment to production.
The appropriate rollout strategy will
vary depending on the environment and situation. While SharePoint Server
deployment generally is outside the scope of this document, there are certain
suggested activities that may come out of the capacity planning exercise. Here
are some examples:-
·
Deploying a new SharePoint Server farm: The capacity planning exercise should
have guided and confirmed plans for a design and deployment of SharePoint
Server 2013. In this case, the rollout will be the first broad deployment of
SharePoint Server 2013. It will require moving or rebuilding the servers and
services that were used during the capacity planning exercises into production.
This is the most straight-forward scenario because there are not any upgrades
or modifications needed to an existing farm.
·
Upgrading an Office SharePoint Server 2007 farm to SharePoint
Server 2013: The capacity planning exercise should
have validated the design for a farm that can meet existing demands and scale
up to meet increased demand and usage of a SharePoint Server 2013 farm. Part of
the capacity planning exercise should have included test migrations to validate
how long the upgrade process will take, whether any custom code must be
modified or replaced, whether any third-party tools have to be updated, and so
on At the conclusion of capacity planning you should have a validated design,
and understanding of how much time that it will take to upgrade, and a plan for
how best to work through the upgrade process – for example, an in-place
upgrade, or migrating content databases into a new farm. If you're doing an
in-place upgrade then during capacity planning you may have found that
additional or upgraded hardware will be needed, and considerations for
downtime. Part of the output from the planning exercise should be a list of the
hardware changes that are needed and a detailed plan to deploy the hardware
changes to the farm first. Once the hardware platform that was validated during
capacity planning is in place, you can move forward with the process of
upgrading to SharePoint Server 2013.
·
Improving the performance of an existing SharePoint Server 2013
farm: The capacity planning exercise should have helped you to identify
the bottlenecks in your current implementation, plan ways to reduce or
eliminate those bottlenecks, and validate an improved implementation that meets
your business requirements for SharePoint Server services. There are different
ways in which performance issues could have been resolved, from something as
easy as reallocating services across existing hardware, upgrading existing
hardware, or adding additional hardware and adding additional services to it.
The different approaches should be tested and validated during the capacity
planning exercise, and then a deployment plan designed depending on the results
of that testing.
Step
5: Monitor and Maintain
To maintain system performance, you
must monitor your server to identify potential bottlenecks. Before you can
monitor effectively, you must understand the key indicators that will tell you
if a specific part of your farm requires attention, and know how to interpret
these indicators. If you find that your farm is operating outside the targets
you have defined, you can adjust your farm by adding or removing hardware
resources, changing your topology, or changing how data is stored.
Storage
and SQL Server capacity planning and configuration
Storage
requirements and best practices:
Gather
storage and SQL Server space and I/O requirements
Several SharePoint Server 2013 architectural
factors influence storage design. The key factors are: the amount of content,
enabled features, deployed service applications, number of farms, and
availability requirements.
Before you start to plan storage, you should
understand the databases that SharePoint Server 2013 can use.
In this section:-
Databases used by SharePoint 2013
The databases that are installed with SharePoint
2013 depend on the features that are being used in the environment. All
SharePoint 2013 environments rely on the SQL Server system databases. This
section provides a summary of the databases installed with SharePoint 2013.
Product version and edition
|
Databases
|
SharePoint Foundation 2013
|
Configuration
Central Administration content
Content (one or more)
App Management Service
Business Data Connectivity
Search service application:
·
Search
administration
·
Analytics
Reporting (one or more)
·
Crawl (one or
more)
·
Link (one or
more)
Secure Store Service
Subscription Settings Service Application (if it
is enabled through Windows PowerShell)
Usage and Health Data Collection Service
Word Conversion Service
|
SharePoint Server 2013
|
Machine Translation Services
Managed Metadata Service
PerformancePoint Services
PowerPivot Service (Power Pivot for SharePoint
2013)
Project Server Service
State Service
User Profile Service application:
·
Profile
·
Synchronization
·
Social
tagging
Word Automation services
|
If you are integrating further with SQL Server,
your environment may also include additional databases, as in the following
scenarios:-
·
SQL Server 2012 Power Pivot for
SharePoint 2013 can be used in a SharePoint Server 2013 environment that
includes SQL Server 2008 R2 Enterprise Edition and SQL Server Analysis
Services. If in use, you must also plan to support the Power Pivot application
database, and the additional load on the system.
·
The SQL Server 2008 R2 Reporting
Services (SSRS) plug-in can be used with any SharePoint 2013 environment. If
you are using the plug-in, plan to support the two SQL Server 2008 R2 Reporting
Services databases and the additional load that is required for SQL Server 2008
R2 Reporting Services.
Understand SQL Server and IOPS
On any server that hosts a SQL Server instance, it
is very important that the server achieve the fastest response possible from
the I/O subsystem.
More and faster disks or arrays provide sufficient
I/O operations per second (IOPS) while maintaining low latency and queuing on
all disks.
You cannot add other types of resources, such as
CPU or memory, to compensate for slow response from the I/O subsystem. However,
it can influence and cause issues throughout the farm. Plan for minimal latency
before deployment, and monitor your existing systems.
Before you deploy a new farm, we recommend that you
benchmark the I/O subsystem by using the SQLIO disk subsystem benchmark tool.
Note that this tool works on all Windows Server versions with all versions of
SQL Server. For more information, see SQLIO Disk Subsystem Benchmark Tool.
Stress testing also provides valuable information
for SQL Server. For information about how to use the SQLIOSim utility and SQLIO
for stress testing, see the TechNet video Stress testing using SQLIOSIM and
SQLIO and How to use the SQLIOSim utility to simulate SQL
Server activity on a disk subsystem.
For detailed information about how to analyze IOPS
requirements from a SQL Server perspective, see Analyzing I/O Characteristics and Sizing Storage
Systems for SQL Server Database Applications.
Estimate core storage and IOPS needs
Configuration and content storage and IOPS are the
base layer that you must plan for in every SharePoint Server 2013 deployment.
Configuration storage and IOPS
Storage requirements for the Configuration database
and the Central Administration content database are not large. We recommend
that you allocate 2 GB for the Configuration database and 1 GB for
the Central Administration content database. Over time, the Configuration
database may grow beyond 1 GB. It does not grow quickly — it grows by
approximately 40 MB for each 50,000 site collections.
Transaction logs for the Configuration database can
be large. Therefore, we recommend that you change the recovery model for the
database from full to simple.
IOPS requirements for the Configuration database
and Central Administration content database are minimal.
Content storage and IOPS
Estimating the storage and IOPS required for
content databases is not a precise activity. In testing and explaining the
following information, we intend to help you derive estimates to use to
determine the initial size of your deployment. However, when your environment
is running, we expect that you'll revisit your capacity needs based on the data
from your live environment.
Formula to estimate content database
storage
The following process describes how to
approximately estimate the storage required for content databases, without
considering log files:
1. Use the following formula to estimate the size of
your content databases:
Database size =
((D × V) × S) +
(10 KB × (L +
(V × D)))
The value, 10 KB, in the formula is a constant that
approximately estimates the amount of metadata required by SharePoint Server
2013. If your system requires significant use of metadata, you may want to
increase this constant.
2. Calculate the expected number of documents. This
value is known as D in the
formula.
How you calculate the number of documents will be
determined by the features that you are using. For example, for My Sites or
collaboration sites, we recommend that you calculate the expected number of
documents per user and multiply by the number of users. For records management
or content publishing sites, you may calculate the number of documents that are
managed and generated by a process.
If you are migrating from a current system, it may
be easier to extrapolate your current growth rate and usage. If you are
creating a new system, review your existing file shares or other repositories
and estimate based on that usage rate.
3. Estimate the average size of the documents that
you'll be storing. This value is known as S in the
formula. It may be worthwhile to estimate averages for different types or
groups of sites. The average file size for My Sites, media repositories, and
different department portals can vary significantly.
4. Estimate the number of list items in the
environment. This value is known as L in the
formula.
List items are more difficult to estimate than
documents. We generally use an estimate of three times the number of documents
(D), but this will vary based on how you expect
to use your sites.
5. Determine the approximate number of versions.
Estimate the average number of versions any document in a library will have.
This value will usually be much lower than the maximum allowed number of
versions. This value is known as V in the
formula.
The value of V must
be above zero.
As an example, use this formula and the characteristics
in the following table to estimate the required storage space for data files in
a content database for a collaboration environment. The result is that you need
approximately 105 GB.
Input
|
Value
|
Number of documents (D)
|
200,000
Calculated by assuming 10,000 users times 20
documents
|
Average size of documents (S)
|
250 KB
|
List items (L)
|
600,000
|
Number of non-current versions (V)
|
2
Assuming that the maximum versions allowed is 10
|
Database size = (((200,000 x 2)) × 250) + ((10 KB × (600,000 + (200,000 x 2))) = 110,000,000 KB
or 105 GB
Efficient File I/O in SharePoint Server 2013 is a
storage method in which a file is split into pieces that are stored and updated
separately. These pieces are streamed together when a user requests the file.
This increases the I/O performance but it normally does not increase the file
size. However, small files can see a small increase in the disk storage that is
required.
Features that influence the size of
content databases
The following SharePoint Server 2013 features can
significantly affect the size of content databases:-
·
Recycle
bins until a document is
fully deleted from both the first stage and second stage recycle bin, it
occupies space in a content database. Calculate how many documents are deleted
each month to determine the effect of recycle bins on the size of content
databases.
·
Auditing Audit data can quickly compound
and use large amounts of space in a content database, especially if view
auditing is turned on. Rather than letting audit data grow without constraint,
we recommend that you enable auditing only on the events that are important to
meet regulatory needs or internal controls. Use the following guidelines to
estimate the space that you must reserve for auditing data:-
o Estimate the number of new auditing entries for a
site, and multiply this number by 2 KB (entries generally are limited to
4 KB, with an average size of about 1 KB).
o Based on the space that you want to allocate,
determine the number of days of audit logs you want to keep.
Estimate content database IOPS
requirements
IOPS requirements for content databases vary
significantly based on how your environment is being used, available disk
space, and the number of servers that you have. In general, we recommend that
you compare the predicted workload in your environment to one of the solutions
that we tested.
In tests, we found that the content databases tend
to range from 0.05 IOPS/GB to around 0.2 IOPS/GB. We also found that a best
practice is to increase the top-end to 0.5 IOPS/GB. This is more than necessary
and can be much more than you'll need in your environment. Note that if you use
mirroring, this results in much more IO than the primary content databases.
Simply be aware that the mirrored content databases are never lightweight.
Estimate service application storage
needs and IOPS
After you estimate content storage and IOPS needs,
you must determine the storage and IOPS required by the service applications
that are being used in your environment.
SharePoint Server 2013 service
application storage and IOPS requirements
To estimate the storage requirements for the
service applications in the system, you must first be aware of the service
applications and how you'll use them. Service applications that are available
in SharePoint Server 2013 and that have databases are listed in the following
tables. The storage and IOPs data for all of the service applications in
SharePoint Server 2013 remains the same as in SharePoint Server 2010 except for
the Search service application and User Profile service application.
Search service application storage
and IOPS requirements
Database
|
Scaling
|
Disk IOPS
|
Disk size
|
10M items
|
100M items
|
Crawl
|
One DB per 20M items
SQL IOPS: 10 per 1 DPS
|
Medium/High
|
Medium
|
15GB
2GB log
|
110GB
50GB log
|
Link
|
One DB per 60M items
SQL IOPS: 10 per 1M items
|
Medium
|
Medium
|
10GB
0.1GB log
|
80GB
5GB log
|
Analytics Reporting
|
Split when reaching 100-300GB
|
Medium
|
Medium
|
Usage dependent
|
Usage dependent
|
Service application storage
requirements and IOPS recommendations
Service application
|
Size estimation recommendation
|
User Profile
|
The User Profile service application is
associated with three databases: Profile, Synchronization, and Social
Tagging.
|
Managed Metadata Service
|
The Managed Metadata service application has one
database. The size of the database is affected by the number of content types
and keywords used in the system. Many environments will include multiple
instances of the Managed Metadata service application.
|
Secure Store Service
|
The size of the Secure Store service application
database is determined by the number of credentials in the store and the
number of entries in the audit table. We recommend that you allocate
5 MB for each 1,000 credentials for it. It has minimal IOPS.
|
State Service
|
The State service application has one database.
We recommend that you allocate 1 GB for it. It has minimal IOPS.
|
Word Automation Services
|
The Word Automation service application has one
database. We recommend that you allocate 1 GB for it. It has minimal
IOPS.
|
PerformancePoint Services
|
The PerformancePoint service application has one
database. We recommend that you allocate 1 GB for it. It has minimal
IOPS.
|
Business Data Connectivity service
|
The Business Data Connectivity service
application has one database. This database is small and significant growth
is unlikely. It has minimal IOPS.
|
App Management
|
The App Management service application has one
database. This database is small and significant growth is unlikely. It has
minimal IOPS.
|
Word Automation Services
|
The Word Automation Services service application
has one database. This database is small and significant growth is unlikely.
It has minimal IOPS.
|
PerformancePoint Services
|
The PerformancePoint Services has one database.
This database is small and significant growth is unlikely. It has minimal
IOPS.
|
Power Pivot
|
The Power Pivot Service application has one
database. This database is small and has no significant I/O impact. We
recommend that you use the same IOPS as the SharePoint content database. Note
that content databases have significantly higher I/O requirements than the
Power Pivot service application database.
|
Determine availability needs
Availability is how much a SharePoint Server 2013
environment is perceived by users to be available. An available system is a
system that is resilient — that is, incidents that affect service occur
infrequently, and timely and effective action is taken when they do occur. Availability
requirements can significantly increase your storage needs.
Choose
SQL Server version and edition
We recommend that you consider running your
environment on the Enterprise Edition of SQL Server 2008 R2 with SP1 or SQL
Server 2012 to take advantage of the additional performance, availability,
security, and management capabilities that these versions provide.
In particular, you should consider your need for
the following features:-
·
Backup
compression Backup compression
can speed up any SharePoint backup, and is available in SQL Server 2008 R2 with
Service Pack 1 (SP1) or SQL Server 2012 Enterprise Edition or Standard edition.
By setting the compression option in your backup script, or by configuring the
server that is running SQL Server to compress by default, you can significantly
reduce the size of your database backups and shipped logs.
·
Transparent
data encryption If your security
requirements include the need for transparent data encryption, you must use SQL
Server Enterprise Edition.
·
Content
deployment If you plan to use
the content deployment feature, consider SQL Server Enterprise Edition so that
the system can take advantage of database snapshots.
·
Remote
BLOB storage If you want to take
advantage of remote BLOB storage to a database or location outside the files
associated with each content database, you must use SQL Server 2008 R2 with SP1
or SQL Server 2012 Enterprise Edition.
·
Resource
governor Resource Governor
is a technology introduced in SQL Server 2008 to enable you to manage SQL
Server workloads and resources by specifying limits on resource consumption by
incoming requests. Resource Governor enables you to differentiate workloads and
allocate CPU and memory as they are requested, based on the limits that you
specify. It is available only in SQL Server 2008 R2 with SP1 Enterprise
edition. For more information about how to use Resource Governor,
·
We recommend that you use Resource
Governor with SharePoint Server 2013 to:
o Limit the amount of SQL Server resources that the
web servers targeted by the search crawl component consume. As a best practice,
we recommend limiting the crawl component to 10 percent CPU when the system is
under load.
o Monitor how many resources are consumed by each
database in the system — for example, you can use Resource Governor to
help you determine the best placement of databases among computers that are
running SQL Server.
·
Power
Pivot for SharePoint 2013 Enables
users to share and collaborate on user-generated data models and analysis in
Excel and in the browser while automatically refreshing those analyses. It is
part of SQL Server 2008 R2 Analysis Services (SSAS) and SQL Server 2012 SP1
Analysis Services (SSAS) Enterprise Edition.
Design
storage architecture based on capacity and I/O requirements
The storage architecture and disk types that you
select for your environment can affect system performance.
In this section:
Choose
a storage architecture
SharePoint Server 2013 supports Direct Attached
Storage (DAS), Storage Area Network (SAN), and Network Attached Storage (NAS)
storage architectures, although NAS is only supported for use with content
databases that are configured to use remote BLOB storage. Your choice depends
on factors within your business solution and your existing infrastructure.
Any storage architecture must support your
availability needs and perform adequately in IOPS and latency. To be supported,
the system must consistently return the first byte of data within
20 milliseconds (ms).
Direct Attached Storage (DAS)
DAS is a digital storage system that is directly
attached to a server or workstation, without a storage network in between. DAS
physical disk types include Serial Attached SCSI (SAS) and Serial Attached ATA
(SATA).
In general, we recommend that you choose a DAS
architecture when a shared storage platform can't guarantee a response time of
20 ms and sufficient capacity for average and peak IOPS.
Storage Area Network (SAN)
SAN is an architecture to attach remote computer
storage devices (such as disk arrays and tape libraries) to servers in such a
way that the devices appear as locally attached to the operating system (for
example, block storage).
In general, we recommend that you choose a SAN when
the benefits of shared storage are important to your organization.
The benefits of shared storage include the
following:
·
Easier to reallocate disk storage
between servers.
·
Can serve multiple servers.
·
No limitations on the number of disks
that can be accessed.
Network Attached Storage (NAS)
A NAS unit is a self-contained computer that is
connected to a network. Its sole purpose is to supply file-based data storage
services to other devices on the network. The operating system and other
software on the NAS unit provide the functionality of data storage, file
systems, and access to files, and the management of these functionalities (for
example, file storage).
NAS is only supported for use with content
databases that are configured to use remote BLOB storage (RBS). Any network
storage architecture must respond to a ping within 1 ms and must return
the first byte of data within 20 ms. This restriction does not apply to
the local SQL Server FILESTREAM provider, because it only stores data locally
on the same server.
Some confusion exists about if you use the
Internet Small Computer System Interface (iSCSI) and assume that it is a NAS
protocol. If you access this iSCSI storage through the Common Internet File
System (CFIS), it is a NAS protocol. This means that you can't use this
storage with content databases if they aren't configured to use RBS. If
however, you access this iSCSI storage through a locally attached hard disk,
it is considered a SAN architecture. This means that you can use it with NAS.
|
Choose
disk types
The disk types that you use in the system can
affect reliability and performance. All else being equal, larger drives
increase mean seek time. SharePoint Server 2013 supports the following types of
drives:-
·
Small
Computer System Interface (SCSI)
·
Serial
Advanced Technology Attachment (SATA)
·
Serial-attached
SCSI (SAS)
·
Fibre
Channel (FC)
·
Integrated
Device Electronics (IDE)
·
Solid
State Drive (SSD) or Flash Disk
Choose
RAID types
RAID (Redundant Array of Independent Disks) is
often used to both improve the performance characteristics of individual disks
(by striping data across several disks) and to provide protection from
individual disk failures.
All RAID types are supported for SharePoint Server
2013. However, we recommend that you use RAID 10 or a vendor-specific RAID
solution that has equivalent performance.
When you configure a RAID array, make sure that you
align the file system to the offset that is supplied by the vendor.
Estimate
memory requirements
The memory that is required for SharePoint Server
2013 is directly related to the size of the content databases that you are
hosting on a server that is running SQL Server.
As you add service applications and features, your
requirements are likely to increase. The following table gives guidelines for
how much memory we recommend.
Combined size of content databases
|
RAM recommended for computer
running SQL Server
|
Minimum for small production deployments
|
8 GB
|
Minimum for medium production deployments
|
16 GB
|
Recommendation for up to 2 terabytes
|
32 GB
|
Recommendation for the range of 2 terabytes to 5
terabytes
|
64 GB
|
Recommendation for more than 5 terabytes
|
Additional RAM over 64 GB can improve SQL Server
caching speed
|
Note:- These values are higher than those
recommended as the minimum values for SQL Server because of the distribution of
data required for a SharePoint Server 2013 environment..
Other factors that may influence the memory that is
required include the following:
·
The use of SQL Server mirroring.
·
The frequent use of files larger than
15 megabytes (MB).
Understand
network topology requirements
Plan the network connections within and between
farms. We recommend that you use a network that has low latency.
The following list provides some best practices and
recommendations:-
·
All servers in the farm should have
LAN bandwidth and latency to the server that is running SQL Server. Latency
should be no greater than 1 millisecond.
·
We do not recommend a wide area network
(WAN) topology in which a server that is running SQL Server is deployed
remotely from other components of the farm over a network that has latency
greater than 1 ms., because this topology has not been tested.
·
Plan for an adequate WAN network if
you plan to use SQL Server the Always On implementation suite, mirroring, log
shipping, or Failover Clustering to keep a remote site up-to-date.
·
We recommend that web servers and
application servers have two network adapters: one network adapter to handle
user traffic and the other to handle communication with the servers that are
running SQL Server.
Configure
SQL Server
The following sections describe how to plan to
configure SQL Server for SharePoint Server 2013.
In this section:
Estimate
how many servers are required
In general, SharePoint Server 2013 is designed to
take advantage of SQL Server scale out. For example, SharePoint Server
2013 may perform better with many medium-size servers that are running SQL
Server than with only several large servers.
Always put SQL Server on a dedicated server that is
not running any other farm roles or hosting databases for any other
application. The only exception to this recommendation is if you deploy the
system on a stand-alone server for a development or a non-performance oriented
test environment.
The following is general guidance for when to
deploy an additional server that will run a SQL Server instance:-
·
Add an additional database server
when you have more than four web servers that are running at capacity.
·
Add an additional database server
when your current server has reached its effective resource limits of RAM, CPU,
disk IO throughput, disk capacity, or network throughput.
To promote secure credential storage when you are
running the Secure Store service application, we recommend that the Secure
Store database be hosted on a separate database instance where access is
limited to one administrator.
Configure
storage and memory
On the server that is running SQL Server 2008 R2 with
SP1 or SQL Server 2012, we recommend that the L2 cache per CPU have a minimum
of 2 MB to improve memory.
Follow vendor storage configuration
recommendations
For optimal performance when you configure a
physical storage array, adhere to the hardware configuration recommendations
supplied by the storage vendor instead of relying on the default values of the
operating system.
If you do not have guidance from your vendor, we
recommend that you use the DiskPart.exe disk configuration utility to configure
storage for SQL Server 2008.
We recommend using the Windows PowerShell storage
cmdlets that are available for Windows Server 2012.
Provide as many resources as possible
Ensure that the SQL Server I/O channels to the
disks are not shared by other applications, such as the paging file and
Internet Information Services (IIS) logs.
Provide as much bus bandwidth as possible. Greater
bus bandwidth helps improve reliability and performance. Consider that the disk
is not the only user of bus bandwidth — for example, you must also account
for network access.
Set
SQL Server options
The following SQL Server settings and options
should be configured before you deploy SharePoint Server.
·
Do not enable auto-create statistics
on a server that hosts SQL Server and supports SharePoint Server. SharePoint
Server configures the required settings upon provisioning and upgrade.
Auto-create statistics can significantly change the execution plan of a query
from one instance of SQL Server to another instance of SQL Server. Therefore,
to provide consistent support for all customers, SharePoint Server provides
coded hints for queries as needed to provide the best performance across
all scenarios.
·
To ensure optimal performance, we
strongly recommend that you set max degree of parallelism (MAXDOP) to 1 SQL Server instances that host
SharePoint Server 2013 databases.
Configure
databases
The following guidance describes best practices to
plan for as you configure each database in your environment.
Separate and
prioritize your data among disks
Ideally, you should place the tempdb database,
content databases, Usage database, search databases, and SQL Server 2008 R2
with SP1 and SQL Server 2012 transaction logs on separate physical hard disks.
The following list provides some best practices and
recommendations for prioritizing data:
·
When you prioritize data among faster
disks, use the following ranking:
1. Tempdb data files and transaction logs
2. Database transaction log files
3. Search databases, except for the Search
administration database
4. Database data files
In a heavily read-oriented portal site, prioritize
data over logs.
·
Testing and customer data show that
SharePoint Server 2013 farm performance can be significantly impeded by
insufficient disk I/O for tempdb. To avoid this issue, allocate dedicated disks
for tempdb. If a high workload is projected or monitored — that is, the
average read action or the average write action requires more than
20 ms — you might have to ease the bottleneck by either separating the
files across disks or by replacing the disks with faster disks.
·
For best performance, place the
tempdb on a RAID 10 array. The number of tempdb data files should equal the
number of core CPUs, and the tempdb data files should be set at an equal size.
Count dual core processors as two CPUs for this purpose. Count each processor
that supports hyper-threading as a single CPU. For more information, see Optimizing tempdb Performance.
·
Separate database data and
transaction log files across different disks. If files must share disks because
the files are too small to warrant a whole disk or stripe, or you have a
shortage of disk space, put files that have different usage patterns on the same
disk to minimize concurrent access requests.
·
Consult your storage hardware vendor
for information about how to configure all logs and the search databases for
write optimization for your particular storage solution.
Use multiple data
files for content databases
Follow these recommendations for best performance:
·
Only create files in the primary
filegroup for the database.
·
Distribute the files across separate
disks.
·
The number of data files should be
less than or equal to the number of core CPUs. Count dual core processors as
two CPUs for this purpose. Count each processor that supports hyper-threading
as a single CPU.
·
Create data files of equal size.
Although you can use the backup and recovery
tools that are built in to SharePoint Server 2013 to back up and recover
multiple data files, if you overwrite in the same location, the tools can't
restore multiple data files to a different location. For this reason, we
strongly recommend that when you use multiple data files for a content
database, you use SQL Server backup and recovery tools.
|
Limit content
database size to improve manageability
Plan for database sizing that will improve
manageability, performance, and ease of upgrade for your environment.
To help ensure system performance, we recommended
that you limit the size of content databases to 200 GB, except when specific
usage scenarios and conditions support larger sizes.
We generally recommend that a site collection
should not exceed 100 GB unless it is the only site collection in the
database so that you can use the SharePoint Server 2013 granular backup tools
to move a site collection to another database if you need to.
Proactively manage
the growth of data and log files
We recommend that you proactively manage the growth
of data and log files by considering the following recommendations:
·
As much as possible, pre-grow all
data and log files to their expected final size.
·
We recommend that you enable auto
growth for safety reasons. Do not rely on the default auto growth settings.
Consider the following guidelines when you configure auto growth:-
o When you plan content databases that exceed the
recommended size (200 GB), set the database auto growth value to a fixed
number of megabytes instead of to a percentage. This will reduce the frequency
with which SQL Server increases the size of a file. Increasing file size is a
blocking action that involves filling the new space with empty pages.
o If the calculated size of the content database is
not expected to reach the recommended maximum size of 200 GB within the
next year, set it to the maximum size the database is predicted to reach in a
year — with 20 percent additional margin for error — by using
the ALTER DATABASE MAXSIZE property. Periodically review this
setting to make sure that it is still an appropriate value, depending on past
growth rates.
·
Maintain a level of at least
25 percent available space across disks to allow for growth and peak usage
patterns. If you are managing growth by adding disks to a RAID array or
allocating more storage, monitor disk size closely to avoid running out of
space.
Validate
and monitor storage and SQL Server performance
Test that your performance and backup solution on
your hardware enables you to meet your service level agreements (SLAs). In
particular, test the I/O subsystem of the computer that is running SQL Server
to make sure that performance is satisfactory.
Test the backup solution that you are using to make
sure that it can back up the system within the available maintenance window. If
the backup solution can't meet the SLAs your business requires, consider using
an incremental backup solution such as System Center 2012 - Data Protection
Manager (DPM) with Service Pack 1 (SP1).
It is important to track the following resource
components of a server that is running SQL Server: CPU, memory, cache/hit
ratio, and I/O subsystem. When one or more of the components seems slow or
overburdened, analyze the appropriate strategy based on the current and
projected workload.
The following section lists the performance
counters that we recommend that you use to monitor the performance of the SQL
Server databases that are running in your SharePoint Server 2013 environment.
Also listed are approximate healthy values for each counter.
SQL
Server counters to monitor
Monitor the following SQL Server counters to ensure
the health of your servers:-
·
General
statistics This object
provides counters to monitor general server-wide activity, such as the number
of current connections and the number of users connecting and disconnecting per
second from computers that are running an instance of SQL Server. Consider
monitoring the following counter:-
o User connections this counter shows the number of
user connections on your computer that is running SQL Server. If you see this
number increase by 500 percent from your baseline, you may see a
performance reduction.
·
Databases This object provides counters to
monitor bulk copy operations, backup and restore throughput, and transaction
log activities. Monitor transactions and the transaction log to determine how
much user activity is occurring in the database and how full the transaction
log is becoming. The amount of user activity can determine the performance of
the database and affect log size, locking, and replication. Monitoring
low-level log activity to gauge user activity and resource usage can help you
identify performance bottlenecks. Consider monitoring the following counter:-
o Transactions/sec This counter shows the number of
transactions on a given database or on the entire server per second. This
number is more for your baseline and to help you troubleshoot issues.
·
Locks This object provides information
about SQL Server locks on individual resource types. Consider monitoring the
following counters:
o Average Wait Time (ms) This counter shows the average
amount of wait time for each lock request that resulted in a wait.
o Lock Wait Time (ms) This counter shows the wait time
for locks in the last second.
o Lock waits/sec This counter shows the number of
locks per second that couldn't be satisfied immediately and had to wait for
resources.
o Number of deadlocks/sec This counter shows the number of
deadlocks on the computer that is running SQL Server per second. This should
not increase above 0.
·
Latches This object provides counters to
monitor internal SQL Server resource locks called latches. Monitoring the
latches to determine user activity and resource usage can help you identify
performance bottlenecks. Consider monitoring the following counters:
o Average Latch Wait Time (ms) This counter shows the average
latch wait time for latch requests that had to wait.
o Latch Waits/sec This counter shows the number of
latch requests that couldn't be granted immediately.
·
SQL
Statistics This object
provides counters to monitor compilation and the type of requests sent to an
instance of SQL Server. Monitoring the number of query compilations and
recompilations and the number of batches received by an instance of SQL Server
gives you an indication of how quickly SQL Server is processing user queries
and how effectively the query optimizer is processing the queries. Consider
monitoring the following counters:
o SQL Compilations/sec This counter indicates the number
of times the compile code path is entered per second.
o SQL Re-Compilations/sec This counter indicates the number
statement recompiles per second.
·
Buffer
Manager This object
provides counters to monitor how SQL Server uses memory to store data pages,
internal data structures, and the procedure cache, and also counters to monitor
the physical I/O as SQL Server reads and writes database pages. Consider
monitoring the following counter:
o Buffer Cache Hit Ratio
o This counter shows the percentage of pages that
were found in the buffer cache without having to read from disk. The ratio is
the total number of cache hits divided by the total number of cache lookups
over the last few thousand page accesses. Because reading from the cache is
much less expensive than reading from disk, you want this ratio to be high.
Generally, you can increase the buffer cache hit ratio by increasing the memory
available to SQL Server.
·
Plan
Cache This object
provides counters to monitor how SQL Server uses memory to store objects such
as stored procedures, unprepared and prepared Transact-SQL statements, and
triggers. Consider monitoring the following counter:
o Cache Hit Ratio
o This counter indicates the ratio between cache hits
and lookups for plans.
Physical
server counters to monitor
Monitor the following counters to ensure the health
of your computers that are running SQL Server:-
·
Processor:
% Processor Time: _Total This
counter shows the percentage of time that the processor is executing
application or operating system processes other than Idle. On the computer that
is running SQL Server, this counter should be kept between 50 percent and
75 percent. In case of constant overloading, investigate whether there is
abnormal process activity or if the server needs additional CPUs.
·
System:
Processor Queue Length This
counter shows the number of threads in the processor queue. Monitor this
counter to make sure that it remains less than two times the number of core
CPUs.
·
Memory:
Available Mbytes This counter shows
the physical memory, in megabytes, available to processes running on the
computer. Monitor this counter to make sure that you maintain a level of at
least 20 percent of the total available physical RAM.
·
Memory:
Pages/sec This counter shows
the rate at which pages are read from or written to disk to resolve hard page
faults. Monitor this counter to make sure that it remains under 100.
Disk
counters to monitor
Monitor the following counters to ensure the health
of disks. Note that the following values represent values measured over
time — not values that occur during a sudden spike and not values that are
based on a single measurement.
·
Physical
Disk: % Disk Time: DataDrive This
counter shows the percentage of elapsed time that the selected disk drive is
busy servicing read or write requests–it is a general indicator of how busy the
disk is. If the PhysicalDisk: % Disk Time counter is high (more than 90
percent), check the PhysicalDisk: Current Disk Queue
Length counter to
see how many system requests are waiting for disk access. The number of waiting
I/O requests should be sustained at no more than 1.5 to 2 times the number of
spindles that make up the physical disk.
·
Logical
Disk: Disk Transfers/sec This
counter shows the rate at which read and write operations are performed on the
disk. Use this counter to monitor growth trends and forecast appropriately.
·
Logical
Disk: Disk Read Bytes/sec and Logical
Disk: Disk Write Bytes/sec These counters show
the rate at which bytes are transferred from the disk during read or write
operations.
·
Logical
Disk: Avg. Disk Bytes/Read This
counter shows the average number of bytes transferred from the disk during read
operations. This value can reflect disk latency — larger read operations
can result in slightly increased latency.
·
Logical
Disk: Avg. Disk Bytes/Write This
counter shows the average number of bytes transferred to the disk during write
operations. This value can reflect disk latency — larger write operations
can result in slightly increased latency.
·
Logical
Disk: Current Disk Queue Length This
counter shows the number of requests outstanding on the disk at the time that
the performance data is collected. For this counter, lower values are better.
Values greater than 2 per disk may indicate a bottleneck and should be
investigated. This means that a value of up to 8 may be acceptable for a
logical unit (LUN) made up of 4 disks. Bottlenecks can create a backlog that
can spread beyond the current server that is accessing the disk and result in
long wait times for users. Possible solutions to a bottleneck are to add more
disks to the RAID array, replace existing disks with faster disks, or move some
data to other disks.
·
Logical
Disk: Avg. Disk Queue Length This
counter shows the average number of both read and write requests that were
queued for the selected disk during the sample interval. The rule is that there
should be two or fewer outstanding read and write requests per spindle. But
this can be difficult to measure because of storage virtualization and
differences in RAID levels between configurations. Look for larger than average
disk queue lengths in combination with larger than average disk latencies. This
combination can indicate that the storage array cache is being overused or that
spindle sharing with other applications is affecting performance.
·
Logical
Disk: Avg. Disk sec/Read and Logical
Disk: Avg. Disk sec/Write These counters show
the average time, in seconds, of a read or write operation to the disk. Monitor
these counters to make sure that they remain below 85 percent of the disk
capacity. Disk access time increases exponentially if read or write operations
are more than 85 percent of disk capacity. To determine the specific
capacity for your hardware, refer to the vendor documentation or use the SQLIO
Disk Subsystem Benchmark Tool to calculate it. For more information, see SQLIO Disk Subsystem Benchmark Tool.
o Logical Disk: Avg. Disk sec/Read This counter shows the average
time, in seconds, of a read operation from the disk. On a well-tuned system,
ideal values are from 1 through 5 ms for logs (ideally 1 ms on a
cached array), and from 4 through 20 ms for data (ideally less than
10 ms). Higher latencies can occur during peak times. However, if high
values occur regularly, you should investigate the cause.
o Logical Disk: Avg. Disk sec/Write This counter shows the average
time, in seconds, of a write operation to the disk. On a well-tuned system,
ideal values are from 1 through 5 ms for logs (ideally 1 ms on a
cached array), and from 4 through 20 ms for data (ideally less than
10 ms). Higher latencies can occur during peak times. However, if high
values occur regularly, you should investigate the cause.
When you are using RAID configurations with the Logical
Disk: Avg. Disk Bytes/Read or Logical Disk: Avg. Disk Bytes/Write counters, use the formulas listed in
the following table to determine the rate of input and output on the disk.
RAID level
|
Formula
|
RAID 0
|
I/Os per disk
= (reads + writes) / number of disks
|
RAID 1
|
I/Os per disk
= [reads + (2 × writes)] / 2
|
RAID 5
|
I/Os per disk
= [reads + (4 × writes)] / number of disks
|
RAID 10
|
I/Os per disk
= [reads + (2 × writes)] / number of disks
|
For example, if you have a RAID 1 system that has
two physical disks, and your counters are at the values that are shown in the
following table.
Counter
|
Value
|
Avg. Disk sec/Read
|
80
|
Logical Disk: Avg. Disk sec/Write
|
70
|
Avg. Disk Queue Length
|
5
|
The I/O value per disk can be calculated as
follows: (80 + (2 × 70))/2 = 110
The disk queue length can be calculated as follows:
5/2 = 2.5
In this situation, you have a borderline I/O
bottleneck.
Other monitoring
tools
You can also monitor disk latency and analyze
trends by using the sys.dm_io_virtual_file_stats dynamic management view in SQL
Server 2008.
Monitoring
and maintaining SharePoint Server 2013
To maintain
SharePoint Server 2013 system performance, you must monitor your server to
identify potential bottlenecks.
Topics
Configuring
monitoring
Setting
|
Value
|
Notes
|
Event Log
Flooding Protection
|
Disabled
|
The default value
is Enabled. It can be disabled to collect as much monitoring data
as possible. For normal operations, it should be enabled.
|
Timer Job
Schedule
|
||
Microsoft SharePoint
Foundation Usage Data Import
|
5 minutes
|
The default value
is 30 minutes. Lowering this setting imports the data into the
usage database more frequently, and is especially useful when
troubleshooting. For normal operations, it should be 30 minutes.
|
Diagnostic
Providers
|
||
Enable all
diagnostic providers
|
Enabled
|
The default value
is Disabled except for the "Search Health Monitoring -
Trace Events" provider. These providers collect health data for various
features and components. For normal operations, you may want to revert to the
default.
|
Set
"job-diagnostics-performance-counter-wfe-provider" and
"job-diagnostics-performance-counter-sql-provider" Schedule
Intervals
|
1 minute
|
The default value
is 5 minutes. Lowering this setting can poll data more
frequently, and is especially useful when troubleshooting. For normal
operations, it should be 5 minutes.
|
Miscellaneous
|
||
Enable stack
tracing for content requests
|
Enabled
|
The default value
is Disabled. Enabling this setting allows diagnosis of content
requests failures using the process stack trace. For normal operations, it
should be disabled.
|
Enable the
Developer Dashboard
|
Enabled
|
The default value
is Disabled. Enabling this setting allows diagnosis of slow
pages, or other problems by using the Developer Dashboard. For normal
operations, and as soon as troubleshooting is no longer necessary, it should
be disabled.
|
Usage Data
Collection
|
||
Content Import
Usage
Content Export
Usage
Page Requests
Feature Use
Search Query Use
Site Inventory Usage
Timer Jobs
Rating Usage
|
Enabled
|
Enabling the
logging of this set of counters allows you to collect more usage data across
the environment and to better understand the traffic patterns in the
environment.
|
Performance counters
If you are using the usage database, then you can
add the performance counters that assist you in monitoring and evaluating your
farm's performance to the usage database, in such a way that they are logged
automatically at a specific interval (by default, 30 minutes).
Given that, you can query the usage database to
retrieve these counters and graph the results over time. Here's an how to use
the Add-SPDiagnosticsPerformanceCounter PowerShell cmdlet to add the % Processor Time
counter to the usage database. This only has to be run on one of the web
servers:-
Add-SPDiagnosticsPerformanceCounter -Category "Processor" -Counter "% Processor Time" -Instance "_Total" –WebFrontEnd
There are several generic performance counters that
you should monitor for any server system. The following table outlines these
performance counters.
Performance Counter
|
Description
|
Processor
|
You
should monitor processor performance to ensure that all processor usage does
not remain consistently high (over 80 percent) as this indicates that the
system would not be able to handle any sudden surges of activity. And that in
the common state, you will not see a domino effect if one component failure
will bring the remaining components to a malfunctioning state. For example,
if you have three web servers, you should make sure that the average CPU
across all servers is under 60% so that if one fails, there is still room for
the other two to absorb the additional load.
|
Network
Interface
|
Monitor
the rate at which data is sent and received via the network interface card.
This should remain below 50 percent of network capacity.
|
Disks
and Cache
|
There
are several logical disk options that you should monitor regularly. The
available disk space is important in any capacity study, but you should also
review the time that the disk is idle. Dependent on the types of applications
or services that you are running on your servers, you may review disk read
and write times. Extended queuing for write or read function will affect
performance. The cache has a major effect on read and write operations. You
must monitor for increased cache failures.
|
Memory
and Paging File
|
Monitor
how much physical memory is available for allocation. Insufficient memory
will lead to excessive use of the page file and an increase in the number of
page faults per second.
|
System counters
The following table provides information on system
objects and counters that you could add to the set of counters monitored in the
usage database using theSPDiagnosticPerformanceCounter on a web server.
Objects and Counters
|
Description
|
Processor
|
|
%
Processor Time
|
This
shows processor usage over time. If this is consistently too high, you may
find performance is adversely affected. Remember to count "Total"
in multiprocessor systems. You can measure the utilization on each processor
also, to ensure balanced performance between cores.
|
Disk
|
|
-
Avg. Disk Queue Length
|
This
shows the average number of both read and write requests that were queued for
the selected disk during the sample interval. A bigger disk queue length may
not be a problem as long as disk reads/writes are not suffering and the
system is working in a steady state without expanding queuing.
|
Avg.
Disk Read Queue Length
|
The
average number of read requests that are queued.
|
Avg.
Disk Write Queue Length
|
The
average number of write requests that are queued.
|
Disk
Reads/sec
|
The
number of reads to disk per second.
|
Disk
Writes/sec
|
The
number of writes to disk per second.
|
Memory
|
|
-
Available Mbytes
|
This
shows how much physical memory is available for allocation. Insufficient
memory leads to excessive use of the page file and an increase in the number
of page faults per second.
|
-
Cache Faults/sec
|
This
counter shows the rate at which faults occur when a page is sought in the
file system cache and is not found. This may be a soft fault, when the page
is found in memory, or a hard fault, when the page is on disk.
The
effective use of the cache for read and write operations can have a
significant effect on server performance. You must monitor for increased
cache failures, indicated by a reduction in the Async Fast Reads/sec or Read Aheads/sec.
|
-
Pages/sec
|
This
counter shows the rate at which pages are read from or written to disk to
resolve hard page faults. If this increases, it indicates system-wide
performance problems.
|
Paging File
|
|
-
% Used and % Used Peak
|
The
server paging file, also known as the swap file, holds "virtual"
memory addresses on disk. Page faults occur when a process has to stop and
wait while required "virtual" resources are retrieved from disk
into memory. These will be more frequent if the physical memory is
insufficient.
|
NIC
|
|
-
Total Bytes/sec
|
This
is the rate at which data is sent and received via the network interface
card. You may have to investigate further if this rate is over 40-50 percent
network capacity. To fine-tune your investigation, monitor Bytes received/sec and Bytes Sent/sec.
|
Process
|
|
-
Working Set
|
This
counter indicates the current size (in bytes) of the working set for a given
process. This memory is reserved for the process, even if it is not being
used.
|
-
% Processor Time
|
This
counter indicates the percentage of processor time that is used by a given
process.
|
Thread
Count (_Total)
|
The
current number of threads.
|
ASP.NET
|
|
Requests
Total
|
The
total number of requests since the service was started.
|
Requests
Queued
|
SharePoint
Foundation 2013 provides the building blocks for HTML pages that are rendered
in the user browser over HTTP. This counter shows the number of requests
waiting to be processed.
|
Request
Wait Time
|
The
number of milliseconds that the most recent request waited in the queue for
processing. As the number of wait events increases, users will experience
decreased page-rendering performance.
|
Requests
Rejected
|
The
total number of requests not executed because of insufficient server resources
to process them. This counter represents the number of requests that return a
503 HTTP status code, which indicates that the server is too busy.
|
Requests
Executing (_Total)
|
The
number of requests currently executing.
|
Requests/Sec
(_Total)
|
The
number of requests executed per second. This represents the current
throughput of the application. Under constant load, this number should remain
in a certain range, barring other server work (such as garbage collection,
cache cleanup thread, external server tools, and so on).
|
.NET CLR Memory
|
|
#
Gen 0 Collections
|
Displays
the number of times the generation 0 objects (that is, the youngest, most
recently allocated objects) are reclaimed by garbage collection since the
application started. This number is useful as a ratio of #Gen 0: #Gen 1: #Gen
2 to make sure that the number of Gen 2 collections does not greatly exceed
Gen 0 collections, optimally by a factor of 2.
|
#
Gen 1 Collections
|
Displays
the number of times the generation 1 objects are reclaimed by garbage
collection since the application started.
|
#
Gen 2 Collections
|
Displays
the number of times the generation 2 objects are reclaimed by garbage
collection since the application started. The counter is incremented at the
end of a generation 2 garbage collection (also known as a full garbage
collection).
|
%
Time in GC
|
Displays
the percentage of elapsed time that was spent performing a garbage collection
since the last garbage collection cycle. This counter usually indicates the
work done by the garbage collector to collect and compact memory on behalf of
the application. This counter is updated only at the end of every garbage
collection. This counter is not an average. Its value reflects the last
observed value. This counter should be under 5% in normal operation.
|
SQL Server counters
The following table provides information on SQL
Server objects and counters.
Objects and Counters
|
Description
|
General
Statistics
|
This
object provides counters to monitor general server-wide activity, such as the
number of current connections and the number of users connecting and
disconnecting per second from computers that are running an instance of SQL
Server.
|
User
Connections
|
This
counter shows the number of user connections on your instance of SQL Server.
If you see this number increase by 500 percent from your baseline, you may
see a performance reduction.
|
Databases
|
This
object provides counters to monitor bulk copy operations, backup and restore
throughput, and transaction log activities. Monitor transactions and the
transaction log to determine how much user activity is occurring in the
database and how full the transaction log is becoming. The amount of user
activity can determine the performance of the database and affect log size,
locking, and replication. Monitoring low-level log activity to gauge user
activity and resource usage can help you identify performance bottlenecks.
|
Transactions/sec
|
This
counter shows the number of transactions on a given database or on the whole
SQL Server instance per second. This number is to help you create a baseline
and to help you troubleshoot issues.
|
Locks
|
This
object provides information about SQL Server locks on individual resource
types.
|
Number
of Deadlocks/sec
|
This
counter shows the number of deadlocks on the SQL Server per second. This
should typically be 0.
|
Average
Wait Time (ms)
|
This
counter shows the average amount of wait time for each lock request that
resulted in a wait.
|
Lock
Wait Time (ms)
|
This
counter shows the total wait time for locks in the last second.
|
Lock
Waits/sec
|
This
counter shows the number of locks per second that could not be satisfied
immediately and had to wait for resources.
|
Latches
|
This
object provides counters to monitor internal SQL Server resource locks called
latches. Monitoring the latches to determine user activity and resource usage
can help you identify performance bottlenecks.
|
Average
Latch Wait Time (ms)
|
This
counter shows the average latch wait time for latch requests that had to
wait.
|
Latch
Waits/sec
|
This
counter shows the number of latch requests per second that could not be
granted immediately.
|
SQL
Statistics
|
This
object provides counters to monitor compilation and the type of requests sent
to an instance of SQL Server. Monitoring the number of query compilations and
recompilations and the number of batches received by an instance of SQL
Server gives you an indication of how quickly SQL Server is processing user
queries and how effectively the query optimizer is processing the queries.
|
SQL
Compilations/sec
|
This
counter indicates the number of times the compile code path is entered per
second.
|
SQL
Re-Compilations/sec
|
This
counter indicates the number of times statement recompiles are triggered per
second.
|
Plan
Cache
|
This
object provides counters to monitor how SQL Server uses memory to store
objects such as stored procedures, impromptu and prepared Transact-SQL
statements, and triggers.
|
Cache
Hit Ratio
|
This
counter indicates the ratio between cache hits and lookups for plans.
|
Buffer
Cache
|
This
object provides counters to monitor how SQL Server uses memory to store data
pages, internal data structures, and the procedure cache and counters to
monitor the physical I/O as SQL Server reads and writes database pages.
|
Buffer
Cache Hit Ratio
|
This
counter shows the percentage of pages found in the buffer cache without
having to read from disk. The ratio is the total number of cache hits divided
by the total number of cache lookups since an instance of SQL Server was
started.
|
Removing
bottlenecks
Objects and
Counters
|
Problem
|
Resolution
Options
|
||
Processor
|
||||
Processor - %
Processor Time
|
Over 75-85%
|
Upgrade processor
Increase number
of processors
Add additional
server(s)
|
||
Disk
|
||||
Avg. Disk Queue
Length
|
Gradually
increasing, system not in a steady state and queue is backing up
|
Increase number
or speed of disks
Change array
configuration to stripe
Move some data to
an alternative server
|
||
% Idle Time
|
Less than 90%
|
Increase number
of disks
Move data to an
alternative disk or server
|
||
% Free Space
|
Less than 30%
|
Increase number
of disks
Move data to an
alternative disk or server
|
||
Memory
|
||||
Available Mbytes
|
Less than 2GB on
a Web server.
|
Add memory.
|
||
Cache Faults/sec
|
Greater than 1
|
Add memory
Increase cache
speed or size if possible
Move data to an
alternative disk or server
|
||
Pages/sec
|
Greater than 10
|
Add memory
|
||
Paging File
|
||||
% Used and % Used
Peak
|
The server paging
file, sometimes called the swap file, holds "virtual" memory
addresses on disk. Page faults occur when a process has to stop and wait
while required "virtual" resources are retrieved from disk into
memory. These will be more frequent if the physical memory is inadequate.
|
Add memory
|
||
NIC
|
||||
Total Bytes/sec
|
Over 40-50% of
network capacity. This is the rate at which data is sent and received via the
network interface card.
|
Investigate
further by monitoring Bytes received/sec and Bytes Sent/sec.
Reassess network
interface card speed
Check number,
size, and usage of memory buffers
|
||
Process
|
||||
Working Set
|
Greater than 80%
of total memory
|
Add memory
|
||
% Processor Time
|
Over 75-85%.
|
Increase number
of processors
Redistribute
workload to additional servers
|
||
ASP.NET
|
||||
Application Pool
Recycles
|
Several per day,
causing intermittent slowness.
|
Make sure that
you have not implemented settings that automatically recycle the application
pool unnecessarily throughout the day.
|
||
Requests Queued
|
Hundreds or
thousands of requests queued.
|
Implement
additional Web servers
The default
maximum for this counter is 5,000, and you can change this setting in the
Machine.config file
|
||
Request Wait Time
|
As the number of
wait events increases, users will experience degraded page rendering
performance.
|
Implement
additional Web servers
|
||
Requests Rejected
|
Greater than 0
|
Implement
additional Web servers
|
Capacity
Planning Worksheet (Medium / Large Farm)
First
discuss with customer before preparing capacity planning worksheets
Topics for discussion
|
Customer’s Reply
|
||||||||||
How many SharePoint users are in your environment?
|
40000
|
||||||||||
How many concurrent users in your environment?
|
5000
|
||||||||||
User growth per year?
|
10%
|
||||||||||
How many Departments / Divisions?
|
20
|
||||||||||
Environment Setup – Intranet /
Internet/Extranet
|
Intranet
|
||||||||||
Single place / Diversified Servers
|
Single Place
|
||||||||||
Topology
|
WFE – 4, Load Balancer, App
Server-3, DB Server-3 + Search DB -1
Medium Topology
|
||||||||||
Request per user
|
40
|
||||||||||
Request per second
|
2000
|
||||||||||
Peak Hours
|
9 Am – 5 Pm
|
||||||||||
System Availability
|
99.9%
|
||||||||||
Total no of Farms / Servers
|
Farm -1 , Servers
|
||||||||||
Total no of Web application / Site
Collection
|
10 / 50
|
||||||||||
Total no of Site Collection/Sites
|
50/ 5000
|
||||||||||
Do you already have a source of
Authentication to provisioning users?
|
AD, ADFS (Multiple AD)/Fed. server
|
||||||||||
WFE Configuration Details (per
server)
|
Users / Throughput
40000/2500
|
||||||||||
APP Server Configuration Details
(per server)
|
Service Architectures
BCS / Search / Excel / PPS
|
||||||||||
DB Server Configuration Details
(per server)
|
Content DB / Search / User Profile
|
||||||||||
Forefront Server Configuration
Details (per server)
|
|||||||||||
Federation Server Configuration Details
(per server)
|
|||||||||||
Average size of documents
|
Database
size = ((D × V) × S) +
(10 KB × (L +
(V × D)))
|
Comments
Post a Comment