Field Issues to Avoid; Enhancements to Use BroadSoft Technical Summit, June 2009 Agenda • Issues to Avoid – DNS Delays – Call Looping & Fan-out – Overload • Enhancements to Use – Platform Enhancements – BroadWorks Hardware Support Policy 2 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 1 Avoidable Field Issue – Analysis/Recommendations • Top three trouble ticket root causes that can generally be avoided – DNS Delays – Call Looping & Fan-out – Overload 3 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute DNS Delays 4 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 2 DNS Delay Impacts • When a route or contact is defined as a domain name, the server needs to resolve the name – Call in limbo until DNS resolution completed or timed out – Call Processing thread that the call is running in is blocked • Alarms = bwThreadDelayDetected – Severe thread delays can trigger overload condition • Alarms = bwOverloadZoneTransition – Severe delays across all Call Half Input Adapter threads can bring down the Execution Server process • Alarms = bwForcedExitDueToHungThread, bwCallPThreadAutoRestart, bwPMExecutionServerRestarted • DNS Delays = Trouble for You and Your Customers 5 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute What Requires DNS Resolution? • Application Server access device addresses that are FQDNs • Network Server 302 response returned routes that are FQDNs • SIP headers used in response/request routing that are FQDNs – Request-URI – Via header – Contact header 6 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 3 DNS Problem Types • DNS delay issues can be caused by a number of factors – Connectivity Issues: DNS server completely unreachable for a period of time – Slow Response Time: DNS server is reachable, but the application is introducing delays • BIND is not a carrier grade DNS – Non-Authoritative Lookups: Delays incurred when going out to other DNS to resolve FQDNs not owned by provider • URL dialing – user dialing non-existent FQDNs can take seconds to resolve 7 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Is There Anything I Can Do? • Release 14sp3 Application Server includes a number of DNS client enhancements that: – Apply lookup time limits to mitigate the impact of DNS delays on the system – Optionally disable DNS lookups in URL dialing – Better control DNS querying and response caching – Better monitor DNS performance • Some of these require configuration to be activated 8 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 4 14SP3+ DNS Lookup Algorithm • DNS information pulled from /etc/resolv.conf on BroadWorks startup – nameserver: DNS server list. Lookup will start with first server in the list (unless rotate option used) and route advance to next in list on no response conditions – domain: Local domain name. Will be appended to the contact for additional lookups if lookup returns “No such name” response – options: Optional parameters • • • retrans: response wait time (Default =1 sec) retry: number of query attempts to a nameserver before advancing to next server (Default = 2) rotate: load balance across all listed nameservers bwadmin@IHApp$ more /etc/resolv.conf domain eng.broadsoft.com nameserver 192.168.2.40 nameserver 10.2.1.1 options retrans:1 retry:2 rotate 9 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 14SP3+ DNS Lookup Algorithm • Query type performed varies depending on configuration: 1. AS_CLI/Interface/SIP>supportDnsSrv parameter 2. Contact port provided or not – e.g., device was configured with port= 5060, or no port (null) 3. Transport unspecified, UDP or TCP supportDnsSrv = False Port = ANY Transport = ANY Port = Null Transport = TCP Port = Null Transport = UDP Port = Null Transport = Unspecified A record A record A record • _sip._tcp SRV, if no match, • A record •_sip._udp SRV, if no match, • A record • _sip._tcp SRV, if no match, • _sip._udp SRV, if no match, • A record A record supportDnsSrv = True A record 10 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 5 14SP3+ DNS Query Properties • Additional DNS configuration properties defined in the /usr/local/broadworks/bw_base/conf/appserver.properties – Configurable via AS_CLI/System/StartupParam> Property bw.nameservice.cachePolicy Description Enumeration {NEVER, CONFIGURED, HONOR_DNS}. When set to HONOR_DNS, the DNS client uses the response’s ttl value. Default is “CONFIGURED”. Amount of time (in seconds) a successfully looked up record is cached if cachePolicy = CONFIGURED. Default is “86400”. Enumeration {NEVER, CONFIGURED, HONOR_DNS}. When set to HONOR_DNS, the DNS client uses the “minimum” value of the response’s SOA record. Default is “CONFIGURED”. Amount of time (in seconds) a looked up record with negative response is kept in negative cache if negativeCachePolicy = CONFIGURED. Default is “600”. bw.nameservice.cacheTtlSecs bw.nameservice.nCachePolicy bw.nameservice.nCacheTtlSecs 11 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 14SP3+ DNS Query Properties • Additional DNS configuration properties defined in the /usr/local/broadworks/bw_base/conf/appserver.properties – Configurable via AS_CLI/System/StartupParam> Property bw.nameservice.unreachableServerLingerSecs Description Minimum time interval (in seconds) for which no DNS request is sent to a server detected as unreachable. Default is “60”. bw.nameservice.useAdditionalSrvRrs Boolean indicating that A lookups resulting from SRV lookups should use, when populated, the pre-resolved A resource records from the additional RR section of the SRV lookup response. Local caching must be enabled for this to have effect. Default is “true”. Boolean indicating if duplicate lookup with same name and type from two time-bounded threads are allowed or not. Default is “true”. bw.nameservice.denyTimeBoundedDuplicateLoo kups 12 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 6 14SP3+ DNS Configuration Options • Local Name File: Can populate local records directly on the server that will be loaded into cache on startup – /usr/local/broadworks/bw_base/conf/namedefs – DNS Client lookup order is local cache, /etc/hosts, DNS bwadmin@IHApp$ more /usr/local/broadworks/bw_base/conf/namedefs # _sip._udp.ns.lab.broadsoft.com SRV 1 99 5060 ns1.lab.broadsoft.com _sip._udp.ns.lab.broadsoft.com SRV 2 99 5060 ns2.lab.broadsoft.com ns1.lab.broadsoft.com IN A 192.168.1.91 ns2.lab.broadsoft.com IN A 192.168.2.61 vm1.lab.broadsoft.com IN A 192.168.1.107 vm2.lab.broadsoft.com IN A 192.168.2.119 _pop3._tcp.lab.broadsoft.com SRV 0 0 110 vm1.lab.broadsoft.com _pop3._tcp.lab.broadsoft.com SRV 1 0 110 vm2.lab.broadsoft.com 13 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 14SP3+ DNS Configuration Options AS_CLI/System/CallP/DNS> get enableNameLookupForURLDialing = false enableNameLookupTimeout = true nameLookupTimeoutMilliseconds = 500 • URL Dialing Lookup: Ability to disable DNS lookups for user dialed URLs (e.g.,
[email protected]) – Generally, URL call will go to NS, if UrlDialing policy hit, NS returns MADDR=Domain, AS looks up the domain • UrlDialing can be disabled on NS – Bad domains can take seconds to resolve – Lookup controlled on AS by enableNameLookupForURLDialing parameter (Default=true) – If disabled, URL call that requires lookup will get treatment 14 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 7 14SP3+ DNS Configuration Options AS_CLI/System/CallP/DNS> get enableNameLookupForURLDialing = false enableNameLookupTimeout = true nameLookupTimeoutMilliseconds = 500 • Time Limit on CallP Lookup: Ability to configure a time bound on DNS lookups within the CallP thread – If enabled, any CallP thread DNS lookup that takes longer than namelookupTimeoutMilliseconds will result in that call being sent to treatment – Lookup will still continue in the background and any result (positive or negative) will be cached for further use – Default setting is false for no time limit 15 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Additional 14SP3+ DNS Tool AS_CLI/ASDiagnostic/DNS> ? 0) clearAllCache : Clear DNS cache 1) clearCache : Clear a single entry from BroadWorks DNS cache 2) lookup : Lookup name using DNS client 3) reload : Reload BroadWorks DNS client static entries from configuration files • clearAllCache/clearCache: Flush a single entry or all entries from AS DNS cache without requiring a server stop • reload: Dynamically read the local namedefs file to update cache • Lookup: Perform the same lookup the application will and see where it is pulling the result from, local file, /etc/hosts, DNS – Need to specify the query type and proper prepend for SRV e.g., AS_CLI/ASDiagnostic/DNS> lookup _sip._udp.ns.eng.broadsoft.com SRV 16 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 8 Identifying DNS Delays • DNS Specific SNMP Traps – bwSipUnrecognisedDomainName: Severity = Medium • Generated when a lookup returns a “No such Name” response to a query • Trap will include the unresolved domain • Unresolved domain will be added to negative cache – bwDnsServerUnreachable: Severity = Low • Generated when a DNS server does not return a response within /etc/resolve.conf retrans & retry parameters • DNS server is considered out-of-service for bw.nameservice.unreachableServerLingerSecs time period • DNS client will route advance and use next DNS server – bwDnsAllServersUnreachable: Severity = High • The last available DNS server is unreachable and all others are still in the “unreachable linger” state 17 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Identifying DNS Delays • DNS Specific SNMP PMs – broadworks/executionServer/dnsModule/dnsStats/ AS_CLI/Monitoring/PM/ApplicationServer> get -r -----------------------------------------------------------------------------broadworks/executionServer/dnsModule/dnsStats/ -----------------------------------------------------------------------------*bwDnsQueryTimeMax 501 *bwDnsQueryTimeMaxTimestampMSB 289 *bwDnsQueryTimeMaxTimestampLSB 310639013 *bwDnsQueryTimeAvg 23 bwDnsStatsQueriesTable: (1) (2) (3) (4) bwDnsStatsQueryIndex bwDnsStatsQueryType bwDnsStatsQueries bwDnsStatsQueryTimeouts (2) A PTR SRV NAPTR (3) 10 0 34 0 (4) 0 0 8 0 (1) 1 2 3 4 18 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 9 Identifying DNS Delays • DNS Specific SNMP PMs – bwDnsQueryTimeMaxTimestampMSB/LSB: Time Stamp of the longest query time (excluding timeouts). • Need to be decoded » [(MSB * 2^32) + LSB] = Unix time from 1970 » To get local server time, drop rightmost 3 digits and use the following » $ perl -e 'require "ctime.pl"; print &ctime(RESULT);' • Example, » *bwDnsQueryTimeMaxTimestampMSB » *bwDnsQueryTimeMaxTimestampLSB 289 310639013 [ (289 *2^32) + 310639013] =1241556187557 IHApp$ perl -e 'require "ctime.pl"; print &ctime(1241556187);' Tue May 5 16:43:07 US/Eastern 2009 19 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Identifying DNS Delays • bwThreadDelayDetected SNMP trap – If a Call Half Input Adapter thread is delayed for more than 2.5 seconds, /var/broadworks/logs/appserver XSOutputXX.log file will capture a thread dump • You can quickly verify if this thread delay was related to DNS by searching for the presence of the following strings » $grep “Inet4AddressImpl” XSOutput*.log » $grep “DNS.ExtendedResolver” XSOutput*.log • You should not see this occur if recommended DNS configuration is implemented 20 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 10 DNS Design Considerations • Ideally DNS should not be used for call processing within the core – Network core elements tend to be static • Should be defined IP addresses, or if FQDN is absolutely required, define them locally on the AS in the namedefs file – DNS does have a place on the access side • Big difference if a phone has a problem resolving an address versus an AS running 60 Calls/Sec can’t resolve an address • If DNS is required within the core for Call Processing – Should use the local namedefs files if possible – If external DNS required, should be on dedicated DNS infrastructure, not overlay on data DNS – If “wide-open” URL dialing is not required, DNS server should not forward to root server for domains not “owned” by the server 21 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Identifying DNS Delays • DNS Specific SNMP PMs – bwDnsQueryTimeAvg: Average response time from the DNS servers • Average latency added to each call requiring DNS resolution – bwDnsStatsQueryTimeouts: Per record-type DNS server timeout count • Means server did not respond within /etc/hosts retry/retrans parameters – bwDnsQueryTimeMax: Longest query time since last reset 22 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 11 DNS Recommended Configuration 1. Use local namedefs file whenever possible • If you just have a small number of non-IP contacts to resolve, then define locally to ensure instant response time If URL dialing outside the group is not required, set enableNameLookupForURLDialing to false • Protects against users URL dialing to “bad” domains • Can also not assign URLDialing policy on the NS namelookupTimeoutMilliseconds should be enabled and the timeout period left at the default 500 msec • Ensures that the call processing thread will not be delayed more that 500 msec waiting on DNS /etc/resolv.conf retrans and retry parameter should be increased to retrans:1 retry:5 • Since call processing protected via DNS time limit, DNS server time out should be adjusted upwards to avoid flagging a DNS as “unreachable” due to long response 2. 3. 4. 23 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute DNS on the Network Server • Network Server has default DNS functionality that most people do not realize nor use – For every INVITE received by the NS, the NS does a forward (if IP) or reverse (if FQDN) lookup on the @host portion of the originators URI (From:, RPID, PAI) to see if it matches a known network element – Most customer do need that since they define NS routingNEs and hostingNEs for strict string match • This unnecessary DNS lookup can be disabled via CLI NS_CLI/Interface/SIP> get useDNSLookup = false 24 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 12 Call Looping & Fan-out 25 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Call Looping & Fan-out Types • Excessive calling – Large number of origination or terminations associated with a given user • May be malicious or not (e.g., autodialer) • Redirection Looping – A calls B, who is FWD to C, who is FWD back to A • Redirection Chaining – A calls B, who is FWD to C, who is FWD to D, who is FWD to …. • Excessive Call Fan-out – Incoming call hits user with SIMRING to 10 numbers, and each of those numbers has SIMRING to 10 other numbers, and so on….. • Depending on the fan-out depth, can result in 100s of new calls starting almost instantaneously impacting server performance 26 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 13 Redirection Information • In general, redirection information should be passed in SIP INVITEs using either the Diversion or HistoryInfo header – Header will include a list of redirecting parties and reason • In practice, Diversion/History-Info header information can be lost, especially when traversing network boundaries (VoIP→TDM →Wireless →TDM →VoIP) – Can’t be relied on to make protection decisions 27 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Is Looping/Fan-out Protection Available? • BroadWorks provides a full range of Call Policy call limits protection that can be applied at the system level, and customized at any of the sub levels (Enterprise/Group/User) – Functionality introduced in 14SP1 under feature ID 33339 (with subset patched back in 13.0) – 33339 needs to be activated and configured even if protection was enabled in 13.0 patch back 28 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 14 Simultaneous Call Protection • defaultMaxNumberSimultaneousCalls – Protects against excessive calling situations • Count triggered on INVITE • Does not require 33339 activation • Separate parameter for video call control – When maximum simultaneous calls hit: • Redirect to VM on termination • 403 Forbidden returned on originations – bwUserExceededMaxSimultaneousCalls informational severity trap generated AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultMaxNumberSimultaneousCalls = 10 defaultUseMaxNumberSimultaneousCalls = true defaultMaxNumberSimultaneousVideoCalls = 5 defaultUseMaxNumberSimultaneousVideoCalls = true 29 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute SIP Redirection Header Based Protection • Automatic Loop Detection – Automatically detect redirection loops using SIP Diversion/History-Info header information • On terminating INVITE to user, if Diversion/History-Info header contains user’s number, block any redirection and “short-circuit” terminating to the user • On redirection based on a user’s service, if the redirect-to number is present in the received Diversion/History-Info header then deny the redirection – Functionality enabled by default (cannot be disabled) – bwForwardDestinationLoop informational severity trap generated identifying calling and called party – One issue identified (EV87594): Loop Detection triggers on user configured CLID (recommend you apply patch) 30 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 15 SIP Redirection Header Based Protection • defaultMaxRedirectionDepth – Protects against redirection chaining by counting number of redirections in Diversion/History-Info header and “shortcircuits” call to the user if number > MaxRedirectionDepth – Functionality enabled by default (cannot be disabled), replaced old MaxHops system parameter – bwUserExceededMaxRedirectionDepth informational severity trap generated identifying calling and called party – AS_CLI/Interface/SIP maxForwardingHops: has nothing to do with forwarding loops; controls max SIP message forwarding through proxies AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultMaxRedirectionDepth = 10 31 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute SIP Redirection Header Based Protection • defaultMaxFindMeFollowMeDepth – Protects against redirection chaining of forking services like SIMRING by counting number of reason=follow-me redirections in Diversion/History-Info header and “shortcircuits” call to the user if number > MaxFindMeFollowMeDepth – Controlled by feature 33339, disabled by default – bwUserExceededMaxFindMeFollowMeDepth informational severity trap generated identifying calling and called party AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultUseMaxFindMeFollowMeDepth = true defaultMaxFindMeFollowMeDepth = 3 32 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 16 Service Call Count Based Protection • defaultMaxNumberConcurrentRedirectedCalls – Protects redirection services from forwarding loops when Redirection information is not being preserved • Diversion/History-Info header lost – Counts number of concurrent redirections from all redirecting services and “short-circuits” call to the user if number > MaxNumberConcurrentRedirectedCalls – Controlled by feature 33339, disabled by default – bwUserExceededMaxConcurrentRedirectedCalls informational severity trap generated identifying calling and called party AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultUseMaxNumberConcurrentRedirectedCalls = true defaultMaxNumberConcurrentRedirectedCalls = 3 33 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Service Call Count Based Protection • defaultMaxNumberConcurrentFindMeFollowMeInvocations – Protects redirection services (e.g., SIMRING, Sequential Ringing, Remote Office) from forwarding loops when Redirection information is not being preserved • Diversion/History-Info header lost – Counts number of concurrent redirections from all “follow-me” redirection services and will block the redirection if number > MaxNumberConcurrentRedirectedCalls – Controlled by feature 33339, disabled by default – bwUserExceededMaxFindMeFollowMeInvocations informational severity trap generated identifying calling and called party AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultUseMaxNumberConcurrentFindMeFollowMeInvocations = true defaultMaxNumberConcurrentFindMeFollowMeInvocations = 3 34 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 17 Virtual Subscribers • Currently, the looping and excessive calling protections outlined do not apply to virtual subscribers (e.g., call centers, voice portal) – Exception: Auto Attendant nesting looping protection added in 14SP9 (65983) • Can configure maximum number of re-entries into Auto Attendants to have a maximum number of loops/nestings AS_CLI/Service/AutoAttendant> get maxReentryForSameCall = 50 35 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Recommended System Settings AS_CLI/SubscriberMgmt/Policy/CallProcessing/CallLimits> get defaultMaxNumberSimultaneousCalls = 10 defaultUseMaxNumberSimultaneousCalls = true defaultMaxNumberSimultaneousVideoCalls = 5 defaultUseMaxNumberSimultaneousVideoCalls = true defaultMaxCallTimeForAnsweredCallsInMinutes = 600 defaultUseMaxCallTimeForAnsweredCalls = false defaultMaxCallTimeForUnansweredCallsInMinutes = 2 defaultUseMaxCallTimeForUnansweredCalls = false defaultUseMaxNumberConcurrentRedirectedCalls = true defaultMaxNumberConcurrentRedirectedCalls = 10 defaultUseMaxFindMeFollowMeDepth = true defaultMaxFindMeFollowMeDepth = 3 defaultMaxRedirectionDepth = 10 defaultUseMaxNumberConcurrentFindMeFollowMeInvocations = true defaultMaxNumberConcurrentFindMeFollowMeInvocations = 3 36 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 18 Overload 37 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Overload Controls Key Points • BroadWorks provides extensive overload protection on the Application Server and Network Server • Intelligent throttling for conventional overload – Based on processing delays for the main queues and memory consumption – Goal: maximize traffic while protecting the system • Priority given to existing calls over new calls • Emergency Calls have configurable priority • Call throttling more aggressive as overload increases • Aggressive throttling in extreme overload – Based on maximum queue size and encoder/decoder queue delay 38 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 19 Conventional Overload Controls • Overload Controls specific to traffic type – Call Processing related traffic • SIP: INVITE, SUBSCRIBE, NOTIFY, etc… • MGCP: All – Non-Call Processing related traffic • SIP: REGISTER, MESSAGE, OPTIONS • MGCP: None • Triggers and actions based on traffic type – e.g. REGISTER storm would throttle Non-Call Processing traffic without triggering a Call Processing overload condition – Configurable resulting actions • SIP: Ignore, 302 Temporarily Moved, or 503 Service Unavailable • MGCP : Ignore, 409 Processing Overload • Normal (green) zone plus 2 overload zones (yellow and red) – Increasing level of traffic throttling if condition deteriorates 39 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Extreme Overload Controls • Protection invoked at the low-level queues – SIP and MGCP, Decoder and Encoder queues – Limit placed on the overall size of each queue – Discard based on configurable time in queue • Age or queue size based message discard – Stale messages discarded from the queue – Newer messages added to the queue – System protection is key 40 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 20 BroadWorks Queue Architecture Primary Call Processing Queues and Threads MGCP DecodeQ Call Half Input Adapter MGCP EncodeQ MGCP Port 2427 SIP - CallP DecodeQ SIP EncodeQ SIP Port 5060 SIP – Non-CallP DecodeQ Non-CallP Input Adapter Background activity Queues and Threads Voice Mail Input Adapter CallP ThreadedDBAccesQueue Accounting Output Adapter Worker Thread Queue 41 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Overload Controls Invocation Points Primary Call Processing Queues and Threads MGCP DecodeQ Call Half Input Adapter MGCP EncodeQ MGCP Port 2427 SIP - CallP DecodeQ SIP EncodeQ SIP Port 5060 SIP – Non-CallP DecodeQ Non-CallP Input Adapter •Conventional Background activity Queues and Threads Overload •Extreme Overload Controls invoked on Input CallP Controls Voice Mail on invoked Accounting Adapters ThreadedDBAccesQueue Output Adapter DecoderInput Adapter and Encoder Worker Thread queues Queue 42 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 21 Conventional Overload State Transition •Transition from Green to Yellow to Red based on configurable criteria Criteria to Enter Yellow Criteria to Enter Red Engineered CallP Capacity •Separate overload controls for Call Processing related traffic and Non-Call processing related traffic Criteria to Leave Yellow Criteria to Enter Yellow Criteria to Leave Red Criteria to Enter Red Engineered Non-CallP Capacity •Orderly, configurable back-off to eliminate pingpong effect between zones Criteria to Leave Yellow Criteria to Leave Red 43 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Conventional Overload Actions Yellow Red No Actions 100% OPTIONS discarded 100% REGISTER actioned Expiration 2x the time since green zone No new sessions will be created Non-CallP queue timers halved Non-Call Processing Call Related Non-Call Related REGISTER expiration Stale Messages No Actions 100% OPTIONS discarded 50% new REGISTER actioned Expiration 2x the time since green zone New sessions completed in 5 secs Non-CallP queue timers halved Call Related Non-Call Related 50% new calls actioned 0% existing calls actioned Forced to Yellow All queue timers halved Suspended T1 is doubled 5 second session completion Protocol level debug only PM Reporting suspended Maintenance Scripts suspended Access Device Monitoring suspended IP device reset suspended 100% new calls actioned 0% existing calls actioned Forced to Red All queue timers halved Suspended T1 is quadrupled 5 second session completion All debug logs suspended PM Reporting suspended Maintenance Scripts suspended Access Device Monitoring suspended IP device reset suspended Call Processing 44 Stale Messages Subscriber Rollbacks SIP Retries Logs Misc. maintenance actions 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 22 Overload Control Configuration • Overload Controls require configuration – See BroadWorks System Configuration Document AS_CLI/System/OverloadControls> get enabled = false mgcpOverloadAction = error sipOverloadAction = error percentMemoryInUseToEnterYellow = 85 percentMemoryInUseToEnterRed = 90 percentMemoryInUseToLeaveYellow = 85 percentMemoryInUseToLeaveRed = 90 allowEmergencyCallsInOverload = true maxPacketAgeInMsecs = 3000 maxPacketAgeDuringOverloadInMsecs = 1500 AS_CLI/System/OverloadControls/CallP> get sampleSize = 100 minTimeInZoneInMsecs = 30000 delayInMsecsToEnterYellow = 1150 delayInMsecsToEnterRed = 1350 delayInMsecsToLeaveYellow = 1050 delayInMsecsToLeaveRed = 1250 AS_CLI/System/OverloadControls/NonCallP> get sampleSize = 100 minTimeInZoneInMsecs = 120000 delayInMsecsToEnterYellow = 1001 delayInMsecsToEnterRed = 1201 delayInMsecsToLeaveYellow = 1000 delayInMsecsToLeaveRed = 1200 callpQDelayInMsecsToEnterYellow = 700 callpQDelayInMsecsToEnterRed = 750 callpQDelayInMsecsToLeaveYellow = 600 callpQDelayInMsecsToLeaveRed = 701 45 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute So What Is The Problem? Sounds like turning on Overload Controls on the AS/NS is a good thing and I should do it ASAP • On a well running system, Overload Controls will provide you protection against external driven events – Traffic spikes, Registration storms • On poorly performing systems, Overload Controls will potentially trigger as a result of the poor performance – Primary queue delays that are the result of things like DNS delays or lack of CPU resources are not distinguished from delays that are the result of excessive external traffic 46 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 23 Gauging Server Performance How do I know if my server is performing poorly? • Performance problems are going to express themselves in delays which can be monitored – bwThreadDelayDetected traps: this trap is generated whenever a Call Half Input Adapter thread is delayed by >= 2.5 sec • If you are getting these traps frequently, you have problems – bwSipStatsMaxSetupSignalDelay gauge: this gauge tracks the max call setup time and encompasses all delays • Good indicator on general performance of the system • EMS tracks and resets every 15 min • If all readings 1 sec, < 2 sec, you could 2x all OC delay thresholds 48 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 24 Determining The Trigger • Alarms will be generated whenever a server transitions from one zone to another – bwCallOverloadZoneTransition – bwNonCallOverloadZoneTransition • Identifying the trigger that caused overload is more art than science – Internal Driver: DNS delay, lack of resource (CPU), bad DB query • Generally requires BroadSoft support in analyzing XSOuputXX.log file (thread dump analysis) – External Driver: Call Looping scenario, traffic flood • Generally need to parse through the log files to see what occurred before/during/after the event • New bwTrafficParser script can help 49 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute bwTrafficParser Script • BwTrafficParser script will extract traffic information from the XSLog files providing an analysis of traffic flow – Will identify traffic based on SIP method, source/destination IP/port, internal session ID, DN/userid – Script will be sourced in release 16, patched in 14SP9, and available for download on BroadSoft Xchange for earlier releases bwadmin@IHApp$ ./bwTrafficParser Must specify one or more log files. Usage: bwTrafficParser f -p -r -t List of possible option: -f : prints full output (instead of top 10) -m : prints out log file entries matching regular expression -p : narrows in address information by port -r : displays rate-based information -t : prints top items instead of top 10 The script parses up to 5 log files. When using the -r option the files should be specified in time order. 50 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 25 bwTrafficParser Script • BwTrafficParser script will extract traffic information from the XSLog files providing an analysis of traffic flow – Will identify traffic based on SIP method, source/destination IP/port, internal session ID, DN/userid – Script will be sourced in release 16, patched in 14SP9, and available for download on BroadSoft Xchange for earlier releases bwadmin@IHApp$ ./bwTrafficParser Must specify one or more log files. Usage: bwTrafficParser f -p -r -t List of possible option: -f : prints full output (instead of top 10) -m : prints out log file entries matching regular expression -p : narrows in address information by port -r : displays rate-based information -t : prints top items instead of top 10 The script parses up to 5 log files. When using the -r option the files should be specified in time order. 51 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Enhancements to Use •Platform Enhancements •BroadWorks Hardware Support Policy 52 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 26 Platform Related Enhancements • New BroadWorks platform related enhancements added since Release 14SP6 – Pre-Upgrade Validation Tool – Installation Patch Bundle (14SP7) – Configurable MS interface (14SP8) – Configurable TimesTen Replication Port (14SP9) – New OS Support • RHEL 5.x support (14SP9) • Solaris 10 x86_64 support (15.0) – EMS Threshold Rework (16.0) – Improved TimesTen DB Migration Time (16.0) 53 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Pre-Upgrade Validation Tool • New tool that should be run prior to any upgrade; it will – Validate supported upgrade paths – Check system configuration attributes • Disk space, system variables, ssh configuration, etc • Release/OS independent tool – Download latest version for various target releases from BroadSoft Xchange • e.g.: bw-preUpgradeCheck-Rel_15-95158.bin is for any source release looking to upgrade to any R15 release 54 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 27 Installation Patch Bundle (14SP7) • Installing a new BroadWorks version now requires a valid Installation Patch (IP) bundle – IP bundle updated regularly when new install/upgrade patchworthy identified – Customer should always get the latest IP – IP bundle installed at same time release .bin file is run • # ./AS_Rel_15.0_1.285.Linux-x86_64.bin –patch /bw/install/IP.as.15.0.285.ip20080616.Linuxx86_64.tar.gz • IP bundles can also include (and automatically install) critical Application Patches (AP) – Ensure that patches deemed critical are present at upgrade 55 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Configurable MS Interface (14SP8) • AS uses the publicIPAddress in the HTTP URL when signaling the MS for playing media files, and not the address defined for media via config-network script – New appserver.properties parameter (bw.http.mediaif) that allows MS apache interface to be bound to any of the available AS interfaces • Defaults to the AS public address • Controlled via MS interface settings as part of the confignetwork script 56 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 28 Configurable TimesTen Replication Port (14SP9) • By default, TimesTen will select a random TCP port (>32K) to use for replication – Number of customers have expressed that this was a security concern • New functionality allows for static port allocation – New installs will prompt user for replication port setting – Existing install can move to fixed port via CLI config AS_CLI/System/Peering> get portNumber = random 57 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute New OS Support • RHEL 5.x Linux Support – Available release 14SP9+ • Solaris 10 x86_64 Support – Available release 15.0+ – Means Solaris can be used on any Intel x64 based server (e.g. IBM) 58 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 29 EMS Threshold Rework (16.0) • EMS incorporates monitored object threshold that can produce alarms or health summary changes when surpassed – Based on field data, thresholds have been revisited – Thresholds automatically selected based on automatic server size discovery • Based on server resources (e.g., CPUs, Memory), correct thresholds selected – Manual threshold modification simplified – A number of monitored objects dropped from the health summary check • Focus on the core basics • New monitored objects added to better match performance/growth monitoring recommendation – XS JAVA Heap – Database Size – True AS user count (not including virtual subscribers) 59 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Improved TimesTen DB Migration Time (16.0) • As part of a BroadWorks upgrade, there are a number of database backup/restores that are performed; for a large database, these restores can take an extremely long time – e.g.: 60+ minutes for a 84K DB on a SUN T2000 • Upgrade enhanced to optimize the number of DB restores – Eliminate unnecessary restores when the TimesTen database version has not changed – Significant upgrade times savings (hours) 60 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 30 BroadWorks Hardware Support Policy 61 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Hardware/OS Support Policy: History • Pre-Release 13: SUN SPARC-based servers only – Required Solaris SPARC operating system – Limited subset of servers supported (e.g., v24x, V44x, T2000) • Release 13: IBM Intel Xeon-based server support added – Required Linux operating system – Limited subset of servers supported (x336, x3550, HS20, HS21) • Release 15: SUN Intel Xeon-based servers added – Solaris x64 operating system – Limited subset of servers supported (e.g., x4150, x4250, x6250) 62 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 31 Hardware/OS Support Policy: Evolution • Restricted server/vendor support a frustration to customers – Slow turn around in getting new servers added to the “supported” list – Some customer preferred vendors completely shut out (e.g., HP) • Supported server list continuously growing over time – Since BroadSoft rarely deprecates servers, number of servers requiring periodic performance testing increased year-by-year • Ubiquitous OS support increases deployment combinations – Intel-based x64 servers can run either Solaris or Linux • Above issues have led BroadSoft to loosen up its hardware and OS support policies 63 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute New Hardware Support Policy • BroadSoft platform support focuses on 2 processor types – Intel-based Xeon family – UltraSPARC family • Hardware still categorized by server size (e.g., Small, Medium, Large, Large – High Performance) which map back to capacity numbers – Server-sizes are CPU specific (Intel Small and UltraSPARC Small have different capacity numbers – Server-sizes map to minimum resource requirements (CPU, Memory, HDD) • Introduction of new hardware classification with different support connotations – Preferred, Supported, Compatible, Legacy, Lab 64 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 32 Preferred Server Category • List of Sun and IBM platforms using the preferred Intel Xeon-based CPU which architecturally is best suited for BroadWorks Call Processing applications – Currently focused on the Xeon 5400 family of CPU, but list will continue to evolve • Xeon 5500 family coming soon – BroadSoft validates these platforms in the lab and continues to do so from release to release – BroadSoft performs performance benchmarking on these servers to ensure that they perform to rated numbers provided in the BroadWorks System Capacity Planner • These are the servers that BroadSoft recommends the customer use 65 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Supported Server Category • List of Sun platforms using the “supported” UltraSPARC-based CPU – Currently no plan to validate any next-gen UltraSPARCs – BroadSoft validates these platforms in the lab and continues to do so from release to release. – BroadSoft performs performance benchmarking on these servers to ensure that they perform to rated numbers provided in the BroadWorks System Capacity Planner • UltraSPARC processor’s family (i.e., T1 and T2) do not fit well with the BroadWorks application, and as such, is considered a supported, but not preferred CPU by BroadSoft 66 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 33 Compatible Server Category • The Compatible category applies to any Intel Xeon-based platform that is not on the Preferred platforms list. – Compatible servers can be used in lieu of Preferred servers • Must meet or surpass the Preferred platform minimum CPU speed and the minimum hard disk drive (HDD) requirement, and are equipped with the required amount of memory for a given server size – BroadSoft does not provide any guarantee that there will be no platform interaction between our application and the Compatible category server. • Although unlikely, there is the possibility that certain platform-level operations (e.g., installation, upgrades, patching, licensing) might experience issues • BroadSoft can provide a BroadWorks Platform Compatibility Test Plan that can be used to validate basic functionality on the Compatible server 67 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Compatible Server Category – BroadSoft does not provide any guidance on capacity • The Compatible server is not part of the regular performance benchmarking/validation process. » In general, since the Compatible server has the same hardware footprint as the Preferred server for a given server size, the Compatible server capacity should track to the Preferred server • BroadSoft’s performance validation scripts and databases are available to any customer wanting to perform a performance benchmarking of their Compatible server • Customers can engage BroadSoft Professional Service to perform functional and performance validation of Compatible servers 68 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 34 Legacy Server Category • Any server that was once considered Preferred or Supported but is no longer documented, and has not been officially deprecated – Examples of valid Legacy Servers: IBM x336, IBM HS20 • Legacy category servers can still be used with BroadWorks and are basically equivalent to a Compatible server type from a support perspective • BroadSoft does not provide any guidance on capacity for Legacy category servers – BroadSoft no longer actively performs performance validation on Legacy servers 69 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Lab Server Category • Any UltraSPARC-based or Intel Xeon 5000 based platform can be considered a Lab category server as long as it meets the lab server minimum requirements – Ideally, lab servers should be of the same platform type used in production (that is, a server from the Preferred, Supported, or Compatible category) • Lab category servers are supported with the following caveats: – BroadSoft does not provide any guarantee that there will be no platform interaction between BroadSoft’s application and the Lab category server • Although unlikely, there is the possibility that certain platform-level operations (for example, installation, upgrades, patching, licensing) might experience issues – BroadSoft does not provide any guidance on capacity for Lab category servers. 70 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 35 Which OS can I use? • We now support Solaris SPARC, Solaris x64 and Linux – Solaris SPARC is only applicable to SUN SPARC-based servers – Solaris x64 and Linux can run on ANY x64 server (SUN, IBM, or Compatible • From a BroadSoft perspective, our general preferred OS is Solaris, but it really is a customer choice as to which to use based on support – Linux RHEL 4 WS: Rel 13, 14, 15 – Linux RHEL 5 Server: Rel 14sp9+ – Solaris 10 x64: Rel 15.0+ – Solaris 9 SPARC: Rel 13, 14, 15 – Solaris 10 SPARC: Rel 14, 15 71 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Hardware Deprecation Policy • In general, BroadSoft does not deprecate supported hardware unless there is a functional reason – Only the SunFire V12x and Netra 12x have been officially deprecated – Any hardware deprecations would be accompanied by a BroadSoft Support Alert • BroadSoft’s approach is natural evolution – As a hardware ages, customer will naturally replace it with newer hardware and port the BroadWorks instance – Hardware platforms evolve within BroadSoft by moving from the Preferred, or Supported category to Legacy category 72 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute 36 Documentation Changes • BroadWorks Recommended Hardware Guide has been update to reflect the policy change – Document slimmed down to under 20 pages – All references to part numbers removed • Impossible to keep up with current and valid info • Vendor/Distributor is the proper place for BoM creation based on our minimum requirements 73 2009 BroadSoft®, Inc. Proprietary and confidential; do not copy, duplicate, or distribute Corporate Headquarters 220 Perry Parkway Gaithersburg, Maryland 20877 Tel. +1 301.944.9770 www.broadsoft.com 37