Monday, June 3, 2019
Ssis Is An In Memory Pipeline Computer Science Essay
Ssis Is An In Memory Pipeline Computer Science EssaySince SSIS is an in- retrospection pipeline, atomic physique 53 has to master that exercises occur in the retentivity for performance benefits. To check if your parcel of land is staying within memory limits, unmatched should review the SSIS performance counter Buffers spooled. This has an sign comfort of 0. all value higher up 0 is an indication that the engine has started record-swapping activities.Capacity planning to rede re reference book utilizationIn order to experience re stock utilization it is truly important to monitor CPU, Memory, I/O and Net movement utilization of the SSIS computer software.CPUIt is important to understand how much CPU is existence utilized by SSIS and how much of CPU is be utilized by over alone SQL boniface age consolidation Services is running. This latter point is very important, especi all(prenominal)y if you subscribe to SSIS and SQL Server on the said(prenominal) box, beca use if thither is re ancestor contention, SQL Server volition surely win that impart result into disk spilling from consolidation Services resulting in slower switching speed.The performance counter that should be monitored is crop / % Processor Time (Total). angiotensin-converting enzyme should measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not close-fitting to 100% CPU load, wherefore this indicatesApplication contention For e.g. SQL Server egresss more processor re starts, makes it unavailable for SSISHardw atomic number 18 contention Probably a suboptimal disk I/O or not enough memory to handled the amount of selective information to be processedDesign limitation The SSIS design is not make use of parallelism, and/or the portion has too many single-threaded tasksNetworkSSIS moves information as fast as your network is able to handle it. Hence, it is important to understand your network topo logy and check that the path between the source a nd terminus have both low latency and high throughput. Following performance counters gage help you tune the topologyNetwork Interface / Current Bandwidth Provides estimate of current bandwidthNetwork Interface / Bytes Total/Sec The rates at which bytes are sent and received on to each one network adapterNetwork Interface / Transfers/Sec How many network transfers per second are occurring. If the number is close to 40,000 IOPs, then get an early(a) NIC card and use teaming between the NIC cardsInput / Output (I/O)A good SSIS package should hit the disk hardly when it reads from the sources and writes back to the target area. But if the I/O is slow, reading and especially writing john create a bottleneck. So it is very important to understand that the I/O system is not solely specified in size of it (like 1 TB, 2 TB) save in like manner its sustainable speed (like 20,000 IOPs).MemoryThe identify counters to monitor memory for SSIS and SQL Server are as followsProcess / P rivate Bytes (DTEXEC.EXE) amount of memory currently used by Integration Services that so-and-sonot be shared with other processesProcess / Working Set (DTEXEC.EXE) amount of allocated memory by Integration ServicesSQL Server Memory Manager / Total Server Memory amount of allocated memory for SQL Server. This counter is the tabumatch indicator of total memory used by SQL, because SQL Server has another way to allocate memory using the AWE APIMemory / Page Reads/sec total memory pressure on the system. If this consistently goes to a higher place 500, it is an indication that the system is under memory pressureBaseline Source System Extract SpeedIt is important to understand the source system and the speed at which data can be extracted from it. Measure the speed of the source system by creating a unsophisticated package that reads data from round source with the destination that says Row Count serve this package from the command line and measure the time it took for it to c omplete the task. Using Integration Services log output, you can measure the time taken. Formula to be usedRows/Sec = RowCount / TimeBased on the above value, you can judge the maximum number of rows per second that can be read from the source. To development the Rows/Sec calculation, you can perform one of the following operationsImprove drivers and driver configurations Ensure you are using the up-to-date driver configurations for the network, data source and disk I/O.Start multiple communitys To overcome limitations of drivers, you can start multiple connections to your data source. If the source is able to handle many concurrent connections, the throughput will increase if you start several extracts at once. If concurrency causes locking or pulley block issues, consider partitioning the source having your packages read from different partitions to more evenly break up the load usance multiple NIC cards If network is the bottleneck and you have ensured you are using gigabit n etwork cards and routers, then a potential dissolvent is to use multiple NIC cards per server.Optimize SQL data source, Lookup transformations and DestinationHere are rough optimization tips that you can implement in your SSIS packages rehearse NOLOCK or TABLOCK hints to remove locking overheadRefrain from using SELECT * in SQL queries. Mention each column adduce in the SELECT clause for which data needs to be retrievedIf possible, perform datetime conversions at source or target databasesIn SQL Server 2008 Integration Services, there is a new rollick of shared lookup cache. During the use of parallel pipelines, it provides high-speed, shared cacheIf Integration Services and SQL Server run on the same box, use SQL Server destination or else of OLE DBCommit size 0 is fastest on heap bulk targets. If you cannot use 0, use the highest possible value of target size to reduce overhead of multiple- flock writing. Commit size = 0 is bad while inserting into B channelise because all incoming rows must be sorted at once into the target BTree, and if the memory is limited, there is a likelihood of spill. Batchsize=0 is ideal for inserting into a heap. Please note that a commit size value of 0 dexterity cause the running package to stop responding if the OLE DB destination and another data fertilise component are updating the same source elude. To ensure that the package does not stop, set the maximum insert commit size option to 2147483647Use a commit size of Heap inserts are typically instant(prenominal) than using a clustered forefinger. This means it is root oned to drop and rebuild all the indexes if there is a large part of the destination card getting removed.Use partitions and partition SWITCH command. In other words load a work table that contains single partition and SWITCH it into the main table after the indexes are build and then put the constraints onNetwork tuningPacket size is the main proportion of the network that needs to be monitored / looked at in order to take decisions for Network tuning. By evasion this value is set to 4,096 bytes. As noted in SqlConnection.PacketSize place in .Net elbow roomling Class Library, when the packet size is increased, it will improve performance because fewer network read and write operations are required to transfer a large data set. If your system is transactional in nature, lowering the value will improve the performance.Another network tuning technique is to use network resemblance at the operating system level to increase the performance at high throughputs.Use information Type wiselyFollowing are some best practices related to usage of data typesDefine data types as narrow as possibleDo not perform excessing casting of data types. Match your data types to the source or destination and explicitly specify data type castingTake care of precision when using money, fuck up and decimal data types. Money data type is always faster than decimal and has fewer precision consid erations than float.Change the designFollowing are some best practices related to SSIS designDo not SORT within Integration Services unless absolutely necessary. In order to sort the data Integration Services allocates memory space for the entire data set that needs to be transformed. Preferably, presort the data before hand. Another way to sort the data is by using ORDER BY clause to sort large data in the database.There are times where using Transact-SQL will be faster than processing the data in SSIS. Generally all set-based operations will perform faster in Transact-SQL because the problem can be transformed into a relational algebra formulation that SQL Server is optimized to resolve.Set-based UPDATE statements these are more efficient than row-by-row OLE DB callsAggregation statements like GROUP BY and SUM are as well calculated faster using T-SQL instead of in-memory calculations by a pipelineDelta detection is a technique where you vary existing rows in the target table instead of reloading the table. To perform delta detection, one can change detection mechanism such as the new SQL Server 2008 Change Data Capture (CDC) functionality. As a rule of thumb, if the target table has changed 10 %, it is often faster to simply reload than to perform the delta detectionPartition the problemFor ETL design, partition source data into smaller chunks of equal size. Here are some more partitioning tipsUse partitioning on your target table. Multiple versions of the same package can be executed in parallel to insert data into different partitions of the same table. The SWITCH statement should be used during partitioning. It not precisely increases parallel load speed, solely also allows efficient transfer of data.As implied above, the package should have a parameter defined that specifies which partition should it work on.downplay logged operationsIf possible, used minimal logged operations while inserting data into your target SQL Server database. When data is inserted into a database in fully logged mode, the size of the log grows quickly, because each row that is written in the database is also written to the log. Therefore, consider the following while designing SSIS packagesTry to perform data mixs in bulk mode instead of row by row. This will help minimize the number of entries to the log file. This answerually results into less disk I/O hence up(p) the performanceIf for any reason you need to delete data, organize the data in such a way that you can use TRUNCATE instead of DELETE. The later places an entry of each row that is deleted into the log file. The former will delete all the data and just put one entry into the log fileIf for any reason partition need to be move around, use the SWITCH statement. This is a minimally logged operationIf you use DML statements along with your bring in statements, minimum logging is suppressed.Schedule and distribute it correctlyGood way to handle execution is to create a priority queue for your package and then execute multiple instances of the same package (with different partition parameter values). This queue can be a round-eyed SQL Server table. A simple loop in the control flow should be a part of each package toPick a relevant chunk from the queue applicable means that is not already been processed and that all chunks it depends on have already executedExit the package if no occurrence is returned from the queue exercise work required on the chunkMark the chunk as done in the queueReturn to the start of the loopPicking an item from the queue and marking it as done can be implemented as a stored procedure. Once you have the queue in place, you can simple start multiple copies of DTEXEC to increase parallelism.Keep it simpleUnnecessary use of components should be avoided. Here is one of the way to avoid it pure tone 1 accommodate the variable quantity varServerDateStep 2 Use ExecuteSQL labor in the control flow to execute a SQL query to get the server datatime and store it in the variableStep 3 Use the dataflow task and insert/update database with the server datatime from the variable varServerDateThis sequence is advisable sole(prenominal) in oddballs where the time difference from Step 2 to Step 3 really matters. If that does not matter, then just use the getdate() command at Step 3 as shown below form table Table1(t_ID int, t_date datetime)Insert into Table1(t_ID, t_date) values(1, getdate())Executing a kidskin package multiple times from a levy with different parameter valuesWhile performance a child package from a master package, parameters that are passed from the master package should be configured in the child package. Use the Parent software package Configuration option in the child package to implement this lark. But for using this option, you need to specify the name of the Parent Package Variable that is passed to the child package. If there is a need to call the same child package multiple times (each time with a diff erent parameter value), declare the parent package variables (with the same name as given in the child package) with a scope limited to Execute Package Tasks. SSIS allows declaring variables with the same name but the scope limited to different tasks all inside the same package.SQL Job with many atomic stepsFor the SQL job that calls the SSIS packages, create multiple steps, each playing small tasks rather than one step that performs all the tasks. Creating one big step, the transaction log grows too big and if a rollback takes place, it make take the full processing space of the server.Avoid unnecessary typecastsAvoid unnecessary typecasts. For e.g., flat file connection manager, be negligence, uses the string DT-STR data type for all columns. You will have to manually change it, if there is a need to use the actual data type. It is always a good option to change it at the source-level itself to avoid unnecessary type casting.TransactionsUsually, ETL processes handle large volum e of data. In such scenarios, do not attempt a transaction on the whole package logic. SSIS does support transactions, and it is advisable to use transactions.Distributed transaction that span across multiple tasksThe control flow of an SSIS package go unneurotic various control tasks. In SSIS it is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, set value of the retainsameconnection property of the Connection Manager to trueLimit the package name to maximum of 100 charactersWhen a SSIS package with a package name transcend 100 characters is deployed in SQL Server, it trims the package name to 100 characters, which may cause an execution failure.SELECT * FROMDo not pass any unnecessary columns from the source to the destination. With the OLEDB connection manager source, using the Table or View data access mode is equivalent to SELECT * FROM tablename, which will fand so forth all the columns. Use SQL Command to fetch who le required columns and pass that to the destination.Excel source and 64-bit runtimeExcel Source or Excel Connection manager works only with the 32-bit runtime. Whenever a package that uses Excel Source is enabled for 64-bit runtime (by default, this is enabled), it will fail on the production server using the 64-bit runtime. Go to solution property pages debugging and set Run64BitRuntime to FALSE.On failure of a component, stop / continue the execution with the next componentWhen a component fails, the property failParentonFailure can be effectively used each to stop the package execution or continue with the next component execution in the sequence container. The constraint value connecting the components in the sequence should be set to Completion. Also the failParentonFailure property should be set to FALSE.ProtectionTo avoid most of the package deployment wrongful conduct from one system to other, set the package shield level to DontSaveSensitiveCopy pasting script component Once you copy-paste a script component and execute the package, it may fail. As a work-around, open the script editor in chief of the pasted script component, save the script and then execute the package.Configuration filter Use as a filterAs a best practice use the package name as the configuration filter for all the configuration items that are specific to a package. This is typically useful when there are so many packages with package specific configuration items. Use a generic name for configuration items that are general to many packages. best use of configuration recordsAvoid using the same configuration item recorded under different filter / object name. For e.g. there should be only one configuration record created if two packages are using the same connection string. This can be achieved by using the same name for the connection manager in both the packages. This is quite useful at the time of porting from one environment to other (like UAT to Prod).Pulling High Volume da taProcess of pulling high volume is represented in the following flowchartThe recommendation is to consider dropping all indexes from the target tables if possible before inserting data especially when the volume inserts are high.Effect of OLEDB Destination SettingsCertain settings with OLEDB destination will impact the performance of the data transfer. Lets look at some of themData Access Mode This setting provides fast load option, which internally uses BULK INSERT statement for uploading data into the destination table.Keep Identity By default this setting is undisciplined which means the destination table (if it has an identity column) will create identity values on its own. On checking this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.Keep NULLs By default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target colu mn) during INSERT into the destination table if NULL value is coming from the source for that particular column. On checking this option, the default constraint on the destination tables column will be ignored and preserved NULL of the source column will be inserted into the destination column.Table Lock By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at the same time.Check Constraints By default this setting is checked and recommendation is to have it unchecked if you are sure the incoming data is not going to violate constraints of the destination table. This setting indicates that the dataflow pipeline engine will validate the incoming data against the constraints of target table. Performance of data load can be improved by unchecking this option.Effects of Rows per Batch and Maximum Insert Commit Size settingsRows per batch The default value for this setting is -1 which means all incom ing rows will be treated as a single batch. If required you can change this to a positive integer value to break all incoming rows into multiple batches. The positive integer value will represent the total number of rows in a batchMaximum insert commit size Default value for this setting is 2147483647 which means all incoming rows will be committed once on successful completion. If required, you can change this positive integer to any other positive integer number that would represent that the commit will be done for those specified number of records. This might put an overhead on the dataflow engine to commit several times, but on the other side it will release the pressure on the transaction log and save tempdb from growing tremendously especially during high volume data transfers.The above two settings are mainly focused on improving the performance of tempdb and transaction log.Avoid Synchronous/Asynchronous transformationsWhile executing the package, SSIS runtime engine execut es every task other than data flow task in defined sequence. On encountering a data flow task the execution of the data flow task is taken over by the data flow pipeline engine. The dataflow pipeline engine then breaks the execution of the data flow task into one ore more execution tree(s). It may also execute these trees in parallel to achieve high performance.To make things a bit clearly, here is what an Execution Tree means. An Execution tree starts at a source or an asynchronous transformation and ends at a destination or inaugural asynchronous transformation in the hierarchy. Each tree has a set of allocated buffer and scope of these buffers is associated to this tree. Also in addition to this every tree is allocated an OS thread (worker-thread) and unlike buffers other execution tree may share this thread.Synchronous transformation gets a record, processes it and passes it to the other transformation or destination in the sequence. The processing of a record does not depende nt on the other incoming rows. Since synchronous transformations output the same number of rows as the input, it does not require new buffers to be created and hence is faster in processing. For e.g., in the Derived column transformation, a new column gets added in each incoming row, without adding any additional records to the output.In case of asynchronous transformation, different number of rows can be created than the input requiring new buffers to be created. Since an output is dependent on one or more records it is called blocking transformation. It might be partial or full blocking. For e.g., the Sort transformation is a fully blocking transformation as it requires all the incoming rows to acquire before processing.Since the asynchronous transformation requires additional buffers it performs slower than synchronous transformations. Hence asynchronous transformations must be avoided wherever possible. For e.g. instead of using Sort Transformation to get sorted results, use O RDER BY clause in the source itself.Implement Parallel Execution in SSISParallel execution in allowed by SQL Server Integration Services (SSIS) in two different ways by controlling two properties mentioned belowMaxConcurrentExecutables this property defines how many tasks (executable) can run simultaneously. This property defaults to -1, which is translated to the number of processors plus 2. In case, hyper-threading is turned on in your box, it is the logical processor rather than the physically present processor that is counted.For e.g. we have a package with 3 Data Flow tasks where every task has 10 flows in the form of OLE DB Source - SQL Server Destination. To execute all 3 Data Flow Tasks simultaneously, set the value of MaxConcurrentExecutables to 3.The second property named EngineThreads controls whether all 10 flows in each individual Data Flow Task get started concurrently.EngineThreads this property defines how many work threads the schedule will create and run in paral lel. The default value for this property is 5.In the above example, if we set the EngineThreads to 10 on all 3 Data Flow Tasks, then all the 30 flows will start at the same time.One thing we want to be clear about EngineThreads is that it governs both source threads (for source components) and work threads (for transformation and destination components). Source and work threads are both engine threads created by the Data Flows scheduler. Looking back at the above example, setting a value of 10 for Engine Threads means up to 10 source and 10 work threads each.In SSIS, we dont affinitize the threads that we create to any of the processors. If the number of threads surpasses the number of available processors, it might hurt the throughput due to an excessive amount of con schoolbook switches.Package restart without losing pipeline dataSSIS has a cool feature called Checkpoint. This feature allows your package to start from the last point of failure on next execution. You can save a lot of time by alter this feature to start the package execution from the task that failed in the last execution. To enable this feature for your package set values for three properties CheckpointFileName, CheckpointUsage and SaveCheckpoints. Apart from this you should also set FailPackageOnFailure property to TRUE for all tasks that you want to be considered in restarting.By doing this, on failure of that task, the package fails and the information is captured in the checkpoint file and on subsequent execution, the execution starts from that tasks.It is very important to note that you can enable a task to participate in checkpoint including data flow task but it does not apply inside the data flow task. Lets consider a scenario, where you have a data flow task for which you have set FailPackageOnFailure property to TRUE to participate in checkpoint. Lets assume that inside the data flow task there are five transformations in sequence and the execution fails at 5th transformation (ass umption is that earlier 4 transformations complete successfully). On the following execution instance, the execution will start from the data flow task and the first 4 transformations will run again before coming to 5th one.It is worth noting below points.For loop and for each loop do not remark Checkpoint.Checkpoint is enabled at only control flow level and not at data level, so regardless of checkpoint the package will execute the control flow/data flow from the start in a case of restart.If package fails, checkpoint file, all server configurations and variables values are stored and also point of failure. So if package restarted, it takes all configuration values from checkpoint file. During failure you cannot change the configuration values.Best practices for loggingIntegration Services includes logging features that write log entries when run-time events occur and can also write bespoke messages. Logging, to help you in auditing and troubleshooting a package every time it is run, can capture run-time information about a package. For e.g., name of the operator who ran the package and the time the package began and finished can be captured in the log.Logging (or tracing the execution) is a great way of diagnosing the problem occurring during runtime. This is especially very useful when your code does not work as expected. Not only that, SSIS allows you to choose different events of a package and components of the packages to log as well as the location where the log information is to be written (text files, SQL Server, SQL Server Profiler, Windows Events, or XML files).The logging saves you from several hours of frustration that you might get while finding out the causes of problem if you are not using logging, but the story doesnt end here. Its true, it helps you in identifying the problem and its root cause, but at the same time its an overhead for SSIS that ultimately affects the performance as well, especially if you are excessively using logging. So the recommendation here is to use logging in a case of error (OnError event of package and containers) . Enable logging on other containers only if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package.You can create your own custom logging which can be used for troubleshooting, package monitoring, ETL operations performance dashboard creation etc.However the best approach is to use the built-in SSIS logging where appropriate and augment it with your own custom logging. A normal custom logging can provide all the information you need as per requirement.Security audit and data audit is out of scope of this document.To help you understand which bulk load operations will be minimally logged and which will not, the following table lists the possible combinations.Table advocateesRows in tableHintsWithout TF 610With TF 610Concurrent possibleHeap eachTABLOCKMinimalMinimalYesHeap AnyNoneFullFullYesHeap + IndexAnyTABLOCKFullDepends (3)NoClusterEmptyTABLOCK, ORDER (1)MinimalMinimalNoClusterEmptyNoneFullMinimalYes (2)ClusterAnyNoneFullMinimalYes (2)ClusterAnyTABLOCKFullMinimalNoCluster + IndexAnyNoneFullDepends (3)Yes (2)Cluster + IndexAnyTABLOCKFullDepends (3)No(1) It is not necessary to specify the ORDER hint, if you are using the INSERT SELECT method, but the rows need to be in the same order as the clustered index. While using BULK INSERT it is necessary to use the ORDER hint.(2) Concurrent loads are only possible under certain conditions. Only rows those are written to newly allocated pages are minimally logged.(3) Based on the plan chosen by the optimizer, the non-clustered index on the table may either be fully- or minimally logged.Best practices for error handlingThere are two methods of extending the logging capability, fabricate a custom log providerUse event handlersWe can extent SSISs event handler for error logging. We can capture error on OnError event of package and let package handle it gracefully. We can capture actual error using script task and log it in text file or in a SQL server tables. You can capture error details using system variables SystemErrorCode, SystemErrorDescription, SystemSourceDescription etc.If you are using custom logging, log the error in same table.In some cases you may wish to ignore it or handle the error at container level or in some cases at task level.Event handlers can be attached to any container in the package and that event handler will doojigger all events raised by that container and any child containers of that container. Hence, by attaching an event handler to the package (which is parent container) we can catch all events raised of that event type by every container in the package. This is powerful because it saves us from building event handlers for each task in the package.A container has an option to opt out of having its events captured by an event handler. Lets say, you had a s equence container for which you didnt find it important to capture events, you can then simply switch them tally using the sequence containers DisableEventHandlers property.If are looking to capture only certain events of that sequence task by an event handler, you could control this using the SystemPropogate variable.We recommend you to use se
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.