Stage Variables: Having too much variables in the transformer can impact the memory consumption. Avoid looping in transformer unless necessary. Try not to use the basic transformer when possible using other one.
Buffering: Try to tune the following before changing the buffering policy in the job: 1) Job design 2) Config File 3) Disk
Performance of the job can be improved if: 1) Unnecessary column are removed from the up and down stream links. 2) Removing these unnecessary columns will help reducing the memory consumption. 3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption. 4) Use RCP very carefully. 5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.
Sort operations: Always perform the following checks first before using the sort in the design:
1) Is sort really needed? 2) What is the volume of data going to be sorted? 3) Is data being read from database first and getting sorted in the job? Can we not sort the data in the database and bring sorted data? 4) What are the values set in the system related to Sort stage?
If we give attention to above questions before applying sort then this will help up creating more performant job.
Parallelism is not always good: Remember parallelism is not always beneficial. You have to think about the design of the job and the configuration. Degree of parallelism is determined by the configuration file where you can check how many node are defined. Increased parallelism can bring more overhead but will help distributing the work.
To get the maximum performance from job: To get the maximum performance from job we should start the job design with the smaller set of the data and then increase the amount of data. We will only get the best performing job when we will experiment with the design of the job using different partitioning methods etc.
Point to remember while partitioning the data: While partitioning the data make sure that the partitions are having equal amount of data in them. inequality will make the job less performant.
While configuring the SMP environment: While configuring the SMP environment designers should leave some processors for the operating systems processes.
Following factors affect the job design: 1. Communication between the operators should be optimized. 2. Amount of data handled by the operators should be handled by the nodes efficiently. This means the partitions should be correct.
Datasets Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning and would make the job more robust.
Degree of Parallelism Degree of parallelism determined by the configuration file, by the number of nodes. More the number of nodes more the resource usage will happen. Hence having more nodes can improve the performance of the job but one should think about the resource consumption on the server also. Number of number will bring more number of processes which can exhaust the server. ULIMIT settings can prevent parallel jobs from running and processes from execution. Setting this up can help in better execution of the server and jobs.
Scratch Area Scratch area is memory space which will be used by the jobs in case if the data required to read is more than the buffer size. One should keep scratch area maintained for the smooth operation of the jobs.
You can not fit everything in memory One should always remember that not every thing can be fitted into memory hence the while reading / sorting large amount of data, correct stage selection is very necessary. Never use lookup stage where large amount of record set needs to be stored in virtual datasets.
Stored Procedure Once should avoid using stored procedure for per row basis.
Stage Variables:
ReplyDeleteHaving too much variables in the transformer can impact the memory consumption. Avoid looping in transformer unless necessary. Try not to use the basic transformer when possible using other one.
Buffering:
Try to tune the following before changing the buffering policy in the job:
1) Job design
2) Config File
3) Disk
Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.
2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.
Sort operations:
Always perform the following checks first before using the sort in the design:
1) Is sort really needed?
2) What is the volume of data going to be sorted?
3) Is data being read from database first and getting sorted in the job? Can we not sort the data in the database and bring sorted data?
4) What are the values set in the system related to Sort stage?
If we give attention to above questions before applying sort then this will help up creating more performant job.
Parallelism is not always good:
Remember parallelism is not always beneficial. You have to think about the design of the job and the configuration. Degree of parallelism is determined by the configuration file where you can check how many node are defined. Increased parallelism can bring more overhead but will help distributing the work.
To get the maximum performance from job:
To get the maximum performance from job we should start the job design with the smaller set of the data and then increase the amount of data. We will only get the best performing job when we will experiment with the design of the job using different partitioning methods etc.
Point to remember while partitioning the data:
While partitioning the data make sure that the partitions are having equal amount of data in them. inequality will make the job less performant.
While configuring the SMP environment:
While configuring the SMP environment designers should leave some processors for the operating systems processes.
Following factors affect the job design:
1. Communication between the operators should be optimized.
2. Amount of data handled by the operators should be handled by the nodes efficiently. This means the partitions should be correct.
Datasets
Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning and would make the job more robust.
Degree of Parallelism
Degree of parallelism determined by the configuration file, by the number of nodes. More the number of nodes more the resource usage will happen. Hence having more nodes can improve the performance of the job but one should think about the resource consumption on the server also. Number of number will bring more number of processes which can exhaust the server. ULIMIT settings can prevent parallel jobs from running and processes from execution. Setting this up can help in better execution of the server and jobs.
Scratch Area
Scratch area is memory space which will be used by the jobs in case if the data required to read is more than the buffer size. One should keep scratch area maintained for the smooth operation of the jobs.
You can not fit everything in memory
One should always remember that not every thing can be fitted into memory hence the while reading / sorting large amount of data, correct stage selection is very necessary. Never use lookup stage where large amount of record set needs to be stored in virtual datasets.
Stored Procedure
Once should avoid using stored procedure for per row basis.