% unoptimized state. This application needs to

% !TeX root = ../TUDthesis.tex% Add the above to each chapter to make compiling the PDF easier in some editors.chapter{Introduction}label{chapter:Introduction}section{Motivation} label{section:Motivation}The main reason for writing parallel programs is better performancecite{performance_Hollingsworth} and scalability. We use parallel programs to use power of multiple processor in distributed system. Parallel programs make large program into smaller pieces and execute in multi processors system simultaneously to improve overall execution time. Multiple processors can run same program in parallel by contributing their computing time and memory storage.Correctness is our main focus during development of parallel program. But after that we focus on performance of the program. Performance of a parallel program is a complex thing. It is actually various parameter by which we can estimate performance of the program. Some of the parameters are speedup, efficiency, scalability etc. We can measure and analyze different performance metricsfootnote{Performance metrics is any statistic about a parallel computation designed to help us improve the running time of parallel programs cite{performance_Hollingsworth}.}. in different ways. By analyzing those metric we can find barriers for better performance in other word performance bottleneck of the program. So that we can focus which area we should optimize mostly to improve overall performance.Main reason for the performance measurement is to optimize and improve performance of the program. Generally performance optimization have some basics steps. It is shown on ~autoref{fig:Perform_Cycle}.egin{figure}htpbcenteringincludegraphicswidth=0.5linewidth{figures/Performance_Cycle.png}caption{Performance Optimization Cyclecite{manualScorep}}label{fig:Perform_Cycle}end{figure}The process always begins with the original application in its unoptimized state. This application needs to be instrumentedfootnote{Application need to enable the measurement of the performance data. Instrumentation means it will insert special measurement calls into the application code at specific important points.}. It could be manual instrumentationfootnote{User can manually insert some code for measurement}, automatic by the compiler, or linking against pre-instrumented libraries.Then instrumented application is executed and the additional commands introduced during the instrumentation phase collect the data required to evaluate the performance properties of the code. These data can be stored either as a profilefootnote{show the amount of time a program spends evaluating each function and amount of time spend communicating or waiting for communications with the other process.} or as event tracesfootnote{A trace based system generates a file that records most detailed significant events in the running of a program.}. Additional instructions which are added for instrumentation requires extra run time and storage space. So measurement procedure itself has a little effect of the performance of the instrumented code. Whether the changes made have a significant effect on the behavior depends on the structure of the code to be investigated. Most cases the changes will be rather small so that the overall results of performance can be considered to be a approximately same as the corresponding properties of the un-instrumented code. However, certain constructions like regions with very small temporal extent that are executed frequently are likely to suffer from significant changes. Therefore it is suggested not to measure such regions.The next step is the analysis of the data obtained in the measurement phase. Normally this has mainly been done after the execution of the instrumented application has ended. If the collected data are event traces then a more detailed investigation is possible than in the case of profiles. In particular, one can then also look at dependencies between events happening on different processes.The optimization cycle then continues with the presentation of the analysis results as a report. It is important to eliminate the part of irrelevant information for the code optimization from the measured data. The complexity will be reduced in this way and it will simplify the evaluation of the data for the user. However, it need to be careful for not to present the results in a too abstract fashion which would hide important facts from the user.Then the user to evaluate the performance of the code with the performance report. Result would be either the application behaves well enough and exit the optimization cycle with the optimized version of the code being chosen as the final state, or proceed to identify weaknesses that need to be addressed and the potential for improvements of the code.In the latter case, one then continues by changing the source code according to the outcome of the previous step and thus obtains an improved application that then can again be instrumented to become ready for a re-entry into the optimization cycle.In parallel computing there are many metric parameter by which we can measure performance of the programs. We can measure those metrics from measuring various parameter.  Some major factors that influence the performance of parallel programs are network locality, bandwidth between host processors and load imbalance among processors.In distributed and parallel systems memory addresses are distributed across the processors. Each process have some part of data but no process have complete data. A processor have faster access to memory locations mapped locally than the memory location mapped on other processors. There may also be different in the access time to memory locations mapped to different processors because most communication network for parallel computers are multi-stage network in which communication packets from a given processor may go through different number of stage to get to different processors, and each stage add some latency to the communication. Only one process can access any data at a time. Different process can not access same data simultaneously. In case of remote data process have to request for data and there will be transfer of data and messages. If processes use their own data most of the time then there will be less communication overhead. This idea is called the network data locality of parallel program.In recent time distributed system is very common. Programs use distributed processing power and memory of the system. As programs use memory mapped over distributed system it needs to access memory mapped remotely. Processors will transmit requests and data over communication channel. So communication behavior and network data locality is playing a vital role in parallel programs. Communication behavior and network locality metric varies with the size of the network ans system. So analyze network data locality and observe communication behavior in a scalable system to measure performance would be very effective.So we are going to measure metrics related to network data locality for our research. We will use existing tools and measurement techniques to collect and analyze those metrics.As we are measuring communication metrics so amount of data transfered is the most important parameter we want to measure. We will measure this value then we will compare to other important parameters.Memory allocation gives us the idea about amount of data used on a process. More memory allocate means more data used by process. We will measure Memory allocation and compare with data transfered between processes. Higher ratio means more data used locally.Each instruction means a command which process will execute. We will measure this value with data exchanged between processes/peers. Higher ratio means there are more instruction executed and less data are transfered. In other words more data used locally.Load/Store instruction is closely related to memory access and cache behavior. Process will first look into local memory. If data not found locally then process request for remote data. We will compare number of Load/Store instruction with data exchanged between processes/peers. It will give idea about how often local memory is accessed compared to byte transfered.section{Approach} label{section:Approach}We will analyze network data locality to understand communication behavior for our research. So we have to measure communication metrics related to data locality. But practically parallel computing system have a large number(may be millions or billions) of processors or node. And volume and complexity of performance data are increasing with number of processors and computational power increases cite{performance_Hollingsworth}. So it would be too much data to store and analyze for large applications. For this we need a tool which will store summary of data so that can be navigated and analyzed.There exists many performance measurement tools which can analyze and performance by measuring different metrics. This tools collect, interpret, and manifest of information concerning the interactions among concurrently executing processescite{performance_CQY}. Tools focuses on where bottlenecks may be expected and where their codes offer room for further improvements. To do so those tools collect various metrics about applications, operating systems and hardware. Then we can analyze collect metrics and measure performance of parallel programs. By measuring performance we can find Performance bottleneckfootnote{Bottlenecks in code like computational bottlenecks (slow serial performance) and communications bottlenecks}, understand communication behavior, increase scalability. So we can focus on optimizing bottleneck of the program so that we can improve runtime of parallel program.Different tools use measurement method which measure performance in various way. Some of them are HPM Toolkit Cite{performance_HPM}, Vampir cite{paper_Vampir}, TAU cite{paper_Tau} etc. These measure different properties like function calls, hardware counters, communication, I/O behavior, memory allocation etc. Measuring the performance of a parallel program is more complex than simply optimizing its execution time. This is because of the large number of variables that can influence a program’s behavior. Some of these variables are the number of processors, the size of the data, interprocessor communications limits, available memory.They focus on different aspects and provide specialized features, can be used in different combination. At the same time, there are many similarities and overlapping functionality, so there will be redundancy in basic functionalities. So we need a tool that will remove redundancy, present data as summary and we can view and analyze those data.To measure performance we have to collect relevant Performance metrics. Then metrics have to be associated with source code region. We can gather data through instrumentationfootnote{Instrumentation is inserting extra code in parallel application to measure some value}.Score-P cite{paper_scoreP} is a Scalable Performance Measurement Infrastructure for Parallel Codes. It is a tool that provides a measurement infrastructure for profiling, event trace recording, and online analysis of High Performance Computing (HPC) application by instrument source code automatically. This profile/trace file can be viewed and analyses by Scalasca cite{paper_Scalasca} and CUBE cite{paper_Cube}. Score-P by default store data as profile files which we can explore with CUBE cite{paper_Cube, cube_home}. To measure metrics related to Instruction we need tool that can count hardware event. The Performance API (PAPI) cite{paper_Papi, Papi_home} is a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. So we can easily use PAPI cite{paper_Papi, Papi_home} hardware counter with combination with Score-P. So for our research we will use Score-P with PAPI support. CUBE is by default installed with Score-P.To measure performance metrics we need some benchmark application that will support parallel programs. We will run LULESH cite{paper_lulesh1, paper_lulesh2, Lulesh_home}, MILC cite{Milc_home} and Kripke cite{Kripke_home} for our research.To compile large applications and run it with large number of processors we need a High Performance computer or cluster. TU Darmstadt has such a High Performance Computer cluster called Lichtenberg High Performance Computer cite{Lichtenberg_cluster}. The Lichtenberg High Performance Computer serves scientific research, for many scientists of different universities. We will compile and run our applications using this High Performance cluster so that we can run those faster and parallel among multiple cores.section{Organization} label{section:Organization}We have divided this thesis into six chapters. In autoref{chapter:Measurement} we will discuss about the tools (Score-P, CUBE and PAPI) we have used for our research and reason to use them. Then in autoref{chapter:MeasurementTechniques} we will discuss how we will approach for data collection to measure performance and how can we configure our tools to do that.In autoref{chapter:ExperimentEnvironment} we will discuss about how we can configure application and system for our experiment to obtain our desired metrics. And we will briefly discuss Lichtenberg cluster system where we compile and run benchmark applications.Then in autoref{chapter:Experiments} we will discuss about Benchmark programs that we will for our experiment. We will also discuss reason behind choose those applications. In this experiment we have used LULESH, MILC and Kripke.In autoref{chapter:Results_Discuss} we have discussed results of our experiments. We will discuss results from different benchmark applications with different network and problem size. We will discuss reason of the behavior we observe from result. At the end in autoref{chapter:Conclusion} we draw our conclusion.%pagebreak