Construction of the Economic Environment Index

The sampling procedure

The primary survey done to capture user experiences for Economic Environment Index is a crucial part of EEI methodology as it provides the data source for several variables. The survey was conducted for both households and business firms in all districts of Tamil Nadu. The sample size (with 95% confidence level and 10 % confidence interval) of 100 household per district and 50 business firms will be used for the study. Here the business firms include a diverse range of business organizations ranging from firms listed at stock exchange to unregistered vendors. For the survey of businesses, a stratified random sampling technique was used.

  • Sample size for household survey: 3200
  • Sample size for business firms: 1600
The sampling design and the survey were executed by the Centre for Monitoring Indian Economy.

Transformation of variables for imputation

Prior to analyzing the missing values, a test for the skewness1 of each variable was conducted in order to check for normality of observations. For variables which confirmed a skewness of greater than 2, logarithmic transformations (and square root transformations wherever necessary) were applied to them. These variables were then transformed back after the imputation process.

Imputing Missing Values

Missing values in any dataset have to be carefully dealt with in order to avoid biased estimates and invalid conclusions. In the dataset collated for EEI, a few missing points were observed which could not be ignored as this would have adversely affected the soundness of EEI ranks. In order to deal with such missing values, the EM algorithm has been used. The EM method consists of two components: first, the expectation (E), which calculates the conditional expectation of the complete data log likelihood given the observed data and the parameter estimates; second, maximization (M) steps, which finds the parameter estimates to maximize the complete-data log likelihood from the E-step, given a complete data log likelihood. The EM algorithm was used for imputation assuming dataset to be missing completely at random (MCAR). For confirmation of dataset to be MCAR, Little’s MCAR test was computed using SPSS.

Outlier Analysis

After the missing value analysis, the variables were transformed back to their original form. The next step before Principal Component Analysis, was analyzing the dataset for extreme values. For example, a particular variable might have a value which is particularly high or low in comparison to the rest of the observations. Such values might lead to biased estimates and will overly dominate the aggregation algorithm. In order to reduce such extremities, the Winsorization2 process was conducted. For every variable, the values exceeding the 97.5th percentile were lowered to the 97.5th percentile and the values smaller than the 2.5th percentile were raised to the 2.5th percentile. Hence, the extreme observations within a variable were trimmed to bring them within the width of 2.5 - 97.5 percentile bounds.

Normalization of Values

Any composite indicator is made of large number of variables and each variable has different measurement units. Aggregating all variables with different measurement units would render the final index meaningless. Thus, one must normalize the data prior to its aggregation. The following procedure has been adopted for normalizing the raw data:

Here, the best would be the maximum value in the observed set of values in each variable, while the worst would be the minimum one. Maximum and minimum would again depend on the nature of the variable. If the variable has a positive influence on the index, the maximum value would be the highest number amongst the observed set of values and in case of a negative influence, the best observation would then become the least valued number in the observed set of values within the variable. It should be noted here that the above exercise will result in all normalized values lying between 0 and 1.

Weighting and Aggregation

EEI employs a technique called the Principal Component Analysis (PCA) which reduces a set of seemingly correlated variables to a smaller set of uncorrelated principal components. PCA assigns weights to variables in accordance to the variation in the observed dataset. For instance, if variable X has a greater variance than variable Y, then Y will receive lesser weight than X in the construction of the index.

PCA is a multivariate statistical technique used to reduce the number of variables in a data set into a smaller number of ‘dimensions’. In mathematical terms, from an initial set of n correlated variables, PCA creates uncorrelated indices or components, where each component is a linear weighted combination of the initial variables3. In simpler words, through PCA, a set of observed variables are reduced to smaller set of variables called Principal Components, which can be used for subsequent analysis. To understand Principal Components in more detail, let us consider a set of variables x1, x2, ……..,xn. If PCA produces different principal components P1 and P2 for this set of variables, the following equations give detail of these principal components.

P1 = a11X1+a12X2+…………+a1nXn

P2 = a21X1+a22X2+…………+a2nXn

Where a1n = the regression coefficient (weight) for the observed variable n

And Xn= the subject’s score on observed variable n

The Principal Component Analysis gives us the Principal Components and Eigen Vectors of the co-variance matrix (correlation matrix in case of non-standardized data) of the variables. The eigen value of a corresponding eigen vector gives the variance for each Principal Component. PCA generates a number of un-correlated principal components and all these principal components are ordered on the basis of proportion of the variation (in the given data set) explained by the principal component. The subsequent principal components explain additional but less proportion of the variation in the data set. The number of principal components generated is dependent on the degree of correlation among the variables. The higher the degree of correlation, fewer the components are required.

For construction of EEI, PCA is run twice, first to generate weights to variables within the sub indices and then to assign weights to the sub indices itself. The final score of a district is aggregated on the basis of the weights generated by the PCA . If you need more information or SPSS Data click here to contact us.

  • 1. Skewness is used to measure of the asymmetry of a probability distribution. A normal distribution has no skewness as it is symmetric with mean equal to both median and mode.
  • 2. It is a technique to correct for the outliers in the dataset. It does it by moving the extreme value towards the centre of distribution. Here, instead of deleting/discarding extreme value , it is rather replaced certain percentile
  • 3. http://heapol.oxfordjournals.org/cgi/content/full/21/6/459#SEC3