AbstractThis often managed by analysts auditing what amounts

AbstractThis whitepaper will highlight in providing cutting-edge solutions in analyzing, detecting, preventing fraud with power of modern technologies such as Big Data and Hadoop. We will be demonstrating principles of Fraud Management and applications of Big Data with sample use-case of Credit Card default detection using Big Data Technologies. IntroductionFraud is a major concern across all industries. You name the industry (Banking, Insurance, Government, HealthCare, or Retail, for example) and you’ll find fraud.In today’s interconnected world, the sheer volume and complexity of transactions makes it harder than ever to find fraud.Traditional approaches to fraud prevention aren’t particularly efficient. For example, the management of improper payments is often managed by analysts auditing what amounts to a very small sample of claims paired with requesting medical documentation from targeted submitters. The industry term for this model is pay and chase. Claims are accepted and paid out and processes look for intentional or unintentional overpayments by way of post-payment review of those claims.Though the sheer volume of transactions makes it harder to spot fraud because of the volume of data, ironically, this same challenge can help create better fraud predictive models – an area where Hadoop and Big Data shines.How is Fraud detection done?Because of the limitations of traditional technologies, fraud models are built by sampling data and using the sample to build a set of fraud-prediction and detection models. When you contrast this model with a Hadoop Big Data – anchored fraud department that uses the full data set – No Sampling – to build out the models, you can see the difference. For creating fraud-detection models, Hadoop is well suited to: – Handle Volume: That means processing the full data set – no data sampling.Manage new varieties of data: Data coming from different sources and in different formats.Maintain an agile environment: Enable different kinds of analysis and changes to existing models. The limitations of sampling: -Faced with expensive hardware and a pretty high commitment in terms of time and RAM, people tried to make the analytics workload a bit more reasonable by analyzing only a sampling of the data.While sampling is a good idea in theory, in practice this is often an unreliable tactic. Finding a statistically significant sampling can be challenging for sparse and/or skewed data sets, which are quite common. This leads to poorly judged samplings, which can introduce outliers and anomalous data points, and can, in turn, bias the results of analysis.Best Practices in Fraud Management: -A best-practice fraud management approach is integrated from end to end.  COMBATING FRAUD WITH THE TECHNOLOGY AVAILABLE TODAY – Big Data HadoopStep 1. Create an enterprise wide view of patterns and perpetrators.Step 2. Prevent and detect fraud in enterprise wide context.Step 3. Investigate and Resolve Fraud in an Integrated Environment. Figure below shows how Hadoop can be integrated within an Enterprise and how it can be used in an enterprise for building Fraud Patterns and Models and analytics on full data, rather going for sampling. Figure 1: Hadoop in Enterprise   A best-practice fraud management system is integrated from end to end, from data management to analysis (using multiple analytical techniques), alert generation and management, and case management.Hadoop as a queryable archive in support of an enterprise data warehouse.Hadoop can be used as a data transformation engine.Hadoop as a data processing engineHadoop to add Discovery and Sandbox capabilities to a modern-day analytics ecosystem. Fraud Models and HadoopMost Hadoop use cases is that it assists business in breaking through the glass ceiling on the volume and variety of data that can be incorporated into decision analytics. The more data we have, the better our models can be.Mixing non-traditional forms of data with set of historical transactions can make fraud models even more robust.Organization can work to move away from market segment modelling and move toward at-transaction or at-person level modelling. Quite simply, making a forecast based on a segment is helpful, but making a decision based on particular information about an individual transaction is better. To do this, we work up a larger set of data than is conventionally possible in the traditional approach.If the data used to identify or bolster new fraud-detection models isn’t available at a moment’s notice, by the time we discover these new patterns, it could be too late to prevent damage.Evaluate the benefit to business of not only building out more comprehensive models with more types of data but also being able to refresh and enhance those models faster than ever. Traditional technologies aren’t as agile, either. Hadoop makes it easy to introduce new variables into the model.   Traditional Statistical Analysis and Hadoop: -Traditional statistical analysis applications come with powerful tools for generating workflows.These applications utilize intuitive graphical user interfaces that allow for better data visualization. Hadoop follow a similar pattern as these other tools for generating statistical analysis workflows.See Figure 2, during the final data exploration and visualization step, users can export to human-readable formats (JSON/CSV) or take advantage of visualization tools.Figure 2: Generalized statistical analysis workflow with Hadoop   Applications of Big Data – Credit Card Default DetectionThis project corresponds to a real-life scenario found in Credit Card companies.  Customers who own credit cards are expected to pay off their balances monthly. But, they do default (not pay), which forces the bank into financial situations. Banks want to know which customer would possibly default in the future, so they can take necessary actions (such as closing their card, reducing their spending limits etc.). This problem involves a specific bank who wants to analyze their customer’s payment patterns and narrow down to cases where they are most likely to default. This problem has a dataset that contains information about Credit Card customers for the past 6 and a set of questions that the bank has. Our Project is to analyze the data and come up with answers to these questions using Big Data technologies.The dataset is in a file creditcarddefault.csv. The file contains the following column: Column NameDescriptionCUSTIDUnique Customer IDLIMIT_BALMaximum Spending Limit for the customerSEXSex of the customer. Some records have M and F to indicate sex. Some records have 1 ( Male) and 2 (Female)EDUCATIONEducation Level of the customer. The values are 1 (Graduate), 2 (University), 3 (High School) and 4 (Others)MARRIAGEMarital Status of the customer. The values are 1 (Single), 2 ( Married) and 3 ( Others)AGEAge of the customerPAY_1 to PAY_6Payment status for the last 6 months, one column for each month. This indicates the number of months (delay) the customer took to pay that month’s billBILL_AMT1 to BILL_AMT6The Billed amount for credit card for each of the last 6 months.PAY_AMT1 to PAY_AMT6The actual amount the customer paid for each of the last 6 monthsDEFAULTEDWhether the customer defaulted or not on the 7th month. The values are 0 (did not default) and 1 (defaulted)  Figure 3: creditcarddefault.csv dataset  Bank wants to solve following business problem questions and wants answer to some of the questions below: 01: Is there a clear distinction between Males and females when it comes to the pattern of defaulting? Do one sex default more than the other? Bank wants to Produce a report that looks like this showing percent defaulted for both males and females.  SEX_NAMETOTALDEFAULTSPER_DEFAULTFemale591.00218.0037.00Male409.00185.0045.00    02: How does marital status and level of education affect the level of defaulting? Does one category of customers default more than the other? Produce a report that looks like the following. MARR_DESCED_STRTOTALDEFAULTSPER_DEFAULTMarriedGraduate268.0069.0026.00MarriedHigh School55.0024.0044.00MarriedOthers4.002.0050.00MarriedUniversity243.0065.0027.00OthersGraduate4.004.00100.00OthersHigh School8.006.0075.00SingleGraduate123.0071.0058.00SingleHigh School87.0052.0060.00  03: Does the average payment delay for the previous 6 months provide any indication for the customer to default in the future? Produce a report that looks like the following. AVG_PAY_DURTOTALDEFAULTSPER_DEFAULT0.00356.00141.0040.001.00552.00218.0039.002.0085.0041.0048.003.004.002.0050.004.003.001.0033.00  Here we are using Big Data Technologies – Apache Spark to solve bank challenges and answer above three business questions and generate reports.    Sample – Java Apache Spark application: -Figure 4: Apache Spark Java Application   Output of application showing answers to banks three questions: -Applications of Big Data – Credit Card Default DetectionUsing Spark’s default log4j profile: org/apache/spark/log4j-defaults.propertiesRaw Data : +——+———+—+———+——–+—+—–+—–+—–+—–+—–+—–+———+———+———+———+———+———+——–+——–+——–+——–+——–+——–+———+|CUSTID|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_1|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|DEFAULTED|+——+———+—+———+——–+—+—–+—–+—–+—–+—–+—–+———+———+———+———+———+———+——–+——–+——–+——–+——–+——–+———+|   530|    20000|  2|        2|       2| 21|   -1|   -1|    2|    2|   -2|   -2|        0|        0|        0|        0|        0|        0|       0|       0|       0|       0|  162000|       0|        0||    38|    60000|  2|        2|       2| 22|    0|    0|    0|    0|   -2|   -2|        0|        0|        0|        0|        0|        0|       0|       0|       0|       0|       0|    1576|        0||    43|    10000|  1|        2|       2| 22|    0|    0|    0|    0|   -2|   -2|        0|        0|        0|        0|        0|        0|       0|       0|       0|       0|       0|    1500|        0||    47|    20000|  2|        1|       2| 22|    0|    0|    2|   -1|    0|   -1|     1131|      291|      582|      291|        0|      291|     291|     582|       0|       0|  130291|     651|        0||    70|    20000|  1|        4|       2| 22|    2|    0|    0|    0|   -1|   -1|     1692|    13250|      433|     1831|        0|     2891|   13250|     433|    1831|       0|    2891|  153504|        0|+——+———+—+———+——–+—+—–+—–+—–+—–+—–+—–+———+———+———+———+———+———+——–+——–+——–+——–+——–+——–+———+only showing top 5 rowsroot |– CUSTID: string (nullable = true) |– LIMIT_BAL: string (nullable = true) |– SEX: string (nullable = true) |– EDUCATION: string (nullable = true) |– MARRIAGE: string (nullable = true) |– AGE: string (nullable = true) |– PAY_1: string (nullable = true) |– PAY_2: string (nullable = true) |– PAY_3: string (nullable = true) |– PAY_4: string (nullable = true) |– PAY_5: string (nullable = true) |– PAY_6: string (nullable = true) |– BILL_AMT1: string (nullable = true) |– BILL_AMT2: string (nullable = true) |– BILL_AMT3: string (nullable = true) |– BILL_AMT4: string (nullable = true) |– BILL_AMT5: string (nullable = true) |– BILL_AMT6: string (nullable = true) |– PAY_AMT1: string (nullable = true) |– PAY_AMT2: string (nullable = true) |– PAY_AMT3: string (nullable = true) |– PAY_AMT4: string (nullable = true) |– PAY_AMT5: string (nullable = true) |– PAY_AMT6: string (nullable = true) |– DEFAULTED: string (nullable = true)Transformed Data :+——+——–+—+———+——–+—-+———+——————+—————–+———+———+|CustId|LimitBal|Sex|Education|Marriage| Age|AvgPayDur|        AvgBillAmt|        AvgPayAmt|  PerPaid|Defaulted|+——+——–+—+———+——–+—-+———+——————+—————–+———+———+| 530.0| 20000.0|2.0|      2.0|     2.0|20.0|      2.0|               0.0|          27000.0|2700000.0|      0.0||  43.0| 10000.0|1.0|      2.0|     2.0|20.0|      1.0|               0.0|            250.0|  25000.0|      0.0||  70.0| 20000.0|1.0|      4.0|     2.0|20.0|      1.0|            3349.5|          28651.5|    850.0|      0.0||  99.0| 50000.0|2.0|      3.0|     1.0|20.0|      0.0|117.83333333333333|            829.5|    700.0|      0.0|| 135.0| 30000.0|2.0|      2.0|     2.0|20.0|      1.0|61.333333333333336|359.8333333333333|    575.0|      0.0|+——+——–+—+———+——–+—-+———+——————+—————–+———+———+only showing top 5 rowsTransformed and Joined Data : +——+——–+—+———+——–+—-+———+——————+—————–+———+———+——-+———–+————+|CustId|LimitBal|Sex|Education|Marriage| Age|AvgPayDur|        AvgBillAmt|        AvgPayAmt|  PerPaid|Defaulted|sexName|    eduName|marriageName|+——+——–+—+———+——–+—-+———+——————+—————–+———+———+——-+———–+————+| 530.0| 20000.0|2.0|      2.0|     2.0|20.0|      2.0|               0.0|          27000.0|2700000.0|      0.0| Female| University|     Married||  43.0| 10000.0|1.0|      2.0|     2.0|20.0|      1.0|               0.0|            250.0|  25000.0|      0.0|   Male| University|     Married||  70.0| 20000.0|1.0|      4.0|     2.0|20.0|      1.0|            3349.5|          28651.5|    850.0|      0.0|   Male|     Others|     Married||  99.0| 50000.0|2.0|      3.0|     1.0|20.0|      0.0|117.83333333333333|            829.5|    700.0|      0.0| Female|High School|      Single|| 135.0| 30000.0|2.0|      2.0|     2.0|20.0|      1.0|61.333333333333336|359.8333333333333|    575.0|      0.0| Female| University|     Married|+——+——–+—+———+——–+—-+———+——————+—————–+———+———+——-+———–+————+only showing top 5 rowsSolution for PR#01 :Stage 18:=================================================>    (182 + 2) / 199                                                                                +——-+—–+——–+———-+|sexName|Total|Defaults|PerDefault|+——-+—–+——–+———-+| Female|  591|   218.0|      37.0||   Male|  409|   185.0|      45.0|+——-+—–+——–+———-+Solution for PR#02 : Stage 24:==================================>                   (126 + 2) / 200                                                                                +————+———–+—–+——–+———-+|marriageName|    eduName|Total|Defaults|PerDefault|+————+———–+—–+——–+———-+|     Married|   Graduate|  268|    69.0|      26.0||     Married|High School|   55|    24.0|      44.0||     Married|     Others|    4|     2.0|      50.0||     Married| University|  243|    65.0|      27.0||      Others|   Graduate|    4|     4.0|     100.0||      Others|High School|    8|     6.0|      75.0||      Others| University|    7|     3.0|      43.0||      Single|   Graduate|  123|    71.0|      58.0||      Single|High School|   87|    52.0|      60.0||      Single|     Others|    3|     2.0|      67.0||      Single| University|  198|   105.0|      53.0|+————+———–+—–+——–+———-+Solution for PR#03 : +———+—–+——–+———-+|AvgPayDur|Total|Defaults|PerDefault|+———+—–+——–+———-+|      0.0|  277|   115.0|      42.0||      1.0|  631|   244.0|      39.0||      2.0|   82|    39.0|      48.0||      3.0|    7|     4.0|      57.0||      4.0|    3|     1.0|      33.0|+———+—–+——–+———-+Figure 5: Console output of Apache Spark Java Application Closing Thoughts:Fraud is a major concern across all industries.Many organizations spend lot of money and efforts in preventing fraud. With power of modern technologies such as Big Data and Hadoop analyzing, detecting and preventing fraud has gone to a next level.Organizations can continue using their existing IT infrastructure and leverage Big Data Hadoop technologies for real-time fraud analysis.Organizations can truly be agile while handing Data in Motion, Data at Rest & Data in Many Forms with Big Data Hadoop Technologies.References:https://en.wikipedia.org/wiki/Data_analysis_techniques_for_fraud_detectionhttps://blog.codecentric.de/en/2017/09/data-science-fraud-detection/https://mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming/https://blogs.technet.microsoft.com/machinelearning/2017/06/28/using-azure-data-lake-and-r-for-fraud-detection/Related Fraud Detection Solutions:https://gallery.cortanaintelligence.com/Experiment/Online-Fraud-Detection-Step-1-of-5-Generate-tagged-data-2https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-real-time-fraud-detectionNisum SolutionAt Nisum, we are passionate about leveraging cutting-edge technologies to create successes for our clients. We take a vendor/tool agnostic approach in our consulting services, with the only goal of enhancing ROI for your investments.  We use a variety of methods, surveys and market research to assess the current state and readiness of our clients’ digital measurement infrastructure as well as potential future requirements, before recommending custom solutions that fit these requirements, budgets, and timelines, over both the short-and-long-term. We have robust experience in working with several vendors listed in this whitepaper which includes overseeing the implementation, maintenance and improving adoption of their digital analytics solutions.We have strong capabilities in providing 24/7 client support using Agile methodologies and leveraging our consultants across offices globally in the United States, Chile and India for end-to-end solutions.  Who We AreNisum enables transformation for industry-leading brands: we know howto build strong emotional bonds between B2C clients and customers via smart technology solutions. Nisum is a global consulting firm headquartered in Southern California Founded in 2000 with the customer-centric motto, Building Success Together®, we’ve grown to over 900 consultants across the United States, India, and Chile.  Our philosophy and deep technical expertise result in integrated solutions that deliver real and measurable growth.Whether you’re a hot startup or a major global brand, our approach is the same: forge the most powerful connection possible between people, processes and products to achieve unparalleled success. At the intersection of business and technology, Nisum has everything you need to grow your organization From Strategic IT Planning, Agile Enablement and Business Process Engineering to Application Development, Test Automation and DevOps, Nisum has you covered. We specialize in building Adaptable Back-End systems such as Order Management, Inventory and eCommerce to facilitate true omnichannel success for our customers.