What does the process of data mining begin with?

What is the core link of data mining

Model building. According to the query of relevant public information, the core link of data mining is model building; model building is the core link of data mining, in the process of model building, you need to choose the appropriate algorithm, optimize the model parameters, and evaluate the model. Computer science and technology is a general higher education undergraduate program, belonging to the computer class, and the basic training period is four years.

What is data mining, or what is the process of data mining

1.1 The Rise of Data Mining

1.1.1 Data Abundance and Knowledge Scarcity

Reprocessing the information, i.e., deeper inductive analysis, to understand the patterns from the information, in order to obtain more useful information, i.e., knowledge. On the basis of the accumulation of a large amount of knowledge, the principles and laws are summarized, and the so-called wisdom is formed.

The current awkward situation: “rich data” and “poor knowledge”

1.1.2 From Data to Knowledge

The formation of the data warehouse: with the growth of the volume of data, the data source brought about by the incompatibility of various data formats, in order to facilitate access to the information needed to make decisions, it is necessary to integrate the entire organization’s data in a unified form stored together, which formed the data warehouse (datawarehouse, DW)

OLAP (OnLineAnalyticalProcessing) online analytical tools: in response to accelerated changes in market OLAP (OnLine Analytical Processing): In response to the accelerated pace of change in the marketplace, OLAP has been proposed as a reproducible analytical tool capable of conducting real-time analysis and generating corresponding reports. OLAP allows users to interactively browse the contents of the data warehouse and perform multidimensional analysis of the data therein.

The OLAP analysis process is based on the premise that the user has a preconception and assumption of some kind of knowledge hidden in the data, and is a user-directed process of information analysis and knowledge discovery.

Intelligent automatic analysis tools: In order to adapt to the rapidly changing market environment, there is a need for intelligent automatic tools based on computers and information technology to help mine all kinds of knowledge hidden in the data. Such tools can generate a variety of their own hypotheses ➡️ and then use the data in the data warehouse (or large databases) to test or verify ➡️ and then return to the user the most valuable test results.

In addition, such tools should be able to adapt to the multiple characteristics of real-world data (large volume, noisy, incomplete, dynamic, sparse, heterogeneous, nonlinear, etc.)

1.1.3 The emergence of data mining (DM)

In 1995, in the United States, at the annual meeting of the computer, the concept of data mining (DataMining) was proposed.

The whole knowledge discovery process is composed of a number of important steps (data mining is only one of the important steps):

1) Data Cleaning: removing data noise and data that are obviously irrelevant to the mining topic

2) Data Integration: combining relevant data from multiple data sources into a single data source

2) Data Integration: combining relevant data from multiple data sources into a single data source

3) Data Mining. (6) Knowledge Representation: Its role is to use visualization and knowledge expression techniques to show users the relevant knowledge mined

1.1.4 Business Problems Solved by Data Mining (Cases)

Customer Behavior Analysis

Customer Loss Analysis

Cross-Selling

Fraud Detection

Risk Management

Customer Segmentation

Advertising Targeting

Market and Trend Analysis

Which stages can data mining generally be divided into

Data mining can generally be divided into the following stages:

Defining the problem: clearly define the business problem and determine the purpose of data mining.

Data Preparation: Data preparation includes: selecting data – extracting the target dataset for data mining in large databases and data warehouse targets; data preprocessing – performing data reprocessing, including checking the integrity of the data and the consistency of the data, de-noising, filling in the missing domains, and deleting invalid data.

Data Mining: select the appropriate algorithm according to the type of data function and and the characteristics of the data, and carry out data mining on the purified and transformed dataset.

Results analysis: the results of data mining are interpreted and evaluated, and converted into knowledge that can finally be understood by users.

What are the common methods used in data mining? What is the basic process?

Classification algorithms: classify data into different categories based on existing data features, such as algorithms based on Decision Tree, Plain Bayes, Support Vector Machine, etc.

Clustering algorithms: group data according to similarity, such as based on K-Means clustering, hierarchical clustering and other algorithms.

Association Rule Mining: discovering the correlation between items in a dataset, e.g. Apriori algorithm, etc.

Predictive modeling: using patterns in historical data to find future trends and predictions, e.g., based on regression analysis, time series analysis, etc.

The basic process of data mining includes: selecting datasets, data preprocessing, feature selection, model selection, model evaluation and model application. Among them, data preprocessing is the most important step in the process of data mining, including data cleaning, data transformation, data normalization and so on.

In the process of realizing data mining, the commonly used tools are R language, Python, SQLServerAnalysisServices and so on, which can provide the visual display of data mining and the implementation of a variety of data analysis algorithms.

If the data integration in the cloud, you can choose Datax, ETLCloud and other etl tools, mainly focusing on data extraction, conversion and loading, although it can also be completed according to user needs, simple data preprocessing, cleaning and other operations, but for complex data mining process, or need to be realized by professional data mining tools.

What are the techniques of data mining?

Today there are a lot of newcomers who want to participate in the field of big data development

I. The basic concept of data mining technology

With the development of computer technology, all industries have begun to use computers and the corresponding information technology for management and operation, which makes the enterprise’s ability to generate, collect, store, and process data much higher, and the amount of data is increasing day by day. Enterprise data is actually the accumulation of enterprise experience, when it accumulates to a certain extent, will inevitably reflect the regularity of things. To the enterprise, a mountain of data is tantamount to a huge treasure trove. In this context, there is an urgent need for a new generation of computing technologies and tools to mine the treasures in databases and make them useful knowledge to guide the technical and operational decisions of enterprises and make them invincible in the competition. On the other hand, in the past decade or so, computers and information technology has also made great progress, resulting in many new concepts and new technologies, such as higher performance computers and operating systems, the Internet (intemet), data warehouses (datawarehouse), neural networks and so on. In the market demand and the technical basis of these two factors have the environment, data mining technology or KDD (KnowledgeDiscoveryinDatabases; database knowledge discovery) concepts and technologies came into being.

Data Mining (DataMining) aims to extract the information and knowledge implicit in large amounts of incomplete, noisy, fuzzy, and random data that people do not know beforehand, but are potentially useful. There are many other terms similar to this term, such as Knowledge Discovery from Databases (KDD), Data Analytics, DataFusion, and Decision Support.

The following is an introduction to ten data mining (DataMining) analysis methods:

1, history-based MBR analysis (Memory-BasedReasoning; MBR)

History-BasedReasoning analysis method is the most important concept is to use the known case (case) to predict the future case of some of the attributes (attribute). attributes of future cases, usually by looking for the most similar cases for comparison.

There are two main elements in Memory-Based Reasoning, the distance function and the combination function. The distance function is intended to identify the most similar cases, while the combination function combines the attributes of similar cases for prediction purposes. The advantage of memory-based reasoning is that it allows for a wide range of data types, which do not need to obey certain assumptions. Another advantage is that it has the ability to learn, and it can acquire knowledge about new cases by learning from old cases. The more critical point is that it requires a large amount of historical data, which is sufficient to make good predictions. In addition, memory-based reasoning is time-consuming to process, and it is not easy to find the best distance function and combination function. The scope of its application includes the detection of deceptive behavior, customer response prediction, medical treatment, response categorization and so on.

2, shopping basket analysis (MarketBasketAnalysis)

The main purpose of the shopping basket analysis is to find out what kind of things should be put together? It is used in business to understand what kind of customers and why these customers buy these products by their buying behavior, and to find out the relevant association rules, so that enterprises can gain benefits and establish competitive advantages by mining these rules. For example, a retailer can use this analysis to change the arrangement of products on shelves or design business packages that attract customers.

The basic operation process of shopping cart analysis consists of the following three points:

(1) Selecting the right items: The correctness referred to here is for the enterprise, which has to select the really useful items among hundreds or thousands of items.

(2) Mining association rules by exploring the co-occurrence matrix.

(3) Overcome the practical limitations: the more items are selected, the longer the computation takes in terms of resources and time (exponentially increasing), and some techniques must be applied to minimize the loss of resources and time.

Basket analysis techniques can be applied to the following problems:

(1) For credit card purchases, the ability to predict what future customers are likely to buy.

(2) For the telecommunications and financial services industries, shopping basket analysis can be used to design different combinations of services to maximize profits.

(3) The insurance industry can use shopping basket analysis to detect and prevent potentially unusual insurance combinations.

(4) For patients, shopping basket analysis can be used as a basis for determining whether a combination of treatments will lead to complications.

3, decision trees (DecisionTrees)

Decision trees in the solution of categorization and prediction has a very strong ability, it is expressed in the form of laws, and these laws are expressed as a series of questions, by constantly asking questions to ultimately derive the desired results. A typical decision tree has a root at the top and a number of leaves at the bottom, which breaks down the record into subsets, each of which may contain a simple law. In addition, a decision tree may have different appearances, such as a binary tree, a ternary tree, or a hybrid decision tree type.

4. GeneticAlgorithm

Genetic algorithms learn about the process of cellular evolution, whereby cells can be selected, replicated, mated, and mutated to produce new and better cells. Genetic algorithms work in a similar way, in that they must establish a pattern in advance, and then through a sequence of operations similar to the process of generating new cells, use a fitness function to determine whether the resulting offspring match the pattern, and in the end, only the best fit survives, and the program continues until the function converges to the optimal solution. Genetic algorithms perform well in cluster problems, and can generally be used to supplement applications of memory-based reasoning and neural network-like applications.

5, cluster analysis (ClusterDetection)

This technique covers a wide range of genetic algorithms, neural networks, cluster analysis in statistics have this function. The goal is to find previously unknown similarities in the data, and in many analyses, cluster detection is used at first as a starting point for research.

6, connectivity analysis (LinkAnalysis)

Connectivity analysis is based on the graph theory in mathematics (graphtheory), through the relationship between the records to develop a model, it is the relationship as the main body, by people and people, things and things or people and things of the relationship between the development of a considerable number of applications. For example, the telecommunication service industry can use link analysis to collect the time and frequency of the customer’s use of the phone, and then deduce the customer’s preference for use, and put forward a plan that is beneficial to the company. In addition to the telecom industry, more and more marketing companies are also using link analysis to do research that benefits their businesses.

7, OLAP analysis (On-LineAnalyticProcessing; OLAP)

Strictly speaking, OLAP analysis is not considered a special data mining technology, but through the online analytic processing tools, the user can more clearly understand the potential meaning of the data hidden. As with some visualization techniques, this can be done through charts or graphs, which are much more user-friendly. Such tools also support the goal of turning data into information.

8, NeuralNetworks (NeuralNetworks)

Neural Networks are a repetitive learning method, a series of examples will be handed over to the learning, so that it can be summarized in a distinguishable style. If faced with a new example, the neural network can be summarized according to its past learning results and derive new results, which is a kind of machine learning. Data mining related problems can also be adopted in the way of neural learning, and its learning effect is very correct and can do the prediction function.

9, discriminant analysis (DiscriminantAnalysis)

When the problem encountered by the dependent variable is qualitative (categorical), and the independent variable (predictor) for quantitative (metric), discriminant analysis is a very appropriate technology, usually applied to solve the problem of classification. If the dependent variable consists of two groups, it is called Two-Group Discriminant Analysis (Two-GroupDiscriminantAnalysis); if it consists of multiple groups, it is called Multiple Discriminant Analysis (MultipleDiscriminantAnalysis; MDA).

(1) Finding linear combinations of predictor variables that maximize the ratio of between-group variation to within-group variation, each of which is uncorrelated with any previously obtained linear combination.

(2) Check whether the centers of gravity of the groups differ.

(3) Find out which predictor variables have the greatest differentiating power.

(4) Assign a new subject to a group based on the value of that subject’s predictor variable.

10, LogisticAnalysis

When the group in a discriminant analysis does not meet the assumption of normal distribution, Rogers regression analysis is a good alternative. Rogies regression analysis does not predict whether an event (EVENT) will occur or not, but rather the odds of that event. It assumes that the relationship between the independent variable and the dependent variable is in the shape of an S line, and when the independent variable is very small, the value of the chance is close to zero; when the value of the independent variable slowly increases, the value of the chance increases along the curve, and when it increases to a certain extent, the curve coefficient begins to decrease, so the value of the chance is between 0 and 1.