Web Data Acquisition (Real-time Acquisition and Analysis of Web Content)
In today’s information age, the amount of data on the Internet is exploding, and this data contains a variety of valuable information. In order to obtain and analyze this information, web page data acquisition has become an important technique. In this article, we will introduce the basic concepts, operational steps, and some commonly used tools and techniques of web page data collection.
I. The Concept of Web Data Acquisition
Web Data Acquisition, as its name suggests, is a program that automatically acquires the contents of web pages on the Internet and saves them as structured data. These data can include text, pictures, videos and other forms, and can be used for various purposes, such as market research, public opinion analysis, data mining and so on.
Two, the operational steps of web page data collection
1. Determine the collection target: first of all, make clear what web page data you need to collect. It can be all the pages of a specific website or the search result pages of specific keywords.
2. Choose collection tools: According to the different collection targets, choose the appropriate collection tools. Commonly used tools include Python’s BeautifulSoup, Scrapy framework, and some software specialized in web page data collection.
3. Write collection program: according to the selected collection tool, write the corresponding collection program. The main task of the program is to simulate the behavior of the browser to automatically access the web page and extract the required data.
4. Running the collection program: Run the written collection program to start collecting web page data. In the process of running, you can set some parameters according to your needs, such as collection depth, collection speed and so on.
5. Data processing and analysis: the collected web page data is usually irregular and needs to be cleaned and organized. You can use Python’s data processing libraries, such as Pandas, Numpy, etc., to clean, de-weight, statistics and other operations on the data. Then, data analysis and mining are performed according to the needs.
Third, commonly used web page data collection tools and techniques
1. BeautifulSoup: is a commonly used web page parsing library in Python, which makes it easy to extract data from HTML or XML files. It provides a clean API that makes data extraction easy to use.
2. Scrapy Framework: is a powerful Python crawler framework that can be used to efficiently collect large-scale web data. It has distributed, asynchronous, and multi-threaded features that enable it to handle large numbers of web requests quickly.
3. Use Proxy IP: In order to prevent being blocked by the website, proxy IP can be used in the collection process. proxy IP can hide the real IP address, making the collection behavior more secretive.
4. Abide by the rules of the website: When you are collecting web data, you should abide by the rules of the website. Don’t make excessive requests to the website, so as not to burden the website or even be banned.
How to crawl Jingdong mobile tens of thousands of commodity data, this tool can help you
Charles is a network packet capture tool, we can use it to do the App packet capture analysis, to get the App running process occurs during the content of all the network requests and responses, which and the Web side of the browser’s developer tools Network part of the results seen in the same way.
Compared to Fiddler, Charles is more powerful and has better cross-platform support. So we chose Charles as the main mobile packet capture tool for analyzing mobile App packets, to assist in completing the App data capture work.
Objectives of this section
In this section, we take the Jingdong App as an example, through the Charles to capture the network packets in the process of the App, and then view the specific Request and Response content, in order to understand the use of Charles.
The second preparation
Please make sure that Charles has been installed correctly and the proxy service has been turned on, the cell phone and Charles are on the same LAN, and the Charles proxy and CharlesCA certificates are set up.
Three, the principle
First of all, Charles runs on its own PC, when Charles is running, it will open a proxy service on the PC’s port 8888, and this service is actually a HTTP/HTTPS proxy.
Make sure that your phone and PC are on the same LAN, we can either use the phone emulator to connect via virtual network or use the real phone and PC to connect via wireless network.
Set the cell phone proxy as the proxy address of Charles, so that the packets of the cell phone accessing the Internet will flow through Charles, and Charles will forward these packets to the real server, and the packets returned from the server will be forwarded back to the cell phone by Charles, and Charles will play the role of a middleman, and all the traffic packets can be captured, so all HTTP requests and responses can be captured. At the same time Charles has the power to modify the request and response.
In the initial state of Charles’s interface is shown below.
Charles will always listen to the network packets occurring on the PC and the cell phone, and the captured packets will be displayed on the left side, and as time goes by, more and more packets are captured, and more and more contents will be listed on the left side.
As you can see, the left side of the figure shows the request sites that Charles has captured, and by clicking on any of the entries we can view the details of the corresponding request, including Request, Response, and so on.
The next step is to clear the Charles crawl, clicking on the broom button on the left to clear all the currently captured requests. Then click on the second listen button and make sure the listen button is on, this indicates that Charles is listening to the App’s network data stream as shown below.
Then open the mobile Jingdong, note that you must set up the Charles proxy and configure the CA certificate in advance, otherwise it has no effect.
Open any product, such as iPhone, and then open its product review page, as shown below.
Continuously pulling up and loading the reviews, you can see that Charles captured all the network requests that occurred within the Jingdong App during this process, as shown below.
An api.m.jd.com link appears in the list on the left, and it’s blinking, so it’s likely that the current request from the app to get the comment data was captured by Charles. We click to expand it and continue to scroll up to refresh the comments. As we scroll up, another network request is logged here, and the new packet request is identified as a request for comments.
To verify that this is correct, let’s click to see the details of one of the entries. Switching to the Contents tab, this is where we find some JSON data, checking the results, it turns out that there is a commentData field whose content matches the content of the comments we see in the app, as shown below.
At this point, we can determine that the interface corresponding to this request is the interface for getting product reviews. In this way we have successfully captured the request and response content that occurs during the pull-up refresh process.
Now analyze the details of this request and response. First you can go back to the Overview tab, the top shows the interface URL of the request, followed by the response status StatusCode, request method Method, etc., as shown below.
This result is similar in form to the one originally captured on the Web using the browser developer tools.
Next, click on the Contents tab to see the details of the request and response.
The top half shows information about the Request and the bottom half shows information about the Response. For example, for Reqeust, we switch to the Headers tab to see the Headers information of the Request, and for Response, we switch to the JSONTEXT tab to see the Body information of the Response, and the content has been formatted as shown in the following figure.
Since this request is a POST request, we also need to be concerned about the POST form information, switch to the Form tab to see, as shown below.
In this way we have successfully captured the request and response of the comment interface in the app and can view the JSON data returned by the Response.
As for other apps, we can use the same way to analyze. If we can directly analyze the request URL and parameters of the law, directly with the program simulation can be batch crawl.
Charles also has a powerful feature, it can be captured to modify the request and send the modified request. When you click the Modify button at the top, there is an additional link on the left side of the list that starts with the Edit icon, which means that the request corresponding to this link is being modified by us, as shown in the following figure.
We can remove a field from the Form, for example, remove the partner field and click Remove, which modifies the FormData carried by the original request, and then click the Execute button at the bottom to execute the modified request, as shown in the following figure.
You can see that the request result of the interface appears again in the list on the left, and the content remains the same, as shown in the following figure.
Deleting the partner field in the Form form has no effect, so this field is irrelevant.
With this feature, we can easily use Charles to do debugging, you can modify the parameters, interfaces, etc. to test the response state of different requests, you can know which parameters are necessary which are not necessary, as well as the parameters of the respective laws, and finally get the simplest interface and parameters for the program to simulate the call form to use.
The above is the process of analyzing the app request through Charles packet capture. Through Charles, we successfully capture the network packets flowing through the App, capture the original data, and also modify the original request and re-initiate the modified request for interface testing.
Knowing the specific information of the request and response, if we can analyze the pattern of the request URL and parameters, we can directly use the program to simulate the batch capture!
Mobile phone crawler is very interesting, and can crawl the data is very much, of course, there are still many things to learn. Later I will also write some real-world interesting cases to you.
How to capture the picture information on the web page
Directly screenshot, you can, the browser have screenshot function, cell phone, if you open the picture, hold down the picture does not move, will pop up to save the picture of the text, you click on it to save it, and then from the cell phone picture software can be seen.
Hope to help you, hope to adopt, thank you!
Please tell me how to crawl specific data in a web page?
Web page crawling can be done using crawler technology, here are some common web page crawling methods:
1. Use Python’s Requests library to request a web page, and then use the BeautifulSoup library to perform page parsing to extract the target data.
2. Use Selenium library to simulate the browser operation, locate specific elements through CSSSelector or XPath to extract the target data.
3. Using Scrapy crawler framework, define extraction rules in the crawler script to automatically crawl the web page and extract the target data.
It is important to note that when performing web crawling, you should comply with the website’s Robots protocol and not crawl too frequently to avoid burdening the website. Attention also needs to be paid to whether the data is used in a way that complies with regulations and ethics.
How to “crawl data”?
First of all, the crawler is divided into crawling mobile APP data and website data, the main method is the same, but the details are a little different.
Take crawling website data analysis:
1. Browser developer tools with the Network function to analyze the corresponding data interface or view the source code to write the corresponding regular expression to match the relevant data
2. The results of the analysis of the first step or regular script language simulation request to extract key data. This may involve more than one request interface, and generally to do the data signature and data encryption, this piece needs to find the corresponding js file analysis algorithm.
Crawling a website data roughly on the above two steps, of course, there are many details, such as simulating the request header, request mode and request body. If you are crawling mobile APP data, it also involves packet analysis, software decompilation and so on, relatively speaking, APP crawler to be a little more complex.
How to extract web links from a cell phone
Try lmcjl online tool, view on the tool to grab all the links, enter the domain name and click to grab the link.