Parsing HTML tables to a file

Question

need help with parsing, there is a site with many companies https://www.investing.com/equities/ . If you open any of them, say Alphabet and then open the tabs "financials" and then the balance sheet, https://www.investing.com/equities/google-inc-c-balance-sheet then a table appears, each has such a table the company. How can I parse such tables simply by changing the link to the company? Sample table

Data must be uploaded to a file (csv)

That's what is at the moment, but I do not understand how to bring the output to the proper form and unload it into a file

from bs4 import BeautifulSoup from urllib.request import Request, urlopen import pandas as pd data = [] site= "https://www.investing.com/equities/ebay-inc-balance-sheet" hdr = {'User-Agent': 'Mozilla/5.0'} req = Request(site,headers=hdr) page = urlopen(req) soup = BeautifulSoup(page, "lxml") table = soup.find("table",{"class": "genTbl reportTbl"}) td_list = [] line = soup.find("table",{"class": "genTbl reportTbl"}).find('tbody').find_all('tr') for row in line: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) print(data)

Accepted Answer · 2019-02-21T12:15:31

Try this:

 import requests import pandas as pd url = 'https://www.investing.com/equities/google-inc-c-balance-sheet' header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest" } r = requests.get(url, headers=header) periods = pd.read_html(r.text, match='Period\s+Ending.*')[0].columns[1:] dfs = pd.read_html(r.text, match='Total\s+.*') for i,df in enumerate(dfs): (df.set_index(df.columns[0]) .set_axis(periods, axis=1, inplace=False) .to_csv(rf'c:\temp\tab_{i:02d}.csv'))

Error appears ValueError: Length mismatch: Expected axis has 4 elements, new values have 45 elements
not, yazh wrote above, error ValueError: Length mismatch: Expected axis has 4 elements, new values have 45 elements
@Shared, I run all the code from the answer "as is" under Pandas 0.24.1 - everything works ... What is your version of Pandas?
updated pandas, 4 tables appeared, only they are without separators.

Answer 2 · 2019-02-21T13:14:30

Another solution:

 url = 'https://www.investing.com/equities/google-inc-c-balance-sheet' header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest" } r = requests.get(url, headers=header) df = pd.read_html(r.text, match='Total\s+.*')[0] df = df.rename(columns={'Period Ending:':'Name'}) res = df.loc[df['Name'].str.len() < 100] # save to CSV res.to_csv(r'c:\path\to\result.csv', sep=',', index=False) # save to Excel res.to_excel(r'c:\path\to\result.xlsx', index=False)

Result:

 In [50]: pd.set_option('display.max_rows', 20) In [51]: res Out[51]: Name 201831/12 201830/09 201830/06 201831/03 0 Total Current Assets 135676 129702 124157 123761 2 Cash and Short Term Investments 109140 106416 102254 102885 3 Cash - - - - 4 Cash & Equivalents 16701 13443 14148 12658 5 Short Term Investments 92439 92973 88106 90227 6 Total Receivables, Net 21193 18067 17244 16814 7 Accounts Receivables - Trade, Net 20838 17897 17043 16777 8 Total Inventory 1107 1212 698 636 9 Prepaid Expenses 3723 3775 3540 3240 10 Other Current Assets, Total 513 232 421 186 .. ... ... ... ... ... 43 Common Stock, Total 0.7 0.7 0.7 0.69 44 Additional Paid-In Capital 45048.3 43110.3 42242.3 41486.31 45 Retained Earnings (Accumulated Deficit) 134885 128405 121282 120008 46 Treasury Stock - Common - - - - 47 ESOP Debt Guarantee - - - - 48 Unrealized Gain (Loss) -688 -279 -188 -34 49 Other Equity, Total -1618 -1397 -1337 -636 50 Total Liabilities & Shareholders' Equity 232792 221538 211610 206935 51 Total Common Shares Outstanding 695.56 695.96 695.95 694.95 52 Total Preferred Shares Outstanding - - - - [48 rows x 5 columns]

Thank you for such help) But he is still without separators, the whole text is in 1 cell
opened the file in excel, added a picture to the question what the data looks like

Parsing HTML tables to a file

2 answers 2

More articles: