Как утащить что угодно с любого сайта

Как утащить что угодно с любого сайта

Например, заголовки или содержимое статьи.

Сегодняшний проект послужит основой многих наших дальнейших программ. Мы научимся собирать с сайтов любые данные, которые нам нужны. 

У нас есть рабочий проект на цепях Маркова. Цепи Маркова — это несложный алгоритм, который анализирует сочетаемость слов в заданном тексте и выдаёт новый текст на основе старого. Похоже на работу нейронок, но на самом деле это просто перебор слов и бессмысленное их сочетание. 

Для работы наших первых проектов на цепях Маркова мы скачали книгу с рассказами Чехова. Программа анализирует сочетаемость чеховских слов и выдаёт текст в чеховском духе (хотя и бессмысленный).

Но что, если мы хотим сделать текст не в духе Чехова, а в духе журнала «Код»? Или в духе какого-нибудь издания-иноагента? Или сделать генератор статей в духе какого-нибудь блогера? 

Решение — написать программу, которая посмотрит на сайте наши статьи и вытащит оттуда весь значимый текст. Единственное, что для этого понадобится, — список ссылок на статьи, но мы их уже собрали, когда делали проект с гаданием на статьях Кода.

Логика работы

Программа будет работать на Python на локальной машине. Алгоритм:

  1. Подключаем нужные библиотеки.
  2. Готовим переменную со списком сайтов для обхода.
  3. Заходим на первый сайт и получаем оттуда исходный код.
  4. Указываем, откуда брать текст.
  5. Сохраняем найденный текст в отдельном файле.
  6. Повторяем цикл до тех пор, пока не пройдём все страницы из списка.

 👉 Главное в таких проектах — знать структуру содержимого страницы и понимать, где именно и в каких тегах находятся нужные для вас данные. 

Чтобы было проще, на старте сделаем программу, которая собирает названия страниц. Как освоимся — сделаем что посложнее.

Изучаем исходный код страницы

Прежде чем заниматься парсингом (сбором) со страницы чего угодно, нужно выяснить, где это лежит и в какой кодировке. Мы знаем, что все статьи Кода созданы по одному и тому же шаблону, поэтому нам достаточно посмотреть, как устроена одна, чтобы понять их все. 

Смотрим исходный код любой нашей статьи. Нас интересуют два момента — кодировка страницы и тег <title>. Нам нужно убедиться, что в этом теге прописано название. 

Сначала кодировка:

Как утащить что  угодно с любого сайта

Эта строчка означает, что страница работает с кодировкой UTF-8. Запомним это.

Теперь пролистываем исходный код ниже и находим тег <title> — именно он отвечает за заголовок страницы. Убеждаемся, что он есть и с ним всё в порядке:

Как утащить что  угодно с любого сайта

Библиотеки для работы

В проекте нам понадобятся две библиотеки: urllib и BeautifulSoup.

Первая отвечает за доступ к страницам по их адресу, причём оттуда нам будет нужна только одна команда urlopen().read — она отправляется по указанному адресу и получает весь исходный код страницы.

Вторая библиотека входит в состав большой библиотеки bs4 — в ней уже собраны все команды для парсинга исходного HTML-кода и разбора тегов. Чтобы установить bs4, запускаем терминал и пишем:

pip3 install bs4
Как утащить что  угодно с любого сайта

Пишем код

Сначала подключим все нужные библиотеки:

# подключаем urlopen из модуля urllib
from urllib.request import urlopen

# подключаем библиотеку BeautifulSoup
from bs4 import BeautifulSoup

Теперь объявим список страниц, которые нужно посетить и забрать оттуда заголовки. Мы уже составили такой список для проекта с гаданием на статьях Кода, поэтому просто возьмём его оттуда и адаптируем под Python:

url = [

"https://thecode.media/is-not-defined-jquery/",

"https://thecode.media/arduino-projects-2/",

"https://thecode.media/10-raspberry/",

"https://thecode.media/easy-css/",

"https://thecode.media/to-be-front/",

"https://thecode.media/cryptex/",

"https://thecode.media/ali-coders/",

"https://thecode.media/po-glandy/",

"https://thecode.media/megaexcel/",

"https://thecode.media/chat-bot-generators/",

"https://thecode.media/wifi/",

"https://thecode.media/andri-oxa/",

"https://thecode.media/free-hosting/",

"https://thecode.media/hotwheels/",

"https://thecode.media/do-not-disturb/",

"https://thecode.media/dyno-ai/",

"https://thecode.media/snake-ai/",

"https://thecode.media/leet/",

"https://thecode.media/ninja/",

"https://thecode.media/supergirl/",

"https://thecode.media/vpn/ ",

"https://thecode.media/what-is-wordpress/",

"https://thecode.media/hardware/",

"https://thecode.media/division/",

"https://thecode.media/nuggets/",

"https://thecode.media/binary-notation/",

"https://thecode.media/bootstrap/",

"https://thecode.media/chat-bot/",

"https://thecode.media/myadblock3000/",

"https://thecode.media/trello/",

"https://thecode.media/python-time/",

"https://thecode.media/editor/",

"https://thecode.media/timer/",

"https://thecode.media/intro-bootstrap/",

"https://thecode.media/php-form/",

"https://thecode.media/hr-quiz/",

"https://thecode.media/c-sharp/",

"https://thecode.media/showtime/",

"https://thecode.media/uchtel-rasskazhi/",

"https://thecode.media/sshhhh/",

"https://thecode.media/marry-me-python/",

"https://thecode.media/haters-gonna-code/",

"https://thecode.media/speed-css/",

"https://thecode.media/fired/",

"https://thecode.media/zabuhal/",

"https://thecode.media/est-tri-shkatulki/",

"https://thecode.media/milk-that/",

"https://thecode.media/binary-mouse/",

"https://thecode.media/bowling/",

"https://thecode.media/dealership/",

"https://thecode.media/best-seller/",

"https://thecode.media/hr/",

"https://thecode.media/no-comments/",

"https://thecode.media/drakoni-yajca/",

"https://thecode.media/who-is-who/",

"https://thecode.media/get-a-room/",

"https://thecode.media/alps/",

"https://thecode.media/handshake/",

"https://thecode.media/choose-life/",

"https://thecode.media/high-voltage/",

"https://thecode.media/spy/",

"https://thecode.media/squirrelrrel/",

"https://thecode.media/so-agile/",

"https://thecode.media/wedding/",

"https://thecode.media/supper/",

"https://thecode.media/le-tarakan/",

"https://thecode.media/batareyki-besyat/",

"https://thecode.media/dr_jekyll/",

"https://thecode.media/everybody_lies/",

"https://thecode.media/electrician/",

"https://thecode.media/einstein/",

"https://thecode.media/bugz/",

"https://thecode.media/needforspeed/",

"https://thecode.media/be-smart/",

"https://thecode.media/bot-online/",

"https://thecode.media/microb/",

"https://thecode.media/jquery/",

"https://thecode.media/split-screen/",

"https://thecode.media/calculus/",

"https://thecode.media/big-data-sales/",

"https://thecode.media/ambient/",

"https://thecode.media/fatality/",

"https://thecode.media/biggest-loser/",

"https://thecode.media/wifi/",

"https://thecode.media/nosock/",

"https://thecode.media/variables/",

"https://thecode.media/start_python/",

"https://thecode.media/i-gonna-code/",

"https://thecode.media/sigi-est/",

"https://thecode.media/nes-game/",

"https://thecode.media/live-view/",

"https://thecode.media/remote/",

"https://thecode.media/arduino-code/",

"https://thecode.media/horses/",

"https://thecode.media/runinstein/",

"https://thecode.media/wp-template/",

"https://thecode.media/tilda/",

"https://thecode.media/todo/",

"https://thecode.media/telebot/",

"https://thecode.media/summator-2/",

"https://thecode.media/get-rich-coding/",

"https://thecode.media/content-manager/",

"https://thecode.media/vzrosly-stal/",

"https://thecode.media/py-install/",

"https://thecode.media/quantum/",

"https://thecode.media/dns/",

"https://thecode.media/practicum/",

"https://thecode.media/react/",

"https://thecode.media/1september/",

"https://thecode.media/summator/",

"https://thecode.media/vds/",

"https://thecode.media/made-in-china/",

"https://thecode.media/bar/",

"https://thecode.media/zodiac/",

"https://thecode.media/crc32/",

"https://thecode.media/css-links/",

"https://thecode.media/oop_battle/",

"https://thecode.media/be-combo/",

"https://thecode.media/unity/",

"https://thecode.media/data-science/",

"https://thecode.media/junior/",

"https://thecode.media/qc/",

"https://thecode.media/be-middle/",

"https://thecode.media/senior/",

"https://thecode.media/teamlead/",

"https://thecode.media/frontend/",

"https://thecode.media/lift/",

"https://thecode.media/be-fuzzy/",

"https://thecode.media/best-2020/",

"https://thecode.media/git/",

"https://thecode.media/stt-cloud/",

"https://thecode.media/matrix-pills/",

"https://thecode.media/na-stile/",

"https://thecode.media/no-coffee/",

"https://thecode.media/framelibs/",

"https://thecode.media/children/",

"https://thecode.media/balls-possibly/",

"https://thecode.media/le-meduza/",

"https://thecode.media/electricity/",

"https://thecode.media/tailored-swift/",

"https://thecode.media/objective/",

"https://thecode.media/host/",

"https://thecode.media/go-public/",

"https://thecode.media/how-internet-works-1/",

"https://thecode.media/domain/",

"https://thecode.media/this-is-object/",

"https://thecode.media/ole-ole-ole/",

"https://thecode.media/thousand/",

"https://thecode.media/average/",

"https://thecode.media/stt-python/",

"https://thecode.media/ping-pong/",

"https://thecode.media/pygames/",

"https://thecode.media/odobreno/",

"https://thecode.media/qwerty123/",

"https://thecode.media/neurocorrector/",

"https://thecode.media/neuro-cam/",

"https://thecode.media/10-jquery/",

"https://thecode.media/repeat/",

"https://thecode.media/assembler/",

"https://thecode.media/sublime-one-love/",

"https://thecode.media/zloy/",

"https://thecode.media/mariya-ivanovna/",

"https://thecode.media/ruby/",

"https://thecode.media/electron-password/",

"https://thecode.media/plane/",

"https://thecode.media/glitch/",

"https://thecode.media/security/",

"https://thecode.media/stupid-2019/",

"https://thecode.media/jquery-search/",

"https://thecode.media/pimp-my-pass/",

"https://thecode.media/text-ultimate/",

"https://thecode.media/hurry/",

"https://thecode.media/siri/",

"https://thecode.media/zero-cool/",

"https://thecode.media/small-talk/",

"https://thecode.media/die-hard/",

"https://thecode.media/le-piton/",

"https://thecode.media/hr-code/",

"https://thecode.media/nano-code/",

"https://thecode.media/the_question/",

"https://thecode.media/godlike/",

"https://thecode.media/be-logic/",

"https://thecode.media/snake-js/",

"https://thecode.media/be-mobile/",

"https://thecode.media/baboolya/",

"https://thecode.media/timelag/",

"https://thecode.media/doors/",

"https://thecode.media/phone-code/",

"https://thecode.media/snake-arduino/",

"https://thecode.media/css-intro/",

"https://thecode.media/le-timer/",

"https://thecode.media/oop_battle/",

"https://thecode.media/good-morning/",

"https://thecode.media/study-bot/",

"https://thecode.media/python-bot/",

"https://thecode.media/robot-quiz/",

"https://thecode.media/hacking-quiz/",

"https://thecode.media/lulz-quiz/",

"https://thecode.media/hard-quiz/",

"https://thecode.media/torrent/",

"https://thecode.media/travel/",

"https://thecode.media/le-snob/",

"https://thecode.media/no-spagetti/",

"https://thecode.media/house/",

"https://thecode.media/cryptorush/",

"https://thecode.media/coronarelax/",

"https://thecode.media/pure/",

"https://thecode.media/c-cpp/",

"https://thecode.media/machine-loving/",

"https://thecode.media/orwell/",

"https://thecode.media/darknet/",

"https://thecode.media/ai/",

"https://thecode.media/oop-class/",

"https://thecode.media/cookie/",

"https://thecode.media/malware/",

"https://thecode.media/ftp/",

"https://thecode.media/html/",

"https://thecode.media/java/",

"https://thecode.media/php-haters/",

"https://thecode.media/tor/",

"https://thecode.media/crack-safe/",

"https://thecode.media/epidemic/",

"https://thecode.media/hash-brown/",

"https://thecode.media/java-js/",

"https://thecode.media/js-types/",

"https://thecode.media/losers/",

"https://thecode.media/ssl/",

"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",

"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",

"https://thecode.media/uncaughttyperrror-cannot-read-property/",

"https://thecode.media/mobile-dev/",

"https://thecode.media/verevka/",

"https://thecode.media/speed/",

"https://thecode.media/buckwheat/",

"https://thecode.media/distance/",

"https://thecode.media/node-js/",

"https://thecode.media/pascal/",

"https://thecode.media/ill-be-clean/",

"https://thecode.media/to-be-back/",

"https://thecode.media/replaceable/",

"https://thecode.media/code-review/",

"https://thecode.media/gasoline/",

"https://thecode.media/to-be-test/",

"https://thecode.media/scala/",

"https://thecode.media/row-power/",

"https://thecode.media/percent/",

"https://thecode.media/с/",

"https://thecode.media/things/",

"https://thecode.media/prof-newsletter/",

"https://thecode.media/backend/",

"https://thecode.media/immortal-pong/",

"https://thecode.media/blind/",

"https://thecode.media/go-faster/",

"https://thecode.media/cpp/",

"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",

"https://thecode.media/stress-quiz/",

"https://thecode.media/secret-pong/",

"https://thecode.media/override/",

"https://thecode.media/whg/",

"https://thecode.media/profit/",

"https://thecode.media/memas/",

"https://thecode.media/digital-sound/",

"https://thecode.media/api/",

"https://thecode.media/be-math-2/",

"https://thecode.media/backup/",

"https://thecode.media/backup-master/",

"https://thecode.media/glvrd/",

"https://thecode.media/id/",

"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",

"https://thecode.media/ex-startup/",

"https://thecode.media/doom-everywhere/",

"https://thecode.media/template-one/",

"https://thecode.media/david-roganov/",

"https://thecode.media/spacex/",

"https://thecode.media/webstorm/",

"https://thecode.media/json/",

"https://thecode.media/treger/",

"https://thecode.media/ya-blitz/",

"https://thecode.media/radius/",

"https://thecode.media/xhr/",

"https://thecode.media/treger2/",

"https://thecode.media/raidemption/",

"https://thecode.media/chief-technical-officer/",

"https://thecode.media/summary/",

"https://thecode.media/ex-wallpaper/",

"https://thecode.media/soap/",

"https://thecode.media/decompose/",

"https://thecode.media/desc/",

"https://thecode.media/sprint/",

"https://thecode.media/bye-or-die/",

"https://thecode.media/who-win/",

"https://thecode.media/vladimir-olokhtonov/",

"https://thecode.media/lossless/",

"https://thecode.media/parse/",

"https://thecode.media/typeerror-is-not-an-abject/",

"https://thecode.media/backup-me/",

"https://thecode.media/stress-test/",

"https://thecode.media/syntaxerror-missing-formal-parameter/",

"https://thecode.media/start-fast/",

"https://thecode.media/halkechev/",

"https://thecode.media/halkechev2/",

"https://thecode.media/le-design/",

"https://thecode.media/syntaxerror-missing-after-property-id/",

"https://thecode.media/attrb-mthd/",

"https://thecode.media/headphones/",

"https://thecode.media/active-noise-cancelling/",

"https://thecode.media/remote-work-quiz/",

"https://thecode.media/garbage/",

"https://thecode.media/ubuntu-linux/",

"https://thecode.media/trie/",

"https://thecode.media/func/",

"https://thecode.media/laravel/",

"https://thecode.media/save-json/",

"https://thecode.media/syntaxerror-missing-after-formal-parameters/",

"https://thecode.media/recursion/",

"https://thecode.media/haskell/",

"https://thecode.media/gen/",

"https://thecode.media/db/",

"https://thecode.media/boosting/",

"https://thecode.media/pavel-sviridov/",

"https://thecode.media/mnogo/",

"https://thecode.media/sokr/",

"https://thecode.media/dbsm/",

"https://thecode.media/pik-balmera/",

"https://thecode.media/kanban/",

"https://thecode.media/check-list/",

"https://thecode.media/text-quiz/",

"https://thecode.media/mysql/",

"https://thecode.media/mysql/",

"https://thecode.media/rust/",

"https://thecode.media/manage-this/",

"https://thecode.media/altshuller/",

"https://thecode.media/interview/",

"https://thecode.media/fotorama/",

"https://thecode.media/tetris/",

"https://thecode.media/ai-tetris/",

"https://thecode.media/scrum/",

"https://thecode.media/speed-two/",

"https://thecode.media/quick-share/",

"https://thecode.media/stack/",

"https://thecode.media/mobile-first/",

"https://thecode.media/nosql/",

"https://thecode.media/narazves/",

"https://thecode.media/oop-class-2/",

"https://thecode.media/design-first/",

"https://thecode.media/arcanoid/",

"https://thecode.media/donut/",

"https://thecode.media/casino/",

"https://thecode.media/heap/",

"https://thecode.media/rust/",

"https://thecode.media/float/",

"https://thecode.media/markdown/",

"https://thecode.media/books/",

"https://thecode.media/daniil-popov/",

"https://thecode.media/android-developer/",

"https://thecode.media/symbols/",

"https://thecode.media/oauth/",

"https://thecode.media/kotlin/",

"https://thecode.media/todo/",

"https://thecode.media/plotly/",

"https://thecode.media/no-digit-code/",

"https://thecode.media/asymmetric/",

"https://thecode.media/qi/",

"https://thecode.media/vernam/",

"https://thecode.media/vernam-js/",

"https://thecode.media/shtykov/",

"https://thecode.media/memory/",

"https://thecode.media/ark/",

"https://thecode.media/7-oshibok-na-sobesedovanii/",

"https://thecode.media/dh/",

"https://thecode.media/typescript/",

"https://thecode.media/stark/",

"https://thecode.media/crypto/",

"https://thecode.media/zapusk-2/",

"https://thecode.media/fingerprint/",

"https://thecode.media/puzzle/",

"https://thecode.media/python-time-2/",

"https://thecode.media/no-chance/",

"https://thecode.media/lossy/",

"https://thecode.media/1wire/",

"https://thecode.media/pasha-flipper/",

"https://thecode.media/perl/",

"https://thecode.media/alexey-vasilev/",

"https://thecode.media/viasat/",

"https://thecode.media/podcast/",

"https://thecode.media/copy-ya-ru/",

"https://thecode.media/mircrosd/",

"https://thecode.media/bash/",

"https://thecode.media/rotation/",

"https://thecode.media/css-grid/",

"https://thecode.media/train/",

"https://thecode.media/grid-2/",

"https://thecode.media/za-proezd/",

"https://thecode.media/grid-3/",

"https://thecode.media/david/",

"https://thecode.media/alien-vs-predator/",

"https://thecode.media/grid-portfolio/",

"https://thecode.media/podcast-lavka/",

"https://thecode.media/it-start-2/",

"https://thecode.media/anastasiya-nikulina/",

"https://thecode.media/linter/",

"https://thecode.media/bomberman/",

"https://thecode.media/5-linters/",

"https://thecode.media/lineynaya-algebra-vektory/",

"https://thecode.media/no-nda/",

"https://thecode.media/code-swap/",

"https://thecode.media/how-to-start/",

"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",

"https://thecode.media/vectors-operations/",

"https://thecode.media/oop-abstract/",

"https://thecode.media/leonov/",

"https://thecode.media/lucky-strike/",

"https://thecode.media/browser/",

"https://thecode.media/2020/",

"https://thecode.media/3d-stars/",

"https://thecode.media/anna-leonova/",

"https://thecode.media/cold-fusion/",

"https://thecode.media/normalize/",

"https://thecode.media/hotkey/",

"https://thecode.media/oven/",

"https://thecode.media/vim/",

"https://thecode.media/draw/",

"https://thecode.media/visual-studio-code/",

"https://thecode.media/tetris-2/",

"https://thecode.media/start-now/",

"https://thecode.media/lapsha-1/",

"https://thecode.media/cubism/",

"https://thecode.media/lapsha-2/",

"https://thecode.media/zerocode/",

"https://thecode.media/cat/",

"https://thecode.media/static/",

"https://thecode.media/komm/",

"https://thecode.media/path-js/",

"https://thecode.media/cloudly/",

"https://thecode.media/haters-gonna-code-2/",

"https://thecode.media/csp/",

"https://thecode.media/mitin-says-no/",

"https://thecode.media/hire-js/",

"https://thecode.media/tableau/",

"https://thecode.media/impossible/",

"https://thecode.media/csp-on/",

"https://thecode.media/maze/",

"https://thecode.media/mix/",

"https://thecode.media/le-beton/",

"https://thecode.media/3d-print/",

"https://thecode.media/5-and-a-half/",

"https://thecode.media/lineynaya-zavisimost-vektorov/",

"https://thecode.media/fast-m1/",

"https://thecode.media/ninja-run/",

"https://thecode.media/matrix-101/",

"https://thecode.media/arm-x86/",

"https://thecode.media/piano-js/",

"https://thecode.media/10-swift/",

"https://thecode.media/travel-plane/",

"https://thecode.media/extention/",

"https://thecode.media/obratnaya-matritsa/",

"https://thecode.media/svg/",

"https://thecode.media/freelance/",

"https://thecode.media/brat-2/",

"https://thecode.media/angular/",

"https://thecode.media/rgb/",

"https://thecode.media/10-go/",

"https://thecode.media/coffee/",

]

Теперь перебираем все элементы этого массива в цикле, используя всю мощь библиотек. Обратите внимание на строчку, где мы получаем исходный код страницы — мы сразу конвертируем его в нужную кодировку, которую выяснили на предыдущем этапе:

# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")

# перебираем все адреса из списка
for x in url:
    # получаем исходный код очередной страницы из списка
    html_code = str(urlopen(x).read(),'utf-8')
    # отправляем исходный код страницы на обработку в библиотеку
    soup = BeautifulSoup(html_code, "html.parser")

    # находим название страницы с помощью метода find()
    s = soup.find('title').text

    # выводим его на экран
    print(s)

    # сохраняем заголовок в файле и переносим курсор на новую строку
    file.write(s + '\n')

# закрываем файл
file.close()

Запускаем и смотрим на результат:

Как утащить что  угодно с любого сайта

# подключаем urlopen из модуля urllib
from urllib.request import urlopen

# подключаем библиотеку BeautifulSout
from bs4 import BeautifulSoup

url = [
"https://thecode.media/is-not-defined-jquery/",
"https://thecode.media/arduino-projects-2/",
"https://thecode.media/10-raspberry/",
"https://thecode.media/easy-css/",
"https://thecode.media/to-be-front/",
"https://thecode.media/cryptex/",
"https://thecode.media/ali-coders/",
"https://thecode.media/po-glandy/",
"https://thecode.media/megaexcel/",
"https://thecode.media/chat-bot-generators/",
"https://thecode.media/wifi/",
"https://thecode.media/andri-oxa/",
"https://thecode.media/free-hosting/",
"https://thecode.media/hotwheels/",
"https://thecode.media/do-not-disturb/",
"https://thecode.media/dyno-ai/",
"https://thecode.media/snake-ai/",
"https://thecode.media/leet/",
"https://thecode.media/ninja/",
"https://thecode.media/supergirl/",
"https://thecode.media/vpn/ ",
"https://thecode.media/what-is-wordpress/",
"https://thecode.media/hardware/",
"https://thecode.media/division/",
"https://thecode.media/nuggets/",
"https://thecode.media/binary-notation/",
"https://thecode.media/bootstrap/",
"https://thecode.media/chat-bot/",
"https://thecode.media/myadblock3000/",
"https://thecode.media/trello/",
"https://thecode.media/python-time/",
"https://thecode.media/editor/",
"https://thecode.media/timer/",
"https://thecode.media/intro-bootstrap/",
"https://thecode.media/php-form/",
"https://thecode.media/hr-quiz/",
"https://thecode.media/c-sharp/",
"https://thecode.media/showtime/",
"https://thecode.media/uchtel-rasskazhi/",
"https://thecode.media/sshhhh/",
"https://thecode.media/marry-me-python/",
"https://thecode.media/haters-gonna-code/",
"https://thecode.media/speed-css/",
"https://thecode.media/fired/",
"https://thecode.media/zabuhal/",
"https://thecode.media/est-tri-shkatulki/",
"https://thecode.media/milk-that/",
"https://thecode.media/binary-mouse/",
"https://thecode.media/bowling/",
"https://thecode.media/dealership/",
"https://thecode.media/best-seller/",
"https://thecode.media/hr/",
"https://thecode.media/no-comments/",
"https://thecode.media/drakoni-yajca/",
"https://thecode.media/who-is-who/",
"https://thecode.media/get-a-room/",
"https://thecode.media/alps/",
"https://thecode.media/handshake/",
"https://thecode.media/choose-life/",
"https://thecode.media/high-voltage/",
"https://thecode.media/spy/",
"https://thecode.media/squirrelrrel/",
"https://thecode.media/so-agile/",
"https://thecode.media/wedding/",
"https://thecode.media/supper/",
"https://thecode.media/le-tarakan/",
"https://thecode.media/batareyki-besyat/",
"https://thecode.media/dr_jekyll/",
"https://thecode.media/everybody_lies/",
"https://thecode.media/electrician/",
"https://thecode.media/einstein/",
"https://thecode.media/bugz/",
"https://thecode.media/needforspeed/",
"https://thecode.media/be-smart/",
"https://thecode.media/bot-online/",
"https://thecode.media/microb/",
"https://thecode.media/jquery/",
"https://thecode.media/split-screen/",
"https://thecode.media/calculus/",
"https://thecode.media/big-data-sales/",
"https://thecode.media/ambient/",
"https://thecode.media/fatality/",
"https://thecode.media/biggest-loser/",
"https://thecode.media/wifi/",
"https://thecode.media/nosock/",
"https://thecode.media/variables/",
"https://thecode.media/start_python/",
"https://thecode.media/i-gonna-code/",
"https://thecode.media/sigi-est/",
"https://thecode.media/nes-game/",
"https://thecode.media/live-view/",
"https://thecode.media/remote/",
"https://thecode.media/arduino-code/",
"https://thecode.media/horses/",
"https://thecode.media/runinstein/",
"https://thecode.media/wp-template/",
"https://thecode.media/tilda/",
"https://thecode.media/todo/",
"https://thecode.media/telebot/",
"https://thecode.media/summator-2/",
"https://thecode.media/get-rich-coding/",
"https://thecode.media/content-manager/",
"https://thecode.media/vzrosly-stal/",
"https://thecode.media/py-install/",
"https://thecode.media/quantum/",
"https://thecode.media/dns/",
"https://thecode.media/practicum/",
"https://thecode.media/react/",
"https://thecode.media/1september/",
"https://thecode.media/summator/",
"https://thecode.media/vds/",
"https://thecode.media/made-in-china/",
"https://thecode.media/bar/",
"https://thecode.media/zodiac/",
"https://thecode.media/crc32/",
"https://thecode.media/css-links/",
"https://thecode.media/oop_battle/",
"https://thecode.media/be-combo/",
"https://thecode.media/unity/",
"https://thecode.media/data-science/",
"https://thecode.media/junior/",
"https://thecode.media/qc/",
"https://thecode.media/be-middle/",
"https://thecode.media/senior/",
"https://thecode.media/teamlead/",
"https://thecode.media/frontend/",
"https://thecode.media/lift/",
"https://thecode.media/be-fuzzy/",
"https://thecode.media/best-2020/",
"https://thecode.media/git/",
"https://thecode.media/stt-cloud/",
"https://thecode.media/matrix-pills/",
"https://thecode.media/na-stile/",
"https://thecode.media/no-coffee/",
"https://thecode.media/framelibs/",
"https://thecode.media/children/",
"https://thecode.media/balls-possibly/",
"https://thecode.media/le-meduza/",
"https://thecode.media/electricity/",
"https://thecode.media/tailored-swift/",
"https://thecode.media/objective/",
"https://thecode.media/host/",
"https://thecode.media/go-public/",
"https://thecode.media/how-internet-works-1/",
"https://thecode.media/domain/",
"https://thecode.media/this-is-object/",
"https://thecode.media/ole-ole-ole/",
"https://thecode.media/thousand/",
"https://thecode.media/average/",
"https://thecode.media/stt-python/",
"https://thecode.media/ping-pong/",
"https://thecode.media/pygames/",
"https://thecode.media/odobreno/",
"https://thecode.media/qwerty123/",
"https://thecode.media/neurocorrector/",
"https://thecode.media/neuro-cam/",
"https://thecode.media/10-jquery/",
"https://thecode.media/repeat/",
"https://thecode.media/assembler/",
"https://thecode.media/sublime-one-love/",
"https://thecode.media/zloy/",
"https://thecode.media/mariya-ivanovna/",
"https://thecode.media/ruby/",
"https://thecode.media/electron-password/",
"https://thecode.media/plane/",
"https://thecode.media/glitch/",
"https://thecode.media/security/",
"https://thecode.media/stupid-2019/",
"https://thecode.media/jquery-search/",
"https://thecode.media/pimp-my-pass/",
"https://thecode.media/text-ultimate/",
"https://thecode.media/hurry/",
"https://thecode.media/siri/",
"https://thecode.media/zero-cool/",
"https://thecode.media/small-talk/",
"https://thecode.media/die-hard/",
"https://thecode.media/le-piton/",
"https://thecode.media/hr-code/",
"https://thecode.media/nano-code/",
"https://thecode.media/the_question/",
"https://thecode.media/godlike/",
"https://thecode.media/be-logic/",
"https://thecode.media/snake-js/",
"https://thecode.media/be-mobile/",
"https://thecode.media/baboolya/",
"https://thecode.media/timelag/",
"https://thecode.media/doors/",
"https://thecode.media/phone-code/",
"https://thecode.media/snake-arduino/",
"https://thecode.media/css-intro/",
"https://thecode.media/le-timer/",
"https://thecode.media/oop_battle/",
"https://thecode.media/good-morning/",
"https://thecode.media/study-bot/",
"https://thecode.media/python-bot/",
"https://thecode.media/robot-quiz/",
"https://thecode.media/hacking-quiz/",
"https://thecode.media/lulz-quiz/",
"https://thecode.media/hard-quiz/",
"https://thecode.media/torrent/",
"https://thecode.media/travel/",
"https://thecode.media/le-snob/",
"https://thecode.media/no-spagetti/",
"https://thecode.media/house/",
"https://thecode.media/cryptorush/",
"https://thecode.media/coronarelax/",
"https://thecode.media/pure/",
"https://thecode.media/c-cpp/",
"https://thecode.media/machine-loving/",
"https://thecode.media/orwell/",
"https://thecode.media/darknet/",
"https://thecode.media/ai/",
"https://thecode.media/oop-class/",
"https://thecode.media/cookie/",
"https://thecode.media/malware/",
"https://thecode.media/ftp/",
"https://thecode.media/html/",
"https://thecode.media/java/",
"https://thecode.media/php-haters/",
"https://thecode.media/tor/",
"https://thecode.media/crack-safe/",
"https://thecode.media/epidemic/",
"https://thecode.media/hash-brown/",
"https://thecode.media/java-js/",
"https://thecode.media/js-types/",
"https://thecode.media/losers/",
"https://thecode.media/ssl/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-identifier/",
"https://thecode.media/uncaughtsyntaxerror-unexpected-token/",
"https://thecode.media/uncaughttyperrror-cannot-read-property/",
"https://thecode.media/mobile-dev/",
"https://thecode.media/verevka/",
"https://thecode.media/speed/",
"https://thecode.media/buckwheat/",
"https://thecode.media/distance/",
"https://thecode.media/node-js/",
"https://thecode.media/pascal/",
"https://thecode.media/ill-be-clean/",
"https://thecode.media/to-be-back/",
"https://thecode.media/replaceable/",
"https://thecode.media/code-review/",
"https://thecode.media/gasoline/",
"https://thecode.media/to-be-test/",
"https://thecode.media/scala/",
"https://thecode.media/row-power/",
"https://thecode.media/percent/",
"https://thecode.media/things/",
"https://thecode.media/prof-newsletter/",
"https://thecode.media/backend/",
"https://thecode.media/immortal-pong/",
"https://thecode.media/blind/",
"https://thecode.media/go-faster/",
"https://thecode.media/cpp/",
"https://thecode.media/uncaught-syntaxerror-unexpected-end-of-input/",
"https://thecode.media/stress-quiz/",
"https://thecode.media/secret-pong/",
"https://thecode.media/override/",
"https://thecode.media/whg/",
"https://thecode.media/profit/",
"https://thecode.media/memas/",
"https://thecode.media/digital-sound/",
"https://thecode.media/api/",
"https://thecode.media/be-math-2/",
"https://thecode.media/backup/",
"https://thecode.media/backup-master/",
"https://thecode.media/glvrd/",
"https://thecode.media/id/",
"https://thecode.media/uncaught-syntaxerror-missing-after-argument-list/",
"https://thecode.media/ex-startup/",
"https://thecode.media/doom-everywhere/",
"https://thecode.media/template-one/",
"https://thecode.media/david-roganov/",
"https://thecode.media/spacex/",
"https://thecode.media/webstorm/",
"https://thecode.media/json/",
"https://thecode.media/treger/",
"https://thecode.media/ya-blitz/",
"https://thecode.media/radius/",
"https://thecode.media/xhr/",
"https://thecode.media/treger2/",
"https://thecode.media/raidemption/",
"https://thecode.media/chief-technical-officer/",
"https://thecode.media/summary/",
"https://thecode.media/ex-wallpaper/",
"https://thecode.media/soap/",
"https://thecode.media/decompose/",
"https://thecode.media/desc/",
"https://thecode.media/sprint/",
"https://thecode.media/bye-or-die/",
"https://thecode.media/who-win/",
"https://thecode.media/vladimir-olokhtonov/",
"https://thecode.media/lossless/",
"https://thecode.media/parse/",
"https://thecode.media/typeerror-is-not-an-abject/",
"https://thecode.media/backup-me/",
"https://thecode.media/stress-test/",
"https://thecode.media/syntaxerror-missing-formal-parameter/",
"https://thecode.media/start-fast/",
"https://thecode.media/halkechev/",
"https://thecode.media/halkechev2/",
"https://thecode.media/le-design/",
"https://thecode.media/syntaxerror-missing-after-property-id/",
"https://thecode.media/attrb-mthd/",
"https://thecode.media/headphones/",
"https://thecode.media/active-noise-cancelling/",
"https://thecode.media/remote-work-quiz/",
"https://thecode.media/garbage/",
"https://thecode.media/ubuntu-linux/",
"https://thecode.media/trie/",
"https://thecode.media/func/",
"https://thecode.media/laravel/",
"https://thecode.media/save-json/",
"https://thecode.media/syntaxerror-missing-after-formal-parameters/",
"https://thecode.media/recursion/",
"https://thecode.media/haskell/",
"https://thecode.media/gen/",
"https://thecode.media/db/",
"https://thecode.media/boosting/",
"https://thecode.media/pavel-sviridov/",
"https://thecode.media/mnogo/",
"https://thecode.media/sokr/",
"https://thecode.media/dbsm/",
"https://thecode.media/pik-balmera/",
"https://thecode.media/kanban/",
"https://thecode.media/check-list/",
"https://thecode.media/text-quiz/",
"https://thecode.media/mysql/",
"https://thecode.media/mysql/",
"https://thecode.media/rust/",
"https://thecode.media/manage-this/",
"https://thecode.media/altshuller/",
"https://thecode.media/interview/",
"https://thecode.media/fotorama/",
"https://thecode.media/tetris/",
"https://thecode.media/ai-tetris/",
"https://thecode.media/scrum/",
"https://thecode.media/speed-two/",
"https://thecode.media/quick-share/",
"https://thecode.media/stack/",
"https://thecode.media/mobile-first/",
"https://thecode.media/nosql/",
"https://thecode.media/narazves/",
"https://thecode.media/oop-class-2/",
"https://thecode.media/design-first/",
"https://thecode.media/arcanoid/",
"https://thecode.media/donut/",
"https://thecode.media/casino/",
"https://thecode.media/heap/",
"https://thecode.media/rust/",
"https://thecode.media/float/",
"https://thecode.media/markdown/",
"https://thecode.media/books/",
"https://thecode.media/daniil-popov/",
"https://thecode.media/android-developer/",
"https://thecode.media/symbols/",
"https://thecode.media/oauth/",
"https://thecode.media/kotlin/",
"https://thecode.media/todo/",
"https://thecode.media/plotly/",
"https://thecode.media/no-digit-code/",
"https://thecode.media/asymmetric/",
"https://thecode.media/qi/",
"https://thecode.media/vernam/",
"https://thecode.media/vernam-js/",
"https://thecode.media/shtykov/",
"https://thecode.media/memory/",
"https://thecode.media/ark/",
"https://thecode.media/7-oshibok-na-sobesedovanii/",
"https://thecode.media/dh/",
"https://thecode.media/typescript/",
"https://thecode.media/stark/",
"https://thecode.media/crypto/",
"https://thecode.media/zapusk-2/",
"https://thecode.media/fingerprint/",
"https://thecode.media/puzzle/",
"https://thecode.media/python-time-2/",
"https://thecode.media/no-chance/",
"https://thecode.media/lossy/",
"https://thecode.media/1wire/",
"https://thecode.media/pasha-flipper/",
"https://thecode.media/perl/",
"https://thecode.media/alexey-vasilev/",
"https://thecode.media/viasat/",
"https://thecode.media/podcast/",
"https://thecode.media/copy-ya-ru/",
"https://thecode.media/mircrosd/",
"https://thecode.media/bash/",
"https://thecode.media/rotation/",
"https://thecode.media/css-grid/",
"https://thecode.media/train/",
"https://thecode.media/grid-2/",
"https://thecode.media/za-proezd/",
"https://thecode.media/grid-3/",
"https://thecode.media/david/",
"https://thecode.media/alien-vs-predator/",
"https://thecode.media/grid-portfolio/",
"https://thecode.media/podcast-lavka/",
"https://thecode.media/it-start-2/",
"https://thecode.media/anastasiya-nikulina/",
"https://thecode.media/linter/",
"https://thecode.media/bomberman/",
"https://thecode.media/5-linters/",
"https://thecode.media/lineynaya-algebra-vektory/",
"https://thecode.media/no-nda/",
"https://thecode.media/code-swap/",
"https://thecode.media/how-to-start/",
"https://thecode.media/referenceerror-invalid-left-hand-side-in-assignment/",
"https://thecode.media/vectors-operations/",
"https://thecode.media/oop-abstract/",
"https://thecode.media/leonov/",
"https://thecode.media/lucky-strike/",
"https://thecode.media/browser/",
"https://thecode.media/2020/",
"https://thecode.media/3d-stars/",
"https://thecode.media/anna-leonova/",
"https://thecode.media/cold-fusion/",
"https://thecode.media/normalize/",
"https://thecode.media/hotkey/",
"https://thecode.media/oven/",
"https://thecode.media/vim/",
"https://thecode.media/draw/",
"https://thecode.media/visual-studio-code/",
"https://thecode.media/tetris-2/",
"https://thecode.media/start-now/",
"https://thecode.media/lapsha-1/",
"https://thecode.media/cubism/",
"https://thecode.media/lapsha-2/",
"https://thecode.media/zerocode/",
"https://thecode.media/cat/",
"https://thecode.media/static/",
"https://thecode.media/komm/",
"https://thecode.media/path-js/",
"https://thecode.media/cloudly/",
"https://thecode.media/haters-gonna-code-2/",
"https://thecode.media/csp/",
"https://thecode.media/mitin-says-no/",
"https://thecode.media/hire-js/",
"https://thecode.media/tableau/",
"https://thecode.media/impossible/",
"https://thecode.media/csp-on/",
"https://thecode.media/maze/",
"https://thecode.media/mix/",
"https://thecode.media/le-beton/",
"https://thecode.media/3d-print/",
"https://thecode.media/5-and-a-half/",
"https://thecode.media/lineynaya-zavisimost-vektorov/",
"https://thecode.media/fast-m1/",
"https://thecode.media/ninja-run/",
"https://thecode.media/matrix-101/",
"https://thecode.media/arm-x86/",
"https://thecode.media/piano-js/",
"https://thecode.media/10-swift/",
"https://thecode.media/travel-plane/",
"https://thecode.media/extention/",
"https://thecode.media/obratnaya-matritsa/",
"https://thecode.media/svg/",
"https://thecode.media/freelance/",
"https://thecode.media/brat-2/",
"https://thecode.media/angular/",
"https://thecode.media/rgb/",
"https://thecode.media/10-go/",
"https://thecode.media/coffee/",
]

# открываем текстовый файл, куда будем добавлять заголовки
file = open("zag.txt", "a")

# перебираем все адреса из списка
for x in url:
    # получаем исходный код очередной страницы из списка
    html_code = str(urlopen(x).read(),'utf-8')
    # отправляем исходный код страницы на обработку в библиотеку
    soup = BeautifulSoup(html_code, "html.parser")

    # находим название страницы с помощью метода find()
    s = soup.find('title').text

    # выводим его на экран
    print(s)

    # сохраняем заголовок в файле и переносим курсор на новую строку
    file.write(s + '. ')

# закрываем файл
file.close()

Что дальше

А дальше логичное продолжение — программа на цепях Маркова, которая будет генерировать заголовки для статей Кода на основе наших старых заголовков.

Текст:

Михаил Полянин

Редактура:

Максим Ильяхов

Художник:

Даня Берковский

Корректор:

Ирина Михеева

Вёрстка:

Мария Дронова

Соцсети:

Олег Вешкурцев

Любишь Python? Зарабатывай на нём!
«Практикум» вернёт деньги, если после обучения вы пойдёте работать в Яндекс. Изучите самый модный язык программирования и станьте крутым бэкенд-разработчиком. Старт — бесплатно.
Попробовать
Любишь Python? Зарабатывай на нём! Любишь Python? Зарабатывай на нём! Любишь Python? Зарабатывай на нём! Любишь Python? Зарабатывай на нём!
Получите ИТ-профессию
В «Яндекс Практикуме» можно стать разработчиком, тестировщиком, аналитиком и менеджером цифровых продуктов. Первая часть обучения всегда бесплатная, чтобы попробовать и найти то, что вам по душе. Дальше — программы трудоустройства и компенсация, если вы пойдёте работать в «Яндекс».
Начать карьеру в ИТ
Получите ИТ-профессию Получите ИТ-профессию Получите ИТ-профессию Получите ИТ-профессию
Еще по теме
Как подключить комментарии к сайту
Как подключить комментарии к сайту

Приключение на 4 минуты.

Делаем своё расширение для браузера за 10 минут
Делаем своё расширение для браузера за 10 минут

Cнова запускаем снежинки.

Пианино на JavaScript для Chrome
Пианино на JavaScript для Chrome

Не «Мир Дикого запада», но тоже сгодится.

Проект: генератор тупых новогодних поздравлений

Новый год спасён!

Задача про вёрстку баннера
Задача про вёрстку баннера

Для тех, кто любит конкурсы разработчиков.

Uncaught SyntaxError: missing ) after argument list — что это значит
Uncaught SyntaxError: missing ) after argument list — что это значит

Потрясающе хитрая ошибка.

Шифр Вернама на JavaScript
Шифр Вернама на JavaScript

Невзламываемый шифр за 4 строчки кода.

Uncaught SyntaxError: Unexpected end of input — что это значит?
Uncaught SyntaxError: Unexpected end of input — что это значит?

Скорее всего, вы забыли закрыть скобки при объявлении функции.

Пинг-понг на JavaScript

Да, вдвоём тоже можно.

Как убрать что угодно на любом сайте
Как убрать что угодно на любом сайте

Самый популярный приём разработчиков.

Адаптируем статью под время суток
Адаптируем статью под время суток

Простая игрушка на JS.

Мегапроект: расшифровщик аудио в текст… через облако Яндекса!

Сегодня мы будем эксплуатировать чужие облака.

Uncaught SyntaxError: Unexpected identifier — что это означает?
Uncaught SyntaxError: Unexpected identifier — что это означает?

Вредная ошибка, которую легко исправить.

medium