We talk with Oleg Tarasenko and Tze Yiing about crawling the web using Elixir. Oleg created the crawly project to help solve this problem and Tze Yiing joined him as a contributor and maintainer. We cover how Elixir is well suited to orchestrate crawling, how to deal with login pages, understanding the legal concerns, building a codeless scraper and much more!
Show Notes online - http://podcast.thinkingelixir.com/31 (http://podcast.thinkingelixir.com/31)
Elixir Community News
- https://dashbit.co/blog/ten-years-ish-of-elixir (https://dashbit.co/blog/ten-years-ish-of-elixir) – January 9th marked the 10th year since the first commit to the Elixir repository
- https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b (https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b) – First commit on the repository
- https://twitter.com/josevalim/status/1349010127270129670 (https://twitter.com/josevalim/status/1349010127270129670) – Jose Valim reveals the name of his secret project is called 'Nx'
- https://remote.com/blog/welcoming-elixir-creator-jose-valim (https://remote.com/blog/welcoming-elixir-creator-jose-valim) – Jose Valim joins Remote as a Technical Adivsor
- https://twitter.com/josevalim/status/1347858475267854336 (https://twitter.com/josevalim/status/1347858475267854336) – ExUnit will catch SIGQUIT message from CTRL+\ and shows the tests that were running
- https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34 (https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34) – ExUnit will print how much time the test suite spent on async tests vs sync tests
- https://twitter.com/fhunleth/status/1348092050487570433 (https://twitter.com/fhunleth/status/1348092050487570433) – Nerves support on the M1 is looking good
- https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg (https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg) – Elixir Conf 2020 videos have all been publicly released!
Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at [email protected] (mailto:[email protected])
Discussion Resources
- https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13 (https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13)
- https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64 (https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64) – Using Elixir for price monitoring
- https://hex.pm/packages/crawly (https://hex.pm/packages/crawly)
- https://github.com/oltarasenko/crawly (https://github.com/oltarasenko/crawly)
- https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html (https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html) – Oleg's older web scraping with Elixir article
- https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html (https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html) – Building a machine learning projects with Elixir, Tensorflow and Crawly
- https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0 (https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0) – What is web scraping, and why you might want to use it?
- https://www.pillowskin.com (https://www.pillowskin.com) – Ziinc's project using scraping and aggregation
- https://www.tensorflow.org/ (https://www.tensorflow.org/)
- https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b (https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b)
- https://scrapy.org/ (https://scrapy.org/)
- https://github.com/fredwu/crawler (https://github.com/fredwu/crawler)
- https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data (https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) – EFF legal interpretation of LinkedIn vs HiQ scraping case
- https://github.com/scrapinghub/splash/ (https://github.com/scrapinghub/splash/)
- https://www.joinhoney.com/ (https://www.joinhoney.com/)
- https://hexdocs.pm/crawly/readme.html#quickstart (https://hexdocs.pm/crawly/readme.html#quickstart) – Crawly quickstart guid
- https://hexdocs.pm/crawly/tutorial.html (https://hexdocs.pm/crawly/tutorial.html) – Crawley tutorial
- https://github.com/oltarasenko/crawly_ui (https://github.com/oltarasenko/crawly_ui) – Crawly UI project
- http://crawlyui.com/ (http://crawlyui.com/) – Crawly UI project page
- Data is the new gold
- https://t.me/elixir_crawly (https://t.me/elixir_crawly) – Crawley Telegram group
Guest Information
- https://github.com/oltarasenko (https://github.com/oltarasenko) – Oleg on Github
- https://oltarasenko.medium.com/ (https://oltarasenko.medium.com/) – Oleg's Blog
- https://twitter.com/tzeyiing (https://twitter.com/tzeyiing) – Lee TzeYiing on Twitter
- https://github.com/Ziinc (https://github.com/Ziinc) – Lee TzeYiing on Github
- https://www.tzeyiing.com (https://www.tzeyiing.com) – Lee TzeYiing Blog
Find us online
- Message the show - @ThinkingElixir (https://twitter.com/ThinkingElixir)
- Email the show - [email protected] (mailto:[email protected])
- Mark Ericksen - @brainlid (https://twitter.com/brainlid)
- David Bernheisel - @bernheisel (https://twitter.com/bernheisel)
- Cade Ward - @cadebward (https://twitter.com/cadebward)

We talk with Oleg Tarasenko and Tze Yiing about crawling the web using Elixir. Oleg created the crawly project to help solve this problem and Tze Yiing joined him as a contributor and maintainer. We cover how Elixir is well suited to orchestrate crawling, how to deal with login pages, understanding the legal concerns, building a codeless scraper and much more!

Show Notes online - http://podcast.thinkingelixir.com/31

Elixir Community News

https://dashbit.co/blog/ten-years-ish-of-elixir – January 9th marked the 10th year since the first commit to the Elixir repository
https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b – First commit on the repository
https://twitter.com/josevalim/status/1349010127270129670 – Jose Valim reveals the name of his secret project is called 'Nx'
https://remote.com/blog/welcoming-elixir-creator-jose-valim – Jose Valim joins Remote as a Technical Adivsor
https://twitter.com/josevalim/status/1347858475267854336 – ExUnit will catch SIGQUIT message from CTRL+\ and shows the tests that were running
https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34 – ExUnit will print how much time the test suite spent on async tests vs sync tests
https://twitter.com/fhunleth/status/1348092050487570433 – Nerves support on the M1 is looking good
https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg – Elixir Conf 2020 videos have all been publicly released!

Do you have some Elixir news to share? Tell us at @ThinkingElixir or email at [email protected]

Discussion Resources

https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64 – Using Elixir for price monitoring
https://hex.pm/packages/crawly
https://github.com/oltarasenko/crawly
https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html – Oleg's older web scraping with Elixir article
https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html – Building a machine learning projects with Elixir, Tensorflow and Crawly
https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0 – What is web scraping, and why you might want to use it?
https://www.pillowskin.com – Ziinc's project using scraping and aggregation
https://www.tensorflow.org/
https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b
https://scrapy.org/
https://github.com/fredwu/crawler
https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data – EFF legal interpretation of LinkedIn vs HiQ scraping case
https://github.com/scrapinghub/splash/
https://www.joinhoney.com/
https://hexdocs.pm/crawly/readme.html#quickstart – Crawly quickstart guid
https://hexdocs.pm/crawly/tutorial.html – Crawley tutorial
https://github.com/oltarasenko/crawly_ui – Crawly UI project
http://crawlyui.com/ – Crawly UI project page
Data is the new gold
https://t.me/elixir_crawly – Crawley Telegram group

Guest Information

https://github.com/oltarasenko – Oleg on Github
https://oltarasenko.medium.com/ – Oleg's Blog
https://twitter.com/tzeyiing – Lee TzeYiing on Twitter
https://github.com/Ziinc – Lee TzeYiing on Github
https://www.tzeyiing.com – Lee TzeYiing Blog

Find us online

Message the show - @ThinkingElixir
Email the show - [email protected]
Mark Ericksen - @brainlid
David Bernheisel - @bernheisel
Cade Ward - @cadebward

Twitter Mentions