Skip to content

Latest commit

 

History

History
230 lines (147 loc) · 13.5 KB

README.md

File metadata and controls

230 lines (147 loc) · 13.5 KB

HackerNews new jobs

A website that provides obvious, effortless and deep insights into fresh and recurring job opportunities in Hacker News "Who's Hiring" thread.

For example, if you are looking for a job and want to focus more on fresh opportunities or research the ad history of a specific company.

You are welcome to use, self-host, fork and modify, and contribute to this project.

Demo

https://hackernews-new-jobs.arm1.nemanjamitic.com

Screenshots

hackernews-new-jobs-2x-50-percent.mp4

Features

  • Separate companies by: 1. First time companies (first job ad ever), 2. New companies (no job ad in previous month), 3. Old companies (had job ad in previous month) for every month in history
  • Display separated companies in multi-line chart graph
  • Job ad posts are linked and sorted by original time of creation
  • For every company in a month, find every job ad in history, link them and sort them chronologically
  • Display separated companies by number of job ads in the previous 12 months period in bar chart graph
  • Search job ads from a company
  • Cron job for parsing a new month running between 2nd and 15th day every month
  • Cron job to respect Algolia API rate limiting and parse and seed database for the entire history (since 2015-06 uniform formatting with | separated post titles)
  • SQLite database for fast querying and Keyv caching for fast reads and page load
  • Http response caching with Keyv to reduce load on Algolia API and mitigate rate limiting
  • Docker deployments with x86 and ARM images, from local machine and Github Actions
  • Next.js app router app with SSR and Shadcn components
  • Responsive design and dark theme support
  • Plausible analytics
  • Winston logging
  • Clean, separated types, constants, utils and config

Installation and running

node -v
# v22.9.0

# set environment variables
cp .env.example .env

# install dependencies
yarn install

# run in dev mode, visit http://localhost:3000/
yarn dev

# build
yarn build

# run in prod mode
yarn standalone

# or
yarn cp
node .next/standalone/server.js

# Docker

# build image
yarn dc:build

# run
yarn dc:up

Deployment

Environment variables

Create .env file.

cp .env.example .env

Set environment variables.

# .env

# for metadata
SITE_HOSTNAME=my-domain.com

# protect endpoints
API_SECRET=long-string

# plausible analytics
PLAUSIBLE_SERVER_URL=https://plausible.my-domain.com
PLAUSIBLE_DOMAIN=my-domain.com

Github Actions

Set the following Github secrets and use build-push-docker.yml and deploy-docker.yml Github Actions workflows to build, push and deploy Docker image to the remote server.

DOCKER_USERNAME
DOCKER_PASSWORD

REMOTE_HOST
REMOTE_USERNAME
REMOTE_KEY_ED25519
REMOTE_PORT

Copy seeded database to the remote server.

# replace <vars>
scp ./data/database/hn-new-jobs-database-dev.sqlite3 <user@server>:~/<your-path-on-server>/hn-new-jobs/data/database/hn-new-jobs-database-prod.sqlite3

# e.g.
scp ./data/database/hn-new-jobs-database-dev.sqlite3 arm1:~/traefik-proxy/apps/hn-new-jobs/data/database/hn-new-jobs-database-prod.sqlite3

Deploy from local machine

# build and push x86 and arm images
yarn docker:build:push:all

# edit package.json script, replace <vars>
"deploy:docker:pi": "bash scripts/deploy-docker.sh <user@server> '~/<your-path-on-server>/hn-new-jobs' <your-image-name:latest>"

# deploy to remote server
yarn deploy:docker:pi

Implementation details

Data source

Initially app started with scraping original https://news.ycombinator.com thread pages and inside the docs/old-code/scraper folder there is still complete scraper implementation. It uses Axios for fetching html but soon I discovered that their server has very strict rate limiting and such scrapper was getting a lot of unpredictable ETIMEDOUT and ECONNRESET connection abrupt terminations. It's easy to distinguish Axios from browser so I considered moving to Playwright until I discovered Algolia API https://hn.algolia.com/api.

As stated in docs it has generous rate limiting plan 10000 requests/hour which is 2.77 requests/second. Although in practice I experienced more strict rate limits than this, it is still more than enough to easily collect enough data. Especially the fact that you can receive up to 1000 items per a single request. Meaning pagination is almost not needed, although it is implemented. "Who is hiring" threads have up to 700 comments, entire thread can be fetched with a single API request.

Company names are extracted with a simple | character Regex from a comment title, this posting convention was enforced around the year 2015, so the database is seeded starting from '2015-06' month. You can see all of this in constants/algolia.ts.

Database

Threads are modeled with Thread and Comment models, although they are named Month and Company because those words make more sense in context of app logic. Company table is both Company and Comment (job ad) at same time, that is why for example self join is used to extract all ads for a company. Month primary key is name string in 'YYYY-MM' format which is sortable.

Important implementation detail is that database connection is done as Singleton factory function and this way removed from the global scope. This allows the Next.js app to be built without requiring database connection at build time which significantly simplifies the build process. See this Github discussion. The same Singleton factory is reused for both Keyv and Axios instances utils/singleton.ts.

You can check schema in modules/database/schema.ts.

Database has just a single insert query, when a new month is parsed with scheduled cron job. Or by calling seeding job, which is mostly not needed because the seeded database for a range 2015-06 - 2024-11 is provided and committed in Git in data/database/backup folder.

There is a number of select queries inside modules/database/select to extract all wanted data, some of them are more complex. I decided not to use ORM this time and to stick to light weight solution with better-sqlite3 library.

Database ER diagram

Database ER diagram

Caching

Like stated in the previous section, database is dominantly read only which allowed for easy caching of query responses. This improved SSR pages loading performance from 1200 ms to 400 ms. There is just a single invalidation event, when new month is parsed, you can see it in modules/parser/calls.ts.

Keyv library with KeyvFile is used for caching both http requests and database queries into .json files. Invalidating cache entry does not seem to remove it from the .json file to save space, this requires more research and it is added to Todo list.

Inside the libs/keyv.ts there is a cacheDatabaseWrapper() function that accepts database query function and returns cached version.

Keyv instances are also removed from the global scope with Singleton factory function.

Scheduler

Scheduler is needed to execute parsing a new month, every day between 2nd and 15th day every month at 9:00am Belgrade time ('0 9 2-15 * *'). There is an additional logic to skip first two days if those are on weekend in libs/datetime.ts.

Scheduler is also useful to seed the database for the entire history and avoid Algolia rate limits.

Initial idea was to add cron task inside the Docker image itself. After careful research I discovered that crond daemon must run as root user or it will produce setpgid: Operation not permitted error, see this Github issue. This was unacceptable because Next.js app needs to run as non-root user for security reasons and easier managing file permissions in Docker bind mount volumes (database, cache, log files).

This requires exposing scripts as API endpoints and running crond daemon in a separate Docker image, which is unpractical and greatly complicates deployment. There are 3rd party binaries to run scheduled tasks in Docker, mostly in Go aptible/supercronic.

Fortunately there are also few Node.js packages to run scheduled tasks, I picked node-cron/node-cron for simplicity and practical reasons.

Scheduled tasks definitions are in modules/scheduler/main.ts. Important note is that they are dynamically imported and invoked in instrumentation.ts which will guarantee that they are scheduled exactly once and there are no double tasks. Here is Next.js docs about this.

Scripts itself are also exposed (and left unused) as API endpoints in app/api/parser/[script]/route.ts. They can also be executed manually as tsx scripts in dev mode in modules/parser/main.ts.

Server side rendering

Currently there are 115 months in history, it is not practical to prerender that many pages. For performance and SEO reasons server side rendering is always preferred to client side. Month is used as dynamic param, validated by Regex and database lookup, cached database query is read and the page is generated on server at request time. This is true for all 3 pages. Scroll position is stored in localStorage on onChange Select control event, read, restored and cleared on page load. For search page query param is passed as dynamic route param, as described in Next.js docs.

Docker

There are configurations provided to build Docker images both locally and in Github Actions, for both linux/amd64 and linux/arm64 platforms. There are also dev docker-compose.yml and prod docker-compose.prod.yml files. Two important things to note:

data folder that contains files for SQLite database, cache and logs is exposed to Docker as bind mount. To avoid file permission issues a current user is passed in both dev and prod docker-compose.yml.

There are two environment variables required at build time, SITE_HOSTNAME for Next.js metadata and PLAUSIBLE_SERVER_URL for custom Plausible domain. Both are passed into Dockerfile as build args and you need to provide them whenever you build Docker images (Github Actions, docker-compose.yml, yarn scripts). If you don't app will still run but you will have broken SEO and analytics.

Plausible analytics

It uses 4lejandrito/next-plausible package and its configuration is documented in its Readme file. Provider is configured in components/base-head.tsx and the API proxy in next.config.mjs. It requires two environment variables, PLAUSIBLE_DOMAIN for your website domain, and PLAUSIBLE_SERVER_URL for your self-hosted Plausible instance. Proxy is needed so that script.js is loaded from your local domain instead of remote Plausible server, this will help mitigating adblockers. Important node is that variable used in next.config.mjs needs to be provided at build time.

Logging

Winston is configured in libs/winston.ts. Dev logger logs to console and prod logs to both console and html file that should be publicly available. Prod logger is disabled because up to this point I haven't found simple, elegant way to rotate single log file. Simple and obvious thing but somehow unavailable in Winston. Currently it is used only for logging scheduled parsing of new month, more logging should be added.

Separate utils/pretty-print.ts function is used to log initial configuration and environment variables on app start.

Todo

  • Handle not found exceptions in database select queries
  • Clear Keyv cache files, not just invalidate
  • Winston rotate single file
  • Plausible incorrectly detects server's location as user's country in some cases

References

License

MIT license: License