ūüöÄ Wrangling Space Data

ūüöÄ Wrangling Space Data
  • Space and rockets are cool!
  • AI is cool!
  • Data engineering is cool!

So what could be cooler than combining the them!?

Companies like Glint Solar, DHI GRAS, and Lumen Energy use satellite images combined with artificial intelligence to make a positive impact on the world.

Using  Convolutional Neural Networks (CNNs) and other techniques, they can  evaluate if a location is suitable for a solar panel installation, even  if the site is on the water in the case of Glint Solar! DHI GRAS has a  wide range of remote sensing capabilities applied on wind farms, urban  planning, and much more.

ūüõįÔłŹ Bigearth.net and Sentinel 2

Bigearth.net is  a dataset that contains pre-labeled satellite images from the Sentinel 2  satellite and is freely available. Kudos to G. Sumbul, M. Charfuelan,  B. Demir, and V. Markl for the work they put into this!

Sentinel 2 takes photos of the earth in 13 spectral bands.  Each of the bands has properties that can be more or less relevant for a  given application. One such application is measuring the amount of  vegetation. In this case, the sensors that capture wavelengths between  704.1 and 782.8 nanometers are suitable.

Bigearth.net is available as a TensorFlow dataset. But you only have the choice of getting the red, green, blue bands; or all the bands.

Out  of nerdy curiosity, I decided to build an Apache Beam pipeline that  would allow the user to create a TFRecord data set with an arbitrary  combination of bands.

ūüŹÜ The Goal

A data pipeline  generally has one or more systems that consume the results it produces. I  decided that the destination of the data should be a binary  classification CNN. I wanted the model to have a practical application  and settled on a target CNN that can differentiate between images with  or without water in them.

A model that can find water in satellite  images is applicable for both floating solar and wind farms. Maybe even  to find water on other heavenly bodies!?

ūüĆć Prepping the Data

Before  the actual data-wrangling could begin, I needed to make it available  for the pipeline. I did that with the following steps:

  1. Download the dataset
  2. Unpack and upload into a Google Could Storage Bucket
  3. Create a BigQuery table with the image locations and metadata

I downloaded the full dataset to my laptop and streamed the decompressed content to my bucket.

I also used Apache Beam to get the data into BigQuery,  The gist of it looks like this:

Now  I could query the image names, acquisition dates, and labels to my  heart's content. One main reason for using BigQuery is making  re-creatable datasets using the FARM_FINGERPRINT hash function.

Example of the data in the BigQuery table.

ūüē≥ÔłŹ The pipeline

Before  the CNN can learn from the satellite images, we have to transform them  into a suitable format. They also need to be labeled in a way that makes  sense for the application of the resulting model. Using the  pre-existing labels, I derived a new binary label: `has_water` by making  two views:


  labels FROM {TABLE}
WHERE ('Water bodies' in UNNEST(labels) OR 'Sea and ocean' in UNNEST(labels))


  labels FROM {TABLE}
WHERE NOT ('Water bodies' in UNNEST(labels) OR 'Sea and ocean' in UNNEST(labels))

ūüí™ Flex Template

To make the pipeline as reusable as possible, I implemented it as a Dataflow Flex Template.

Flex  Templates are pipelines wrapped in a Docker container. The pipeline can  be invoked with the `gcloud dataflow flex-template run` command. Custom  parameters can be passed by using the `--parameters` flag.

Running  the following will create a dataset with a 9:1 training/evaluation  split containing the red, green, and blue spectral bands.

gcloud dataflow flex-template run "sentinel-2-tf-records`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location "$TEMPLATE_PATH" \
--region "europe-west3" \ 
--parameters ^~^output_dir="$OUTPUT_BUCKET"~split="9,1"~bands="B02,B03,B04"

ūüôŹ Thanks for reading